Research Highlights

Private, Secure, and Reliable Artificial Intelligence

Advancing Privacy, Security, and Trust in Next-Generation AI Systems

At MERL, we are advancing the frontiers of private, secure, and reliable machine learning. Our recent research explores vulnerabilities and defense methods for foundation models and generative AI systems, contributing to the development of safer and more responsible AI technologies.

Go In Depth

SAC-GNC: SAmple Consensus for adaptive Graduated Non-Convexity

IEEE/CVF International Conference on Computer Vision (ICCV) 2025,
Highlight Paper.

Outliers pose a major challenge in geometric vision tasks like pose estimation and mapping, often leading to inaccurate results. Although robust loss functions help, they struggle with sensitivity to initialization and selecting the correct shape parameter to differentiate inliers from outliers. Graduated Non-Convexity (GNC) is commonly used to address this, but its fixed annealing strategy can result in suboptimal performance. This paper introduces an adaptive GNC method that dynamically adjusts the shape parameter by sampling multiple annealing options and scoring models to select the best one at each iteration. The approach also includes new . . .

Go In Depth

Task-aware Unified Source Separation - Audio Examples

Several audio examples of the Task-aware Unified Source Separation (TUSS) model.

This page provides several audio examples of the Task-aware Unified Source Separation (TUSS) model. The model features learnable prompts to specify what source to separate and changes its behavior based on the prompts. Audio examples are included, with both synthetic mixtures and real recordings.

Go In Depth

Sustainable AI

Enabling computationally efficient and environmentally sustainable frameworks for high-performance artificial intelligence (AI) and machine learning (ML).

Artificial intelligence (AI), encompassing machine learning (ML) and deep learning (DL), has profoundly transformed various aspects of our lives, from everyday solutions to societal services and industrial evolutions. Particularly, the advent of Generative AI, such as GPT (generative pre-trained transformer) and diffusion-based models, has been revolutionizing a multitude of applications in recent years.

While these advanced AI systems have achieved remarkable performance, the computational complexity, power consumption, and storage . . .

Go In Depth

Quantum AI Technology

Taking advantage of emerging quantum computing technologies for high-performance and parameter-efficient artificial intelligence (AI) and machine learning (ML).

The concept of natural computing is a promising avenue for fostering sustainable growth in AI, moving beyond the current reliance on traditional CPU (central processing unit), GPU (graphical processing unit), and TPU (tensor processing unit) modalities. Within the realm of natural computing, which includes molecular computing, liquid-state computing, and DNA (deoxyribonucleic acid) computing, quantum computing has garnered widespread global interest.

Go In Depth

PS-NeuS: A Probability-guided Sampler for Neural Implicit Surface Rendering

For more accurate and detailed 3D reconstructions and improved image rendering, especially for the regions of interest in a scene.

Several variants of Neural Radiance Fields (NeRFs) have significantly improved the accuracy of image synthesis and surface reconstruction of 3D scenes/objects. A key characteristic of these methods is that they cannot afford to train the neural network using every possible input, specifically, every pixel and every 3D point along each pixel’s projection ray. While vanilla NeRFs uniformly sample both the image pixels and the 3D points along the projection rays, some variants guide the sampling of the 3D points along the projection rays. In this paper, we propose a guided sampling of both the image . . .

Go In Depth

Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-Aware Spatio-Temporal Sampling

Motion-aware novel view rendering and tracking.

Free-viewpoint rendering aims to realize a realistic rendering of a 3D scene that is consistent with the geometry of the 3D scene, when rendered from any given viewing direction. To make such systems ubiquitous, it is essential that they be capable of handling dynamic scenes, i.e. those where objects change in their position or configuration or both over time. Existing approaches for this task, propose pipelines that are agnostic to the semantic content of the scene and thus treat every region in the 3D space, as being equally important, when rendering. This results in the system struggling to render the regions of the scene that have high motion. In this paper, we depart from such a . . .

Go In Depth

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Text-conditioned image-to-video generation: synthesizes a realistic video starting from a given image and a text description.

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., “a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. . . .

Go In Depth

Generation or Replication: Auscultating Audio Latent Diffusion Models

Identified partially replicated training examples from the full TANGO model.

For each generated example we show the top match found in the training set for both similarity methods explored in the paper -CLAP and mel. While the generated sounds are not identical to the training data, they have striking similarities in terms of features such as event onsets, which appear to be replicated from the training data.

Go In Depth

Steered Diffusion

A method can be used for a wide variety of photorealistic conditional image generation tasks, including image colorization, super-resolution, semantic generation, identity replication, and text-guided editing.

Capitalizing on the power of diffusion models that have been trained with unlabeled data for unconditional image generation, our work enables them to be repurposed for conditional image synthesis without the need for any retraining. Since no additional training is required for conditional generation, we call this zero-shot conditional image generation. Previous approaches utilizing diffusion models for zero-shot conditional generation can either perform either label-based generation or fine-grained conditional generation. In . . .

Go In Depth

Edge-Assisted Internet of Vehicles for Smart Mobility

A new paradigm for connected and automated driving.

Intelligent transportation is a key component to enable Smart Cities. However, vehicular systems are highly complex and dynamic, where mobile vehicles, pedestrians, road conditions, driver characteristics and weather all play important roles. Connected and automated vehicles (CAVs), with the assistance of infrastructure units, allow for sensing and control information to be communicated and acted upon to achieve intelligent transportation goals.

Edge-assisted IoV reduces communication latency for real-time operation and utilizes the advanced features and data collection methods of the connected and automated vehicles to realize smart mobility functions. How to best . . .

Go In Depth

Robust Machine Learning

Improving natural-robust accuracy tradeoff for adversary-resilient deep learning.

Deep learning is widely applied, yet incredibly vulnerable to adversarial examples, i.e., virtually imperceptible perturbations that fool deep neural networks (DNNs). We aim at developing robust machine learning technology: practical defenses that yield deep learning-based systems that are resilient to adversarial examples, through better theoretical understanding of the fragility of conventional DNNs.

Go In Depth

Biosignal Processing for Human-Machine Interaction

Pre-shot learning techniques to read your mind and biosignals for calibration-free brain-computer interface (BCI) and human-machine interaction (HMI).

Realizing Sci-Fi scenes--an intelligent robot is able to read your thoughts in your mind--may be no longer a far future dream thanks to rapid progress in robotics, sensors, and artificial intelligence (AI). Biosignal processing to analyze human’s physiological states is a key enabling technology for mind sensing in HMI and BCI systems. When machine intelligence can collaboratively support human intelligence without conflict, HMI systems will make a breakthrough in various scenarios, including teleworking, maintaining remote facilities, disaster response, epidemic care, . . .

Go In Depth

Video Anomaly Detection

This research tackles the problem of automatically detecting unusual activity in video sequences.

This research tackles the problem of automatically detecting unusual activity in video sequences. To solve the problem, an algorithm is first given video sequences from a fixed camera showing normal activity. A model representing normal activity is created and used to evaluate new video sequences from the same fixed camera. Any parts of the testing video that do not match the model formed from normal video are considered anomalous.

We describe two variations of a novel algorithm for video anomaly detection which we evaluate along with two previously published algorithms on the Street Scene dataset (described later). . . .

Go In Depth

Seamless Speech Recognition

A new multilingual speech recognition technology that simultaneously identifies the language spoken and recognizes the words.

We describe a new multilingual speech recognition technology that that simultaneously identifies the language spoken and recognizes the words. The system can also understand multiple people speaking either the same or different languages simultaneously.

Go In Depth

Video Description

A novel neural network architecture that fuses multimodal information using a modality-dependent attention mechanism.

Understanding scenes through sensed information is a fundamental challenge for man machine interface. We aim to develop methods for learning semantic representations from multimodal information, including both visual and audio data, as the basis for intelligent communications and interface with machines. Towards this goal, we invented a modality-dependent attention mechanism for video captioning based on encoder-decoder sentence generation using recurrent neural networks (RNNs).

Go In Depth

Speech Enhancement

NMF meets Kalman filter dynamics for high-quality speech enhancement in non-stationary noise.

Non-negative data arise in a variety of important signal processing domains, such as power spectra of signals, pixels in images, and count data. We introduce a novel non-negative dynamical system for sequences of such data, and describe its application to modeling speech and audio power spectra.

Go In Depth

MERL Shopping Dataset

We describe a recurrent deep network for detecting actions in video sequences.

This research attempts to solve the problem of finding particular actions occurring in a video. Much of the past work in this field has looked at the related problem of action recognition. In action recognition, the algorithm is given a short video clip of an action and asked to classify which action is present. In contrast, the problem of action detection requires the algorithm to look through a long video and find the start and stop points of all instances of each known action. We consider action detection to be a more difficult, but much more useful problem to solve in practice.

Go In Depth

Deep Clustering

Training deep discriminative embeddings to solve the cocktail party problem.

The human auditory system gives us the extraordinary ability to converse in the midst of a noisy throng of party goers. Solving this so-called cocktail party problem has proven extremely challenging for computers, and separating and recognizing speech in such conditions has been the holy grail of speech processing for more than 50 years. Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering, producing unprecedented speaker-independent single-channel separation performance on two-speaker and three-speaker mixtures.

Go In Depth

Point-Plane SLAM

Real-time 3D reconstruction using an RGB-D sensor on a tablet.

We present a real-time 3D reconstruction system using an RGB-D sensor on a hand-held tablet. The main novelty of the system is a simultaneous localization and mapping (SLAM) algorithm that uses both point and plane features as primitives. Planes are the most common structures in man-made indoor and outdoor scenes.

As the core of the algorithm, we show that it is possible to register 3D data in two different coordinate systems using any combination of three point/plane features (3 planes, 2 planes and 1 point, 1 plane and 2 points, and 3 points). We use the minimal set of features in a RANSAC framework to robustly compute correspondences and estimate the . . .

Go In Depth

mmWave Beam-SNR Fingerprinting (mmBSF)

mmWave Beam-SNR Fingerprinting (mmBSF) for Precise Indoor Localization using Commercial-Off-The-Shelf (COTS) Routers.

We describe our in-house dataset and an approach of fingerprinting-based indoor localization using COTS mmWave WiFi routers compliant with the IEEE802.11ad standards.

Go In Depth