Take some time to look deeper into research done at MERL.
Identified partially replicated training examples from the full TANGO model.
For each generated example we show the top match found in the training set for both similarity methods explored in the paper -CLAP and mel. While the generated sounds are not identical to the training data, they have striking similarities in terms of features such as event onsets, which appear to be replicated from the training data.
A method can be used for a wide variety of photorealistic conditional image generation tasks, including image colorization, super-resolution, semantic generation, identity replication, and text-guided editing.
Capitalizing on the power of diffusion models that have been trained with unlabeled data for unconditional image generation, our work enables them to be repurposed for conditional image synthesis without the need for any retraining. Since no additional training is required for conditional generation, we call this zero-shot conditional image generation. Previous approaches utilizing diffusion models for zero-shot conditional generation can either perform either label-based generation or fine-grained conditional generation. In . . .
A new paradigm for connected and automated driving.
Intelligent transportation is a key component to enable Smart Cities. However, vehicular systems are highly complex and dynamic, where mobile vehicles, pedestrians, road conditions, driver characteristics and weather all play important roles. Connected and automated vehicles (CAVs), with the assistance of infrastructure units, allow for sensing and control information to be communicated and acted upon to achieve intelligent transportation goals.
Edge-assisted IoV reduces communication latency for real-time operation and utilizes the advanced features and data collection methods of the connected and automated vehicles to realize smart mobility functions. How to best . . .
Pre-shot learning techniques to read your mind and biosignals for calibration-free brain-computer interface (BCI) and human-machine interaction (HMI).
Realizing Sci-Fi scenes--an intelligent robot is able to read your thoughts in your mind--may be no longer a far future dream thanks to rapid progress in robotics, sensors, and artificial intelligence (AI). Biosignal processing to analyze human’s physiological states is a key enabling technology for mind sensing in HMI and BCI systems. When machine intelligence can collaboratively support human intelligence without conflict, HMI systems will make a breakthrough in various scenarios, including teleworking, maintaining remote facilities, disaster response, epidemic care, . . .
Improving natural-robust accuracy tradeoff for adversary-resilient deep learning.
Deep learning is widely applied, yet incredibly vulnerable to adversarial examples, i.e., virtually imperceptible perturbations that fool deep neural networks (DNNs). We aim at developing robust machine learning technology: practical defenses that yield deep learning-based systems that are resilient to adversarial examples, through better theoretical understanding of the fragility of conventional DNNs.
This research tackles the problem of automatically detecting unusual activity in video sequences.
This research tackles the problem of automatically detecting unusual activity in video sequences. To solve the problem, an algorithm is first given video sequences from a fixed camera showing normal activity. A model representing normal activity is created and used to evaluate new video sequences from the same fixed camera. Any parts of the testing video that do not match the model formed from normal video are considered anomalous.
We describe two variations of a novel algorithm for video anomaly detection which we evaluate along with two previously published algorithms on the Street Scene dataset (described later).
. . .
A new multilingual speech recognition technology that simultaneously identifies the language spoken and recognizes the words.
We describe a new multilingual speech recognition technology that that simultaneously identifies the language spoken and recognizes the words. The system can also understand multiple people speaking either the same or different languages simultaneously.
A novel neural network architecture that fuses multimodal information using a modality-dependent attention mechanism.
Understanding scenes through sensed information is a fundamental challenge for man machine interface. We aim to develop methods for learning semantic representations from multimodal information, including both visual and audio data, as the basis for intelligent communications and interface with machines. Towards this goal, we invented a modality-dependent attention mechanism for video captioning based on encoder-decoder sentence generation using recurrent neural networks (RNNs).
NMF meets Kalman filter dynamics for high-quality speech enhancement in non-stationary noise.
Non-negative data arise in a variety of important signal processing domains, such as power spectra of signals, pixels in images, and count data. We introduce a novel non-negative dynamical system for sequences of such data, and describe its application to modeling speech and audio power spectra.
We describe a recurrent deep network for detecting actions in video sequences.
This research attempts to solve the problem of finding particular actions occurring in a video. Much of the past work in this field has looked at the related problem of action recognition. In action recognition, the algorithm is given a short video clip of an action and asked to classify which action is present. In contrast, the problem of action detection requires the algorithm to look through a long video and find the start and stop points of all instances of each known action. We consider action detection to be a more difficult, but much more useful problem to solve in practice.
Training deep discriminative embeddings to solve the cocktail party problem.
The human auditory system gives us the extraordinary ability to converse in the midst of a noisy throng of party goers. Solving this so-called cocktail party problem has proven extremely challenging for computers, and separating and recognizing speech in such conditions has been the holy grail of speech processing for more than 50 years. Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering, producing unprecedented speaker-independent single-channel separation performance on two-speaker and three-speaker mixtures.
Real-time 3D reconstruction using an RGB-D sensor on a tablet.
We present a real-time 3D reconstruction system using an RGB-D sensor on a hand-held tablet. The main novelty of the system is a simultaneous localization and mapping (SLAM) algorithm that uses both point and plane features as primitives. Planes are the most common structures in man-made indoor and outdoor scenes.
As the core of the algorithm, we show that it is possible to register 3D data in two different coordinate systems using any combination of three point/plane features (3 planes, 2 planes and 1 point, 1 plane and 2 points, and 3 points). We use the minimal set of features in a RANSAC framework to robustly compute correspondences and estimate the . . .
mmWave Beam-SNR Fingerprinting (mmBSF) for Precise Indoor Localization using Commercial-Off-The-Shelf (COTS) Routers.
We describe our in-house dataset and an approach of fingerprinting-based indoor localization using COTS mmWave WiFi routers compliant with the IEEE802.11ad standards.