TR2022-140

Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

- Chatterjee, M., Ahuja, N., Cherian, A., "Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation", Advances in Neural Information Processing Systems (NeurIPS), November 2022.
  BibTeX TR2022-140 PDF Presentation
  - @inproceedings{Chatterjee2022nov,
  - author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
  - title = {{Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation}},
  - booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  - year = 2022,
  - month = nov,
  - url = {https://www.merl.com/publications/TR2022-140}
  - }
MERL Contacts:
- Moitreya
  Chatterjee
- Anoop
  Cherian
Research Areas:

Computer Vision, Machine Learning, Speech & Audio

Abstract:

There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (ii) predicting the 3D visual motion of a sounding source using its sepa- rated audio. Towards this end, we present Audio Separator and Motion Predictor (ASMP) – a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. At the heart of ASMP is a 2.5D scene graph capturing various objects in the video and their pseudo-3D spatial proximities. This graph is constructed by registering together 2.5D monocular depth predictions from the 2D video frames and associating the 2.5D scene regions with the outputs of an object detector applied on those frames. The ASMP task is then mathematically modeled as the joint problem of: (i) recur- sively segmenting the 2.5D scene graph into several sub-graphs, each associated with a constituent sound in the input audio mixture (which is then separated) and (ii) predicting the 3D motions of the corresponding sound sources from the separated audio. To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods.

Related News & Events

NEWS MERL researchers presenting five papers at NeurIPS 2022
Date: November 29, 2022 - December 9, 2022
Where: NeurIPS 2022
MERL Contacts: Moitreya Chatterjee; Anoop Cherian; Michael J. Jones; Suhas Lohit
Research Areas: Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio
Brief
- MERL researchers are presenting 5 papers at the NeurIPS Conference, which will be held in New Orleans from Nov 29-Dec 1st, with virtual presentations in the following week. NeurIPS is one of the most prestigious and competitive international conferences in machine learning.
  
  MERL papers in NeurIPS 2022:
  
  1. “AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments” by Sudipta Paul, Amit Roy-Chowdhary, and Anoop Cherian
  
  This work proposes a unified multimodal task for audio-visual embodied navigation where the navigating agent can also interact and seek help from a human/oracle in natural language when it is uncertain of its navigation actions. We propose a multimodal deep hierarchical reinforcement learning framework for solving this challenging task that allows the agent to learn when to seek help and how to use the language instructions. AVLEN agents can interact anywhere in the 3D navigation space and demonstrate state-of-the-art performances when the audio-goal is sporadic or when distractor sounds are present.
  
  2. “Learning Partial Equivariances From Data” by David W. Romero and Suhas Lohit
  
  Group equivariance serves as a good prior improving data efficiency and generalization for deep neural networks, especially in settings with data or memory constraints. However, if the symmetry groups are misspecified, equivariance can be overly restrictive and lead to bad performance. This paper shows how to build partial group convolutional neural networks that learn to adapt the equivariance levels at each layer that are suitable for the task at hand directly from data. This improves performance while retaining equivariance properties approximately.
  
  3. “Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation” by Moitreya Chatterjee, Narendra Ahuja, and Anoop Cherian
  
  There often exist strong correlations between the 3D motion dynamics of a sounding source and its sound being heard, especially when the source is moving towards or away from the microphone. In this paper, we propose an audio-visual scene-graph that learns and leverages such correlations for improved visually-guided audio separation from an audio mixture, while also allowing predicting the direction of motion of the sound source.
  
  4. “What Makes a "Good" Data Augmentation in Knowledge Distillation - A Statistical Perspective” by Huan Wang, Suhas Lohit, Michael Jones, and Yun Fu
  
  This paper presents theoretical and practical results for understanding what makes a particular data augmentation technique (DA) suitable for knowledge distillation (KD). We design a simple metric that works very well in practice to predict the effectiveness of DA for KD. Based on this metric, we also propose a new data augmentation technique that outperforms other methods for knowledge distillation in image recognition networks.
  
  5. “FeLMi : Few shot Learning with hard Mixup” by Aniket Roy, Anshul Shah, Ketul Shah, Prithviraj Dhar, Anoop Cherian, and Rama Chellappa
  
  Learning from only a few examples is a fundamental challenge in machine learning. Recent approaches show benefits by learning a feature extractor on the abundant and labeled base examples and transferring these to the fewer novel examples. However, the latter stage is often prone to overfitting due to the small size of few-shot datasets. In this paper, we propose a novel uncertainty-based criteria to synthetically produce “hard” and useful data by mixing up real data samples. Our approach leads to state-of-the-art results on various computer vision few-shot benchmarks.

Related Publication

Chatterjee, M., Ahuja, N., Cherian, A., "Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation", arXiv, October 2022.

BibTeX arXiv

@article{Chatterjee2022oct2,
author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
title = {{Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation}},
journal = {arXiv},
year = 2022,
month = oct,
url = {http://arxiv.org/abs/2210.16472}
}

MERL Contacts:

MoitreyaChatterjee

AnoopCherian

Research Areas:

Abstract:

Moitreya
Chatterjee

Anoop
Cherian