Speech & Audio

Audio source separation, recognition, and understanding.

Our current research focuses on application of machine learning to estimation and inference problems in speech and audio processing. Topics include end-to-end speech recognition and enhancement, acoustic modeling and analysis, statistical dialog systems, as well as natural language understanding and adaptive multimodal interfaces.

  • Researchers

  • Awards

    •  AWARD   Best Poster Award and Best Video Award at the International Society for Music Information Retrieval Conference (ISMIR) 2020
      Date: October 15, 2020
      Awarded to: Ethan Manilow, Gordon Wichern, Jonathan Le Roux
      MERL Contacts: Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • Former MERL intern Ethan Manilow and MERL researchers Gordon Wichern and Jonathan Le Roux won Best Poster Award and Best Video Award at the 2020 International Society for Music Information Retrieval Conference (ISMIR 2020) for the paper "Hierarchical Musical Source Separation". The conference was held October 11-14 in a virtual format. The Best Poster Awards and Best Video Awards were awarded by popular vote among the conference attendees.

        The paper proposes a new method for isolating individual sounds in an audio mixture that accounts for the hierarchical relationship between sound sources. Many sounds we are interested in analyzing are hierarchical in nature, e.g., during a music performance, a hi-hat note is one of many such hi-hat notes, which is one of several parts of a drumkit, itself one of many instruments in a band, which might be playing in a bar with other sounds occurring. Inspired by this, the paper re-frames the audio source separation problem as hierarchical, combining similar sounds together at certain levels while separating them at other levels, and shows on a musical instrument separation task that a hierarchical approach outperforms non-hierarchical models while also requiring less training data. The paper, poster, and video can be seen on the paper page on the ISMIR website.
    •  
    •  AWARD   Best Paper Award at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
      Date: December 18, 2019
      Awarded to: Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe
      MERL Contact: Jonathan Le Roux
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL researcher Jonathan Le Roux and co-authors Xuankai Chang, Shinji Watanabe (Johns Hopkins University), Wangyou Zhang, and Yanmin Qian (Shanghai Jiao Tong University) won the Best Paper Award at the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), for the paper "MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition". MIMO-Speech is a fully neural end-to-end framework that can transcribe the text of multiple speakers speaking simultaneously from multi-channel input. The system is comprised of a monaural masking network, a multi-source neural beamformer, and a multi-output speech recognition model, which are jointly optimized only via an automatic speech recognition (ASR) criterion. The award was received by lead author Xuankai Chang during the conference, which was held in Sentosa, Singapore from December 14-18, 2019.
    •  
    •  AWARD   Best Student Paper Award at IEEE ICASSP 2018
      Date: April 17, 2018
      Awarded to: Zhong-Qiu Wang
      MERL Contact: Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • Former MERL intern Zhong-Qiu Wang (Ph.D. Candidate at Ohio State University) has received a Best Student Paper Award at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) for the paper "Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation" by Zhong-Qiu Wang, Jonathan Le Roux, and John Hershey. The paper presents work performed during Zhong-Qiu's internship at MERL in the summer 2017, extending MERL's pioneering Deep Clustering framework for speech separation to a multi-channel setup. The award was received on behalf on Zhong-Qiu by MERL researcher and co-author Jonathan Le Roux during the conference, held in Calgary April 15-20.
    •  

    See All Awards for Speech & Audio
  • News & Events

    •  EVENT   Prof. Melanie Zeilinger of ETH to give keynote at MERL's Virtual Open House
      Date & Time: Thursday, December 9, 2021; 1:00pm - 5:30pm EST
      Speaker: Prof. Melanie Zeilinger, ETH
      Location: Virtual Event
      Research Areas: Applied Physics, Artificial Intelligence, Communications, Computational Sensing, Computer Vision, Control, Data Analytics, Dynamical Systems, Electric Systems, Electronic and Photonic Devices, Machine Learning, Multi-Physical Modeling, Optimization, Robotics, Signal Processing, Speech & Audio, Digital Video, Human-Computer Interaction, Information Security
      Brief
      • MERL is excited to announce the second keynote speaker for our Virtual Open House 2021:
        Prof. Melanie Zeilinger from ETH .

        Our virtual open house will take place on December 9, 2021, 1:00pm - 5:30pm (EST).

        Join us to learn more about who we are, what we do, and discuss our internship and employment opportunities. Prof. Zeilinger's talk is scheduled for 3:15pm - 3:45pm (EST).

        Registration: https://mailchi.mp/merl/merlvoh2021

        Keynote Title: Control Meets Learning - On Performance, Safety and User Interaction

        Abstract: With increasing sensing and communication capabilities, physical systems today are becoming one of the largest generators of data, making learning a central component of autonomous control systems. While this paradigm shift offers tremendous opportunities to address new levels of system complexity, variability and user interaction, it also raises fundamental questions of learning in a closed-loop dynamical control system. In this talk, I will present some of our recent results showing how even safety-critical systems can leverage the potential of data. I will first briefly present concepts for using learning for automatic controller design and for a new safety framework that can equip any learning-based controller with safety guarantees. The second part will then discuss how expert and user information can be utilized to optimize system performance, where I will particularly highlight an approach developed together with MERL for personalizing the motion planning in autonomous driving to the individual driving style of a passenger.
    •  
    •  EVENT   Prof. Ashok Veeraraghavan of Rice University to give keynote at MERL's Virtual Open House
      Date & Time: Thursday, December 9, 2021; 1:00pm - 5:30pm EST
      Speaker: Prof. Ashok Veeraraghavan, Rice University
      Location: Virtual Event
      Research Areas: Applied Physics, Artificial Intelligence, Communications, Computational Sensing, Computer Vision, Control, Data Analytics, Dynamical Systems, Electric Systems, Electronic and Photonic Devices, Machine Learning, Multi-Physical Modeling, Optimization, Robotics, Signal Processing, Speech & Audio, Digital Video, Human-Computer Interaction, Information Security
      Brief
      • MERL is excited to announce the first keynote speaker for our Virtual Open House 2021:
        Prof. Ashok Veeraraghavan from Rice University.

        Our virtual open house will take place on December 9, 2021, 1:00pm - 5:30pm (EST).

        Join us to learn more about who we are, what we do, and discuss our internship and employment opportunities. Prof. Veeraraghavan's talk is scheduled for 1:15pm - 1:45pm (EST).

        Registration: https://mailchi.mp/merl/merlvoh2021

        Keynote Title: Computational Imaging: Beyond the limits imposed by lenses.

        Abstract: The lens has long been a central element of cameras, since its early use in the mid-nineteenth century by Niepce, Talbot, and Daguerre. The role of the lens, from the Daguerrotype to modern digital cameras, is to refract light to achieve a one-to-one mapping between a point in the scene and a point on the sensor. This effect enables the sensor to compute a particular two-dimensional (2D) integral of the incident 4D light-field. We propose a radical departure from this practice and the many limitations it imposes. In the talk we focus on two inter-related research projects that attempt to go beyond lens-based imaging.

        First, we discuss our lab’s recent efforts to build flat, extremely thin imaging devices by replacing the lens in a conventional camera with an amplitude mask and computational reconstruction algorithms. These lensless cameras, called FlatCams can be less than a millimeter in thickness and enable applications where size, weight, thickness or cost are the driving factors. Second, we discuss high-resolution, long-distance imaging using Fourier Ptychography, where the need for a large aperture aberration corrected lens is replaced by a camera array and associated phase retrieval algorithms resulting again in order of magnitude reductions in size, weight and cost. Finally, I will spend a few minutes discussing how the wholistic computational imaging approach can be used to create ultra-high-resolution wavefront sensors.
    •  

    See All News & Events for Speech & Audio
  • Research Highlights

  • Internships

    • SA1686: Multimodal scene understanding

      We are looking for a graduate student interested in helping advance the field of multi-modal scene understanding, with a focus on detailed captioning of a scene using natural language. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. The ideal candidate would be a senior Ph.D. student with experience in deep learning for audio-visual, signal, and natural language processing. The expected duration of the internship is 3-6 months, and start date is flexible. This internship is preferred to be onsite at MERL, but may be done remotely where you live if the COVID pandemic makes it necessary.

    • CV1722: Multimodal Embodied AI

      MERL is looking for a self-motivated intern to work on problems at the intersection of video understanding, audio processing, and language models. The ideal candidate would be a senior PhD student with a strong background in machine learning and computer vision (as demonstrated via top-tier publications). The candidate must have prior experience in developing deep learning methods for audio-visual-language data. Expertise in popular embodied AI environments as well as a strong background in reinforcement learning will be beneficial. The intern is expected to collaborate with researchers in computer vision and speech teams at MERL to develop algorithms and prepare manuscripts for scientific publications. This internship requires work that can only be done at MERL.

    • SA1689: Audio source separation and sound event detection

      We are seeking a graduate student interested in helping advance the fields of source separation, speech enhancement, robust ASR, and sound event detection/localization in challenging multi-source and far-field scenarios. The intern will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. The ideal candidate would be a senior Ph.D. student with experience in some of the following: audio signal processing, microphone array processing, probabilistic modeling, sequence to sequence models, and deep learning techniques, in particular those involving minimal supervision (e.g., unsupervised, weakly-supervised, self-supervised, or few shot learning). The internship will take place during spring/summer 2022 with an expected duration of 3-6 months and a flexible start date. This internship is preferred to be onsite at MERL, but may be done remotely where you live if the COVID pandemic makes it necessary.


    See All Internships for Speech & Audio
  • Openings


    See All Openings at MERL
  • Recent Publications

    •  Wang, Z.-Q., Wichern, G., Le Roux, J., "On The Compensation Between Magnitude and Phase in Speech Separation", IEEE Signal Processing Letters, November 2021.
      BibTeX TR2021-137 PDF
      • @article{Wang2021nov2,
      • author = {Wang, Zhong-Qiu and Wichern, Gordon and Le Roux, Jonathan},
      • title = {On The Compensation Between Magnitude and Phase in Speech Separation},
      • journal = {IEEE Signal Processing Letters},
      • year = 2021,
      • month = nov,
      • url = {https://www.merl.com/publications/TR2021-137}
      • }
    •  Wang, Z.-Q., Wichern, G., Le Roux, J., "Convolutive Prediction for Reverberant Speech Separation", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 2021.
      BibTeX TR2021-127 PDF
      • @inproceedings{Wang2021oct4,
      • author = {Wang, Zhong-Qiu and Wichern, Gordon and Le Roux, Jonathan},
      • title = {Convolutive Prediction for Reverberant Speech Separation},
      • booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
      • year = 2021,
      • month = oct,
      • url = {https://www.merl.com/publications/TR2021-127}
      • }
    •  Wichern, G., Chakrabarty, A., Wang, Z.-Q., Le Roux, J., "Anomalous sound detection using attentive neural processes", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 2021.
      BibTeX TR2021-129 PDF
      • @inproceedings{Wichern2021oct,
      • author = {Wichern, Gordon and Chakrabarty, Ankush and Wang, Zhong-Qiu and Le Roux, Jonathan},
      • title = {Anomalous sound detection using attentive neural processes},
      • booktitle = {IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
      • year = 2021,
      • month = oct,
      • url = {https://www.merl.com/publications/TR2021-129}
      • }
    •  Chatterjee, M., Le Roux, J., Ahuja, N., Cherian, A., "Visual Scene Graphs for Audio Source Separation", IEEE International Conference on Computer Vision (ICCV), October 2021.
      BibTeX TR2021-095 PDF
      • @inproceedings{Chatterjee2021oct,
      • author = {Chatterjee, Moitreya and Le Roux, Jonathan and Ahuja, Narendra and Cherian, Anoop},
      • title = {Visual Scene Graphs for Audio Source Separation},
      • booktitle = {IEEE International Conference on Computer Vision (ICCV)},
      • year = 2021,
      • month = oct,
      • url = {https://www.merl.com/publications/TR2021-095}
      • }
    •  Higuchi, Y., Moritz, N., Le Roux, J., Hori, T., "Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-571, September 2021, pp. 726-730.
      BibTeX TR2021-103 PDF
      • @inproceedings{Higuchi2021sep,
      • author = {Higuchi, Yosuke and Moritz, Niko and Le Roux, Jonathan and Hori, Takaaki},
      • title = {Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {726--730},
      • month = sep,
      • doi = {10.21437/Interspeech.2021-571},
      • url = {https://www.merl.com/publications/TR2021-103}
      • }
    •  Hori, T., Moritz, N., Hori, C., Le Roux, J., "Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-1643, August 2021, pp. 2097-2101.
      BibTeX TR2021-100 PDF
      • @inproceedings{Hori2021aug3,
      • author = {Hori, Takaaki and Moritz, Niko and Hori, Chiori and Le Roux, Jonathan},
      • title = {Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {2097--2101},
      • month = aug,
      • doi = {10.21437/Interspeech.2021-1643},
      • url = {https://www.merl.com/publications/TR2021-100}
      • }
    •  Hori, C., Hori, T., Le Roux, J., "Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-1975, August 2021, pp. 586–590.
      BibTeX TR2021-093 PDF
      • @inproceedings{Hori2021aug2,
      • author = {Hori, Chiori and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Optimizing Latency for Online Video Captioning Using Audio-VisualTransformers},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {586–590},
      • month = aug,
      • publisher = {ISCA},
      • doi = {10.21437/Interspeech.2021-1975},
      • url = {https://www.merl.com/publications/TR2021-093}
      • }
    •  Moritz, N., Hori, T., Le Roux, J., "Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition", Annual Conference of the International Speech Communication Association (Interspeech), DOI: 10.21437/​Interspeech.2021-1693, August 2021, pp. 1822-1826.
      BibTeX TR2021-094 PDF
      • @inproceedings{Moritz2021aug,
      • author = {Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition},
      • booktitle = {Annual Conference of the International Speech Communication Association (Interspeech)},
      • year = 2021,
      • pages = {1822--1826},
      • month = aug,
      • doi = {10.21437/Interspeech.2021-1693},
      • url = {https://www.merl.com/publications/TR2021-094}
      • }
    See All Publications for Speech & Audio
  • Videos

  • Software Downloads