TR2016-003

Deep clustering: Discriminative embeddings for segmentation and separation

- Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S., "Deep Clustering: Discriminative Embeddings for Segmentation and Separation", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP.2016.7471631, March 2016, pp. 31-35.
  BibTeX TR2016-003 PDF
  - @inproceedings{Hershey2016mar,
  - author = {Hershey, John R. and Chen, Zhuo and {Le Roux}, Jonathan and Watanabe, Shinji},
  - title = {{Deep Clustering: Discriminative Embeddings for Segmentation and Separation}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2016,
  - pages = {31--35},
  - month = mar,
  - doi = {10.1109/ICASSP.2016.7471631},
  - url = {https://www.merl.com/publications/TR2016-003}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

Related News & Events

NEWS Jonathan Le Roux discusses MERL's audio source separation work on popular machine learning podcast
Date: January 24, 2022
Where: The TWIML AI Podcast
MERL Contact: Jonathan Le Roux
Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
Brief
- MERL Speech & Audio Senior Team Leader Jonathan Le Roux was featured in an extended interview on the popular TWIML AI Podcast, presenting MERL's work towards solving the "cocktail party problem". Humans have the extraordinary ability to focus on particular sounds of interest within a complex acoustic scene, such as a cocktail party. MERL's Speech & Audio Team has been at the forefront of the field's effort to develop algorithms giving machines similar abilities. Jonathan talked with host Sam Charrington about the group's decade-long journey on this topic, from early pioneering work using deep learning for speech enhancement and speech separation, to recent works on weakly-supervised separation, hierarchical sound separation, as well as the separation of real-world soundtracks into speech, music, and sound effects (aka the "cocktail fork problem").
  
  The TWIML AI podcast, formerly known as This Week in Machine Learning & AI, was created in 2016 and is followed by more than 10,000 subscribers on Youtube and Twitter. Jonathan's interview marks the 555th episode of the podcast.
NEWS MERL's speech research featured in NPR's All Things Considered
Date: February 5, 2018
Where: National Public Radio (NPR)
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- MERL's speech separation technology was featured in NPR's All Things Considered, as part of an episode of All Tech Considered on artificial intelligence, "Can Computers Learn Like Humans?". An example separating the overlapped speech of two of the show's hosts was played on the air.
  The technology is based on a proprietary deep learning method called Deep Clustering. It is the world's first technology that separates in real time the simultaneous speech of multiple unknown speakers recorded with a single microphone. It is a key step towards building machines that can interact in noisy environments, in the same way that humans can have meaningful conversations in the presence of many other conversations.
  A live demonstration was featured in Mitsubishi Electric Corporation's Annual R&D Open House last year, and was also covered in international media at the time.
  
  (Photo credit: Sam Rowe for NPR)
  
  Link:
  "Can Computers Learn Like Humans?" (NPR, All Things Considered)
  MERL Deep Clustering Demo.
NEWS MERL's breakthrough speech separation technology featured in Mitsubishi Electric Corporation's Annual R&D Open House
Date: May 24, 2017
Where: Tokyo, Japan
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- Mitsubishi Electric Corporation announced that it has created the world's first technology that separates in real time the simultaneous speech of multiple unknown speakers recorded with a single microphone. It's a key step towards building machines that can interact in noisy environments, in the same way that humans can have meaningful conversations in the presence of many other conversations. In tests, the simultaneous speeches of two and three people were separated with up to 90 and 80 percent accuracy, respectively. The novel technology, which was realized with Mitsubishi Electric's proprietary "Deep Clustering" method based on artificial intelligence (AI), is expected to contribute to more intelligible voice communications and more accurate automatic speech recognition. A characteristic feature of this approach is its versatility, in the sense that voices can be separated regardless of their language or the gender of the speakers. A live speech separation demonstration that took place on May 24 in Tokyo, Japan, was widely covered by the Japanese media, with reports by three of the main Japanese TV stations and multiple articles in print and online newspapers. The technology is based on recent research by MERL's Speech and Audio team.
  
  Links:
  Mitsubishi Electric Corporation Press Release
  MERL Deep Clustering Demo
  
  Media Coverage:
  
  Fuji TV, News, "Minna no Mirai" (Japanese)
  The Nikkei (Japanese)
  Nikkei Technology Online (Japanese)
  Sankei Biz (Japanese)
  EE Times Japan (Japanese)
  ITpro (Japanese)
  Nikkan Sports (Japanese)
  Nikkan Kogyo Shimbun (Japanese)
  Dempa Shimbun (Japanese)
  Il Sole 24 Ore (Italian)
  IEEE Spectrum (English).
EVENT John Hershey Invited to Speak at Deep Learning Summit 2016 in Boston
Date: Thursday, May 12, 2016 - , May 13, 2016
Location: Deep Learning Summit, Boston, MA
Research Area: Speech & Audio
Brief
- MERL Speech and Audio Senior Team Leader John Hershey is among a set of high-profile researchers invited to speak at the Deep Learning Summit 2016 in Boston on May 12-13, 2016. John will present the team's groundbreaking work on general sound separation using a novel deep learning framework called Deep Clustering. For the first time, an artificial intelligence is able to crack the half-century-old "cocktail party problem", that is, to isolate the speech of a single person from a mixture of multiple unknown speakers, as humans do when having a conversation in a loud crowd.
NEWS MERL researchers present 12 papers at ICASSP 2016
Date: March 20, 2016 - March 25, 2016
Where: Shanghai, China
MERL Contacts: Petros T. Boufounos; Chiori Hori; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Philip V. Orlik; Anthony Vetro
Research Areas: Computational Sensing, Digital Video, Speech & Audio, Communications, Signal Processing
Brief
- MERL researchers have presented 12 papers at the recent IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which was held in Shanghai, China from March 20-25, 2016. ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing, with more than 1200 papers presented and over 2000 participants.
NEWS John Hershey gives invited talk at Johns Hopkins University on MERL's "Deep Clustering" breakthrough
Date: March 4, 2016
Where: Johns Hopkins Center for Language and Speech Processing
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- MERL researcher and speech team leader, John Hershey, was invited by the Center for Language and Speech Processing at Johns Hopkins University to give a talk on MERL's breakthrough audio separation work, known as "Deep Clustering". The talk was entitled "Speech Separation by Deep Clustering: Towards Intelligent Audio Analysis and Understanding," and was given on March 4, 2016.
  
  This is work conducted by MERL researchers John Hershey, Jonathan Le Roux, and Shinji Watanabe, and MERL interns, Zhuo Chen of Columbia University, and Yusef Isik of Sabanci University.

Related Research Highlights

Deep Clustering

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Link:

Links:

Media Coverage:

Jonathan
Le Roux