TR2022-021

Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR

- Chang, X., Moritz, N., Hori, T., Watanabe, S., Le Roux, J., "Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP43922.2022.9747375, April 2022, pp. 7322-7326.
  BibTeX TR2022-021 PDF
  - @inproceedings{Chang2022apr,
  - author = {Chang, Xuankai and Moritz, Niko and Hori, Takaaki and Watanabe, Shinji and {Le Roux}, Jonathan},
  - title = {{Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2022,
  - pages = {7322--7326},
  - month = apr,
  - publisher = {IEEE},
  - doi = {10.1109/ICASSP43922.2022.9747375},
  - url = {https://www.merl.com/publications/TR2022-021}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Graph-based temporal classification (GTC), a generalized form of the connectionist temporal classification loss, was recently proposed to improve automatic speech recognition (ASR) systems using graph-based supervision. For example, GTC was first used to encode an N-best list of pseudo-label sequences into a graph for semi-supervised learning. In this paper, we propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network, which can be applied to a wider range of tasks. As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task. The transcriptions and speaker information of multi-speaker speech are represented by a graph, where the speaker information is associated with the transitions and ASR outputs with the nodes. Using GTC-e, multi-speaker ASR modelling becomes very similar to single-speaker ASR modeling, in that tokens by multiple speakers are recognized as a single merged sequence in chronological order. For evaluation, we perform experiments on a simulated multi-speaker speech dataset derived from LibriSpeech, obtaining promising results with performance close to classical benchmarks for the task.

Related News & Events

NEWS MERL presenting 8 papers at ICASSP 2022
Date: May 22, 2022 - May 27, 2022
Where: Singapore
MERL Contacts: Anoop Cherian; Chiori Hori; Toshiaki Koike-Akino; Jonathan Le Roux; Tim K. Marks; Philip V. Orlik; Kuan-Chuan Peng; Pu (Perry) Wang; Gordon Wichern
Research Areas: Artificial Intelligence, Computer Vision, Signal Processing, Speech & Audio
Brief
- MERL researchers are presenting 8 papers at the IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which is being held in Singapore from May 22-27, 2022. A week of virtual presentations also took place earlier this month.
  
  Topics to be presented include recent advances in speech recognition, audio processing, scene understanding, computational sensing, and classification.
  
  ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year.

Related Publication

Chang, X., Moritz, N., Hori, T., Watanabe, S., Le Roux, J., "Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR", arXiv, DOI: 10.48550/arXiv.2203.00232, March 2022.

BibTeX arXiv

@article{Chang2022mar,
author = {Chang, Xuankai and Moritz, Niko and Hori, Takaaki and Watanabe, Shinji and {Le Roux}, Jonathan},
title = {{Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR}},
journal = {arXiv},
year = 2022,
month = mar,
doi = {10.48550/arXiv.2203.00232},
url = {https://arxiv.org/abs/2203.00232}
}

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Jonathan
Le Roux