Seamless Speech Recognition

A new multilingual speech recognition technology that simultaneously identifies the language spoken and recognizes the words.

Takaaki Hori, Jonathan Le Roux, Bret Harsham, Niko Moritz, Gordon Wichern
Joint work with: Hiroshi Seki (Toyohashi University of Technology)

Search MERL publications by keyword: Speech & Audio, ASR, deep learning.


This research tackles the challenging task of multilingual multi-speaker automatic speech recognition (ASR) using an all-in-one end-to-end system. Several multilingual ASR systems were recently proposed based on a monolithic neural network architecture without language-dependent modules, showing that modeling of multiple languages is well within the capabilities of an end-to-end framework. There has also been growing interest in multi-speaker speech recognition, which enables generation of multiple label sequences from single-channel mixed speech. In particular, a multi-speaker end-to-end ASR system that can directly model one-to-many mappings without additional auxiliary clues was recently proposed.

We propose an unprecedented all-in-one end-to-end multilingual multi-speaker ASR system integrating end-to-end approaches for multilingual ASR and multi-speaker ASR. This system can be used to provide a seamless ASR experience, in particular improving accessibility of interfaces facing a diverse set of users.

As an example of potential application, we developed a live demonstration of a multilingual guidance system for an airport. The system realizes a speech interface that can guide users to various locations within an airport. It can recognize multilingual speech with code-switching and simultaneous speech by multiple speakers in various languages without prior language settings, and provide the appropriate guidance for each query in the corresponding language. The demonstration was presented during a press event in February 2019, in Tokyo, Japan. It was widely covered by the Japanese media, with reports by all six main Japanese TV stations and multiple articles in print and online newspapers, including in Japan\'s top newspaper, Asahi Shimbun.


Media Coverage


Videos



MERL Publications

  •  Seki, H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J., "A Purely End-to-end System for Multi-speaker Speech Recognition", arXiv, July 2018.
    BibTeX arXiv Video
    • @article{Seki2018jul2,
    • author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan and Hershey, John},
    • title = {A Purely End-to-end System for Multi-speaker Speech Recognition},
    • journal = {arXiv},
    • year = 2018,
    • month = jul,
    • url = {https://arxiv.org/abs/1805.05826}
    • }
  •  Seki, H., Watanabe, S., Hori, T., Le Roux, J., Hershey, J.R., "An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP.2018.8462180, April 2018, pp. 4919-4923.
    BibTeX TR2018-002 PDF Video
    • @inproceedings{Seki2018apr,
    • author = {Seki, Hiroshi and Watanabe, Shinji and Hori, Takaaki and Le Roux, Jonathan and Hershey, John R.},
    • title = {An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech},
    • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
    • year = 2018,
    • pages = {4919--4923},
    • month = apr,
    • doi = {10.1109/ICASSP.2018.8462180},
    • url = {https://www.merl.com/publications/TR2018-002}
    • }
  •  Settle, S., Le Roux, J., Hori, T., Watanabe, S., Hershey, J.R., "End-to-End Multi-Speaker Speech Recognition", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP.2018.8461893, April 2018, pp. 4819-4823.
    BibTeX TR2018-001 PDF Video
    • @inproceedings{Settle2018apr,
    • author = {Settle, Shane and Le Roux, Jonathan and Hori, Takaaki and Watanabe, Shinji and Hershey, John R.},
    • title = {End-to-End Multi-Speaker Speech Recognition},
    • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
    • year = 2018,
    • pages = {4819--4823},
    • month = apr,
    • doi = {10.1109/ICASSP.2018.8461893},
    • url = {https://www.merl.com/publications/TR2018-001}
    • }
  •  Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T., "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition", IEEE Journal of Selected Topics in Signal Processing, DOI: 10.1109/​JSTSP.2017.2763455, Vol. 11, No. 8, pp. 1240-1253, October 2017.
    BibTeX TR2017-190 PDF Video
    • @article{Watanabe2017oct,
    • author = {Watanabe, Shinji and Hori, Takaaki and Kim, Suyoun and Hershey, John R. and Hayashi, Tomoki},
    • title = {Hybrid CTC/Attention Architecture for End-to-End Speech Recognition},
    • journal = {IEEE Journal of Selected Topics in Signal Processing},
    • year = 2017,
    • volume = 11,
    • number = 8,
    • pages = {1240--1253},
    • month = oct,
    • doi = {10.1109/JSTSP.2017.2763455},
    • issn = {1941-0484},
    • url = {https://www.merl.com/publications/TR2017-190}
    • }
  •  Watanabe, S., Hori, T., Hershey, J.R., "Language Independent End-to-End Architecture For Joint Language and Speech Recognition", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), DOI: 10.1109/​ASRU.2017.8268945, December 2017.
    BibTeX TR2017-182 PDF Video
    • @inproceedings{Watanabe2017dec,
    • author = {Watanabe, Shinji and Hori, Takaaki and Hershey, John R.},
    • title = {Language Independent End-to-End Architecture For Joint Language and Speech Recognition},
    • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
    • year = 2017,
    • month = dec,
    • doi = {10.1109/ASRU.2017.8268945},
    • url = {https://www.merl.com/publications/TR2017-182}
    • }
  •  Hori, T., Watanabe, S., Zhang, Y., Chan, W., "Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM", Interspeech, August 2017.
    BibTeX TR2017-132 PDF Video
    • @inproceedings{Hori2017aug,
    • author = {Hori, Takaaki and Watanabe, Shinji and Zhang, Yu and Chan, William},
    • title = {Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM},
    • booktitle = {Interspeech},
    • year = 2017,
    • month = aug,
    • url = {https://www.merl.com/publications/TR2017-132}
    • }
  •  Hori, T., Watanabe, S., Hershey, J.R., "Joint CTC/attention decoding for end-to-end speech recognition", Association for Computational Linguistics (ACL), DOI: 10.18653/​v1/​P17-1048, July 2017, pp. 518-529.
    BibTeX TR2017-103 PDF Video
    • @inproceedings{Hori2017jul,
    • author = {Hori, Takaaki and Watanabe, Shinji and Hershey, John R.},
    • title = {Joint CTC/attention decoding for end-to-end speech recognition},
    • booktitle = {Association for Computational Linguistics (ACL)},
    • year = 2017,
    • pages = {518--529},
    • month = jul,
    • doi = {10.18653/v1/P17-1048},
    • url = {https://www.merl.com/publications/TR2017-103}
    • }
  •  Kim, S., Hori, T., Watanabe, S., "Joint CTC- Attention Based End-to-End Speech Recognition Using Multi-task Learning", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2017.
    BibTeX TR2017-016 PDF Video
    • @inproceedings{Kim2017mar,
    • author = {Kim, Suyoun and Hori, Takaaki and Watanabe, Shinji},
    • title = {Joint CTC- Attention Based End-to-End Speech Recognition Using Multi-task Learning},
    • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
    • year = 2017,
    • month = mar,
    • url = {https://www.merl.com/publications/TR2017-016}
    • }