TR2019-102

Vectorized Beam Search for CTC-Attention-based Speech Recognition


    •  Seki, H., Hori, T., Watanabe, S., Moritz, N., Le Roux, J., "Vectorized Beam Search for CTC-Attention-based Speech Recognition", Interspeech, DOI: 10.21437/​Interspeech.2019-2860, September 2019, pp. 3825-3829.
      BibTeX TR2019-102 PDF
      • @inproceedings{Seki2019sep2,
      • author = {Seki, Hiroshi and Hori, Takaaki and Watanabe, Shinji and Moritz, Niko and Le Roux, Jonathan},
      • title = {Vectorized Beam Search for CTC-Attention-based Speech Recognition},
      • booktitle = {Interspeech},
      • year = 2019,
      • pages = {3825--3829},
      • month = sep,
      • doi = {10.21437/Interspeech.2019-2860},
      • url = {https://www.merl.com/publications/TR2019-102}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

This paper investigates efficient beam search techniques for end-to-end automatic speech recognition (ASR) with attentionbased encoder-decoder architecture. We accelerate the decoding process by vectorizing multiple hypotheses during the beam search, where we replace the score accumulation steps for each hypothesis with vector-matrix operations for the vectorized hypotheses. This modification allows us to take advantage of the parallel computing capabilities of multi-core CPUs and GPUs, resulting in significant speedups and also enabling us to process multiple utterances in a batch simultaneously. Moreover, we extend the decoding method to incorporate a recurrent neural network language model (RNNLM) and connectionist temporal classification (CTC) scores, which typically improve ASR accuracy but have not been investigated for the use of such parallelized decoding algorithms. Experiments with LibriSpeech and Corpus of Spontaneous Japanese datasets have demonstrated that the vectorized beam search achieves 1.8x speedup on a CPU and 33x speedup on a GPU compared with the original CPU implementation. When using joint CTC/attention decoding with an RNNLM, we also achieved 11x speedup on a GPU while maintaining the benefits of CTC and RNNLM. With these benefits, we achieved almost real-time processing with a small latency of 0.1x real-time without streaming search process.

 

  • Related News & Events

    •  NEWS    MERL Speech & Audio Researchers Presenting 7 Papers and a Tutorial at Interspeech 2019
      Date: September 15, 2019 - September 19, 2019
      Where: Graz, Austria
      MERL Contacts: Chiori Hori; Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL Speech & Audio Team researchers will be presenting 7 papers at the 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019, which is being held in Graz, Austria from September 15-19, 2019. Topics to be presented include recent advances in end-to-end speech recognition, speech separation, and audio-visual scene-aware dialog. Takaaki Hori is also co-presenting a tutorial on end-to-end speech processing.

        Interspeech is the world's largest and most comprehensive conference on the science and technology of spoken language processing. It gathers around 2000 participants from all over the world.
    •