Sequence Transduction with Graph-based Supervision


The recurrent neural network transducer (RNN-T) objective plays a major role in building today’s best automatic speech recognition (ASR) systems for production. Similarly to the connectionist temporal classification (CTC) objective, the RNN-T loss uses specific rules that define how a set of alignments is generated to form a lattice for the full-sum training. However, it is yet largely unknown if these rules are optimal and do lead to the best possible ASR results. In this work, we present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels, thus providing a flexible and efficient framework to manipulate training lattices, e.g., for studying different transition rules, implementing different transducer losses, or restricting alignments. We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T, while also ensuring a strictly monotonic alignment, which will allow better optimization of the decoding procedure. For example, the proposed
CTC-like transducer achieves an improvement of 4.8% on the testother condition of LibriSpeech relative to an equivalent RNN-T based system.


  • Related News & Events

  • Related Publication

  •  Moritz, N., Hori, T., Watanabe, S., Le Roux, J., "Sequence Transduction with Graph-based Supervision", arXiv, DOI: 10.48550/​arXiv.2111.01272, December 2021.
    BibTeX arXiv
    • @article{Moritz2021dec,
    • author = {Moritz, Niko and Hori, Takaaki and Watanabe, Shinji and Le Roux, Jonathan},
    • title = {Sequence Transduction with Graph-based Supervision},
    • journal = {arXiv},
    • year = 2021,
    • month = dec,
    • doi = {10.48550/arXiv.2111.01272},
    • url = {}
    • }