TR2024-073

Exploring Keyword Enrollment for Japanese End-to-End Automatic Speech Recognition using Contextual Biasing


    •  Mitsui, Y., Aihara, R., Hori, T., Le Roux, J., Taguchi, S., "Exploring Keyword Enrollment for Japanese End-to-End Automatic Speech Recognition using Contextual Biasing", OTOGAKU Symposium, June 2024.
      BibTeX TR2024-073 PDF
      • @inproceedings{Mitsui2024jun,
      • author = {Mitsui, Yoshiki and Aihara, Ryo and Hori, Takaaki and Le Roux, Jonathan and Taguchi, Shinya}},
      • title = {Exploring Keyword Enrollment for Japanese End-to-End Automatic Speech Recognition using Contextual Biasing},
      • booktitle = {OTOGAKU Symposium},
      • year = 2024,
      • month = jun,
      • url = {https://www.merl.com/publications/TR2024-073}
      • }
  • MERL Contact:
  • Research Area:

    Speech & Audio

Abstract:

End-to-end (E2E) automatic speech recognition (ASR), which has emerged with the development of deep learning, exhibits generally higher performance than conventional modular ASR methods. However, E2E ASR has the drawback that it is difficult to enroll keywords for specific domains, which was easily realized in conventional ASR. Contextual biasing has been proposed for keyword enrollment methods for E2E ASR, but, for Japanese ASR, the performance is not sufficient when we enroll keywords which do not appear in the training data. To overcome this problem, we propose an updated keyword enrollment method where we use phonetic letter notations such as katakana or hiragana to recognize enrolled keywords, converting them back to their original notations in a postprocessing step. Additionally we propose an improved E2E ASR model training method to strengthen the connection between acoustic features obtained from input speech and phonetic letter notations by replacing some words from origial notation to phonetic letter notation. We observed higher keyword enrollment performance for keywords longer than five moras by using the proposed methods.