TR2025-097

Single- and Multi-Channel Speech Enhancement and Separation for Far-Field Conversation Recognition


    •  Masuyama, Y., "Single- and Multi-Channel Speech Enhancement and Separation for Far-Field Conversation Recognition," Tech. Rep. TR2025-097, Jelinek Summer Workshop on Speech and Language Technology (JSALT), June 2025.
      BibTeX TR2025-097 PDF
      • @techreport{Masuyama2025jun,
      • author = {{{Masuyama, Yoshiki}}},
      • title = {{{Single- and Multi-Channel Speech Enhancement and Separation for Far-Field Conversation Recognition}}},
      • institution = {Jelinek Summer Workshop on Speech and Language Technology (JSALT)},
      • year = 2025,
      • month = jun,
      • url = {https://www.merl.com/publications/TR2025-097}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

While ASR achieves superhuman performance on clean benchmarks, it struggles in real-world scenarios like meeting transcription, where word error rates exceed 35% versus under 3% on clean data. This lecture examines the challenges of robust ASR for conversational speech, including noise, reverberation, multiple speakers, and overlapped speech (>15% of meeting duration). The lecture covers evaluation methodologies for long-form multi-speaker audio, including concatenated minimum permutation WER (cpWER), and surveys key datasets from AMI to current benchmarks like CHiME-7/8 and NOTSOFAR1. Technical approaches are categorized into front-end methods (speech separation, beamforming, target speaker extraction) and back-end methods (self-supervised features, serialized output training, target-speaker ASR). Robust ASR remains an active research area with significant opportunities, particularly as large language models enable new applications like automated meeting summarization. Key challenges include speaker tracking, training-inference mismatches, and integrating speech separation, diarization, and recognition components.