TR2022-116

Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers

- Hori, C., Hori, T., Le Roux, J., "Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers", Interspeech, DOI: 10.21437/Interspeech.2022-10891, September 2022, pp. 4511-4515.
  BibTeX TR2022-116 PDF
  - @inproceedings{Hori2022sep,
  - author = {Hori, Chiori and Hori, Takaaki and {Le Roux}, Jonathan},
  - title = {{Low-Latency Streaming Scene-aware Interaction Using Audio-Visual Transformers}},
  - booktitle = {Interspeech},
  - year = 2022,
  - pages = {4511--4515},
  - month = sep,
  - doi = {10.21437/Interspeech.2022-10891},
  - url = {https://www.merl.com/publications/TR2022-116}
  - }
MERL Contacts:
- Chiori
  Hori
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio

Abstract:

To apply scene-aware interaction technology to real-time dia- log systems, we propose an online low-latency response gen- eration framework for scene-aware interaction using a video question answering setup. This paper extends our prior work on low-latency video captioning to build a novel approach that can optimize the timing to generate each answer under a trade- off between latency of generation and quality of answer. For video QA, the timing detector is now in charge of finding a tim- ing for the question-relevant event, instead of determining when the system has seen enough to generate a general caption as in the video captioning case. Our audio visual scene-aware dialog system built for the 10th Dialog System Technology Challenge was extended to exploit a low-latency function. Experiments with the MSRVTT-QA and AVSD datasets show that our ap- proach achieves between 97% and 99% of the answer quality of the upper bound given by a pre-trained Transformer using the entire video clips, using less than 40% of frames from the beginning.

MERL Contacts:

ChioriHori

JonathanLe Roux

Research Areas:

Abstract:

Chiori
Hori

Jonathan
Le Roux