TR2022-025

DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning


    •  Shah, A.P., Hori, T., Le Roux, J., Hori, C., DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning, February 2022.
      BibTeX TR2022-025 PDF
      • @book{Shah2022feb,
      • author = {Shah, Ankit Parag and Hori, Takaaki and Le Roux, Jonathan and Hori, Chiori},
      • title = {DSTC10-AVSD Submission System with Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning},
      • year = 2022,
      • month = feb,
      • url = {https://www.merl.com/publications/TR2022-025}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Human-Computer Interaction, Speech & Audio

Abstract:

We participated in the third challenge for the Audio-Visual Scene-Aware Dialog (AVSD) task in DSTC10. The target of the task was updated by two modifications: 1) the humancreated description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. The baseline system built using an AV-transformer was released along with the new dataset including temporal reasoning for DSTC10-AVSD. This paper introduces a new system that extends the baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network (RPN). We confirmed our system outperformed the baseline system and the previous state of the art for the AVSD test sets for DSTC7, DSTC8, and DSTC10. Furthermore, the temporal reasoning using RPN outperformed the attention method of the baseline system.