TR2023-068

BEATs-based Audio Captioning Model with Instructor Embedding Supervision and ChatGPT Mix-up


    •  Wu, S.-L., Chang, X., Wichern, G., Jung, J.-W., Germain, F., Le Roux, J., Watanabe, S., "BEATs-based Audio Captioning Model with Instructor Embedding Supervision and ChatGPT Mix-up," Tech. Rep. TR2023-068, DCASE2023 Challenge, May 2023.
      BibTeX TR2023-068 PDF
      • @techreport{Wu2023may,
      • author = {Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Francois and Le Roux, Jonathan and Watanabe, Shinji},
      • title = {BEATs-based Audio Captioning Model with Instructor Embedding Supervision and ChatGPT Mix-up},
      • institution = {DCASE2023 Challenge},
      • year = 2023,
      • month = may,
      • url = {https://www.merl.com/publications/TR2023-068}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

Detection and Classification of Acoustic Scenes and Events 2023 Challenge BEATS-BASED AUDIO CAPTIONING MODEL WITH INSTRUCTOR EMBEDDING SUPERVISION AND CHATGPT MIX-UP Technical Report Shih-Lun Wu 1, Xuankai Chang 1, Gordon Wichern 2, Jee-weon Jung 1, Franc ̧ois Germain 2, Jonathan Le Roux 2, Shinji Watanabe 1 1 Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA {shihlunw, xuankaic, jeeweonj, swatanab}@andrew.cmu.edu 2 Speech & Audio Team, Mitsubishi Electric Research Labs, Cambridge, MA, USA {wichern, germain, leroux}@merl.com ABSTRACT DCASE 2023 Task 6A, automated audio captioning (AAC), aims at generating informative descriptions for various sounds from nature and/or human activities. Our AAC system follows the sequence-to- sequence (seq2seq) architecture. The audio encoder stack is com- prised of a frozen BEATS Transformer followed by a 2-layer Con- former. The BEATS module, which has been pretrained on both masked audio token prediction and audio event classification, ex- tracts fine-grained (i.e., ≈ 50 Hz) audio features, while the Con- former downsamples and summarizes the audio features before they are cross-attended by the BART text decoder. Besides the autore- gressive negative log-likelihood (NLL) loss computed on decoder outputs, we simultaneously apply an audio-text contrastive loss on our encoder output to infuse language modality knowledge into it. Specifically, we feed ground-truth captions into INSTRUCTOR Transformer, a state-of-the-art text embedding model, and teach our audio encoder to predict the INSTRUCTOR text embeddings through InfoNCE loss. In addition, we leverage ChatGPT to produce cap- tion mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increases not only the amount but also the complexity and diversity of our training data. During inference, we employ nucleus sampling and a hybrid reranking algorithm that considers both likelihood and audio-caption representation similarity. Combining our efforts, our best single model and ensemble system achieve 0.326 and 0.336 SPIDEr-FL scores, respectively, on the Clotho (V2) evaluation split.

 

  • Related News & Events

    •  AWARD    Joint CMU-MERL team wins DCASE2023 Challenge on Automated Audio Captioning
      Date: June 1, 2023
      Awarded to: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, Francois Germain, Jonathan Le Roux, Shinji Watanabe
      MERL Contacts: François Germain; Jonathan Le Roux; Gordon Wichern
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • A joint team consisting of members of CMU Professor and MERL Alumn Shinji Watanabe's WavLab and members of MERL's Speech & Audio team ranked 1st out of 11 teams in the DCASE2023 Challenge's Task 6A "Automated Audio Captioning". The team was led by student Shih-Lun Wu and also featured Ph.D. candidate Xuankai Chang, Postdoctoral research associate Jee-weon Jung, Prof. Shinji Watanabe, and MERL researchers Gordon Wichern, Francois Germain, and Jonathan Le Roux.

        The IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE Challenge), started in 2013, has been organized yearly since 2016, and gathers challenges on multiple tasks related to the detection, analysis, and generation of sound events. This year, the DCASE2023 Challenge received over 428 submissions from 123 teams across seven tasks.

        The CMU-MERL team competed in the Task 6A track, Automated Audio Captioning, which aims at generating informative descriptions for various sounds from nature and/or human activities. The team's system made strong use of large pretrained models, namely a BEATs transformer as part of the audio encoder stack, an Instructor Transformer encoding ground-truth captions to derive an audio-text contrastive loss on the audio encoder, and ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of the training data. The team's best submission obtained a SPIDEr-FL score of 0.327 on the hidden test set, largely outperforming the 2nd best team's 0.315.
    •