Sameer Khurana

Sameer Khurana
  • Biography

    Sameer's research interests include multimodal, transfer and self-supervised learning applied to speech and audio domains. He conducted his Ph.D. research in the Spoken Language Systems Lab at MIT Computer Science and AI Lab (CSAIL), where he developed transfer learning methods for spoken language processing applications.

  • Recent News & Events

    •  EVENT    MERL Contributes to ICASSP 2024
      Date: Sunday, April 14, 2024 - Friday, April 19, 2024
      Location: Seoul, South Korea
      MERL Contacts: Petros T. Boufounos; François Germain; Chiori Hori; Sameer Khurana; Toshiaki Koike-Akino; Jonathan Le Roux; Hassan Mansour; Kieran Parsons; Joshua Rapp; Anthony Vetro; Pu (Perry) Wang; Gordon Wichern
      Research Areas: Artificial Intelligence, Computational Sensing, Machine Learning, Robotics, Signal Processing, Speech & Audio
      Brief
      • MERL has made numerous contributions to both the organization and technical program of ICASSP 2024, which is being held in Seoul, Korea from April 14-19, 2024.

        Sponsorship and Awards

        MERL is proud to be a Bronze Patron of the conference and will participate in the student job fair on Thursday, April 18. Please join this session to learn more about employment opportunities at MERL, including openings for research scientists, post-docs, and interns.

        MERL is pleased to be the sponsor of two IEEE Awards that will be presented at the conference. We congratulate Prof. Stéphane G. Mallat, the recipient of the 2024 IEEE Fourier Award for Signal Processing, and Prof. Keiichi Tokuda, the recipient of the 2024 IEEE James L. Flanagan Speech and Audio Processing Award.

        Jonathan Le Roux, MERL Speech and Audio Senior Team Leader, will also be recognized during the Awards Ceremony for his recent elevation to IEEE Fellow.

        Technical Program

        MERL will present 13 papers in the main conference on a wide range of topics including automated audio captioning, speech separation, audio generative models, speech and sound synthesis, spatial audio reproduction, multimodal indoor monitoring, radar imaging, depth estimation, physics-informed machine learning, and integrated sensing and communications (ISAC). Three workshop papers have also been accepted for presentation on audio-visual speaker diarization, music source separation, and music generative models.

        Perry Wang is the co-organizer of the Workshop on Signal Processing and Machine Learning Advances in Automotive Radars (SPLAR), held on Sunday, April 14. It features keynote talks from leaders in both academia and industry, peer-reviewed workshop papers, and lightning talks from ICASSP regular tracks on signal processing and machine learning for automotive radar and, more generally, radar perception.

        Gordon Wichern will present an invited keynote talk on analyzing and interpreting audio deep learning models at the Workshop on Explainable Machine Learning for Speech and Audio (XAI-SA), held on Monday, April 15. He will also appear in a panel discussion on interpretable audio AI at the workshop.

        Perry Wang also co-organizes a two-part special session on Next-Generation Wi-Fi Sensing (SS-L9 and SS-L13) which will be held on Thursday afternoon, April 18. The special session includes papers on PHY-layer oriented signal processing and data-driven deep learning advances, and supports upcoming 802.11bf WLAN Sensing Standardization activities.

        Petros Boufounos is participating as a mentor in ICASSP’s Micro-Mentoring Experience Program (MiME).

        About ICASSP

        ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 3000 participants.
    •  
    •  TALK    [MERL Seminar Series 2024] Greta Tuckute presents talk titled Computational models of human auditory and language processing
      Date & Time: Wednesday, January 31, 2024; 12:00 PM
      Speaker: Greta Tuckute, MIT
      MERL Host: Sameer Khurana
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Abstract
      • Advances in machine learning have led to powerful models for audio and language, proficient in tasks like speech recognition and fluent language generation. Beyond their immense utility in engineering applications, these models offer valuable tools for cognitive science and neuroscience. In this talk, I will demonstrate how these artificial neural network models can be used to understand how the human brain processes language. The first part of the talk will cover how audio neural networks serve as computational accounts for brain activity in the auditory cortex. The second part will focus on the use of large language models, such as those in the GPT family, to non-invasively control brain activity in the human language system.
    •  

    See All News & Events for Sameer
  • Awards

    •  AWARD    MERL team wins the Audio-Visual Speech Enhancement (AVSE) 2023 Challenge
      Date: December 16, 2023
      Awarded to: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux
      MERL Contacts: François Germain; Chiori Hori; Sameer Khurana; Jonathan Le Roux; Gordon Wichern; Yoshiki Masuyama
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL's Speech & Audio team ranked 1st out of 12 teams in the 2nd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSE). The team was led by Zexu Pan, and also included Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux.

        The AVSE challenge aims to design better speech enhancement systems by harnessing the visual aspects of speech (such as lip movements and gestures) in a manner similar to the brain’s multi-modal integration strategies. MERL’s system was a scenario-aware audio-visual TF-GridNet, that incorporates the face recording of a target speaker as a conditioning factor and also recognizes whether the predominant interference signal is speech or background noise. In addition to outperforming all competing systems in terms of objective metrics by a wide margin, in a listening test, MERL’s model achieved the best overall word intelligibility score of 84.54%, compared to 57.56% for the baseline and 80.41% for the next best team. The Fisher’s least significant difference (LSD) was 2.14%, indicating that our model offered statistically significant speech intelligibility improvements compared to all other systems.
    •  
    See All Awards for MERL
  • Research Highlights

  • Internships with Sameer

    • SA0045: Internship - Universal Audio Compression and Generation

      We are seeking graduate students interested in helping advance the fields of universal audio compression and generation. We aim to build a single generative model that can perform multiple audio generation tasks conditioned on multimodal context. The interns will collaborate with MERL researchers to derive and implement new models and optimization methods, conduct experiments, and prepare results for publication. Internships regularly lead to one or more publications in top-tier venues, which can later become part of the intern's doctoral work. The ideal candidates are Ph.D. students with experience in some of the following: deep generative modeling, large language models, neural audio codecs. The internship typically lasts 3-6 months.

    See All Internships at MERL
  • MERL Publications

    •  Saijo, K., Ebbers, J., Germain, F.G., Khurana, S., Wichern, G., Le Roux, J., "Leveraging Audio-Only Data for Text-Queried Target Sound Extraction", arXiv, September 2024.
      BibTeX arXiv
      • @article{Saijo2024sep3,
      • author = {{Saijo, Kohei and Ebbers, Janek and Germain, François G and Khurana, Sameer and Wichern, Gordon and Le Roux, Jonathan}},
      • title = {Leveraging Audio-Only Data for Text-Queried Target Sound Extraction},
      • journal = {arXiv},
      • year = 2024,
      • month = sep,
      • url = {https://arxiv.org/abs/2409.13152v1}
      • }
    •  Khurana, S., Hori, C., Laurent, A., Wichern, G., Le Roux, J., "ZeroST: Zero-Shot Speech Translation", Interspeech, DOI: 10.21437/​Interspeech.2024-1088, September 2024, pp. 392-396.
      BibTeX TR2024-122 PDF
      • @inproceedings{Khurana2024sep,
      • author = {Khurana, Sameer and Hori, Chiori and Laurent, Antoine and Wichern, Gordon and Le Roux, Jonathan}},
      • title = {ZeroST: Zero-Shot Speech Translation},
      • booktitle = {Interspeech},
      • year = 2024,
      • pages = {392--396},
      • month = sep,
      • doi = {10.21437/Interspeech.2024-1088},
      • issn = {2958-1796},
      • url = {https://www.merl.com/publications/TR2024-122}
      • }
    •  Kambara, M., Hori, C., Sugiura, K., Ota, K., Jha, D.K., Khurana, S., Jain, S., Corcodel, R., Romeres, D., Le Roux, J., "Human Action Understanding-based Robot Planning using Multimodal LLM", IEEE International Conference on Robotics and Automation (ICRA), June 2024.
      BibTeX TR2024-066 PDF
      • @inproceedings{Kambara2024jun,
      • author = {Kambara, Motonari and Hori, Chiori and Sugiura, Komei and Ota, Kei and Jha, Devesh K. and Khurana, Sameer and Jain, Siddarth and Corcodel, Radu and Romeres, Diego and Le Roux, Jonathan}},
      • title = {Human Action Understanding-based Robot Planning using Multimodal LLM},
      • booktitle = {IEEE International Conference on Robotics and Automation (ICRA) Workshop},
      • year = 2024,
      • month = jun,
      • url = {https://www.merl.com/publications/TR2024-066}
      • }
    •  Koo, J., Wichern, G., Germain, F.G., Khurana, S., Le Roux, J., "SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers", arXiv, April 2024.
      BibTeX arXiv
      • @article{Koo2024apr2,
      • author = {Koo, Junghyun and Wichern, Gordon and Germain, François G and Khurana, Sameer and Le Roux, Jonathan},
      • title = {SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers},
      • journal = {arXiv},
      • year = 2024,
      • month = apr,
      • url = {https://arxiv.org/abs/2404.02252}
      • }
    •  Koo, J., Wichern, G., Germain, F.G., Khurana, S., Le Roux, J., "Understanding and Controlling Generative Music Transformers by Probing Individual Attention Heads", IEEE ICASSP Satellite Workshop on Explainable Machine Learning for Speech and Audio (XAI-SA), April 2024.
      BibTeX TR2024-032 PDF
      • @inproceedings{Koo2024apr,
      • author = {Koo, Junghyun and Wichern, Gordon and Germain, François G and Khurana, Sameer and Le Roux, Jonathan},
      • title = {Understanding and Controlling Generative Music Transformers by Probing Individual Attention Heads},
      • booktitle = {IEEE ICASSP Satellite Workshop on Explainable Machine Learning for Speech and Audio (XAI-SA)},
      • year = 2024,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2024-032}
      • }
    See All MERL Publications for Sameer
  • Other Publications

    •  Yuan Gong, Sameer Khurana, Leonid Karlinsky and James Glass, "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers", Interspeech 2023, 2023.
      BibTeX
      • @Article{gong2023whisper,
      • author = {Gong, Yuan and Khurana, Sameer and Karlinsky, Leonid and Glass, James},
      • title = {Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers},
      • journal = {Interspeech 2023},
      • year = 2023
      • }
    •  Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote and James Glass, "Improved Cross-Lingual Transfer Learning For Automatic Speech Translation", Preprint 2023, 2023.
      BibTeX
      • @Article{khurana2023improved,
      • author = {Khurana, Sameer and Dawalatabad, Nauman and Laurent, Antoine and Vicente, Luis and Gimeno, Pablo and Mingote, Victoria and Glass, James},
      • title = {Improved Cross-Lingual Transfer Learning For Automatic Speech Translation},
      • journal = {Preprint 2023},
      • year = 2023
      • }
    •  Sameer Khurana, "Transfer Learning For Spoken Language Processing", 2023, Massachusetts Institute of Technology.
      BibTeX
      • @Phdthesis{khurana2023transfer,
      • author = {Khurana, Sameer},
      • title = {Transfer Learning For Spoken Language Processing},
      • school = {Massachusetts Institute of Technology},
      • year = 2023
      • }
    •  Antoine Laurent, Souhir Gahbiche, Ha Nguyen, Haroun Elleuch, Fethi Bougares, Antoine Thiol, Hugo Riguidel, Salima Mdhaffar, Gaëlle Laperrière, Lucas Maison and others, "ON-TRAC consortium systems for the IWSLT 2023 dialectal and low-resource speech translation tasks", IWSLT 2023, 2023.
      BibTeX
      • @Inproceedings{laurent2023trac,
      • author = {Laurent, Antoine and Gahbiche, Souhir and Nguyen, Ha and Elleuch, Haroun and Bougares, Fethi and Thiol, Antoine and Riguidel, Hugo and Mdhaffar, Salima and Laperri{\`e}re, Ga{\"e}lle and Maison, Lucas and others},
      • title = {ON-TRAC consortium systems for the IWSLT 2023 dialectal and low-resource speech translation tasks},
      • booktitle = {IWSLT 2023},
      • year = 2023
      • }
    •  Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent and Jarod Duret, "Direct Text to Speech Translation System Using Acoustic Units", IEEE Signal Processing Letters 2023, 2023.
      BibTeX
      • @Article{mingote2023direct,
      • author = {Mingote, Victoria and Gimeno, Pablo and Vicente, Luis and Khurana, Sameer and Laurent, Antoine and Duret, Jarod},
      • title = {Direct Text to Speech Translation System Using Acoustic Units},
      • journal = {IEEE Signal Processing Letters 2023},
      • year = 2023,
      • publisher = {IEEE}
      • }
    •  Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury and James Glass, "Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages", Interspeech 2023, 2023.
      BibTeX
      • @Article{rouditchenko2023comparison,
      • author = {Rouditchenko, Andrew and Khurana, Sameer and Thomas, Samuel and Feris, Rogerio and Karlinsky, Leonid and Kuehne, Hilde and Harwath, David and Kingsbury, Brian and Glass, James},
      • title = {Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages},
      • journal = {Interspeech 2023},
      • year = 2023
      • }
    •  Nauman Dawalatabad, Yuan Gong, Sameer Khurana, Rhoda Au and James Glass, "Detecting Dementia from Long Neuropsychological Interviews", EMNLP 2022, 2022.
      BibTeX
      • @Inproceedings{dawalatabad2022detecting,
      • author = {Dawalatabad, Nauman and Gong, Yuan and Khurana, Sameer and Au, Rhoda and Glass, James},
      • title = {Detecting Dementia from Long Neuropsychological Interviews},
      • booktitle = {EMNLP 2022},
      • year = 2022
      • }
    •  Nauman Dawalatabad, Sameer Khurana, Antoine Laurent and James Glass, "On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration", ICASSP 2023, 2022.
      BibTeX
      • @Article{dawalatabad2022unsupervised,
      • author = {Dawalatabad, Nauman and Khurana, Sameer and Laurent, Antoine and Glass, James},
      • title = {On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration},
      • journal = {ICASSP 2023},
      • year = 2022
      • }
    •  Yuan Gong, Sameer Khurana, Andrew Rouditchenko and James Glass, "CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification", Preprint 2022, 2022.
      BibTeX
      • @Article{gong2022cmkd,
      • author = {Gong, Yuan and Khurana, Sameer and Rouditchenko, Andrew and Glass, James},
      • title = {CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification},
      • journal = {Preprint 2022},
      • year = 2022
      • }
    •  Sameer Khurana, Antoine Laurent and James Glass, "Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0", ICASSP 2022, 2022.
      BibTeX
      • @Inproceedings{khurana2022magic,
      • author = {Khurana, Sameer and Laurent, Antoine and Glass, James},
      • title = {Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0},
      • booktitle = {ICASSP 2022},
      • year = 2022,
      • organization = {IEEE}
      • }
    •  Sameer Khurana, Antoine Laurent and James Glass, "SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation", IEEE Journal of Selected Topics in Signal Processing 2022, 2022.
      BibTeX
      • @Article{khurana2022samu,
      • author = {Khurana, Sameer and Laurent, Antoine and Glass, James},
      • title = {SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation},
      • journal = {IEEE Journal of Selected Topics in Signal Processing 2022},
      • year = 2022
      • }
    •  Anthony Larcher, Yannick Estève, Mickael Rouvier, Natalia Tomashenko, Jarod Duret, Gaelle Laperriere, Santosh Kesijaru, Marek Sarvas, Renata Kohlova, Henry Li and others, "Multi-lingual Speech to Speech Translation for Under-Resourced Languages", 2022.
      BibTeX
      • @Inproceedings{larcher2022multi,
      • author = {Larcher, Anthony and Est{\`e}ve, Yannick and Rouvier, Mickael and Tomashenko, Natalia and Duret, Jarod and Laperriere, Gaelle and Kesijaru, Santosh and Sarvas, Marek and Kohlova, Renata and Li, Henry and others},
      • title = {Multi-lingual Speech to Speech Translation for Under-Resourced Languages},
      • year = 2022,
      • organization = {Jelinek Summer Workshop on Speech and Language Technology 2022}
      • }
    •  Sameer Khurana, Niko Moritz, Takaaki Hori and Jonathan Le Roux, "Unsupervised domain adaptation for speech recognition via uncertainty driven self-training", ICASSP 2021, 2021.
      BibTeX
      • @Inproceedings{khurana2021unsupervised,
      • author = {Khurana, Sameer and Moritz, Niko and Hori, Takaaki and Le Roux, Jonathan},
      • title = {Unsupervised domain adaptation for speech recognition via uncertainty driven self-training},
      • booktitle = {ICASSP 2021},
      • year = 2021,
      • organization = {IEEE}
      • }
    •  Cheng-I Jeff Lai, Yang Zhang, Alexander H Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox and Jim Glass, "Parp: Prune, adjust and re-prune for self-supervised speech recognition", NeurIPS 2021, 2021.
      BibTeX
      • @Article{lai2021parp,
      • author = {Lai, Cheng-I Jeff and Zhang, Yang and Liu, Alexander H and Chang, Shiyu and Liao, Yi-Lun and Chuang, Yung-Sung and Qian, Kaizhi and Khurana, Sameer and Cox, David and Glass, Jim},
      • title = {Parp: Prune, adjust and re-prune for self-supervised speech recognition},
      • journal = {NeurIPS 2021},
      • year = 2021
      • }
    •  Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer and James Glass, "A convolutional deep markov model for unsupervised speech representation learning", Interspeech 2020, 2020.
      BibTeX
      • @Article{khurana2020convolutional,
      • author = {Khurana, Sameer and Laurent, Antoine and Hsu, Wei-Ning and Chorowski, Jan and Lancucki, Adrian and Marxer, Ricard and Glass, James},
      • title = {A convolutional deep markov model for unsupervised speech representation learning},
      • journal = {Interspeech 2020},
      • year = 2020
      • }
    •  Sameer Khurana, Antoine Laurent and James Glass, "Cstnet: Contrastive speech translation network for self-supervised speech representation learning", Preprint, 2020.
      BibTeX
      • @Article{khurana2020cstnet,
      • author = {Khurana, Sameer and Laurent, Antoine and Glass, James},
      • title = {Cstnet: Contrastive speech translation network for self-supervised speech representation learning},
      • journal = {Preprint},
      • year = 2020
      • }
    •  Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans JGA Dolfing, Sameer Khurana, Tanel Alumäe and Antoine Laurent, "Robust training of vector quantized bottleneck models", IJCNN 2020, 2020.
      BibTeX
      • @Inproceedings{lancucki2020robust,
      • author = {{\L}a{\'n}cucki, Adrian and Chorowski, Jan and Sanchez, Guillaume and Marxer, Ricard and Chen, Nanxin and Dolfing, Hans JGA and Khurana, Sameer and Alum{\"a}e, Tanel and Laurent, Antoine},
      • title = {Robust training of vector quantized bottleneck models},
      • booktitle = {IJCNN 2020},
      • year = 2020,
      • organization = {IEEE}
      • }
    •  Sameer Khurana, Ahmed Ali and James Glass, "DARTS: Dialectal Arabic transcription system", Preprint, 2019.
      BibTeX
      • @Article{khurana2019darts,
      • author = {Khurana, Sameer and Ali, Ahmed and Glass, James},
      • title = {DARTS: Dialectal Arabic transcription system},
      • journal = {Preprint},
      • year = 2019
      • }
    •  Sameer Khurana, Shafiq Rayhan Joty, Ahmed Ali and James Glass, "A Factorial Deep Markov Model For Unsupervised Disentangled Representation Learning From Speech", ICASSP 2019, 2019.
      BibTeX
      • @Article{khurana2019factorial,
      • author = {Khurana, Sameer and Joty, Shafiq Rayhan and Ali, Ahmed and Glass, James},
      • title = {A Factorial Deep Markov Model For Unsupervised Disentangled Representation Learning From Speech},
      • journal = {ICASSP 2019},
      • year = 2019
      • }
    •  Sameer Khurana, Reda Rawi, Khalid Kunji, Gwo-Yu Chuang, Halima Bensmail, Raghvendra Mall and Alfonso Valencia, "DeepSol: A Deep Learning Framework for Sequence-Based Protein Solubility Prediction", Bioinformatics 2018, 2018.
      BibTeX
      • @Article{khurana2018deepsol,
      • author = {Khurana, Sameer and Rawi, Reda and Kunji, Khalid and Chuang, Gwo-Yu and Bensmail, Halima and Mall, Raghvendra and Valencia, Alfonso},
      • title = {DeepSol: A Deep Learning Framework for Sequence-Based Protein Solubility Prediction},
      • journal = {Bioinformatics 2018},
      • year = 2018
      • }
    •  Maryam Najafian, Sameer Khurana, Suwon Shan, Ahmed Ali and James Glass, "Exploiting convolutional neural networks for phonotactic based dialect identification", ICASSP 2018, 2018.
      BibTeX
      • @Inproceedings{najafian2018exploiting,
      • author = {Najafian, Maryam and Khurana, Sameer and Shan, Suwon and Ali, Ahmed and Glass, James},
      • title = {Exploiting convolutional neural networks for phonotactic based dialect identification},
      • booktitle = {ICASSP 2018},
      • year = 2018,
      • organization = {IEEE}
      • }
  • Software & Data Downloads