Moitreya Chatterjee

Moitreya Chatterjee
  • Biography

    Moitreya's research interests are in computer vision, and multimodal machine learning with a particular emphasis on learning from audio-visual data. His PhD work received the Joan and Lalit Bahl Fellowship and the Thomas and Margaret Huang Research Award. Earlier, he earned a M.S. degree in Computer Science from the University of Southern California (USC), during which he received an Outstanding Paper Award from the ACM International Conference on Multimodal Interaction (ICMI).

  • Recent News & Events

    •  NEWS    MERL Papers and Workshops at CVPR 2024
      Date: June 17, 2024 - June 21, 2024
      Where: Seattle, WA
      MERL Contacts: Petros T. Boufounos; Moitreya Chatterjee; Anoop Cherian; Michael J. Jones; Toshiaki Koike-Akino; Jonathan Le Roux; Suhas Lohit; Tim K. Marks; Pedro Miraldo; Jing Liu; Kuan-Chuan Peng; Pu (Perry) Wang; Ye Wang; Matthew Brand
      Research Areas: Artificial Intelligence, Computational Sensing, Computer Vision, Machine Learning, Speech & Audio
      Brief
      • MERL researchers are presenting 5 conference papers, 3 workshop papers, and are co-organizing two workshops at the CVPR 2024 conference, which will be held in Seattle, June 17-21. CVPR is one of the most prestigious and competitive international conferences in computer vision. Details of MERL contributions are provided below.

        CVPR Conference Papers:

        1. "TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models" by H. Ni, B. Egger, S. Lohit, A. Cherian, Y. Wang, T. Koike-Akino, S. X. Huang, and T. K. Marks

        This work enables a pretrained text-to-video (T2V) diffusion model to be additionally conditioned on an input image (first video frame), yielding a text+image to video (TI2V) model. Other than using the pretrained T2V model, our method requires no ("zero") training or fine-tuning. The paper uses a "repeat-and-slide" method and diffusion resampling to synthesize videos from a given starting image and text describing the video content.

        Paper: https://www.merl.com/publications/TR2024-059
        Project page: https://merl.com/research/highlights/TI2V-Zero

        2. "Long-Tailed Anomaly Detection with Learnable Class Names" by C.-H. Ho, K.-C. Peng, and N. Vasconcelos

        This work aims to identify defects across various classes without relying on hard-coded class names. We introduce the concept of long-tailed anomaly detection, addressing challenges like class imbalance and dataset variability. Our proposed method combines reconstruction and semantic modules, learning pseudo-class names and utilizing a variational autoencoder for feature synthesis to improve performance in long-tailed datasets, outperforming existing methods in experiments.

        Paper: https://www.merl.com/publications/TR2024-040

        3. "Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling" by X. Liu, Y-W. Tai, C-T. Tang, P. Miraldo, S. Lohit, and M. Chatterjee

        This work presents a new strategy for rendering dynamic scenes from novel viewpoints. Our approach is based on stratifying the scene into regions based on the extent of motion of the region, which is automatically determined. Regions with higher motion are permitted a denser spatio-temporal sampling strategy for more faithful rendering of the scene. Additionally, to the best of our knowledge, ours is the first work to enable tracking of objects in the scene from novel views - based on the preferences of a user, provided by a click.

        Paper: https://www.merl.com/publications/TR2024-042

        4. "SIRA: Scalable Inter-frame Relation and Association for Radar Perception" by R. Yataka, P. Wang, P. T. Boufounos, and R. Takahashi

        Overcoming the limitations on radar feature extraction such as low spatial resolution, multipath reflection, and motion blurs, this paper proposes SIRA (Scalable Inter-frame Relation and Association) for scalable radar perception with two designs: 1) extended temporal relation, generalizing the existing temporal relation layer from two frames to multiple inter-frames with temporally regrouped window attention for scalability; and 2) motion consistency track with a pseudo-tracklet generated from observational data for better object association.

        Paper: https://www.merl.com/publications/TR2024-041

        5. "RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation" by Z. Yang, J. Liu, P. Chen, A. Cherian, T. K. Marks, J. L. Roux, and C. Gan

        We leverage Large Language Models (LLM) for zero-shot semantic audio visual navigation. Specifically, by employing multi-modal models to process sensory data, we instruct an LLM-based planner to actively explore the environment by adaptively evaluating and dismissing inaccurate perceptual descriptions.

        Paper: https://www.merl.com/publications/TR2024-043

        CVPR Workshop Papers:

        1. "CoLa-SDF: Controllable Latent StyleSDF for Disentangled 3D Face Generation" by R. Dey, B. Egger, V. Boddeti, Y. Wang, and T. K. Marks

        This paper proposes a new method for generating 3D faces and rendering them to images by combining the controllability of nonlinear 3DMMs with the high fidelity of implicit 3D GANs. Inspired by StyleSDF, our model uses a similar architecture but enforces the latent space to match the interpretable and physical parameters of the nonlinear 3D morphable model MOST-GAN.

        Paper: https://www.merl.com/publications/TR2024-045

        2. “Tracklet-based Explainable Video Anomaly Localization” by A. Singh, M. J. Jones, and E. Learned-Miller

        This paper describes a new method for localizing anomalous activity in video of a scene given sample videos of normal activity from the same scene. The method is based on detecting and tracking objects in the scene and estimating high-level attributes of the objects such as their location, size, short-term trajectory and object class. These high-level attributes can then be used to detect unusual activity as well as to provide a human-understandable explanation for what is unusual about the activity.

        Paper: https://www.merl.com/publications/TR2024-057

        MERL co-organized workshops:

        1. "Multimodal Algorithmic Reasoning Workshop" by A. Cherian, K-C. Peng, S. Lohit, M. Chatterjee, H. Zhou, K. Smith, T. K. Marks, J. Mathissen, and J. Tenenbaum

        Workshop link: https://marworkshop.github.io/cvpr24/index.html

        2. "The 5th Workshop on Fair, Data-Efficient, and Trusted Computer Vision" by K-C. Peng, et al.

        Workshop link: https://fadetrcv.github.io/2024/

        3. "SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision Models" by X. Chen, J. Liu, Y. Wang, P. Wang, M. Brand, G. Wang, and T. Koike-Akino

        This paper proposes a generalized framework called SuperLoRA that unifies and extends different variants of low-rank adaptation (LoRA). Introducing new options with grouping, folding, shuffling, projection, and tensor decomposition, SuperLoRA offers high flexibility and demonstrates superior performance up to 10-fold gain in parameter efficiency for transfer learning tasks.

        Paper: https://www.merl.com/publications/TR2024-062
    •  
    •  TALK    [MERL Seminar Series 2023] Dr. Tanmay Gupta presents talk titled Visual Programming - A compositional approach to building General Purpose Vision Systems
      Date & Time: Tuesday, October 31, 2023; 2:00 PM
      Speaker: Tanmay Gupta, Allen Institute for Artificial Intelligence
      MERL Host: Moitreya Chatterjee
      Research Areas: Artificial Intelligence, Computer Vision, Machine Learning
      Abstract
      • Building General Purpose Vision Systems (GPVs) that can perform a huge variety of tasks has been a long-standing goal for the computer vision community. However, end-to-end training of these systems to handle different modalities and tasks has proven to be extremely challenging. In this talk, I will describe a lucrative neuro-symbolic alternative to the common end-to-end learning paradigm called Visual Programming. Visual Programming is a general framework that leverages the code-generation abilities of LLMs, existing neural models, and non-differentiable programs to enable powerful applications. Some of these applications continue to remain elusive for the current generation of end-to-end trained GPVs.
    •  

    See All News & Events for Moitreya
  • Research Highlights

  • MERL Publications

    •  Liu, X., Tai, Y.-W., Tang, C.-K., Miraldo, P., Lohit, S., Chatterjee, M., "Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), May 2024.
      BibTeX TR2024-042 PDF Videos
      • @inproceedings{Liu2024may,
      • author = {Liu, Xinhang and Tai, Yu-wing and Tang, Chi-Keung and Miraldo, Pedro and Lohit, Suhas and Chatterjee, Moitreya},
      • title = {Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling},
      • booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      • year = 2024,
      • month = may,
      • url = {https://www.merl.com/publications/TR2024-042}
      • }
    •  Liu, X., Paul, S., Chatterjee, M., Cherian, A., "CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments", AAAI Conference on Artificial Intelligence, DOI: 10.1609/​aaai.v38i4.28167, December 2023, pp. 3765-3773.
      BibTeX TR2023-154 PDF
      • @inproceedings{Liu2023dec2,
      • author = {Liu, Xiulong and Paul, Sudipta and Chatterjee, Moitreya and Cherian, Anoop},
      • title = {CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments},
      • booktitle = {Proceedings of the 38th AAAI Conference on Artificial Intelligence},
      • year = 2023,
      • pages = {3765--3773},
      • month = dec,
      • doi = {10.1609/aaai.v38i4.28167},
      • url = {https://www.merl.com/publications/TR2023-154}
      • }
    •  Sharma, M., Chatterjee, M., Peng, K.-C., Lohit, S., Jones, M.J., "Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection", IEEE International Conference on Computer Vision Workshops (ICCV), October 2023, pp. 924-932.
      BibTeX TR2023-125 PDF Presentation
      • @inproceedings{Sharma2023oct,
      • author = {Sharma, Manish and Chatterjee, Moitreya and Peng, Kuan-Chuan and Lohit, Suhas and Jones, Michael J.},
      • title = {Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection},
      • booktitle = {IEEE International Conference on Computer Vision Workshops (ICCV)},
      • year = 2023,
      • pages = {924--932},
      • month = oct,
      • url = {https://www.merl.com/publications/TR2023-125}
      • }
    •  Liu, X., Paul, S., Chatterjee, M., Cherian, A., "Active Sparse Conversations for Improved Audio-Visual Embodied Navigation", arXiv, June 2023.
      BibTeX arXiv
      • @inproceedings{Liu2023jun,
      • author = {Liu, Xiulong and Paul, Sudipta and Chatterjee, Moitreya and Cherian, Anoop},
      • title = {Active Sparse Conversations for Improved Audio-Visual Embodied Navigation},
      • booktitle = {arXiv},
      • year = 2023,
      • month = jun,
      • url = {https://arxiv.org/abs/2306.04047}
      • }
    •  Chatterjee, M., Ahuja, N., Cherian, A., "Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation", Advances in Neural Information Processing Systems (NeurIPS), November 2022.
      BibTeX TR2022-140 PDF Presentation
      • @inproceedings{Chatterjee2022nov,
      • author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
      • title = {Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation},
      • booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
      • year = 2022,
      • month = nov,
      • url = {https://www.merl.com/publications/TR2022-140}
      • }
    See All MERL Publications for Moitreya
  • Other Publications

    •  Moitreya Chatterjee, Narendra Ahuja and Anoop Cherian, "A hierarchical variational neural uncertainty model for stochastic video prediction", Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9751-9761.
      BibTeX
      • @Inproceedings{chatterjee2021hierarchical,
      • author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
      • title = {A hierarchical variational neural uncertainty model for stochastic video prediction},
      • booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
      • year = 2021,
      • pages = {9751--9761}
      • }
    •  Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja and Anoop Cherian, "Visual scene graphs for audio source separation", Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1204-1213.
      BibTeX
      • @Inproceedings{chatterjee2021visual,
      • author = {Chatterjee, Moitreya and Le Roux, Jonathan and Ahuja, Narendra and Cherian, Anoop},
      • title = {Visual scene graphs for audio source separation},
      • booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
      • year = 2021,
      • pages = {1204--1213}
      • }
    •  Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li and Anoop Cherian, "Dynamic graph representation learning for video dialog via multi-modal shuffled transformers", Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 1415-1423.
      BibTeX
      • @Inproceedings{geng2021dynamic,
      • author = {Geng, Shijie and Gao, Peng and Chatterjee, Moitreya and Hori, Chiori and Le Roux, Jonathan and Zhang, Yongfeng and Li, Hongsheng and Cherian, Anoop},
      • title = {Dynamic graph representation learning for video dialog via multi-modal shuffled transformers},
      • booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
      • year = 2021,
      • volume = 35,
      • number = 2,
      • pages = {1415--1423}
      • }
    •  Moitreya Chatterjee and Anoop Cherian, "Sound2sight: Generating visual dynamics from sound and context", European Conference on Computer Vision, 2020, pp. 701-719.
      BibTeX
      • @Inproceedings{chatterjee2020sound2sight,
      • author = {Chatterjee, Moitreya and Cherian, Anoop},
      • title = {Sound2sight: Generating visual dynamics from sound and context},
      • booktitle = {European Conference on Computer Vision},
      • year = 2020,
      • pages = {701--719},
      • organization = {Springer}
      • }
    •  Abhimanyu Dubey, Moitreya Chatterjee and Narendra Ahuja, "Coreset-based neural network compression", Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 454-470.
      BibTeX
      • @Inproceedings{dubey2018coreset,
      • author = {Dubey, Abhimanyu and Chatterjee, Moitreya and Ahuja, Narendra},
      • title = {Coreset-based neural network compression},
      • booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
      • year = 2018,
      • pages = {454--470}
      • }
    •  Arulkumar Subramaniam, Moitreya Chatterjee and Anurag Mittal, "Deep neural networks with inexact matching for person re-identification", Advances in neural information processing systems, Vol. 29, 2016.
      BibTeX
      • @Article{subramaniam2016deep,
      • author = {Subramaniam, Arulkumar and Chatterjee, Moitreya and Mittal, Anurag},
      • title = {Deep neural networks with inexact matching for person re-identification},
      • journal = {Advances in neural information processing systems},
      • year = 2016,
      • volume = 29
      • }
    •  Moitreya Chatterjee, Sunghyun Park, Louis-Philippe Morency and Stefan Scherer, "Combining two perspectives on classifying multimodal data for recognizing speaker traits", Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, pp. 7-14.
      BibTeX
      • @Inproceedings{chatterjee2015combining,
      • author = {Chatterjee, Moitreya and Park, Sunghyun and Morency, Louis-Philippe and Scherer, Stefan},
      • title = {Combining two perspectives on classifying multimodal data for recognizing speaker traits},
      • booktitle = {Proceedings of the 2015 ACM on International Conference on Multimodal Interaction},
      • year = 2015,
      • pages = {7--14}
      • }
  • Videos