TR2025-002

Temporally Grounding Instructional Diagrams in Unconstrained Videos


    •  Zhang, J., Zhang, F., Rodriguez, C., Ben-Shabat, I., Cherian, A., Gould, S., "Temporally Grounding Instructional Diagrams in Unconstrained Videos", IEEE Winter Conference on Applications of Computer Vision (WACV), December 2024.
      BibTeX TR2025-002 PDF
      • @inproceedings{Zhang2024dec,
      • author = {Zhang, Jiahao and Zhang, Frederic and Rodriguez, Cristian and Ben-Shabat, Itzik and Cherian, Anoop and Gould, Stephen}},
      • title = {Temporally Grounding Instructional Diagrams in Unconstrained Videos},
      • booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
      • year = 2024,
      • month = dec,
      • url = {https://www.merl.com/publications/TR2025-002}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning

Abstract:

We study the challenging problem of simultaneously localizing a sequence of instructional diagram queries in a video. This requires understanding not only the individual diagram queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.