TR2026-083

WISE: Weighted Iterative Society-of-Experts for Multimodal Multi-Agent Debate with Probabilistic Consensus


    •  Cherian, A., Lohit, S., Peng, K.-C., "WISE: Weighted Iterative Society-of-Experts for Multimodal Multi-Agent Debate with Probabilistic Consensus", ICML SCALE AI Workshop, June 2026.
      BibTeX TR2026-083 PDF
      • @inproceedings{Cherian2026jun,
      • author = {Cherian, Anoop and Lohit, Suhas and Peng, Kuan-Chuan},
      • title = {{WISE: Weighted Iterative Society-of-Experts for Multimodal Multi-Agent Debate with Probabilistic Consensus}},
      • booktitle = {ICML SCALE AI Workshop},
      • year = 2026,
      • month = jun,
      • url = {https://www.merl.com/publications/TR2026-083}
      • }
  • MERL Contacts:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning

Abstract:

Multi-agent debate (MAD) is a powerful paradigm for combining multiple large language models (LLMs) to achieve robust reasoning, but prior work has largely focused on language-only settings, leaving its multimodal potential underexplored. We present Weighted Iterative Society-of- Experts (WISE), a generalized MAD framework that systematically integrates heterogeneous multimodal LLMs to address challenging vision-and- language tasks in a zero-shot setting. Our key idea is to factor agents into three roles based on their multimodal capabilities: Solvers, which process multimodal inputs and generate candidate solutions; Reflectors, which may or may not access multimodal inputs but evaluate solutions, provide feedback, and assign weights; and an Orchestra- tor, which operates unimodally to reason over solutions and feedback and produce directives that guide subsequent reasoning. To account for varying agent reliability, we introduce an unsupervised probabilistic aggregation method, termed WISE–Dawid–Skene, which leverages the weighting scheme in WISE-MAD to adaptively combine agent outputs. We evaluate WISE on several challenging mathematical reasoning datasets and show that it consistently outperforms state-of-the- art methods across diverse LLM configurations, demonstrating its effectiveness as a general and scalable multimodal reasoning framework