AI Agents: From Language to Multimodal Reasoning

AI Agents are rapidly evolving from language-only models to sophisticated multimodal reasoners. This post outlines our research journey in advancing AI Agents over the past year, accompanying my presentation at the ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence.

My talk follows up on my CVPR 2024 Tutorial talk on Language-based Agents and Large Action Models. Here are my slides. This accompanying post outlines the content of my talk and consolidates all relevant resources.

From Language-Only to Multimodality

Our journey began with building top-performing Language-only Agents. The core of our research focuses on training Large Action Models (LAMs) to serve as the agent’s ‘brain’. We found two key ingredients to achieving this: a data pipeline that unifies expert trajectories from diverse sources and a finetuning pipeline that leverages supervised-finetuning (SFT), LoRA, and DPO.

In our recent xLAM [1] paper, we demonstrated how to effectively achieve this, leading xLAM to become the Top performing model in the Berkeley Function Calling Leaderboard v2.

Building on this success, we then explored how these core training principles apply to Multimodal AI Agents. Our LATTE paper [2] shows that we can train highly effective Multimodal AI Agents for Visual Reasoning by collecting high-quality agent trajectories and fine-tuning a pre-trained VLM with this agentic data.

Exploring Self-Improvement via Simulation

These are indeed exciting results. However, a key limitation of both approaches is the reliance on carefully curated expert trajectories for training the agent’s ‘brain’. This poses two significant challenges: it requires extensive labor for data collection/curation and, critically, the model’s performance is ultimately capped by the ‘expert’ input quality.

This motivated us to explore more advanced learning mechanisms, in particular, that of automatic exploration in simulation to allow the agent to discover new problem-solving strategies.

We study this in our LAM Simulator paper [3]. We put a language-based agent into an RL-style learning loop, automatically generating trajectories instead of relying on experts, and automatically grading them with Rewards using evaluation functions.

We then extended this idea of self-exploration and improvement to Multimodal AI Agents. In our ViUnit paper [4], we showed that visual programming agents can improve their robustness and performance by leveraging our automatically generated visual unit tests. Similarly, in a related work, we trained a VLM-based agent [5] with reinforcement learning and also saw substantial performance improvements compared to baseline VLM agents.

Conclusion

These are exciting times for AI Agents. While we’ve made incredible progress in transitioning agents from language to robust visual reasoning, the path forward is clear: enabling agents to operate with more modalities beyond Language and Vision, and developing more advanced, self-improving learning mechanisms. I look forward to continuing our work and seeing how quickly this space develops in the research community.

References

Oral

xLAM: A Family of Large Action Models to Empower AI Agent Systems

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, and 12 more authors

In The 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, New Mexico. Apr 2025

Awarded DOI arXiv Bib Blog Code Slides

Oral

@inproceedings{Zhang_NAACL_2025,
  author = {Zhang, Jianguo and Lan, Tian and Zhu, Ming and Liu, Zuxin and Hoang, Thai and Kokane, Shirley and Yao, Weiran and Tan, Juntao and Prabhakar, Akshara and Chen, Haolin and Liu, Zhiwei and Feng, Yihao and Awalgaonkar, Tulika and Murthy, Rithesh and Hu, Eric and Chen, Zeyuan and Xu, Ran and Niebles, Juan Carlos and Heinecke, Shelby and Wang, Huan and Savarese, Silvio and Xiong, Caiming},
  title = {{xLAM}: A Family of Large Action Models to Empower {AI} Agent Systems},
  booktitle = {The 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025)},
  address = {Albuquerque, New Mexico},
  month = apr,
  year = {2025},
  doi = {10.18653/v1/2025.naacl-long.578}
}

Oral

LATTE: Learning to Think with Vision Specialists

Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, and 2 more authors

In Conference on Empirical Methods in Natural Language Processing (EMNLP). Suzhou, China. Nov 2025

Awarded arXiv Bib Blog Code Website Data

Oral

@inproceedings{ma_EMNLP_2025,
  title = {{LATTE}: Learning to Think with Vision Specialists},
  author = {Ma, Zixian and Zhang, Jianguo and Liu, Zhiwei and Zhang, Jieyu and Tan, Juntao and Shu, Manli and Niebles, Juan Carlos and Heinecke, Shelby and Wang, Huan and Xiong, Caiming and Krishna, Ranjay and Savarese, Silvio},
  booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  month = nov,
  year = {2025},
  address = {Suzhou, China},
}

LAM Simulator: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

Thai Quoc Hoang , Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, and 5 more authors

In ACL Findings. Vienna, Austria. Jul 2025

arXiv Bib PDF

@inproceedings{Hoang_ACLF_2025,
  title = {{LAM Simulator}: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback},
  author = {Hoang, Thai Quoc and Huang, Kung-Hsiang and Kokane, Shirley and Zhang, Jianguo and Liu, Zuxin and Zhu, Ming and Grigsby, Jake and Lan, Tian and Ryoo, Michael S and Wu, Chien-Sheng and Heinecke, Shelby and Wang, Huan and Savarese, Silvio and Xiong, Caiming and Niebles, Juan Carlos},
  booktitle = {ACL Findings},
  address = {Vienna, Austria},
  year = {2025},
  month = jul,
}

ViUniT: Visual Unit Tests for More Robust Visual Programming

Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles

In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, Tennessee. Jun 2025

DOI arXiv Bib Blog Code Website

@inproceedings{Artemis_CVPR_2025,
  title = {{ViUniT}: Visual Unit Tests for More Robust Visual Programming},
  author = {Panagopoulou, Artemis and Zhou, Honglu and Savarese, Silvio and Xiong, Caiming and Callison-Burch, Chris and Yatskar, Mark and Niebles, Juan Carlos},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  address = {Nashville, Tennessee},
  year = {2025},
  month = jun,
  doi = {10.1109/CVPR52734.2025.02295}
}

VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making

Jake Grigsby, Yuke Zhu, Michael S Ryoo, and Juan Carlos Niebles

In ICLR 2025 Workshop on Scaling Self-Improving Foundation Models without Human Supervision. Singapore. Apr 2025

arXiv Bib PDF Blog Poster

@inproceedings{Grigsby_ICLRW_2025,
  title = {{VLM} {Q}-Learning: Aligning Vision-Language Models for Interactive Decision-Making},
  author = {Grigsby, Jake and Zhu, Yuke and Ryoo, Michael S and Niebles, Juan Carlos},
  booktitle = {ICLR 2025 Workshop on Scaling Self-Improving Foundation Models without Human Supervision},
  address = {Singapore},
  month = apr,
  year = {2025},
}

From Language-Only to Multimodality

Exploring Self-Improvement via Simulation

Conclusion

References

Enjoy Reading This Article?