AI Agents: From Language to Multimodal Reasoning

AI Agents are rapidly evolving from language-only models to sophisticated multimodal reasoners. This post outlines our research journey in advancing AI Agents over the past year, accompanying my presentation at the ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence.

My talk follows up on my CVPR 2024 Tutorial talk on Language-based Agents and Large Action Models. This accompanying post outlines the content of my talk and consolidates all relevant resources.


From Language-Only to Multimodality

Our journey began with building top-performing Language-only Agents. The core of our research focuses on training Large Action Models (LAMs) to serve as the agent’s ‘brain’. We found two key ingredients to achieving this: a data pipeline that unifies expert trajectories from diverse sources and a finetuning pipeline that leverages supervised-finetuning (SFT), LoRA, and DPO.

In our recent xLAM [1] paper, we demonstrated how to effectively achieve this, leading xLAM to become the Top performing model in the Berkeley Function Calling Leaderboard v2.

Building on this success, we then explored how these core training principles apply to Multimodal AI Agents. Our LATTE paper [2] shows that we can train highly effective Multimodal AI Agents for Visual Reasoning by collecting high-quality agent trajectories and fine-tuning a pre-trained VLM with this agentic data.


Exploring Self-Improvement via Simulation

These are indeed exciting results. However, a key limitation of both approaches is the reliance on carefully curated expert trajectories for training the agent’s ‘brain’. This poses two significant challenges: it requires extensive labor for data collection/curation and, critically, the model’s performance is ultimately capped by the ‘expert’ input quality.

This motivated us to explore more advanced learning mechanisms, in particular, that of automatic exploration in simulation to allow the agent to discover new problem-solving strategies.

We study this in our LAM Simulator paper [3]. We put a language-based agent into an RL-style learning loop, automatically generating trajectories instead of relying on experts, and automatically grading them with Rewards using evaluation functions.

We then extended this idea of self-exploration and improvement to Multimodal AI Agents. In our ViUnit paper [4], we showed that visual programming agents can improve their robustness and performance by leveraging our automatically generated visual unit tests. Similarly, in a related work, we trained a VLM-based agent [5] with reinforcement learning and also saw substantial performance improvements compared to baseline VLM agents.


Conclusion

These are exciting times for AI Agents. While we’ve made incredible progress in transitioning agents from language to robust visual reasoning, the path forward is clear: enabling agents to operate with more modalities beyond Language and Vision, and developing more advanced, self-improving learning mechanisms. I look forward to continuing our work and seeing how quickly this space develops in the research community.

References

  1. Oral
    xlam.jpg
    xLAM: A Family of Large Action Models to Empower AI Agent Systems
    Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, and 12 more authors
    In The 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). Albuquerque, New Mexico. Apr 2025
  2. Oral
    latte_emnlp2025.png
    LATTE: Learning to Think with Vision Specialists
    Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, and 2 more authors
    In Conference on Empirical Methods in Natural Language Processing (EMNLP). Suzhou, China. Nov 2025
  3. lamsim-acl2025.png
    LAM Simulator: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback
    Thai Quoc Hoang , Kung-Hsiang Huang, Shirley Kokane, Jianguo Zhang, Zuxin Liu, Ming Zhu, Jake Grigsby, Tian Lan, Michael S Ryoo, Chien-Sheng Wu, and 5 more authors
    In ACL Findings. Vienna, Austria. Jul 2025
  4. viunit_cvpr25.jpg
    ViUniT: Visual Unit Tests for More Robust Visual Programming
    Artemis PanagopoulouHonglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, Tennessee. Jun 2025
  5. vlmq2025.png
    VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making
    Jake Grigsby, Yuke Zhu, Michael S Ryoo, and Juan Carlos Niebles
    In ICLR 2025 Workshop on Scaling Self-Improving Foundation Models without Human Supervision. Singapore. Apr 2025



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Language-based AI Agents and Large Action Models (LAMs)
  • Level up your Agents: Teaching Vision-Language Models to Play by the Rules
  • xLAM: A Family of Large Action Models for AI Agents - Salesforce
  • Introducing TACO - Salesforce AI Research's Family of Multimodal Action Models - Salesforce
  • Are your Visual Programs Right for the Wrong Reasons?