My talks at CVPR 2026 workshops

This is an accompanying post for my CVPR 2026 workshop talks.
I share slide decks below, link each workshop page, and include a short summary of each talk.

1) Agentic Ambient Intelligence: Perception, Reasoning & Action

Quick summary

This talk presents a capability stack for real-world AI assistants that operate in physical environments.
The focus is on four ingredients: space-time grounding, long-horizon active evidence search, scalable long-context video understanding, and motion-guided action/control.

Covered papers:

  • Strefer [1]
  • Active Video Perception (AVP) [2]
  • Linear Scaling Video VLMs for Long Video Understanding [3]
  • Future Optical Flow Prediction (FOFPred) [4]

2) Scaling Transformers: Architectures, Longer Contexts, Better Data

Quick summary

Transformers are central to modern visual AI, but progress is increasingly constrained by three bottlenecks: expensive architecture exploration, long-context inference costs, and limited open data foundations for fair benchmarking.
This talk is organized around those three levers: post-training architecture editing, efficient long/streaming inference, and large permissively licensed datasets.

Covered papers:

  • Exploring Diffusion Transformer Designs via Grafting [5]
  • Linear Scaling Video VLMs for Long Video Understanding [3]
  • GPIC: A Giant Permissive Image Corpus for Visual Generation [6]

3) Agentic Ambient Intelligence: Efficient Understanding & Action

Quick summary

This talk focuses on building practical Virtual Intelligent Task Assistants (VITAs) that can understand user intent, process long egocentric/streaming visual input, and react in time.
The emphasis is on efficient perception loops, long-context scaling, streaming event detection, and action-oriented motion prediction.

Covered papers:

  • Active Video Perception (AVP) [2]
  • Linear Scaling Video VLMs for Long Video Understanding [3]
  • Streaming Detection of Queried Event Start (SDQES) [7]
  • Future Optical Flow Prediction (FOFPred) [4]

References

  1. strefer_ICCVW2025.png
    Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
    Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles
    In ICCV Workshop on What is Next in Multimodal Foundation Models?. Honolulu, Hawaii. Oct 2025
  2. avp.png
    Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
    Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, and Juan Carlos Niebles
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Findings (CVPR Findings). Denver, Colorado. Jun 2026
  3. statekv.png
    Linear Scaling Video VLMs for Long Video Understanding
    Cristobal Eyzaguirre, Jiajun Wu, and Juan Carlos Niebles
    May 2026
  4. fofpred.png
    Future Optical Flow Prediction Improves Robot Control and Video Generation
    Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S Ryoo, and Juan Carlos Niebles
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Findings (CVPR Findings). Denver, Colorado. Jun 2026
  5. Oral
    grafting_2025.jpg
    Exploring Diffusion Transformer Designs via Grafting
    Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, and 2 more authors
    In Advances in Neural Information Processing Systems (NeurIPS). San Diego, California. Dec 2025
  6. gpic.jpeg
    GPIC: A Giant Permissive Image Corpus for Visual Generation
    Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, and Li Fei-Fei
    May 2026
  7. neurips24-sdqes.jpg
    Streaming Detection of Queried Event Start
    Cristobal Eyzaguirre, Eric Tang, Shyamal Buch, Adrien Gaidon, Jiajun Wu, and Juan Carlos Niebles
    In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track. Vancouver, Canada. Dec 2024



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Agentic Ambient Intelligence: Bringing AI into the Physical World
  • Hey assistant, don't let me forget my card at the ATM!
  • Language-based AI Agents and Large Action Models (LAMs)
  • AI Agents: From Language to Multimodal Reasoning
  • Level up your Agents: Teaching Vision-Language Models to Play by the Rules