Are your Visual Programs Right for the Wrong Reasons?

With the rapid advancement of multimodal AI, we’ve all been impressed by AI’s ability to answer complex questions about images. From “What color is the ball?” to “Is there a red cube to the left of the blue sphere?”, these models seem to get it. But here’s the kicker: even when they give the right answer, they often use the wrong reasoning process. Yep, about 33% of the time, they’re right for all the wrong reasons. Think of it like a student getting the correct final answer on a math test, but their work is completely messed up!

This is a huge problem. Why? Because these models might fail spectacularly when faced with slightly different scenarios. This kind of flawed logic can cause big problems when AI is used in critical applications like self-driving cars or medical imaging.

Program Generator AIs for visual question answering can exhibit inconsistent performance due to logical errors in generated programs.

So, how do we fix this? In software development, we use unit tests to make sure our code is rock-solid. Why not do the same for visual reasoning AI?

That’s exactly what we’ve done with ViUniT (Visual Unit Testing). We’ve created a framework that automatically generates unit tests specifically for visual reasoning programs. Think of them as mini-challenges designed to check if the AI’s logic is sound.

Here’s how it works:

First, a user asks a question about an image. In the example below, the question is: Is there an elephant in the blue water?. A question-answering AI, or Program Generator, produces a candidate Visual Program that attempts to answer the question. Here’s where our Visual Unit Test Generator framework kicks in.

  1. Custom Challenges: Given a visual question, we use a language model to create Image descriptions and their expected answers. These are the basis of our unit tests.
  2. Visual Test: We then use image synthesis, also known as a Text-to-Image Generator, to create images matching those descriptions.
  3. Logic Test: These image-answer pairs are used as a Unit Tests Suite to verify if the AI’s program and underlying reasoning to answer the original question is correct.

We can then use our Logic Test to validate if the candidate program passes the unit tests. In the example below, the candidate program fails to pass the tests. Our results show that using this feedback, we can use multiple strategies to help the Program Generator improve its logic.

In our upcoming CVPR 2025 paper [1], we dove deep into what makes a good set of these visual unit tests. We explored different ways to generate them, how to pick the best ones, and even tried different image generation methods.

Overview of our ViUniT (Visual Unit Testing) framework.

The Results? Mind-blowing!

ViUniT improves model performance by a significant 11.4 percentage points in accuracy. But that’s not all. We even saw 7 billion parameter open-source models outperforming gpt-4o-mini by an average of 7.7 points in visual program synthesis. Plus, we reduced the number of “correct for the wrong reasons” answers by a whopping 40%.

But that’s not all! We also explored four exciting applications of ViUniT:

  • Picking the best AI-generated visual program to answer a given question.
  • Knowing when to say “I don’t know.”
  • Improving answers through re-prompting.
  • Creating a better unsupervised reward for reinforcement and automated AI model improvement.
Four Key Applications of ViUniT, demonstrating its versatility beyond performance improvement.

In short, ViUniT is a game-changer for making visual reasoning AI more reliable and trustworthy. It’s like giving these models a rigorous logic test, ensuring they’re not just getting lucky, but actually understanding what they’re seeing.

So, next time you’re amazed by an AI’s visual reasoning skills, remember: it’s not just about getting the right answer, it’s about getting it right for the right reasons. And with tools like ViUniT, we’re one step closer to making that a reality.

If you are attending CVPR in Nashville, please visit our poster and we can discuss in person. If not, just let us know your thoughts!

References

  1. viunit_cvpr25.jpg
    ViUniT: Visual Unit Tests for More Robust Visual Programming
    Artemis PanagopoulouHonglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, and Juan Carlos Niebles
    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, Tennessee. Jun 2025



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Introducing TACO - Salesforce AI Research's Family of Multimodal Action Models - Salesforce
  • CausalAI: Answering Causality Questions Using Observational Data
  • The Generalization Gambit - Does It Still Matter in the Age of Big AI?
  • Language-based AI Agents and Large Action Models (LAMs)
  • Hey assistant, don't let me forget my card at the ATM!