Are your Visual Programs Right for the Wrong Reasons?
With the rapid advancement of multimodal AI, we’ve all been impressed by AI’s ability to answer complex questions about images. From “What color is the ball?” to “Is there a red cube to the left of the blue sphere?”, these models seem to get it. But here’s the kicker: even when they give the right answer, they often use the wrong reasoning process. Yep, about 33% of the time, they’re right for all the wrong reasons. Think of it like a student getting the correct final answer on a math test, but their work is completely messed up!
This is a huge problem. Why? Because these models might fail spectacularly when faced with slightly different scenarios. This kind of flawed logic can cause big problems when AI is used in critical applications like self-driving cars or medical imaging.


So, how do we fix this? In software development, we use unit tests to make sure our code is rock-solid. Why not do the same for visual reasoning AI?
That’s exactly what we’ve done with ViUniT (Visual Unit Testing). We’ve created a framework that automatically generates unit tests specifically for visual reasoning programs. Think of them as mini-challenges designed to check if the AI’s logic is sound.
Here’s how it works:
First, a user asks a question about an image. In the example below, the question is: Is there an elephant in the blue water?. A question-answering AI, or Program Generator, produces a candidate Visual Program that attempts to answer the question. Here’s where our Visual Unit Test Generator framework kicks in.
- Custom Challenges: Given a visual question, we use a language model to create Image descriptions and their expected answers. These are the basis of our unit tests.
- Visual Test: We then use image synthesis, also known as a Text-to-Image Generator, to create images matching those descriptions.
- Logic Test: These image-answer pairs are used as a Unit Tests Suite to verify if the AI’s program and underlying reasoning to answer the original question is correct.
We can then use our Logic Test to validate if the candidate program passes the unit tests. In the example below, the candidate program fails to pass the tests. Our results show that using this feedback, we can use multiple strategies to help the Program Generator improve its logic.
In our upcoming CVPR 2025 paper [1], we dove deep into what makes a good set of these visual unit tests. We explored different ways to generate them, how to pick the best ones, and even tried different image generation methods.

The Results? Mind-blowing!
ViUniT improves model performance by a significant 11.4 percentage points in accuracy. But that’s not all. We even saw 7 billion parameter open-source models outperforming gpt-4o-mini by an average of 7.7 points in visual program synthesis. Plus, we reduced the number of “correct for the wrong reasons” answers by a whopping 40%.
But that’s not all! We also explored four exciting applications of ViUniT:
- Picking the best AI-generated visual program to answer a given question.
- Knowing when to say “I don’t know.”
- Improving answers through re-prompting.
- Creating a better unsupervised reward for reinforcement and automated AI model improvement.

In short, ViUniT is a game-changer for making visual reasoning AI more reliable and trustworthy. It’s like giving these models a rigorous logic test, ensuring they’re not just getting lucky, but actually understanding what they’re seeing.
So, next time you’re amazed by an AI’s visual reasoning skills, remember: it’s not just about getting the right answer, it’s about getting it right for the right reasons. And with tools like ViUniT, we’re one step closer to making that a reality.
If you are attending CVPR in Nashville, please visit our poster and we can discuss in person. If not, just let us know your thoughts!
References
Enjoy Reading This Article?
Here are some more articles you might like to read next: