Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs.
We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM.
Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.
The figure above shows the overall framework of VPD.
VPD contains two stages: Program generation and verification and Distilling step-by-step.
Program generation and verification contains following steps:
1. Program generation with LLM
2. Program execution with vision modules
3. Program filtering
4. Converting program execution trace into chains-of-thoughts
Given the synthesized CoT data, we fine-tune VLMs to output these chains-of-thoughts, using the same approach as in Distilling step-by-step.
So, how does VPD differ from prior approaches for generating visual instruction tuning data? Here we show examples of how LLaVA and VPD generate data. LLaVA prompts LLM with image captions and let LLM generate task inputs and outputs. However, image captions are coarse representation of images, and does not contain fine-grained attributes and relations. Bounding boxes may complement that, but LLMs are not good at reading bounding boxes, and only densely labeled datasets like COCO can be used. Also, LLM generations suffers from hallucination and spurious reasoning steps. Even GPT-4V is still not reliable enough to generate faithful reasoning steps for complex tasks. In contrast, VPD generates data by sampling programs from LLMs, and then use existing vision tools to get reasoning steps. VPD works on any images, have fine-grained visual details, and contains more factual and consistent reasoning steps.
In generalist setting, our PaLI-X-VPD sets the new state-of-the-art on all benchmarks. PaLI-3-VPD outperforms prior 13B+ VLMs on most benchmarks. Also, VPD variants outperform instruction-tuning baselines. VPD is also helpful for adaptation to real-word tasks with limited data. We experiment with VPD on Hateful Memes dataset. We sets the new SOTA for both supervised and unsupervised settings. Surprisingly, unsupervised PaLI-X-VPD even outperforms strong supervised baselines trained with 8,500 labels.
Human evaluation shows that the long-form outputs of VPD models are more accurent, consistent, and faithful compared to the baselines instruction-tuned with LLM generations.
Here we show some demos of applying VPD on content moderation task Hateful Memes. There are two settings. For the unsupervised settings, we do not use any human labels. For the supervised settings, we use 8,500 "yes" and "no" labels to fine-tune the model. Our models are able to generate human-interpretable reasoning steps, and are able to detect hateful memes. The unsupervised model works surprisingly well. We also show a failure case of the unsuperivsed model.
@misc{hu2023visual,
title={Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models},
author={Yushi Hu and Otilia Stretcu and Chun-Ta Lu and Krishnamurthy Viswanathan and Kenji Hata and Enming Luo and
Ranjay Krishna and Ariel Fuxman},
year={2023},
eprint={2312.03052},
archivePrefix={arXiv},
primaryClass={cs.CV}
}