The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code will be made publicly available.
Language-only GPT-4: Current high-quality multimodal fine-tuning data is primarily generated based on language-only GPT-4 as illustrated in Figure 1-(a). This approach necessitates costly manual pre-annotation and restricts the design of questions and generated responses to existing annotated information. Consequently, if the question posed is not within this annotated information, GPT-4 is unable to respond. This method also loses the detailed information in the image for answerable questions.
VIGC: The proposed VIGC, based on existing Visual Language Models (VLMs), guides the model to generate diverse visual-language question-answer pairs on new images through the fine-tuning of initial instruction data. The ability to generate diverse data is derived from the fact that both the visual encoder and the large language model have been fine-tuned on extensive datasets, encompassing rich image understanding and logical language capabilities. However, we found that data generated directly from provided instructions suffer from severe hallucination issues, which is a common problem plaguing large multimodal models. Fortunately, our visual instruction correction module can significantly reduce model hallucination phenomena through iterative updates.
LLaVA QA-pairs generated by VIGC
OKVQA and A-OKVQA QA-pairs generated byVIGC
@article{wang2023vigc,
title={VIGC: Visual Instruction Generation and Correction},
author={Wang, Bin and Wu, Fan and Han, Xiao and Peng, Jiahui and Zhong, Huaping and Zhang, Pan and Dong, Xiaoyi and Li, Weijia and Li, Wei and Wang, Jiaqi and He, Conghui},
journal={arXiv preprint arXiv:2308.12714},
year={2023}
}
By utilizing this service, users are bound to comply with the stipulated terms: The service is primarily a research preview, designed exclusively for non-commercial applications. It offers only limited safety measures and there's a possibility it may produce offensive content. Users are strictly prohibited from using the service for any illegal, harmful, violent, racist, or sexually explicit purposes.
This service, being a research preview, is intended solely for non-commercial use and is governed by the model License of LLaMA. If you encounter any potential violation, we urge you to contact us immediately.