A four-person research team from Microsoft and two US universities has published LLaVA: The Large Language Visual Assistant is a multimodal AI model based on well-known AI chat systems. It processes speech and images, it is freely available for research purposes – it combines a vision encoder with a large language model, the LLaMA offshoot Vicuna. LLaVA was fine-tuned with machine-generated training data synthesized via GPT-4’s OpenAI API.

The researchers’ goal was to train a large language model (LLM) for zero-shot input and to test this approach multimodally. “Zero shot” means here that the model should be able to output meaningful answers right away without further specifications (prompts). LLaVA is multimodal in that it accepts instructions in both text and image form, as well as a combination of both modalities. Also exciting is the note from the research team that LLaVA should achieve a hit rate of over 92 percent when fine-tuning for scientific question-and-answer conversations. If this can be confirmed independently, it would be progress compared to previous comparative values.

Researchers used the speech-only version of GPT-4 (without the multimodal plug-in) to generate a multimodal set of voice-image instructions. Through the combination of Vision Encoder and Large Language Model (LLM), LLaVA gained general visual and language skills. According to initial tests, it can instantly describe never-before-seen images in text form and is said to behave close to the multimodal version of GPT-4 – the team reports an 85 percent hit rate compared to the multimodal version of GPT-4.

LLaVA Answers Questions About Images: Prompted the question “What should I look out for when visiting this place?” along with a test image. The response provides detailed information about the scene depicted in the image and advice derived from it. (Image: LLaVA-Website)

External benchmarks are not available because the multimodal GPT 4 version has only been presented so far, but is not publicly available. Only selected Microsoft OpenAI partners currently have access to this version. The LLaVA team seems to belong to this circle of the chosen few, especially since one of the researchers involved is employed by Microsoft. Microsoft is the main sponsor of OpenAI and bought the start-up for a total of 11 billion US dollars and secured exclusive rights to use OpenAI’s AI models since GPT-3. Since the beginning of the business relationship with Microsoft in 2019, all OpenAI models are closed source and a black box for the rest of the world. In particular, little is known about the multimodal capabilities of GPT-4, since, unlike the text-based ChatGPT, they cannot yet be tested via a demo. Performance values ​​reported by exclusive partners cannot yet be independently verified.

LLaVA’s publication allows a glimpse into the engine room of Microsoft OpenAI and is exciting in that the team has released the GPT-4 generated data set for visual fine-tuning along with the model and code base. More about the project can be found on the LLaVA website. The research demo can be tried out on a separate domain.

Interaction options in the research demo: The team from Microsoft and two US universities collects user data and asks for user feedback on the results generated with LLaVA. (Image: LLaVA-Website)

A fairly simple rating tool is built into the interface, with which users can rate the results as good or bad (thumbs up: upvote, thumbs down: downvote). In addition, unwanted content can be provided with a warning flag. For an existing prompt, users can request a new answer and clear the history to start over. Two test images are stored in the demo. The model has few built-in security mechanisms and must not be used (apparently it is capable of) for illegal, malicious, violent, racist or sexually-pornographic purposes, the description says. User dialog data is stored “for future research purposes”.

Anyone working with it may “flag” inappropriate responses (a task otherwise often performed by underpaid clickworkers in Kenya and elsewhere in human feedback donations in Reinforcement Learning HF, or by volunteers in crowdsourcing projects in the open source space). This is used to train an apparently automatic moderator. Anyone who participates here should be aware that he or she is donating data to Microsoft that the group could potentially use commercially – but conversely, the model itself is not allowed to be used commercially.

Microsoft and the other project participants collect the user data “for research purposes”. You should be aware of this before you start using the prompt and upload your own pictures, for example. Anyone who uses the demo agrees to the terms and conditions. It is a Terms of Use research preview for non-commercial use only, subject to the LLaMA License Terms (Non-commercial bespoke license), the OpenAI Terms of Use, and the Privacy Practices of ShareGPT, a sharing and retention platform of ChatGPT conversations (thread about privacy issues in the GitHub repository of ShareGPT: Apparently there is currently no way to delete data shared via ShareGPT).

LLaMA and its offshoots in the legal gray area

LLaMA has not yet been released as open source by MetaAI (more on that below) and is only available to selected research partners. The restriction for non-commercial, purely scientific purposes therefore also applies to the new LLaVA, which the four AI researchers Haotian Liu and Yong Jae Lee (University of Wisconsin-Madison), Chunyuan Li (Microsoft Research) and Quingyang Wu (Columbia University ) available at GitHub and Hugging Face including the data set and the model weights.

Only selected research institutions have officially received the model weights, so derivatives of the LLaMA are currently subject to legal reservations and may only be used for research purposes, but not commercially. Some LLaMA offshoots do not come from a research cooperation, but from an illegal bit torrent leak and are therefore subject to even greater reservations.

Target group hobby researchers

The LLaVA-Instruct-150K synthetic dataset is available on Hugging Face. The data is from April 2023. The GPT-4-0314 API was used as the interface for generation. As the LLaVA team points out, the primary target group is scientists and people who are interested in computer vision, NLP, machine learning and AI as a hobby. The data set is subject to the Attribution-NonCommercial 4.0 International license and whoever uses it must also observe the OpenAI rules of procedure. Their terms of use exclude the use of GPT-4-generated data sets to create competing products.

The research report is available at ArXiv.org (“Visual Instruction Tuning”). The model code including weights and an evaluation is available at GitHub. Questions, comments and problems can also be submitted via GitHub.



(sih)

