Listen to the audio version of the article
Announced at the I/O developer conference in May, Google has finally unveiled Gemini, the first multimodal AI model. that is, capable of understanding and operating on different types of information, including text, code, audio, images and video. A real response to ChatGpt and Ai Gen has been expected for a long time from the company that invented transformers and it has arrived.
In a note in the blogpost, the Mountain View scientists write that it is also the most flexible model of Google AI, capable of running on any device, from data centers to mobile devices. It comes in three versions: Gemini Ultra, the largest model and capable of performing highly complex tasks. Gemini Pro which will enter Bard and the search engine and Gemini Nano which is the most efficient model capable of also working in smartphones starting from the Pixel family. The Ultra one, the most powerful, according to what emerges from a table, has superior performance to that of Gpt-4.
The difference between Gemini Ultra and ChatGpt (and us humans)
Gemini Ultra is the first model to outperform human experts in MMLU (massive multitask language understanding), which uses a combination of 57 subjects such as mathematics, physics, history, law, medicine and ethics to test both knowledge of the world and the ability to troubleshooting. It means that it answers questions, summarizes text and translates better than humans. However, let us remember that these systems induce, deduce but are not yet capable of formulating reasonable hypotheses on an observed situation, that is, of reasoning on the best explanation of the facts. Gpt-4 is not multimodal in the traditional sense of the term. It is an advanced language model that can understand and generate text, but does not directly process other types of input, such as images or sounds. However, GPT-4 can interact with other tools and models that handle multimodal input. For example, it can use DALL-E to create images from text descriptions or collaborate with sound processing systems for specific applications. So, while GPT-4 itself is not multimode, it may be part of a larger multimode system. Gemini Ultra excels in several coding benchmarks, including HumanEval, a leading industry standard for evaluating performance in coding tasks, and Natural2Code, the internal dataset that uses author-generated sources rather than web-based information. Gemini can also be used as an engine for more advanced coding systems.
What can Gemini do that’s new?
Gemini relies on machine learning “by reinforcement”. That is, a reward and punishment system to teach how to behave depending on the situation in the generative AI field. This model has been trained to recognize and understand text, images, audio and more simultaneously, so it can better understand nuanced information and answer questions about complex topics. This, Google writes, makes it particularly good at explaining reasoning in complex subjects such as mathematics and physics. The demos shown reveal a multimodal reasoning ability: it means that he is able to make sense of complex written and visual information. For example, he is able to contextualize what he sees and answer questions related to complicated topics. This makes it particularly good at explaining reasoning in complex subjects such as mathematics and physics.
How it was made
Gemini was trained on a generation of proprietary Tensor Processing Units (TPUs) V4 and v5 accelerators that Google said were more powerful, scalable and efficient. Together with Cloud TPU v5p, designed for training cutting-edge AI models, this new generation of TPU announced today will accelerate the development of Gemini and help developers and enterprise customers train more large-scale generative AI models. quickly”. The system complies with Google’s Responsible AI principles and has more comprehensive safety ratings than any AI model made in Mountain View to date, including those for bias and toxicity. Google uses benchmarks such as Real Toxicity Prompts, a set of 100,000 prompts with varying degrees of toxicity extracted from the web, developed by experts at the Allen Institute of AI. Further details on this work will be available shortly.