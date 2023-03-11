Google retaliates with a large model of 562 billion parameters: it is more terrifying than ChatGPT, and the academic circle has swiped the screen

In response to the new round of technological competition, Google is still making a comeback.

In the past two days, a large model named PaLM-E has been frantically swiping the screen in the AI ​​academic circle.

It can tell a robot to fetch potato chips from a kitchen drawer with just one word.

Even if it is disturbed halfway, it will continue to carry out the task.

PaLM-E has 562 billion parameters, more than three times that of GPT-3, and is known as the largest visual language model in history. And the team behind it is Google and the Technical University of Berlin.

As a large model that can handle multi-modal information, it also has very strong logical thinking.

For example, from a bunch of pictures, it can be judged which one can be scrolled.

I will also look at the picture and do the math:

Someone sighed:

This work is one step closer to breaking latest news than ChatGPT!

On the other hand, Microsoft is actually trying ChatGPT to command robots to work.

In this way, Google is in place with PaLM-E in one step?

Bigger model with more logic

PaLM-E is a powerful combination of PaLM and ViT.

The parameter amount of 562 billion is actually the sum of the parameters of the above two models (540 billion + 22 billion).

PaLM is a large language model released by Google in 22 years. It is trained by the Pathways architecture. It can obtain more accurate logical reasoning ability through “thinking process prompts”, and reduce errors and nonsense in AI-generated content.

Pathways is a sparse model architecture, which is one of the key development directions of Google AI in the past two years. The goal is to train a general model that can perform hundreds of tasks.

ViT is a classic work in the field of computer vision, namely Vision Transformer.

After combining the two, PaLM-E can handle multimodal information. include:

-language

-image

– Scene representation

– object representation

By adding an encoder, the model can encode image or sensor data into a series of vectors with the same size as language tags, and use this as input for the next token prediction for end-to-end training.

In terms of specific capabilities, PaLM-E has shown relatively strong logic.

For example, give it a picture and let it make a cake based on what it sees.

The model can first judge what is in the image, and then divide it into 9 steps to explain how to make the cake, from the initial cracking of eggs to the final washing of dishes.

Someone even joked, why did the robot eat the cake by itself before giving it to me?

And make judgments based on pictures: Can I ride a bike on this road?

The model makes a series of logical inferences:

1. No entry 2. Except for bicycles 3. No entry except for bicycles 4. The answer is yes

This is indeed very similar to the human thinking process.

Not only that, but the most powerful thing about the model is that it does not need to be pre-processed, that is, it understands the environment in advance.

It makes judgments and answers based entirely on its own “experience”.

The researchers said that this achievement showed a strong positive transfer ability.

In the training of multiple domain tasks, PaLM-E performed better than single-task robot models.

And they also found that the larger the size of the language model, the stronger the language understanding ability it can ultimately maintain.

For example, when using PaLM with a parameter scale of 540 billion, the actual ability of PaLM-E on language tasks has only decreased by 3.9%.

From the experimental results, PaLM-E achieves a new SOTA on the OK-VQA benchmark.

The task completion in the simulated environment is also good.

Once again verifying that Dali works miracles

At present, this study has sparked a very wide discussion.

Mainly in the following aspects:

1. To a certain extent, it has verified the “powerful miracle” 2. It is closer to breaking latest news than ChatGPT?

On the one hand, as the largest visual language model currently known, the performance of PaLM-E is already amazing enough.

Last year, DeepMind also released a generalist large model Gota, which was trained on 604 different tasks.

But at that time, many people thought that it was not really general, because the research could not prove that the model had positive transfer between different tasks.

According to the authors of the paper, this may be because the model size is not large enough.

Today, PaLM-E seems to complete this argument.

However, there are voices worrying, does this lead the volume parameter from NLP to CV circle?

On the other hand, it is from the perspective of the general trend.

Some people say that this work looks closer to breaking latest news than ChatGPT.

Indeed, using ChatGPT is only to provide text suggestions, and many specific hands-on things have to be done by yourself.

But PaLM-E is to bring the ability of large models into the concrete level, and the barrier between AI and the physical world will be broken.

And this trend is obviously what everyone is thinking about. Microsoft also released a very similar job not long ago-let ChatGPT command the robot.

In addition, many people said that this once again verified that multimodality is the future.

However, this achievement is only released in papers and demos, and its true capabilities have yet to be verified.

In addition, it was discovered that the development team behind the model-driven robot was taken over by Google a few weeks ago. . .

So for more follow-up of PaLM-E, we have to wait and see.

Paper address: https://arxiv.org/abs/2303.03378