The new technology, called Hyena (meaning “hyena”), can achieve the same accuracy as GPT-4, but uses 100 times less computing power than the latter.

Despite the global buzz around Open AI’s AI chatbot ChatGPT and its latest AI language model, GPT-4, these language models are, at the end of the day, just software applications. Like all apps, they have technical limitations.

In March of this year, artificial intelligence scientists at Stanford University and the MILA Institute for AI in Canada jointly published a paper and proposed a new technology (Hyena). This technology is even more efficient than GPT-4 or any similar AI technology, it can take a lot of data and turn it into the answer that the user wants.

The technology, known as Hyena, was able to achieve GPT-4-like accuracy on benchmarks like question answering using only a fraction of its computing power. In some cases, Hyena is able to handle a large amount of text, while GPT-4 can only process no more than 25,000 words at a time.

Google scientist Ashish Vaswani and his colleagues published a paper called “Attention Is All You Need” (“Attention Is All You Need”) in 2017, which is a milestone in the field of artificial intelligence research . This paper gives a very detailed introduction to the Transformer model (a neural network structure). A Transformer-based trainable neural network can be built in the form of stacked Transformers. It is good at processing language understanding tasks and requires less computing power. . The author wrote in the paper: “The results based on the operation of the billion-parameter model show that attention may not be all we need.” Transformer has great potential and has become the basis of many large language models, such as ChatGPT.

However, the Transformer neural network model has a big flaw. When it processes a large amount of input information, it needs to learn from the “attention mechanism” of the human brain, that is, only select some key information inputs for processing, so as to improve the efficiency of the neural network. .

This attention mechanism has “quadratic computational complexity”, its time and storage complexity are quadratic in sequence length, and its ability to deal with long text sequences is very poor. This inherent defect is that it includes ChatGPT and GPT-4 All large language programs including . This quadratic complexity means that the time it takes ChatGPT to generate an answer increases with the amount of input data.

To some extent, if the prompt content is too input, either the program cannot provide the answer, or it must have enough computing power to meet the operational needs, which will lead to a surge in the computing needs of artificial intelligence chatbots.

A new paper, “The Hyena Hierarchy: Towards In Towards Larger Convolutional Language Models’ (Towards Larger Convolutional Language Models’), lead author Michael Poli of Stanford University and his colleagues propose using a “subquadratic function”, or Hyena, to replace the Transformer’s attention function.

The author did not explain the origin of the name “Hyena”, but people can roughly imagine various reasons. Hyena, also translated as “hyena”, is an animal that lives in Africa and can hunt for miles. In a sense, a very powerful language model can act like a hyena, processing tens of thousands of texts in search of “answers”.

But as the title suggests, what the author really cares about is “hierarchy”. The hyena family has a strict hierarchy. Generally speaking, the hyena queen is the most noble, followed by the cubs, and the lowest status is the male hyena. The hyena queen leads and dominates the entire group, enjoying the highest status. This “hierarchy” establishes the hyena queen’s dominance. As you will see, Hyena programs apply a series of very simple operations over and over again in a somewhat similar fashion, combining them to form a data processing hierarchy. This is why the program was named “Hyena”.

The special authors of this paper include many outstanding people from the field of artificial intelligence, such as Yoshua Bengio, the scientific director of the MILA Institute of Artificial Intelligence in Canada, who is the 2019 Turing Award (equivalent to the computer field). winner of the Nobel Prize). Bengio is credited with developing attention mechanisms long before Vaswani and his team applied them to Transformer. Christopher Ré, an associate professor of computer science at Stanford University and co-author, has helped advance the concept of artificial intelligence as “software 2.0” in recent years.

To find an alternative to the “quadratic computational complexity” of the attention mechanism, Poli and his team set out to study how the attention mechanism works.

A recent field of hands-on research in the science of artificial intelligence, known as mechanistic interpretability, is gaining insights into the internals of neural networks—how the attention mechanism works. You can think of it like taking apart a computer, looking at its individual parts, and figuring out how it works.

Polly and his team cite a series of experiments by Nelson Elhage, a researcher at the artificial intelligence startup Anthropic, which conducted a global analysis of Transformer’s algorithmic structure and fundamentally clarified Transformer. What is the work content when processing and generating text, and deeply explores the working principle of the attention mechanism behind it.

Essentially, Elhag and his team discovered that attention works at the most basic level through very simple computer manipulations. Suppose given an input, “Teacher Judy is so busy…because Teacher X…”, X points to “Judy”. The attention mechanism is to look at the last word “Teacher” in the context, and search for a specific word associated with the last word in the context, and then output this associated word as the model.

As another example, if a person enters a sentence from “Harry Potter and the Sorcerer’s Stone” (Harry Potter and the Sorcerer’s Stone) in ChatGPT, such as “Mr. Dursley was the director of a firm called Grunnings…”, then just enter “Durs”, the beginning of the name, may be enough to prompt the program to complete the name “Dursley”, because it has seen this name in the book “Harry Potter and the Sorcerer’s Stone”. The system can copy the record of the character “ley” from memory to automatically complete the output of the sentence.

However, as the number of words increases, the attention mechanism suffers from quadratic complexity. More text requires more “weights” or parameters to run on.

As the authors write: “The Transformer block is a powerful tool for sequence modeling, but it is not without limitations. The most notable of these is the computational cost, which grows rapidly as the content length of the input sequence increases.”

Although OpenAI has not disclosed the technical details of ChatGPT and GPT-4, it is understood that they may have a trillion or more such parameters. Running these parameters requires more GPU chips, thereby increasing the computational cost.

To reduce the cost of secondary computations, Poli and team replaced the attention mechanism with a so-called “convolutional model,” one of the oldest computational models in AI programming, refined as far back as the 1980s. The convolutional model is equivalent to a filter program that can select items from the data, whether it is an image pixel or a text format, it is supported.

Poli and his team did a hybrid study, combining work done by Stanford University researcher Daniel Y. Fu and his team with research by David Romero of the Vrije University in Amsterdam and colleagues, allowing the program to dynamically change the filtering device size. This ability to adapt flexibly reduces the number of parameters or weights required by the program.

Convolutional models can be applied to an unlimited amount of text without requiring more and more parameters to keep the program running. As the author puts it, this is a method that “doesn’t require concentration”.

“Hyena is able to significantly narrow the gap with attention mechanisms, solving equivalent perplexities with a smaller computing power budget,” Poli and his team wrote.

To demonstrate Hyena’s capabilities, the authors tested the program against a series of benchmarks that determine how well a language program performs on various artificial intelligence tasks.

One such test is The Pile, an 825 GiB open-source language modeling dataset collected in 2020 by the nonprofit AI research organization Eleuther.ai. These texts are assembled from 22 smaller high-quality datasets, such as PubMed, arXiv, GitHub, USPTO, etc., which are more specialized than others.

The main challenge the program faced was how to generate a new word when fed a bunch of new sentences. Starting in 2018, Hyena was able to achieve comparable accuracy to OpenAI’s original GPT program with 20 percent fewer computational operations, the researchers wrote. It is the first attention-free convolutional model that matches the quality of GPT.

Next, the authors tested the program on an inference task called SuperGLUE, introduced in 2019 by academics at New York University, Facebook AI Research, Google’s DeepMind division, and the University of Washington.

For example, when given the hypothesis “my body is casting a shadow on the grass”, and given two reasons for this phenomenon: “the sun has risen” or “the grass was cut”, and asked the program to choose one of them For reasonable reasons, it will output “The sun is up” as the output text.

When dealing with multitasking, the Hyena model scores at or near the GPT version’s score, yet it uses less than half the training data as GPT. Even more interesting is what happens when the author tries to increase the length of the input string and finds that the more characters, the better the performance and the less time it takes.

Poli and team believe that they not only tried a different approach with Hyena, but also solved the problem of quadratic computational complexity, making a qualitative change in the difficulty of the program’s calculation results.

Down the road, they believe, breaking the quadratic computing barrier is a critical step toward deep learning, for example using entire textbooks as contextual cues to compose long pieces of music or process gigapixel images.

The authors wrote that Hyena was able to use a screening program that could more efficiently scale up to tens of thousands of words, meaning there was virtually no limit to the context of the query language program, which could even recall text or the content of previous conversations.

They propose that Hyena is not artificially limited and can learn any element of the “input prompt”. Furthermore, besides text, the program can also be applied to different forms of data, such as images, and perhaps video and sound.

It is worth noting that the Hyena program shown in the paper is small compared to GPT-4 or even GPT-3. GPT-3 has 175 billion parameters or weights, while Hyena has at most 1.3 billion parameters. Therefore, it remains to be seen how Hyena performs when compared comprehensively to GPT-3 or GPT-4.

But if the Hyena program also proves to be efficient on a larger scale, the program could become widely popular—comparable to the popularity that attention mechanisms have achieved in the past decade.

As Poli and his team conclude: “Simpler quadratic models, such as Hyena, based on a simple set of guiding principles and mechanistic interpretability benchmarks, can be the basis for large efficient language models.”