Proteins designed by AI could be used in industry – or in cancer medicine.

Already possible: AI models that generate protein sequences.

Illustration: Andrin Engel

Chat GPT has taken the world by storm. The chatbot writes essays, poems and even programming code. He has the amazing ability to generate text that is not only grammatically correct, but can even play with nuances. The chatbot is based on a specific type of artificial intelligence, the so-called large language models.

Now researchers have developed language models that have not learned English, French or German – but the language of biology.

More precisely: the language of proteins. Because there are surprising parallels between the structure of the central building blocks of biology and the structure of our language.

It seems obvious that models that work well in language also have potential in biology. One of the first to conduct research in this area was the group led by Professor Burkhard Rost from the Technical University of Munich. «I was extremely skeptical at the beginning, I didn’t think it would work. Analogy alone does not mean that it will work. There are so many analogies that don’t work,” says Rost.

You also have to remember: In addition to the parallels, there are also important differences. There are only 20 different amino acids in proteins, ridiculously few compared to the 500,000 words that the Duden considers to be the vocabulary of contemporary German. However, proteins with several hundred amino acids are much longer than a normal sentence, which usually lasts less than 30 words.

In the end, a small group of his graduate students persuade Rost to try using language models. In 2019, they will be one of the first research groups to apply language models to biological data. Many more scientific teams will follow.

Language models learn from cloze texts

Language models such as those underlying Chat-GPT learn through fill-in-the-blank texts. And there is certainly enough text on the Internet. Each one of them can be used to train a language model. These huge amounts of data are what make language models so successful.

The biological language models have adopted this recipe for success. The sequences of many proteins are now known. The largest protein database, Uniprot, now lists over 250 million entries of protein sequences. Each can be used to create a cloze text that can be used to feed the language model.

In order to fill in the blanks correctly, the model has to learn a lot about language: word meanings, grammar and connections between words. At the end, all of this information is somewhere in the fully trained language model. In the same way, the language model trained with proteins has learned a lot of fundamental things about the properties and interaction of amino acids and proteins.

Understanding the language of proteins can solve many problems

This knowledge is useful. Similar to how chatbots based on language models can generate new, meaningful sentences, functional new proteins can be designed based on biological language models. And just as you can tell chatbots to formulate sentences on a certain topic and in a certain style, you can also tell protein chatbots what properties the generated proteins should have.

For what reason? Proteins are extremely versatile and can therefore be used for a wide variety of applications. For example, researchers hope to use the technology to design new antibodies that alert the immune system to cancer cells. To do this, an antibody must fit the cancer cell exactly, like a key to a lock. So far, suitable antibodies have only been found for a few subcategories, such as certain forms of breast cancer. Protein chatbots could soon suggest new antibodies for many types of cancer, which could significantly improve patients’ chances of survival.

Many applications are also conceivable outside of medicine. In industry, people want to develop proteins that can carry out certain chemical reactions. Or you want to design proteins that can serve as sensors for dangerous substances.

Designing new proteins has long been an established field of research. But until now, scientists have had to limit themselves to making small changes to already known proteins in order to produce new proteins with the desired properties. Because not all possible sequences of amino acids result in a functional protein – just as not every sequence of words forms a sentence. The language models now open the door to creating completely new proteins.

But with the help of biological language models, you can not only design new proteins, but also predict the properties of proteins for which we currently only know the protein sequence. This is particularly important when developing new drugs. For example, one would like to predict which proteins in the cell would be good targets for a new drug or which could interact with existing drugs.

The conceivable applications are very different. But protein chatbots benefit from one thing: the language of biology is always the same. All the different applications can therefore take the same language model as a basis and build on it in different ways. This means that not every research group has to spend time training their own language model. Rather, you can rely on models that have already been published by the big tech giants. For example, the large protein language model ESM-2, which was published last year by researchers from the Meta group, is popular.

The technology is still very new and it is difficult to predict where developments will lead. As with any new technology, it will become clear that language models are not omnipotent. However, both the potential and the limits of the new protein language models are far from being explored. The development is at its very beginning. It could be the start of a short-lived hype – or the start of a revolution.

In any case, the enthusiasm among scientists is great and is growing every day. This year alone, over 100 scientific publications on protein language models have been published. Ascending trend.

Professor Rost firmly expects the success story to continue. “In a few years, language models will be at the forefront of all research involving protein sequences,” he says. He is certain: “It will change biology.”

