Home » 40 volunteers and a start-up teach GPT-4 proper Icelandic

40 volunteers and a start-up teach GPT-4 proper Icelandic

by admin
40 volunteers and a start-up teach GPT-4 proper Icelandic

Large language models such as OpenAI’s GPT-4 or Google’s LaMDA may not always be correct in terms of content. In terms of language and grammar, however, the results are almost error-free, at least if you use them in English, German or Spanish. With less common languages ​​such as Icelandic, however, the models quickly reach their limits. The Icelandic company Miðeind wants to change that – and at the same time preserve the language culture of the island state.

“The Icelandic language actually has a fixed status,” says Linda Heimisdóttir, COO of Miðeind. “It is used in everyday life and at school, it is passed on from generation to generation and it looks to a rich literary heritage.” But in the digital world, language runs the risk of falling behind. Language assistants such as Siri and Amazon’s Alexa, for example, still do not support Icelandic, software and online tools are often not localized, and the AI ​​language models that are currently stirring up many industries also have their problems with the very special language: GPT-4 can enter text Understand Icelandic quite well, but the output is often grammatically incorrect.

The explanation is simple: the language models are trained on billions of publicly available texts. The more speakers a language has, the more texts are usually available in that language. A total of 4.7 terabytes of text went into the training of Facebook’s LLaMA model. Only 20 gigabytes of that, less than 0.5 percent, was in Icelandic. With just 370,000 Icelandic native speakers worldwide, this is hardly surprising.

See also  Methane leaks in the USA are worse than previously thought

“There is simply not enough data in Icelandic to create your own language model,” says Heimisdóttir. Other languages ​​with only comparatively few speakers have the same problem. To change that, Miðeind, supported by the Icelandic government, has started a cooperation with OpenAI. GPT-4 and its successors are to be made fit for Icelandic, so that future Icelandic companies and citizens can also benefit from the development in their native language and do not necessarily have to switch to English. Miðeind itself is developing the Icelandic language assistant Embla, which could become more flexible with the help of GPT-4.

As a first step, Miðeind employed 40 volunteers to teach GPT-4 “correct Icelandic grammar and cultural knowledge”. The method is called “Reinforcement Learning from Human Feedback” (RLHF) and it is used by both OpenAI and other developers of large language models to tune the algorithms with the help of human feedback in certain aspects. The human trainers feed the algorithms with input, get different answers generated and choose what they think is the best, which makes the model more robust.

“It’s remarkable how well GPT-4 understands Icelandic compared to its predecessor,” says Linda Heimisdóttir. However, the model still has problems generating grammatically correct output. “That’s probably because there’s low-quality, machine-translated data already in the original training data.” As a result, GPT-4 learned patterns that can no longer be completely determined by subsequent human tuning. Miðeind would like OpenAI to include already cleaned Icelandic data in the pre-training phase for the upcoming GPT versions.

See also  Ten artificial intelligence movies (and a series) to watch in streaming

The team at the company, founded in 2015, believes that the knowledge gained from the cooperation with OpenAI could also benefit other languages ​​in the long term. “We are seeing good results in so-called transfer learning, in which the models are able to extrapolate their knowledge of English and thus acquire amazing skills in other languages ​​despite the limited data available,” says Heimisdóttir. In the future there might be language models that are specifically optimized for this transfer learning. The large language models could then also work well with the small languages.


(jl)

To home page

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy