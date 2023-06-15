Large Language Models (LLMs), i.e. large language models, require large amounts of training data – and what this is has a great influence on how good the generated text is in the end. So it’s no wonder that commercial providers of LLMs keep a big secret about the content they contain.

For example, OpenAI only provides extremely vague information about its popular GPT-4 model, which is included in the current version of ChatGPT Plus, among other things. Among other things, it is rumored that the entire Wikipedia in several languages, countless Reddit and forum postings as well as many public domain works of world literature are in the LLM. But it could be much more – or something completely different.

Book archeology with linguistic test

A team of researchers from the University of California at Berkeley has now set out to shed some more light on the subject, focusing primarily on books. David Bamman’s group did not hack OpenAI’s server to get the training data. Instead, in their preprint (heading: “An archeology of books that ChatGPT/GPT-4 knows”), they simply used the interface of the speech generator itself with the help of clever prompts.

List of “Most Popular” books in GPT-4 and ChatGPT. (Image: Bamman et al/UC Berkeley)

Methods from linguistics were used, a kind of quiz game in which words are left out of a text to check whether the system knows the original text. As it turned out, GPT-4 has apparently not only consumed public domain works – albeit particularly well. In fact, the LLM also knows various well-known books that are still subject to copyright for a long time. As expected, the most precisely known titles come from the public domain list (see table).

Harry Potter und “Fifty Shades of Grey”

But that was not all. GPT-4 also “recognized” the first part of the Harry Potter saga by JK Rowling with particular accuracy – with an accuracy value of 76 percent determined by the UC Berkeley researchers. This is followed – of all things – by Orwell’s dystopia “1984” (57 percent), “The Fellowship of the Ring” (51 percent) by Tolkien, the erotic novel “Fifty Shades of Gray” (49 percent), and the young adult thriller “Hunger Games” (48 percent), the parable “Lord of the Flies” (43 percent) and finally “The Hitchhiker’s Guide to the Galaxy” (43 percent). All titles lower on the list, such as “Things Fall Apart”, “Fahrenheit 451” or “Game of Thrones” have a worse rate of 30 percent or less. But that doesn’t mean GPT-4 doesn’t know them at all.

"We found that the OpenAI models learned a wide collection of copyrighted material," the California researchers said. The reason is apparently the frequency with which the books can be found on the Internet. In many cases they are probably also available there as black copies, which OpenAI could not drive out of its language model reading. The UC Berkeley group calls for more open models that also make it easier to control whether an LLM tends to just regurgitate content it has stored. And what does OpenAI do? Unfortunately still says nothing about the training data.









