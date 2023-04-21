Data protection: How OpenAI’s thirst for data boomeranged



Following a ban in Italy and a series of investigations by other EU countries, OpenAI now has just over a week to comply with European privacy laws. Failure to do so could result in hefty fines, data deletion or even a ban in other EU countries.

Experts believe it could be nearly impossible for OpenAI to comply with the regulations. That’s because of the way the data for training the large language models (LLMs) was collected, the experts told MIT Technology Review. The data comes from the internet.

A paradigm of data hunger currently prevails in AI development: the more information in the model during the training phase, the better. OpenAI’s GPT-2 model still had a dataset consisting of 40 gigabytes of text. GPT-3, on which ChatGPT is basically based, was fed with 570 GB. OpenAI has not yet shared the size of the training data set for its latest model, GPT-4, but it is likely to be larger.

However, the hunger for data is now proving to be a boomerang in terms of data protection law. Competent authorities are increasingly interested in how OpenAI collects and processes the data that uses services like ChatGPT. Data protectionists are also of the opinion that the company has collected personal data such as names or e-mail addresses from people and is using them without their consent. In addition, there is the information that currently arises during use and which is assumed that OpenAI could use for further training.

Italy sets a deadline for OpenAI

The Italian data protection authority made the start and blocked ChatGPT as a precaution. French, German, Irish and Canadian privacy regulators are also now investigating how OpenAI collects and uses data. The European Data Protection Board EDSA, the association of data protection authorities, is also setting up an EU-wide task force to coordinate the investigations – including possible sanctions against OpenAI.

Italy has now given the ChatGPT operator until April 30 to comply with local laws. This would mean, among other things, that OpenAI would have to ask people for their consent to scrape their data or prove that it has a so-called legitimate interest under the General Data Protection Regulation GDPR. Websites and social media have previously also used this legal term to display personalized advertising – not always successfully in court.

OpenAI must also explain to users in more detail how ChatGPT uses their data – and even give them the opportunity to correct incorrect information that the chatbot spits out about them. It must be possible to delete data and the system must make it possible to completely dispense with a person’s data if he or she so wishes.

If OpenAI cannot convince the Italian data protection authority that its data use practices are legal, the company’s offerings could be banned in individual EU countries or even the entire European Union. OpenAI could also face heavy fines and be forced to delete entire models, or at least data used to train them. This is the assumption of Alexis Leautier, AI expert at the French data protection authority CNIL.

demand for transparency

OpenAI’s violations could be so blatant that the case ends up in the Court of Justice of the European Union, the EU’s highest court. That’s what Lilian Edwards, Professor of Internet Law at Newcastle University, assumes. Despite the deadline, the Italian data protection authority could wait a long time for the answers to its questions.

And for OpenAI, the stakes couldn’t be higher. The GDPR is currently considered one of the strictest data protection regulations in the world, if not the strictest. And it is copied all over the world. Regulators from Brazil to California are watching closely what happens next. The result could fundamentally change the way AI companies collect and use data.

OpenAI not only has to make its data practices transparent. The first step is to prove that you have worked according to the rules of the GDPR. There are two legal options here. Either consent to data collection was obtained from users. Or there is said “legitimate interest” in doing so. OpenAI did not obtain consent for the so-called scraping of large parts of the Internet, millions of EU citizens would have had to give their consent.

Remains “legitimate interest”. To do this, the company must demonstrate to regulators as convincingly as possible how important the ChatGPT service really is to justify data collection without consent, says legal expert Edwards.

As OpenAI sees it

OpenAI has informed MIT Technology Review that it believes it complies with EU data protection laws. A blog post also states that the company is working to remove personal information from training data upon request, but only “where possible”.

The AI ​​market leader further shares that it has trained its models with publicly available and licensed content. There is also information from human assistants, who helped, among other things, to filter problematic content and evaluate answers (Reinforcement Learning from Human Feedback, RLHF). This should not be enough to comply with the GDPR.

“In the US there is a doctrine that says things that are publicly available are no longer private, which is not the case in European law at all,” says lawyer Edwards. The General Data Protection Regulation gives people specific rights as “data subjects” – including the right to be informed about how their data is being collected and used. You may also request that data be removed from the system again, even if it was public from the start.

OpenAI has another problem. The Italian authority says that OpenAI does not make transparent how it collects users’ data during the post-training phase, e.g. B. in chat logs of their interactions with ChatGPT.

Fear for the data from the chats

“What’s really worrying is how the data that users reveal in chat is being used,” says French privacy advocate Leautier. People tended to share intimate, private information with the chatbot, e.g. B. about their mental state, their health or their personal attitude. According to Leautier, this is problematic because there is a risk that this sensitive data will be passed on to others. According to European law, users must also have the option of having their chat logs deleted. The function exists, but how long internal storage periods run remains unclear.

All this becomes enormously complicated for OpenAI. It will be nearly impossible to identify individuals’ data and remove it from models, says Margaret Mitchell, AI researcher and chief ethics officer at AI startup Hugging Face, who previously co-headed the space at Google.

Even worse: OpenAI could have avoided many of the conflicts that now arise if there had been robust data recording from the start. Instead, according to Mitchell, it is common in the AI ​​industry to create training datasets for large language models by randomly browsing the Internet. Then third-party companies are used – especially in low-wage countries – who have to manually filter out duplicate or irrelevant information, hatred, violence or child pornography – right down to correcting typos.

These methods and the sheer size of the training datasets mean that AI companies typically have a very limited understanding of how their models are constructed. And so it becomes almost impossible to train them in compliance with data protection regulations.

Needle in a haystack of training data

Most AI companies do not document or provide descriptions of how exactly they collect training data. They typically don’t even know exactly what’s in their dataset, says Nithya Sambasivan, a former AI researcher at Google turned entrepreneur who specializes in handling training data.

For example, only discovering the data of Italian users in the huge ChatGPT training dataset is like finding the proverbial needle in a haystack. And even if OpenAI managed to delete the data of this user group, it is unclear whether this would be permanent. Previous studies have shown that training datasets can be found on the internet long after they are said to have been deleted because copies of the original mostly remain online.

“The state of the art in collecting training data is very, very immature,” says Mitchell. That’s because while a tremendous amount of work has gone into developing state-of-the-art AI models, little has gone into the methods of collecting training data—many of which are ten or more years old.

In the AI ​​community, working on the technology of the models is overemphasized at the expense of all other areas, says Mitchell: “From a cultural point of view, there is this problem with machine learning that working with and on the data is considered stupid work, “silly work”, is seen, but working on the models is the right thing to do.” Expert Sambasivan agrees: the entire field of work lacks the necessary legitimacy.

