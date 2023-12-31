Speech AI is created with tons of texts without the consent of the authors. Now they are complaining. The future of journalism could depend on the outcome.

In its lawsuit, the New York Times insists on the value of journalistic sources.

The New York Times has sued Microsoft and Open AI over their chatbot Chat-GPT. The tech companies used millions of articles without permission to train the chatbot.

This is not the first lawsuit against generative artificial intelligence (AI) manufacturers. But it is particularly serious. There are a hundred examples in which Chat-GPT is said to have reproduced entire articles from the New York Times word for word – without citing the source.

This is proof that the tech companies were illegally building a competing product with the “New York Times” material, the lawsuit states. She demands compensation and an order of destruction for all AI products made with “New York Times” material. This would de facto include Chat-GPT, Bing and all other existing AI products.

Why the AI ​​spits out entire “New York Times” texts

“Move Fast and Break Things” is the motto of Silicon Valley. When it comes to AI, many companies adhered particularly closely to this. Until recently, voice AI was a niche topic for researchers. They experimented with how computers could be taught language and knowledge using large amounts of data.

Freely available data sets with names such as “Common Crawl”, “Webtext”, “Books1” and “Wikipedia” were used as sources. The texts in it are automatically retrieved from the Internet. They include Internet forums, Wikipedia entries, but also books and newspaper articles. The fact that these were used without a license didn’t bother anyone as long as AI was a matter of science.

But now AI is a business. Tech companies sell access to companies and private customers. The tech magazine “The Information” estimated in August that Open AI would generate $1 billion in revenue in a year. The company is valued at around $90 billion. The main investor Microsoft increased its stock market value.

Authors sue, Axel Springer and AP make deals

The authors of the data, however, came away empty-handed. Now they are fighting back: prominent writers have already filed a lawsuit against Open AI. Artists and picture agencies are taking action against image-generating programs that are based on their works.

The New York Times lawsuit is the first by a major media company. Open AI has concluded deals with other media to prevent lawsuits and secure cooperation. The Associated Press (AP) news agency received an unspecified sum for official access to its archives and access to Open AI technology.

The Axel Springer media group, which includes “Bild”, “Die Welt”, “Business Insider” and “Politico”, is also in the Open AI business, as was announced two weeks ago. From 2024, Springer content will appear prominently in chatbot responses with a link to the source, Reuters reported. How much Open AI paid is also unknown in this case.

The New York Times also negotiated with Open AI. Apparently she was not happy with the conditions. Now she could fight for something better, in court or through an out-of-court settlement.

The New York Times insists on the value of its data

In fact, the lawyers for the New York Times have put together a substantial statement of claim. This is not just due to the many examples that show how the chatbot reproduces entire articles from the New York Times. The media company also argues the importance of its data when training Chat-GPT.

Training is the name given to the method used to create language models. You submit tons of texts to the algorithm. Some words are obscured. The algorithm tries to predict them. If he succeeds, he will be rewarded. Over time, the AI ​​learns which words fit into which contexts. This works well, so she can write entire texts after training.

If you want the AI ​​to be able to write about facts in the real world and not just quote from recipes and discussions in online forums, it is relevant that it is also fed with articles about the real world. This is what the New York Times lawsuit is based on.

In fact, the newspaper’s website is the third most used source in the Common Crawl dataset, after the US Patent Office and Wikipedia. Today, Open AI remains silent about what data its AI uses. But the lawsuit cites an Open AI paper from 2020 that states that Common Crawl was the main data source for GPT3, the direct predecessor of today’s language models.

The lawsuit now argues that this use of the data without the consent of the copyright owner “New York Times” is unlawful.

A counter-argument is that ancillary copyrights and copyrights should not be applied to AI training. Such exceptions to copyright exist when the purpose of a content when copied differs significantly from the original in order to allow works to be quoted. The extent to which AI training is permissible is not legally clear.

The case raises a question that arises in many AI applications: Who has the right to make money with AI that relies on the data of other people and companies? But the case also represents the latest chapter in the ongoing conflict between tech and media companies.

Tech companies and media have been fighting over content for a long time

Due to the switch to digital, the media have lost large parts of their advertising revenue. Today, advertising is primarily placed online, and Google and Meta are the main earners. Media companies have to make up for this lost income.

Some people focused on the profits of the tech companies: Because Google and Facebook benefit from media content on their platforms, they should pay a fee for the small preview snippets, also known as a “link tax,” according to supporters. Corresponding rules have been introduced in some countries.

The complaint from the New York Times explicitly sets itself apart from this. On the one hand, it shows the link with a headline and a short preview that can be found on a search engine. That is legitimate. In contrast, it contrasts what the Bing chatbot produces: a detailed summary of the article requested without a prominent link to the source.

Everyone could benefit from collaboration

While media benefits from the classic search engine because it is found and clicked on, the chatbot user no longer has to visit the media site at all. This means the media loses out on potential customers and advertising revenue. That’s not sustainable, because someone has to pay for the journalistic work, the on-site research, checking facts, proofreading, all of that.

Technically, media companies now have options to block the algorithms of Google, Open AI and Co., which automatically save newspaper texts and process them into new training data. However, many media outlets fear that Google and Microsoft will punish them for this and rank them lower in the search algorithm.

A simple blockade would also not be ideal for the public. If AI chatbots actually become established as a tool for finding information, it would be desirable for them to be trained in the best possible way, with up-to-date, credible material.

AI manufacturers also depend on this. This is shown by the agreements with Springer and AP. The media economist Philipp Bachmann from the Lucerne University of Applied Sciences suspects that it is clear that Open AI violates the law: “This is the typical Silicon Valley approach in which you accept lawsuits, fines and damages. The business is still worth it.”

In order for it to be worthwhile for the media, acceptable conditions are required. Due to its sheer size and relevance in the data sets, the New York Times has a good chance of winning one.

