From Repubblica to Ansa, via the Italian government website and that of the Ministry of the Interior: I’m alone some of the Italian sources present in Google’s C4one of the datasets used to train AIs like ChatGPT.
To analyze the archive it was the Washington Postwhich published a survey that identified which sites contributed most to the composition of the dataset, starting from the number of tokens (words or linguistic signs) extracted.
Il C4 dataset (a filtered version of the more famous Common Crawl) it was used to train Google’s T5 and Meta’s LLaMA. It is a huge archive of texts extracted from sites and made available to AIs to learn how humans write. The system is programmed to archive the Web, filter content deemed inappropriate and collect the text in a huge database which will then be fed to artificial intelligence. It is just one of many collections of data that AI companies they use: although partial, it offers a series of useful information for understanding the functioning of language models.
The data are one of the most important points to understand generative AI: systems like ChatGPT or Midjourney work starting from the training material, they learn about the world through the data that is provided as input. And they tend to have flaws similar to their only entry point to the surrounding reality: they can have prejudices, biases, problems with copyright. It all starts with data. And those analyzed and made public by the Washington Post tell a lot about the functioning of linguistic models.
Generative AI, 5 questions (plus one) to Hiroshi Ishiguro
by Emanuele Capone
Italian data, from Repubblica to the government website
Within the article, the Washington Post has made available a small search engine that allows you to find the sites included in the dataset starting from keyword. Inside there is a large part of the Web in Italian. We have found, for example, all the main information portals of our country, from Repubblica to Corriere della Sera, up to online-born newspapers such as The Post Internazionale: the most present is Ansa, with over 320,000 tokens. There are, however, disinformation sites like Imola Oggi, in second place in News Guard’s top 10 for spreading fake news in 2022and portals such as TgComNews24, present in the archive with over 400 thousand tokens.
There are also institutional sites, such as that of the Italian government or the Ministry of the Interior, among others, which have contributed to the training of artificial intelligence with a few thousand tokens (3800 and 1500, respectively). There are also some special circumstances, actually difficult to explain: the portal of the Municipality of Solbiate Arno, just over 4,000 inhabitants in the province of Varese, contributed over 150,000 linguistic signs to the dataset. The picture of the Italian public administrations is represented in its almost total entirety, including the regions: the most present in the archive is Piedmont, whose official website has provided 74 thousand tokens. In the archive there are also the portals of the main Italian hospitals, of universityof great ones companiesspecialist magazines, even the most famous sites from which to download pirated material, such as Il Corsaro Nero: with few exceptions, it is really possible to find a large part of the Internet in Italian.
And what are the sites that, more than all, have contributed regarding the Italian language? It might come as a surprise, but the only portal of our country in the Top 1000, in 412th place with over 9 million tokens, is Usa In Detail, a sort of directory, in Italian and English, which offers detailed information on the United States . And, in fact, according to our analysis, the second of the most present portals in the archive is Italy in Detail, the same directory, dedicated only to our country, with over 1 million linguistic signs in the dataset. The official website of the Vatican is also present, with almost 2 million tokens, which, however, is available in several languages.
From patents to Wikipedia, up to copyrighted texts
Internationally, content in English or, in any case, multilingual dominates. To the first place in the standings edited by Jeff Bezos’ newspaper there is Google Patents (720 million tokens), a portal that contains the text of patents from all over the world. Immediately following are Wikipedia with 290 million, then Scribd, a paid digital library, with 100 million.
It appears in the list even copyrighted material: the symbol that identifies a text protected by copyright appears over 200 million times in the dataset. In the collection there are also at least 28 sites identified as distributors of pirated material by the US government and Kickstarter and Patreon, used, in different ways, to monetize creative projects. A circumstance that confirms the concerns of artists from all over the world, who fear that AI can appropriate the work of humans and reproduce it automatically. And that is pushing more and more portali, come Reddit e Stack Overflow, to ask the artificial intelligence giants to be paid for the content delivered to the systems. On the other hand, there are no social networks, except for Twitter: these days, Elon Musk has revealed the intention to sue Microsoft precisely because of the data used to train AIs.
Furthermore, the dataset also contains information concerning personal data, precisely while in Italy the Privacy Guarantor has dictated the terms to OpenAI for the online return of ChatGPT. In particular, the analysis of the Post has found, in the top 100 of the most used portals, two sites that contain information on the voters of Florida and Colorado.
Newspapers and Disinformation: From the New York Times to Breitbart
Another substantial part of the material present in C4 is composed of information sites, which represent 13% of the total. There are the New York Times, the Guardian, Forbes and the Washington Post itself, among others. There are, however, also sources of propaganda, such as the Russian portals RT and Sputnik News, or Breitbart, the alt-right site formerly directed by Steve Bannon.
Despite the filters set during data collection, the system has not always been able to protect the dataset from the storage of violent, obscene or pornographic content. In particular, the analysis reveals that within the collection there is material extracted from portals of white supremacists, such as Stormfront, or anti-trans like Kiwifarms. There is also 4Chan, the anonymous forum often at the center of US news for episodes of radicalization and violence.
In light of what has been revealed, it is not surprising that systems such as ChatGPT can generate, despite the filters, disinformation or, in some cases, even reinforce and perpetrating prejudice, discrimination and racism.
How do you learn to program ChatGPT?
by Antonio Dini
A Western Worldview: The Case of Religion
More than anything, the C4 feels like a heavily Western product and therefore linked to a specific world view. This is particularly visible from an analysis of the main 20 portals linked to religion present in the archive: 14 of these are of Christian origin and only one of Muslim imprint.
Effectively, according to a scientific paper published in Nature, ChatGPT may have a lot of bias precisely against Islam. In the research, in particular, it emerges how the phrase “Two Muslims enter…” is completed by artificial intelligence, in 66% of cases, with the account of violent actions.