Home » Behind the scenes of ChatGPT and Lensa: what are Laion 5B and Common Crawl and how they work

Behind the scenes of ChatGPT and Lensa: what are Laion 5B and Common Crawl and how they work

by admin
Behind the scenes of ChatGPT and Lensa: what are Laion 5B and Common Crawl and how they work

Since the beginning of December they have been the topic of the day, more or less every day, even if you are not a technology enthusiast. Indeed, especially among non-tech enthusiasts: “Have you seen what Lensa can do with your photos?”, e also “I wrote a sonnet using ChatGPT”. Or a song, an essay, an essay, twenty lines of code.

Lensa and ChatGPT are the two most recent but also the most solid examples of what artificial intelligences can do, of all those potentials hitherto only promised but now finally accessible. And accessible to all, by paying or by registering. And obviously people’s interest has grown tremendously.

Privacy risk

What is Lensa, the photo editing app that everyone uses but you shouldn’t use

by Emanuele Capone


google trends: searches for ChatGPT and Lensa in Italy

Crazy about AI

Google searches for these two services, worldwide but also in Italy (graph above), have increased dramatically since 30 November, while subreddit ChatGPT subscriptions on Reddit (which is this) they grew from 0 to 25,000 in about ten days and those responsible for ChatGPT (which we tested here) they said they reached 1 million users in just 5 days of the service’s debut.

To get an idea of ​​what this milestone means, it is useful to remember that Instagram hit its first million after about 75 days and that it even took Spotify 150 days to make it.

What is scraping and how to train an artificial intelligence

To have all this success and to be able to do what they do, Lensa and ChatGPT (ma as well Bert and MUM, Google’s AI algorithms) started out in much the same way: by reading. Reading the whole Internet, to put it simply: to know what to say, or understand what we want to tell him, Bert relies on about 300 million parameters, while GPT-3 (which is the one used by ChatGPT) even on 175 billion. The entire English Wikipedia has been read, but it represents just 0.6% of what he has read. Lensa’s developers made it do the same, but with images.

Obviously, these AIs they didn’t read like we humans do: to study, they use a technique called scraping (in English it means to scrape, scratch away), which consists in surfing the Net collecting information and then storing it, so as to use it when needed. Again, don’t navigate like we do: “Simplified, you write a program that allows the computer to do it automatically – it explained Annalisa Barla, associate professor of Computer Science del Dibris of the University of Genoa – You don’t see the browser windows that open on the monitor, because what happens happens in the background”. And as it happens, the algorithm learns: “You can tell it which sites to consult and which to exclude, too rank them according to their reliability and authority and for each one it will start from the homepage and read all the pages one by one, collecting texts, images and figures”.

See also  Huawei P60 Pro in the test: Android smartphone without Google, but with a top camera

This is the first part of the training, but the AIs we are talking about here are already beyond that: “The model on which ChatGPT is based was closed at the end of 2021 – Barla reminded us – and now it is learning from something else. He’s learning from us.” What do you mean? “They are training her with users, with the questions they ask and with the answers that gives: learn from this, from the history of these exchanges, as well as from the likes that people can give to the answers”. Similarly, Lensa also learns from us (to make new faces out of our faces).

twitter: ChatGPT passes Machine learning exams

What are Common Crawl and Laion 5B

It must be said that scraping is not usually done by the people who develop AIs, but there are companies that do just that. Two of the largest, although not very well known, are called Common Crawl and Laion and their job is exactly to create huge databases to be fed to artificial intelligences.

The first is one no-profit americana which since 2011 has been collecting information online through scraping and making it available free of charge to companies that develop AI algorithms, which can use it (in theory) for non-profit purposes. The idea is that these information, even if protected by copyright, are so useful to the community that they go beyond copyright protection. Especially if you don’t make money on it. At present, the Common Crawl databasewhich mainly includes texts and should be the one used by OpenAI for ChatGPT, collects over 3.1 billion pages and weighs about 420 terabytes.

See also  Atlassian Bamboo at risk: Warning of new IT security vulnerability

Laion is German and it does more or less the same thing, but with images: its name means Large-scale Artificial Intelligence Open Network, i.e. open and large-scale AI Network. The latest product is called Laion 5Bcurrently groups almost 6 billion images, with related captions to describe them, and is a database used (among others) for training Imagen or Stable Diffusion, which work in a similar way to the better known Dall-E 2. Here too, this huge amount of material is made available free of charge to those who develop AI, who should therefore (in theory) reuse it without making a profit.

The case

“The Colosseum is a shopping mall,” says Meta’s new AI

by Francesco Marino


The copyright problem

This last point is important, especially when it comes to photos: as we reported on Italian Tech, Lensa definitely earns from this data, asking users for 29.99 euros for an annual subscription, or 2.99 for a set of 50 face interpretations. And on Twitter there are tons of examples of images produced by Lensa that still show the artist’s signature who made the original work that the AI ​​relied on to create its. It could not be otherwise: artificial intelligences train starting from someone else’s works.

And yet: it is right that who got this knowledge for free, if they charge? Again: is it right that whoever initially produced this knowledge should not be paid for its exploitation? That a digital artist does not see his commitment and his ability economically recognized? If you were the heirs of Monet, Van Gogh o Picassowouldn’t you like to be paid by those who exploit the skills of your famous ancestors to set up a profitable business?

Probably yes, even if it should be considered that an AI, especially at this level, has enormous costs, both in terms of development and management (for GPT-3, estimates speak of about 5 million initial dollars): “To make the so-called LLMs work, i.e. models based on billions of parameters, you need a computing power that is almost immoral from an environmental point of view – Barla told us, partly joking and partly not – They are hosted on servers mainly made up of GPUs, which work at similar rhythms to those typical of cryptocurrency mining”. Once they are ready,”require huge hardware resources to make them work, you need to be able to handle millions of users simultaneously, and also need a large capacity storage in order to store their information”. Which will be used to build even more performing models.

See also  Steam alternative card game Cut the Cards Card abilities can be cut and pasted at will (209319)

ITW 2022

From NFTs to Using AIs: A Beginner’s Guide to Digital Art

by Emanuele Capone


Did I end up in a database?

It is likely that the answers to the questions about legitimacy of these practices they will come over time, with use and with practice (and with the work of lawyers), even if one can already be had by taking a simple online test. Which among other things allows you to personally realize another problem linked to the use of these artificial intelligences: that of privacy protectionand specifically the protection of one’s personal image.

There is a site that was born for this, is called Have I been trained? (in Italian: Did they use me to train?) and was put online by the artistic collective Spawning: it allows you to search, textually or by images, in the Laion database, precisely to understand if one’s work, or one’s face, is been used to train an AI. The idea is that a digital artist can upload one of your creations on Have I been trained? to find out in a few seconds if it ended up in the scraper network and therefore if it is being used without her consent and perhaps without her knowledge. Or, that any person uploads a photo of her to understand if his face (perhaps taken from social networks) it is used to train face recognition algorithms.

The battle for the protection of one’s face is a very topical battle: not only because these software are more and more also often used by the police (yes, even in Italy) but also why now with deepfake technology (things?)you can do just about anything with anyone’s faces.

There are companies, come l’americana Affectiva, who have collected over 5 million of them and are using them to try and teach AIs to understand human emotions; others, like Clearview AI, which before being stopped by the European authorities, aimed to have 100 billion photos in the database by 2022, so that “every human being will be identifiable”. What we humans can do, in our small way, is to use these tools with intelligence and awareness, and also evaluate the possibility of protecting ourselves and somehow hiding, which is something that can be done simply even with makeup. Or with a sweater.

@capoema

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy