Home » Computer vision and image recognition: focus on datasets

Computer vision and image recognition: focus on datasets

by admin
Computer vision and image recognition: focus on datasets

A recent MIT study takes a critical look at the standard image datasets used to train current computer vision models. Datasets deemed too “easy”, too “simple”. And “easy” training leads to poor results.

L’training of computer vision systems for precise image recognition and, therefore, of the objects that populate the scene to be analysed, presents a fundamental flaw.

The question was raised by a group of researchers from the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds and Machines (CBMM) – both within the Massachusetts Institute of Technology (MIT) – authors of the study illustrated in “How hard are computer vision datasets? Calibrating dataset difficulty to viewing time” and presented at the annual conference “Neural Information Processing Systems” (NeurIPS), which was held in New Orleans from December 10 to 16, 2023.

Let’s start by saying that, in the field of artificial intelligence studies, the ability to “recognize” an image presupposes the identification of things, people and places present within it and represents the “basis” of the tasks required of an computer vision model. From this ability then derive more refined and complex operations, including the classification and segmentation of the same image, the analysis of the interactions between the objects that compose it, as well as their movements in the space under examination.

According to the MIT team, the underlying flaw lies in the fact that, despite the numerous works in recent years aimed at improving the level of precision and the analysis times of the artificial intelligence models responsible for image recognition, the standard datasets with which they are trained continue to be characterized by the presence of «video data too simple».

The approach of those who create them tends to «subsampling of images considered difficult for the machine», comments the research group. Which inevitably leads to datasets that are biased towards less complex images and the consequent overestimation of laboratory performance. When instead, it is the performances in the real world that you have to look at. Especially those in which the images to be analyzed have distorted shapes, low definition, occlusions or variations in distribution within the represented space.

If, for a long time, the need of those who create standard datasets to train artificial vision systems for image recognition was that of “quantity”, today it is no longer possible to ignore those aspects linked to the difficulty and complexity of video data to analyze. Inspired by the lengthening of processing times of visual stimuli in humans when faced with images considered “difficult”, MIT researchers have defined a methodology with which to calculate the level of difficulty of the training data. The tests carried out to prove the validity of the new methodology used images taken from two well-known standard datasets such as ImageNet and ObjectNet, demonstrating the starting hypotheses formulated by the team, namely that both databases are unbalanced towards simple images, recognizable in a short time .

See also  OpenAI's discovery: what is the Q* project that puts humanity at risk

Computer vision and image recognition: it is urgent to measure the degree of difficulty of the training data

The goodness of an artificial intelligence system is directly proportional to the goodness of the data used to train it. We should never ignore this assumption. Even more so when we talk about computer vision and image recognition, whose applications range from autonomous driving to diagnostic imaging, from the most advanced video surveillance to predictive maintenance in the industrial sector, just to name a few.

«In general – the authors underline – the problem of standard training datasets persists, because AI developers have no indication of their difficulty level. And without this information it becomes complicated to objectively evaluate the progress of an artificial vision system, its approach to human performance, covering the entire range».

For years, the biggest concern of those assembling datasets to train AI algorithms for image recognition has been their size: the slogan was “bigger is better”, “The more data we put together, the better the training will be.” The concept of “complexity” has been completely ignoredwhich instead is typical of human vision.

By focusing, however, on techniques and methods for measuring the difficulty of video data as they are collected, it is possible calibrate the datasets and create the resources necessary to develop more balanced AI systems from a performance point of view, the team points out.

The “Minimum Viewing Time” metric

Some video data requires more time to be processed, recognized and classified by the human visual system. This lengthening of times is due, for example, to poor lighting, unclear images, a messy, crowded scene, in which objects are overlapped, not in the foreground or partially hidden.

Based on this absolute principle, the authors of the study on computer vision and image recognition have developed a metric called “Minimum Viewing Time” (MVT) – or “minimum viewing time” – «able to quantify the difficulty in recognizing an image based on the time taken by a subject to view it before making a correct identification» they explain.

See also  Sony said it will increase the production of PlayStation 5 game consoles and release new promotional ads at the same time

The new metric was tested on a sample of people using subsets of ImageNet and ObjectNet. The first is a large set of real images taken from the Web (over 14 million, all labeled), specifically created for training in the field of computer vision; the second is a similar dataset, but – unlike the previous one – the objects portrayed have completely random backgrounds, points of view and rotations.

ImageNet and ObjectNet, two standard datasets under investigation

During the test, participants were shown flashing images on a screen, one at a time time duration between 17 milliseconds and 10 seconds. The task consisted of classifying the object correctly, choosing from 50 options.

The images that required short flashes to be recognized are those considered “easy” to identify, while those that required seconds of viewing fall into the “difficult” category. The objective was one: check the difficulty level of images taken from ImageNet and ObjectNet, which MIT researchers have always believed to be undersampled. This was the starting hypothesis.

Well, after over 200,000 trials, both datasets appeared unbalanced towards simpler images, recognizable in a shorter time, with the vast majority of performances derived from images that were easy for the subjects to whom they were administered.

Some of the images shown to the participants during the test with which the “Minimum Viewing Time” metric was put to the test: starting from the simplest, on the left, to arrive at the more complex ones, on the right. Above, the minimum viewing times before they were correctly recognized, from 17 milliseconds to 10 seconds (Source: “How hard are computer vision datasets? Calibrating dataset difficulty to viewing time” – Computer Science and Artificial Intelligence Laboratory (CSAIL) and Center for Brains, Minds and Machines (CBMM) at the Massachusetts Institute of Technology).

At the end of the experiment, the team made available the datasets used – the images of which were marked according to the difficulty of recognition – as well as a series of tools to automatically calculate the Minimum Viewing Time, thus allowing other working groups to add this metric to existing benchmarks and extend it to various applications.

Computer vision and image recognition: the next steps of research

To implement the abilities of machines in the processing and classification of video signals, it is important to work on finding as many correlations as possible between these operations and the difficulties expressed by the necessary “viewing time”. The end is generate more difficult (or easier) versions of the image datasets used during training. The focus is “calibration”, says the study team on the subject of computer vision and image recognition:

«This will help develop more realistic benchmarks, which will lead not only to improvements in the performance of computer vision systems, but also to fairer comparisons between artificial intelligence and human visual perception»

In the future – he continues – with modifications to the recent experiment, «an MVT difficulty metric could also be created for classifying multiple objects at the same time. Calibrating our field to what humans can do across a wide range of vision tasks, given certain datasets and conditions, remains a significant challenge, but one that we now believe can be addressed».

See also  Disposing of space junk: Space probe is supposed to test old rocket for the first time

Anticipation of future scenarios

What should we expect – thirty, forty, fifty years from now – from a machine that perceives all the visual stimuli of the real world (easy and difficult, simple and complex) better than our optical apparatus and which, then, processes them in even faster and more precisely than our brain?

Computer vision and image recognition is one of the most fascinating topics in AI, but it also arouses some shock due to the “power” that – in a distant future – its concrete applications could have.

Beyond the aforementioned autonomous driving and predictive maintenance in the industrial sector, the uses in the medical field and in public video surveillance are the ones that are difficult to calculate today.

Let’s just think about the analysis of images (X-rays, CT scans, MRI, PET) in the early diagnosis of serious chronic diseases, of neurodegenerative and oncological diseases, in which the infinitely small detail still escapes us today. Many lives would be saved or, in any case, it would be possible to further slow down the progression of some pathologies, thanks to an artificial vision system pushed to maximum power.

The cameras with, on board, a system of video analysis capable of analyzing any type of scene in a very short time, could, in 50 years, be systematically used – in the public as well as in the private sector – for thepredictive anti-crime analysis and not only (as happens today) for simple deterrence.

These are futuristic scenarios, which today we can anticipate by mapping their impacts, in order to face the changes and revolutions that they will inevitably bring with them in time.

breaking latest news © (Article protected by copyright)

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy