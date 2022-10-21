We use tools that rely on artificial intelligence (AI) every day – voice assistants like Alexa and Siri are among the most common. These consumer products work reasonably well – Siri understands most of what we say – but they’re by no means perfect. We accept their limitations and adapt the way we use them until they give the right answer, or we give up. After all, the consequences of …

We use tools that rely on artificial intelligence (AI) every day – voice assistants like Alexa and Siri are among the most common. These consumer products work reasonably well – Siri understands most of what we say – but they’re by no means perfect. We accept their limitations and adapt the way we use them until they give the right answer, or we give up. After all, the consequences of a Siri or Alexa misunderstanding of a user request are usually minimal.

However, failures in AI models that support doctors’ clinical decisions can mean life or death. Therefore, it is vital to understand how well these models work before using them. News published about this technology currently paints too optimistic a picture of its accuracy, which sometimes results in sensational headlines in the press. The media talk about algorithms that can diagnose early Alzheimer’s disease with an accuracy of up to 74 percent or that are more accurate than doctors. Scientific articles describing these advances can become the premises for new companies, new investments and lines of research, as well as for large-scale applications in hospital services. In most cases, the technology is not yet ready to be deployed.

Here’s why: As researchers feed new data into AI models, the models are expected to become more accurate, or at least not get worse. However, our work and that of others have identified the opposite: the accuracy reported in published models decreases as the size of the datasets increases.

The cause of this counterintuitive scenario lies in the way scientists estimate and report the accuracy of a model. According to best practices, researchers train their artificial intelligence model on part of their data set, keeping the rest in a “safe deposit box”. They then use the “kept aside” data to test the accuracy of their model. For example, suppose an artificial intelligence program is being developed to distinguish people with dementia from those who do not by analyzing how they speak.

The model is developed using training data consisting of spoken language samples and dementia diagnosis cards, to predict whether a person has dementia from how he speaks. The model is then tested with data of the same type which is kept aside to estimate the accuracy of its performance. The estimate of accuracy is then reported in academic publications; the higher the accuracy on the raw data, the better the algorithm performs according to the scientists.

And why does the research say that reported accuracy decreases as the size of the dataset increases? Ideally, the data kept aside is never seen by scientists until the model has been completed and fixed. However, scientists can peek at the data, sometimes unintentionally, and tweak the model to high accuracy, a phenomenon known as data leak (data leakage). By using the data kept aside to modify the model and then test it, the researchers virtually ensure that the system correctly predicts the same data, which leads to skewed estimates of the model’s true accuracy. Instead, they should use new datasets for testing, to see if the model is actually learning and if it is capable of examining something quite unknown, arriving at the correct diagnosis.



While these overly optimistic accuracy estimates are published in the scientific literature, the lower performing models are locked away in a drawer never to be seen by other researchers; or, if they are presented for publication, they are less likely to be accepted. The impact of data leakage and the bias of publication (a type of bias due to the fact that the result of a study influences the decision on its publication) is exceptionally large for models trained and evaluated on small datasets. In other words, models trained with small data sets are more likely to report inflated accuracy estimates; As a result, there is a particular trend in the published literature that models trained on small datasets report higher accuracy than models trained on large datasets.

We could avoid these problems with greater rigor in validating the models and reporting the results in the literature. Having established that developing an AI model is ethical for a particular application, the first question an algorithm designer should ask is, “Do we have enough data to model a complex construct like human health?” If so, then scientists should spend more time evaluating models reliably and less time trying to squeeze every ounce of “accuracy” out of a model.

Reliable model validation begins with ensuring that you have representative data. The most challenging problem in AI model development is the design of training and test data. While consumer AI companies collect data on an occasional basis, clinical AI models require more attention due to the high stakes. Algorithm designers should regularly question the size and composition of the data used to train a model to ensure that they are representative of a disease’s presentation range and user demographics. All data sets are imperfect in some way. Researchers must seek to understand the limitations of the data used to train and evaluate models and the implications of these limitations on model performance.

Unfortunately, there is no magic bullet to reliably validate clinical AI models. Each instrument and each clinical population is different from the others. To arrive at satisfactory validation plans that take into account real-world conditions, it is necessary to involve clinicians and patients from the earliest stages of the design process, with the contribution of entities such as the Food and Drug Administration [l’ente pubblico statunitense che si occupa tra l’altro di regolamentazione di farmaci e terapie, NdT].

More discussion is more likely to ensure that the training datasets are representative, rather that the parameters for knowing how the model works are relevant and that what the AI ​​communicates to the clinician is appropriate. Some lessons can be drawn from the reproducibility crisis in clinical research, where strategies such as pre-registration and concepts such as patient-centricity in research have been suggested as a means of increasing transparency and promoting trust.

Similarly, a sociotechnical approach to AI model design shows that building reliable and responsible AI models for clinical applications is not a strictly technical problem. It requires a thorough understanding of the underlying clinical application area, the recognition that these models exist in the context of larger systems, and an understanding of the potential harms, should the model’s performance degrade during its use.



Without such a holistic approach, the overestimation of AI will continue. And this is unfortunate, because the technology would actually be able to improve clinical outcomes and extend clinical reach to the most isolated communities. Taking a more holistic approach to developing and testing clinical AI models will lead to more nuanced discussions about the effectiveness of these models and their limitations. We believe this will ultimately lead to the technology reaching its full potential and for people to benefit from it.

The authors thank Gautam Dasarathy, Pouria Saidi and Shira Hahn for the enlightening conversations on this topic. They helped clarify some of the points discussed in the article.

Showing Berisha is an associate professor in the College of Engineering and the College of Health Solutions of Arizona State University and co-founder of Aural Analytics. He is an expert in practical and theoretical machine learning and in signal processing with applications to healthcare.

Julie Liss is associate dean and professor of the Arizona State University College of Health Solutions and co-founder of Aural Analytics. She is an expert in speech analysis in the context of health and neurological disease.

(The original of this article was published in “Scientific American” on October 19, 2022. Translation and editing by “Le Scienze”. Reproduction authorized, all rights reserved.)