Home » Google presents AI for medical questions – experts remain skeptical

Google presents AI for medical questions – experts remain skeptical

by admin
Google presents AI for medical questions – experts remain skeptical

In “Nature”, researchers from Google present a new large language model that answers medical questions. At the same time, they propose a new benchmark to assess the performance of such models: MultiMedQA.

Advertisement

Previous benchmarks often only evaluate the performance of the language models in individual medical tests. MultiMedQA therefore contains new criteria for assessing the quality of the answers along criteria such as factuality, understanding, potential harm and bias. The benchmark consists of seven datasets: six existing ones with questions from medical research and patients, and HealthSearchQA, a dataset with 3173 medical questions that are frequently searched online.

Med-PaLM is a transformer model adapted to medical questions, which is based on PaLM (Pathways Language Model) with 540 billion parameters. With the publication, however, Google is lagging behind its own research. The group had already announced the availability of Med-PALM-2 for cooperation partners at the end of April.

Although the research team was able to further improve the quality of the answers from Med-PaLM with a technique called “Instruction Prompt Tuning”, Med-PaLM still shows the typical weaknesses of large language models: on the one hand, the answers are strongly context-dependent, on the other hand this model also produces hallucinated facts.

Overall, however, the model did not do so badly. According to the paper, the answers from Med-PaLM to randomly selected questions from MultiMedQA were evaluated by nine doctors. The result: 92.6 percent of the detailed answers from Med-PaLM correspond to the “scientific consensus”. 5.8 percent of Med-PaLM’s responses were classified as potentially harmful – comparable to 6.5 percent of responses from human experts. However, the language model’s responses contained incorrect or inappropriate content 18.7 percent of the time – significantly more often than the human responses, which only contained 1.4 percent.

See also  22 January 2024

Despite the sometimes impressive answers from the model, experts are skeptical about the Science Media Center Germany. “It is questionable how well the model would deal with a realistic situation in which a patient makes unclear, incomplete and sometimes incorrect statements and decisions have to be made in the context of practical clinical restrictions,” say Roland Eils and Benjamin Wild from the center for Digital Health from the Berlin Institute of Health at the Charité (BIH). “The biggest methodological problem, similar to other LLMs, is that the models can hallucinate and it is difficult to judge when a statement is correct and when it only seems correct at first glance.”

Advertisement

And Andreas Holzinger from the Institute for Medical Informatics/Statistics at the Medical University of Graz emphasizes that benchmarks “often cannot assess the ability of a model to react to context-specific or individualized queries, as can occur in everyday medical practice.” Therefore, in order to effectively assess the suitability of a large language model for use in medical practice, “it would be important to rely not only on benchmarks, but also on careful testing and evaluation under real-world conditions, including consideration of possible ethical, legal, and safety-related aspects”.

The experts also criticize the fact that Google has neither published the code of the model nor its weights, i.e. the strength of the connection between the neurons of the network. The group itself justifies this with “safety effects from the uncontrolled use of such a model in the medical field” and refers to a “responsible approach to innovations”, which must be further developed together with partners, the research community and regulators.

See also  The Rise of New Technologies in the Healthcare Sector: A Look at Innovations and Tools Impacting Emergency Medicine

If the EU’s AI Act is passed as planned, the group has no other choice. The use of large language models in clinical operations would then, with a probability bordering on certainty, be treated as a “high-risk application” and regulated accordingly.

(wst)

To home page

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy