In “Nature”, researchers from Google present a new large language model that answers medical questions. At the same time, they propose a new benchmark to assess the performance of such models: MultiMedQA.
Advertisement
Previous benchmarks often only evaluate the performance of the language models in individual medical tests. MultiMedQA therefore contains new criteria for assessing the quality of the answers along criteria such as factuality, understanding, potential harm and bias. The benchmark consists of seven datasets: six existing ones with questions from medical research and patients, and HealthSearchQA, a dataset with 3173 medical questions that are frequently searched online.
Med-PaLM is a transformer model adapted to medical questions, which is based on PaLM (Pathways Language Model) with 540 billion parameters. With the publication, however, Google is lagging behind its own research. The group had already announced the availability of Med-PALM-2 for cooperation partners at the end of April.
Although the research team was able to further improve the quality of the answers from Med-PaLM with a technique called “Instruction Prompt Tuning”, Med-PaLM still shows the typical weaknesses of large language models: on the one hand, the answers are strongly context-dependent, on the other hand this model also produces hallucinated facts.
Experts remain skeptical
Overall, however, the model did not do so badly. According to the paper, the answers from Med-PaLM to randomly selected questions from MultiMedQA were evaluated by nine doctors. The result: 92.6 percent of the detailed answers from Med-PaLM correspond to the “scientific consensus”. 5.8 percent of Med-PaLM’s responses were classified as potentially harmful – comparable to 6.5 percent of responses from human experts. However, the language model’s responses contained incorrect or inappropriate content 18.7 percent of the time – significantly more often than the human responses, which only contained 1.4 percent.
Despite the sometimes impressive answers from the model, experts are skeptical about the Science Media Center Germany. “It is questionable how well the model would deal with a realistic situation in which a patient makes unclear, incomplete and sometimes incorrect statements and decisions have to be made in the context of practical clinical restrictions,” say Roland Eils and Benjamin Wild from the center for Digital Health from the Berlin Institute of Health at the Charité (BIH). “The biggest methodological problem, similar to other LLMs, is that the models can hallucinate and it is difficult to judge when a statement is correct and when it only seems correct at first glance.”
Advertisement
And Andreas Holzinger from the Institute for Medical Informatics/Statistics at the Medical University of Graz emphasizes that benchmarks “often cannot assess the ability of a model to react to context-specific or individualized queries, as can occur in everyday medical practice.” Therefore, in order to effectively assess the suitability of a large language model for use in medical practice, “it would be important to rely not only on benchmarks, but also on careful testing and evaluation under real-world conditions, including consideration of possible ethical, legal, and safety-related aspects”.
The experts also criticize the fact that Google has neither published the code of the model nor its weights, i.e. the strength of the connection between the neurons of the network. The group itself justifies this with “safety effects from the uncontrolled use of such a model in the medical field” and refers to a “responsible approach to innovations”, which must be further developed together with partners, the research community and regulators.
If the EU’s AI Act is passed as planned, the group has no other choice. The use of large language models in clinical operations would then, with a probability bordering on certainty, be treated as a “high-risk application” and regulated accordingly.
(wst)
To home page