ChatGPT Examining for US Physician Licensure

Fonte: Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models – MedRxiv

Some researchers have subjected the OpenAI system to the test that American doctors must pass in order to practice their profession. The results obtained are really interesting.

In recent weeks, ChatGPT, a new AI model, has captured attention for its ability to perform a wide variety of tasks in natural language. In this blog I had illustrated my brief experience with this system on the subject of digital health (you can read it here). ChatGPT is a general Large Language Model (LLM) model developed by OpenAI. Unlike existing AI models, which consist mainly of Deep Learning (DL) models, designed to learn and recognize patterns in data, LLMs are a new type of AI algorithm trained to predict the probability of a given sequence of words based on the context of the words that precede it. Therefore, if LLMs are trained on large enough amounts of textual data, they are able to generate new word sequences not previously observed by the model, but which represent plausible sequences based on natural human language.

ChatGPT is powered by GPT3.5, an LLM trained on the OpenAI 175B parameter foundation model and a large body of textual data from the Internet through reinforced and supervised learning methods.

I study

The researchers, in a study published in preview on MedRxiv (NdA the article is a preprint and has not yet been peer-reviewed; reports medical research that has not yet been evaluated), evaluated the performance of ChatGPT, a non-domain specific LLM, on its ability to perform clinical reasoning, by testing its performance on US Medical Bar Exam questions (USMLE).

The USMLE is a three-step standardized testing program that covers all topics physicians need to know, from basic sciences to clinical reasoning, medical management to bioethics. The difficulty and complexity of the questions are highly standardized and regulated, making it an ideal input substrate for AI testing.

The Step 1 exam is usually taken by medical students who have completed two years of didactic learning, is problem-based and focuses on the basic sciences, pharmacology, and pathophysiology; medical students often spend around 300-400 study hours preparing for this exam. The Step 2CK exam is usually taken by fourth year medical students who have also completed 1.5-2 years of training; emphasizes clinical reasoning, medical management, and bioethics. The Step 3 exam is taken by physicians who generally have completed at least six months to one year of postgraduate medical education.

The methodology

The researchers located 376 publicly available test questions on the official USMLE website. Random checks were performed to ensure that none of the answers, explanations or related content were indexed on Google before 1 January 2022, which is the latest date accessible to the ChatGPT training dataset. All test questions were reviewed and those that contained visual elements such as clinical images, medical photographs, and graphics were removed. After filtering, 305 USMLE questions (Step 1: 93, Step 2CK: 99, Step 3: 113) were subjected to coding.

encodes it

The questions were processed in three variants and entered into ChatGPT in the following sequence:

Open-ended (OE) format: Created by removing all answer options and adding a variable initial interrogative sentence. This format simulates free input and a natural query pattern on the part of the user.
Multiple Choice Single Answer Without Forced Justification (MC-NJ): Created by verbatim of the original USMLE question.
Single Multiple Answer with Forced Justification (MC-J): Created by adding a variable imperative or interrogative sentence that forces the ChatGPT to provide a reason for each answer choice.

To reduce retention biases, a new chat session was started in ChatGPT for each entry. Post-hoc analyzes were performed to rule out systematic variation by encoder (data not shown).

The results obtained were independently evaluated for Accuracy, Agreement and Thoroughness by two medical judges.

Accuracy to improve

Exam items were first coded as open questions with variable prompts. This input format simulates a natural user query pattern. With indeterminate responses censored/included, the accuracy of ChatGPT for USMLE steps 1, 2CK, and 3 was 68.0%/42.9%, 58.3%/51.4%, and 62.4%, respectively /55.7%.

Subsequently, the exam items were coded as single multiple choice questions without forced justification (MC-NJ). This input corresponds to the format of the questions presented to the examiners. With indeterminate responses censored/included, the accuracy of the ChatGPT for USMLE steps 1, 2CK, and 3 was 55.1%/36.1%, 59.1%/56.9%, and 60, respectively. 9%/54.9%.

Finally, the items were coded as single multiple choice questions with forced justification of positive and negative selections (MC-J). This input format simulates the behavior of users searching for information. With indeterminate responses censored/included, the accuracy of the ChatGPT was 62.3%/40.3%, 51.9%/48.6%, and 64.6%/59.8%, respectively.

High agreement

Agreement was independently assessed by two review physicians through explanation content inspection. Overall, ChatGPT produced answers and explanations with a 94.6% agreement for all questions. High overall agreement was maintained for all exam levels and OE, MC-NJ, and MC-J question input formats.

Next, the researchers analyzed the contingency between accuracy and agreement in MC-J responses. ChatGPT was forced to justify its preference for answer choices and to defend its rejection of alternative choices. Agreement between accurate responses was nearly perfect and significantly greater than that between imprecise responses (99.1% vs. 85.1%, p<0.001). These data indicate that ChatGPT exhibits a very high answer-explanation agreement, likely reflecting a high internal consistency of its probabilistic language model.

After establishing ChatGPT’s accuracy and concordance, researchers examined its potential to augment human learning in medical education. The AI-generated explanations were independently evaluated by two medical reviewers. The content of the explanations was scrutinized for meaningful insights, defined as instances that met the criteria of novelty, non-obviousness, and validity. The evaluator adopted the perspective of the target audience of the test, as a sophomore medical student for Step 1, a fourth year medical student for Step 2CK, and a first year postgraduate resident for Step 3.

ChatGPT yielded at least one significant insight in 88.9% of all responses. The prevalence of insights was generally consistent across exam type and question entry format.

Conclusions

The results of the study can be divided into two main themes:

the accuracy of ChatGPT, which approaches or exceeds the USMLE pass threshold;
the potential of this AI to generate new insights that can help human students in a medical education setting.

ChatGPT achieved over 50% accuracy in all exams, exceeding 60% in most analyses. The threshold for passing the USMLE, although it varies from year to year, is approximately 60%. ChatGPT is therefore placed within the overrun range. As the first experiment to reach this benchmark, this is a surprising and impressive result for the authors.

Paradoxically, ChatGPT outperformed PubMedGPT (50.8% accuracy, unpublished data), a counterpart LLM with a similar neural structure, but trained exclusively on the biomedical literature.

The authors also examined ChatGPT’s ability to assist the human learning process of its target audience (for example, a second-year medical student preparing for USMLE Step 1). The ChatGPT responses were highly concordant, such that a human learner could easily follow the internal language, logic, and directionality of relationships contained in the explanation text (e.g., adrenal hypercortisolism ⥬ increased osteoclast activity ⥬ increased bone resorption of calcium ⥬ decreased bone mineral density ⥬ increased risk of fractures). High internal agreement and low self-contradiction are an indicator of sound clinical reasoning and an important measure of the quality of explanations. It is reassuring that the directionality of relationships is preserved by the language processing model, in which each verbal object is individually lemmatized.

AI-generated responses also offered significant insight, modeling a deductive reasoning process valuable to human learners. At least one significant insight was present in about 90% of the responses. ChatGPT therefore possesses the partial ability to teach medicine by bringing out new and non-obvious concepts that may not be in the students’ sphere of awareness. This qualitative result provides a foundation for future real-world studies on the effectiveness of generative AI to augment the human medical education process. For example, it is possible to study longitudinal exam performance in a quasi-controlled setting between AI-assisted and non-AI-assisted students.

Finally, the authors also highlight some limitations of their study. Anyone wishing to read the full text of the work can access it Who.

I study

The methodology

encodes it

Accuracy to improve

High agreement

Conclusions

I like:

Related

ChatGPT Examining for US Physician Licensure

I study

The methodology

encodes it

Accuracy to improve

High agreement

Conclusions

I like:

Share this:

Related

Atlántico cannot give in to violence: Attorney General’s Office

A Working Actor’s Working Faith–Tony Mockus, Sr.

You may also like

Leave a Comment Cancel Reply