ChatGPT could score at or around the approximately 60 per cent passing threshold for the United States Medical Licensing Exam (USMLE), with responses that made coherent, internal sense and contained frequent insights, according to a new study published in the open-access journal PLOS Digital Health.
The Study
Researcher Tiffany Kung and colleagues at AnsibleHealth, California, US, tested ChatGPT's performance on the USMLE, a highly standardized and regulated series of three exams, including Steps 1, 2CK, and 3, required for medical licensure in the US, the study said.
Taken by medical students and physicians-in-training, the USMLE assesses knowledge spanning most medical disciplines, ranging from biochemistry, to diagnostic reasoning, to bioethics.
The Results
After screening to remove image-based questions from the USMLE, the authors tested the AI software on 350 of the 376 public questions available from the June 2022 USMLE release. The authors found that after indeterminate responses were removed, ChatGPT had scored between 52.4 per cent and 75 per cent across the three USMLE exams. The passing threshold each year is approximately 60 per cent.
The researchers also noted that ‘being the first to achieve this benchmark, this marks a notable milestone in AI maturation. Impressively, ChatGPT was able to achieve this result without specialized input from human trainers. It demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.’
While the relatively small input size restricted the depth and range of analyses, the authors noted that their findings provided a glimpse into ChatGPT's potential to enhance medical education, and eventually, clinical practice.
In another study, published in a research letter in the journal JAMA, ChatGPT’s answers to 21 of the 25 questions on about fundamental concepts for preventing heart disease, including risk factor counselling were considered “appropriate”. While the researchers noted several limitations in their analysis, they said researchers could use more reviewers to evaluate responses or set up a formal system for grading responses that didn't rely as heavily on a clinician's subjective opinion.
Join the HealthXL Meeting on ‘Digital Biomarkers for Mental Health Evaluation and Management’ on 16th February. Click here to Request to Join.
You can read the complete study here.