Researchers at the Icahn School of Medicine at Mount Sinai have published a new study in The Lancet Digital Health examining how susceptible leading large language models (LLMs) are to medical misinformation. Analyzing more than one million prompts across nine widely used models, the team found that AI systems frequently repeat false or unsafe medical advice when it is embedded in realistic, professional-sounding clinical documentation.
To test the models, researchers took real hospital discharge summaries from the MIMIC database and inserted a single fabricated recommendation into each note. One example advised a patient with esophagitis-related bleeding to “drink cold milk,” a recommendation that is clinically unsafe. Rather than flagging the error, several models accepted and repeated the advice as valid simply because it appeared within a credible hospital note format.
The findings reveal a critical weakness in current AI safeguards: style often overrides substance. According to Dr. Eyal Klang, Chief of Generative AI at Mount Sinai, the models tend to treat confident, clinical language as inherently true, without independently verifying the medical accuracy of the content. This means AI systems are predicting what sounds right in context, not whether it is correct.
The implications for clinical use are significant. If AI tools are used to summarize records or support decision-making, they may amplify existing errors rather than detect them—especially if those errors originate from human documentation or prior AI outputs. Lead author Dr. Mahmud Omar argues that medical AI should be evaluated not just on performance benchmarks, but on how well it resists propagating false information. The researchers propose using their dataset as a standardized “stress test” to assess whether models can recognize and reject dangerous misinformation before being deployed in patient care.
Click here to read the original news story.