05 Jan 2026

Stanford’s VeriFact Uses AI To Verify LLM-Generated Clinical Records

Researchers at Stanford University have developed VeriFact, a platform designed to verify the accuracy of AI-generated clinical documentation by comparing it directly against a patient’s electronic health record (EHR). Described in a study published in NEJM AI, the system evaluates whether statements produced by large language models (LLMs) are factually supported by real patient data, addressing growing concerns about hallucinations and errors in AI-generated medical notes.


VeriFact works by extracting relevant information from a patient’s EHR and applying an “LLM-as-a-judge” approach to assess generated text. It checks whether each statement aligns with known clinical facts, identifies inconsistencies, and explains the underlying causes of errors. The researchers also introduced a clinician-annotated benchmark dataset, VeriFact–Brief Hospital Course (VeriFact-BHC), which breaks discharge summaries into individual claims and labels each as supported or unsupported by the EHR. The dataset includes 100 patients and more than 13,000 statements, each reviewed by three or more clinicians.


In testing, VeriFact achieved 93.2% agreement with clinician judgments, outperforming the highest interrater agreement among clinicians themselves (88.5%). The researchers said this demonstrates that AI-based fact verification can be more consistent than manual review. They noted that VeriFact could help clinicians verify AI-drafted notes before adding them to the EHR and automate documentation tasks that typically require time-consuming chart review.


The study also highlighted limitations. VeriFact relies on the EHR as the source of truth, which may be incomplete or contain errors, especially for new patients. It cannot detect omissions—only inaccuracies in generated text—and its performance declined when applied to human-written notes rather than AI-generated ones. The benchmark dataset is also limited to discharge summaries from a single dataset and documentation pipeline, which may affect generalizability.


Click here to read the original news story.