← Back to Computation and Language
cs.CL

Testing synthetic clinical notes at scale: what gets lost in translation

Jinghui Liu, Sarvesh Soni, Anthony Nguyen

May 18, 2026

This study evaluates synthetic clinical notes generated by LLMs through intrinsic quality metrics, downstream task performance, and factuality checks across one million MIMIC records. Results show synthetic notes maintain predictive value for broad clinical tasks like mortality prediction, but fail on granular tasks like ICD coding. Chunk-based rephrasing improves detail preservation but introduces factual errors under incomplete context. Common synthesis failures include clinical context misinterpretation, temporal confusion, and measurement errors. Despite these limitations, synthetic notes can augment training for rare diagnostic codes. The work is primarily empirical, benchmarking LLM-generated clinical text at hospital-scale production volumes.
Published as Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale arXiv:2605.17775
Read the original paper →