ML @ CMU June 19, 2026 · Papers

Healthcare Benchmarks Are Only as Good as Their Assumptions

In healthcare settings where patients use LLMs as a medical assistant, LLM performance differs between evaluation and deployment. (a) Bean et al. (2025) find a 61 percentage point difference between evaluation and deployment. (b) We argue this gap arises not from poorly designed benchmarks, but from implicit assumption

Read original