Healthcare automation benchmark exposes limits of current AI agents

Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

CHI-Bench tests whether AI agents can automate end-to-end healthcare operations across three domains: prior authorization, utilization management, and care coordination. Each task requires an agent to navigate a simulator with 87 tools representing 20 healthcare apps, follow a 1,290+ document operations handbook, and handle multi-turn interactions across multiple roles. The benchmark finds that the strongest agent-model combinations resolve only 28% of tasks, with performance dropping to 3.8% when tasks must complete in a single session. The results suggest similar capability gaps likely exist in other policy-dense, role-composed enterprise domains.