已发表论文

用于艾滋病患者健康咨询的人工智能驱动的大型语言模型

 

Authors Zhao CY, Song C, Yang T, Huang AC, Qiang HB, Gong CM, Chen JS, Zhu QD

Received 29 April 2025

Accepted for publication 6 August 2025

Published 25 August 2025 Volume 2025:18 Pages 5187—5198

DOI https://doi.org/10.2147/JMDH.S533621

Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 3

Editor who approved publication: Dr Scott Fraser

Chun-Yan Zhao,1,2,* Chang Song,1,2,* Tong Yang,3,* Ai-Chun Huang,1 Hang-Biao Qiang,1 Chun-Ming Gong,1 Jing-Song Chen,4 Qing-Dong Zhu1 

1Department of Tuberculosis, The Fourth People’s Hospital of Nanning, Nanning, Guangxi, People’s Republic of China; 2Clinical Medical School, Guangxi Medical University, Nanning, Guangxi, People’s Republic of China; 3Department of Rehabilitation, Hepu County People’s Hospital, Beihai, Guangxi, People’s Republic of China; 4Department of Gastroenterology, Hepu County People’s Hospital, Beihai, Guangxi, People’s Republic of China

*These authors contributed equally to this work

Correspondence: Qing-Dong Zhu, Department of Tuberculosis, The Fourth People’s Hospital of Nanning, No. 1 Changgang Two-Li, Xingning District, Nanning, Guangxi, 530023, People’s Republic of China, Tel +86 0771-5636973, Email zhuqingdong2003@163.com Jing-Song Chen, Department of Gastroenterology, Hepu County People’s Hospital, No. 95, Dinghai North Road, Hepu County, Beihai, Guangxi, 536100, People’s Republic of China, Tel +86 0779-7106010, Email 410155791@qq.com

Purpose: This study endeavors to conduct a comprehensive assessment on the performance of large language models (LLMs) in health consultation for individuals living with HIV, delve into their applicability across a diverse array of dimensions, and provide evidence-based support for clinical deployment.
Patients and Methods: A 23-question multi-dimensional HIV-specific question bank was developed, covering fundamental knowledge, diagnosis, treatment, prognosis, and case analysis. Four advanced LLMs—ChatGPT-4o, Copilot, Gemini, and Claude—were tested using a multi-dimensional evaluation system assessing medical accuracy, comprehensiveness, understandability, reliability, and humanistic care (which encompasses elements such as individual needs attention, emotional support, and ethical considerations). A five-point Likert scale was employed, with three experts independently scoring. Statistical metrics (mean, standard deviation, standard error) were calculated, followed by consistency analysis, difference analysis, and post-hoc testing.
Results: Claude obtained the most outstanding performance with regard to information comprehensiveness (mean score 4.333), understandability (mean score 3.797), and humanistic care (mean score 2.855); Copilot demonstrated proficiency in diagnostic questions (mean score 3.880); Gemini illustrated exceptional performance in case analysis (mean score 4.111). Based on the post-hoc analysis, Claude outperformed other models in thoroughness and humanistic care (P < 0.05). Copilot showed better performance than ChatGPT in understandability (P = 0.045), while Gemini performed significantly better than ChatGPT in case analysis (P < 0.001). It is important to note that performance varied across tasks, and humanistic care remained a consistent weak point across all models.
Conclusion: The superiority of diverse models in specific tasks suggest that LLMs hold extensive application potential in the management of HIV patients. Nevertheless, their efficacy in the realm of humanistic care still needs improvement.

Keywords: artificial intelligence, large language model, HIV, health consultation, performance analysis