已发表论文

探究并比较大型语言模型在支持骨质疏松症健康咨询中的应用

 

Authors Li X , Li G, Zhao Y, Liang Y, Dong Y, Zhang J

Received 30 July 2025

Accepted for publication 13 November 2025

Published 21 November 2025 Volume 2025:20 Pages 2133—2143

DOI https://doi.org/10.2147/CIA.S551572

Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 3

Editor who approved publication: Prof. Dr. Nandu Goswami

Xin Li,1,* Gen Li,2,* Yue Zhao,3 Yixin Liang,4 Yuefu Dong,1 Jian Zhang1 

1Department of Orthopedics, The First People’s Hospital of Lianyungang, Lianyungang, Jiangsu, People’s Republic of China; 2Department of Orthopedics, The Second Affiliated Hospital of Xuzhou Medical University, Xuzhou, Jiangsu, People’s Republic of China; 3Department of Nursing, Lianyungang Maternity and Child Health Hospital, Lianyungang, Jiangsu, People’s Republic of China; 4Department of Osteoporosis, The First People’s Hospital of Lianyungang, Lianyungang, Jiangsu, People’s Republic of China

*These authors contributed equally to this work

Correspondence: Yuefu Dong, Department of Orthopedics, The First People’s Hospital of Lianyungang, Lianyungang, Jiangsu, People’s Republic of China, Email dongyuefu@163.com Jian Zhang, Department of Orthopedics, The First People’s Hospital of Lianyungang, Lianyungang, Jiangsu, People’s Republic of China, Email lygyyzj@163.com

Purpose: To compare the medical accuracy and content comprehensiveness of three large language models (LLMs) in generating responses to frequently asked osteoporosis-related questions and to determine their potential role in clinical support.
Methods: Twenty-five questions covering six clinical domains were submitted to each model in isolated sessions. Five senior orthopedic physicians, each with over 25 years of clinical experience, independently rated the medical accuracy of each response using a 5-point Likert scale. Responses rated as “acceptable” or above were further evaluated for content comprehensiveness. Statistical analysis included the Kruskal–Wallis test and Dunn’s post hoc test with Bonferroni correction.
Results: A total of 75 unique responses (25 questions × 3 models) were evaluated by five orthopedic experts, yielding 375 ratings. ChatGPT-4o achieved the highest accuracy score (median: 4.6; IQR: 4.4– 4.8), significantly outperforming Gemini-2.5 Pro (p=0.039) and DeepSeek-R1 (p< 0.001). For content comprehensiveness, both ChatGPT-4o and Gemini-2.5 Pro had a median score of 4.4, higher than DeepSeek-R1 (median: 4.2), though differences did not reach statistical significance (p=0.0536). Gemini-2.5 Pro was noted for its fluent and user-friendly language but lacked clinical depth in some responses. DeepSeek-R1, despite offering source citations, demonstrated greater inconsistency.
Conclusion: LLMs have clear potential as tools for patient education in osteoporosis. ChatGPT-4o demonstrated the most balanced and clinically reliable performance. Nonetheless, expert medical oversight remains essential to ensure safe and context-appropriate use in healthcare settings.

Keywords: large language models, osteoporosis, patient education, AI in healthcare, clinical consultation support