论文已发表
注册即可获取德孚的最新动态
IF 收录期刊
三种大型语言模型对中文语境下有关流产后护理咨询的回应表现评估:一项比较分析
Received 1 April 2025
Accepted for publication 9 August 2025
Published 18 August 2025 Volume 2025:18 Pages 2731—2741
DOI https://doi.org/10.2147/RMHP.S531777
Checked for plagiarism Yes
Review by Single anonymous peer review
Peer reviewer comments 2
Danyue Xue,1,2 Sha Liao1,2
1Department of Operating Room Nursing, West China second University Hospital, Sichuan University, Chengdu, Sichuan, People’s Republic of China; 2Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, Sichuan, People’s Republic of China
Correspondence: Sha Liao, Email 1205843055@qq.com
Background: This study aimed to evaluate the response performances of three large language models (LLMs) (ChatGPT, Kimi, and Ernie Bot) to inquiries regarding post-abortion care (PAC) in the context of the Chinese language.
Methods: The data was collected in October 2024. Twenty questions concerning the necessity of contraception after induced abortion, the best time for contraception, choice of a contraceptive method, contraceptive effectiveness, and the potential impact of contraception on fertility were used in this study. Each question was asked three times in Chinese for each LLM. Three PAC consultants conducted the evaluations. A Likert scale was used to score the responses based on accuracy, relevance, completeness, clarity, and reliability.
Results: The number of responses received “good” (a mean score > 4), “average” (3 < mean score ≤ 4), and “poor” (a mean score ≤ 3) in overall evaluation was 159 (88.30%), 19 (10.57%), and 2 (1.10%). No statistically significant differences were identified in the overall evaluation among the three LLMs (P = 0.352). The number of the responses evaluated as good for accuracy, relevance, completeness, clarity, and reliability were 87 (48.33%), 154 (85.53%), 136 (75.57%), 133 (73.87%), and 128 (71.10%), respectively. No statistically significant differences were identified in accuracy, relevance, completeness or clarity between the three LLMs. A statistically significant difference was identified in reliability (P < 0.001).
Conclusion: The three LLMs performed well overall and showed great potential for application in PAC consultations. The accuracy of the LLMs’ responses should be improved through continuous training and evaluation.
Keywords: artificial intelligence, abortion, induced, referral and consultation, delivery of health care, comparative study