已发表论文

三种大型语言模型对中文语境下有关流产后护理咨询的回应表现评估:一项比较分析

 

Authors Xue D, Liao S

Received 1 April 2025

Accepted for publication 9 August 2025

Published 18 August 2025 Volume 2025:18 Pages 2731—2741

DOI https://doi.org/10.2147/RMHP.S531777

Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 2

Danyue Xue,1,2 Sha Liao1,2 

1Department of Operating Room Nursing, West China second University Hospital, Sichuan University, Chengdu, Sichuan, People’s Republic of China; 2Key Laboratory of Birth Defects and Related Diseases of Women and Children (Sichuan University), Ministry of Education, Chengdu, Sichuan, People’s Republic of China

Correspondence: Sha Liao, Email 1205843055@qq.com

Background: This study aimed to evaluate the response performances of three large language models (LLMs) (ChatGPT, Kimi, and Ernie Bot) to inquiries regarding post-abortion care (PAC) in the context of the Chinese language.
Methods: The data was collected in October 2024. Twenty questions concerning the necessity of contraception after induced abortion, the best time for contraception, choice of a contraceptive method, contraceptive effectiveness, and the potential impact of contraception on fertility were used in this study. Each question was asked three times in Chinese for each LLM. Three PAC consultants conducted the evaluations. A Likert scale was used to score the responses based on accuracy, relevance, completeness, clarity, and reliability.
Results: The number of responses received “good” (a mean score > 4), “average” (3 < mean score ≤ 4), and “poor” (a mean score ≤ 3) in overall evaluation was 159 (88.30%), 19 (10.57%), and 2 (1.10%). No statistically significant differences were identified in the overall evaluation among the three LLMs (P = 0.352). The number of the responses evaluated as good for accuracy, relevance, completeness, clarity, and reliability were 87 (48.33%), 154 (85.53%), 136 (75.57%), 133 (73.87%), and 128 (71.10%), respectively. No statistically significant differences were identified in accuracy, relevance, completeness or clarity between the three LLMs. A statistically significant difference was identified in reliability (P < 0.001).
Conclusion: The three LLMs performed well overall and showed great potential for application in PAC consultations. The accuracy of the LLMs’ responses should be improved through continuous training and evaluation.

Keywords: artificial intelligence, abortion, induced, referral and consultation, delivery of health care, comparative study