论文已发表
注册即可获取德孚的最新动态
IF 收录期刊
五种大型语言模型对肝癌综合治疗响应的比较研究
Authors Zhong D, Liang Y, Yan HT, Chen X, Yang Q, Ma S, Su Y, Chen Y, Huang X , Wang M
Received 15 April 2025
Accepted for publication 8 August 2025
Published 20 August 2025 Volume 2025:12 Pages 1861—1871
DOI https://doi.org/10.2147/JHC.S531642
Checked for plagiarism Yes
Review by Single anonymous peer review
Peer reviewer comments 2
Editor who approved publication: Dr David Gerber
Deyuan Zhong, Yuxin Liang, Hong-Tao Yan, Xinpei Chen, Qinyan Yang, Shuoshuo Ma, Yuhao Su, YaHui Chen, Xiaolun Huang, Ming Wang
Department of Liver Transplantation Center and HBP Surgery, Sichuan Clinical Research Center for Cancer, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, People’s Republic of China
Correspondence: Xiaolun Huang, Email huangxiaolun@med.uestc.edu.cn Ming Wang, Email wangming0610@163.com
Introduction: Large language models (LLMs) are increasingly used in healthcare, yet their reliability in specialized clinical fields remains uncertain. Liver cancer, as a complex and high-burden disease, poses unique challenges for AI-based tools. This study aimed to evaluate the comprehensibility and clinical applicability of five mainstream LLMs in addressing liver cancer–related clinical questions.
Methods: We developed 90 standardized questions covering multiple aspects of liver cancer management. Five LLMs—GPT-4, Gemini, Copilot, Kimi, and Ernie Bot—were evaluated in a blinded fashion by three independent hepatobiliary experts. Responses were scored using predefined criteria for comprehensibility and clinical applicability. Overall group comparisons were conducted using the Fisher–Freeman–Halton test (for categorical data) and the Kruskal–Wallis test (for ordinal scores), followed by Dunn’s post-hoc test or Fisher’s exact test with Bonferroni correction. Inter-rater reliability was assessed using Fleiss’ kappa.
Results: Kimi and GPT-4 achieved the highest proportions of fully applicable responses (68% and 62%, respectively), while Ernie Bot and Copilot showed the lowest. Comprehensibility was generally high, with Kimi and Ernie Bot scoring over 98%. However, none of the LLMs consistently provided guideline-concordant answers to all questions. Performance on professional-level questions was significantly lower than on common-sense ones, highlighting deficiencies in complex clinical reasoning.
Conclusion: LLMs demonstrate varied performance in liver cancer–related queries. While GPT-4 and Kimi show promise in clinical applicability, limitations in accuracy and consistency—particularly for complex medical decisions—underscore the need for domain-specific optimization before clinical integration.
Trial Registration: Not applicable.
Keywords: large language models, liver cancer, clinical applicability, ChatGPT, medical chatbot