已发表论文

评估大型语言模型在白癜风患者教育方面的准确性和完整性:一项比较分析

 

Authors Su J , Yang X , Li X, Chen J, Jiang C, Wang Y, Zhuang L, Li H

Received 10 July 2025

Accepted for publication 30 September 2025

Published 23 October 2025 Volume 2025:18 Pages 2757—2767

DOI https://doi.org/10.2147/CCID.S552979

Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 3

Editor who approved publication: Dr Monica K. Li

Jieyan Su,1 Xi Yang,1 Xiangying Li,1 Jiaxuan Chen,2 Caixin Jiang,2 Yi Wang,1 Le Zhuang,1,* Hang Li3– 5,* 

1Department of Dermatology, Central Hospital Affiliated to Shandong First Medical University, Jinan, People’s Republic of China; 2School of Clinical Medicine, Shandong Second Medical University, Weifang, People’s Republic of China; 3Department of Dermatology, Peking University First Hospital, Beijing, People’s Republic of China; 4National Clinical Research Center for Skin and Immune Diseases, Beijing, People’s Republic of China; 5NMPA Key Laboratory for Quality Control and Evaluation of Cosmetics, Beijing, People’s Republic of China

*These authors contributed equally to this work

Correspondence: Hang Li, Peking University First Hospital, No. 8, Xishiku Street, Xicheng District, Beijing, 100034, People’s Republic of China, Tel +8613693058190, Fax +8601083572350, Email drlihang@126.com Le Zhuang, Central Hospital Affiliated to Shandong First Medical University, No. 105, Jiefang Road, Jinan, 250013, People’s Republic of China, Tel +8615966301378, Fax +86053155739999, Email zhuangle@sdu.edu.cn

Background: Vitiligo causes significant psychological stress, creating a strong demand for accessible educational resources beyond clinical settings. This demand remains largely unmet. Large language models (LLMs) have the potential to bridge this gap by enhancing patient education. However, uncertainties exist regarding their ability to accurately address individualized patient inquiries and whether comprehension capabilities vary between LLMs.
Purpose: This study aims to evaluate the applicability, accuracy, and potential limitations of OpenAI o1, DeepSeek-R1, and Grok 3 for vitiligo patient education.
Methods: Three dermatology experts first developed sixteen vitiligo-related questions based on common patient concerns, which were categorized as descriptive or recommendatory with basic and advanced levels. The responses from the three LLMs were then evaluated by three vitiligo-specialized dermatologists for accuracy, comprehensibility, and relevance using a Likert scale. Additionally, three patients rated the comprehensibility of the responses, and a readability analysis was performed.
Results: All three LLMs demonstrated satisfactory accuracy, comprehensibility, and completeness, although their performance varied. They achieved 100% accuracy in responding to basic descriptive questions but exhibited inconsistency when addressing complex recommendatory queries, particularly regarding treatment recommendations for specific populations. Pairwise comparisons indicated that DeepSeek-R1 outperformed OpenAI o1 in accuracy scores (p = 0.042), while no significant difference was observed compared to Grok 3 (p = 0.157). Readability assessments revealed elevated reading difficulty across all models, with DeepSeek-R1 exhibiting the lowest readability (mean Flesch Reading Ease score of 19.7; pairwise comparisons showed DeepSeek-R1 scores were significantly lower than those of OpenAI o1 and Grok 3, both p < 0.01), potentially reducing accessibility for diverse patient populations.
Conclusion: Reasoning-LLMs demonstrate high accuracy in responding to simple vitiligo-related questions, but the quality of treatment recommendations declines as question complexity increases. Current models exhibit errors in providing vitiligo treatment advice, necessitating enhanced filtering mechanisms by developers and mandatory human oversight for medical decision-making.
Plain Language Summary: This study looked at how well three advanced chatbots—OpenAI o1, DeepSeek R1, and Grok 3—answer questions about vitiligo, a skin condition that causes patches of skin to lose color. Vitiligo can be stressful, and patients often need clear, accurate information at home. We tested these chatbots to see if they could provide reliable and easy-to-understand answers. Three skin experts created 16 questions about vitiligo, covering basic facts and treatment advice. The chatbots’ answers were rated by experts and patients for accuracy, clarity, and relevance. All three chatbots did well overall, scoring high on accuracy and completeness, especially for simple questions. DeepSeek R1 was the most accurate, while OpenAI o1 and Grok 3 were easier to read. However, the chatbots sometimes gave wrong advice, especially about treatments for specific groups such as children or pregnant women. Some answers could be hard to read for some users. The study shows that these chatbots can help educate people about vitiligo, especially in areas with limited access to doctors. But they are not perfect and cannot replace expert medical advice. Improvements are needed to make their answers more accurate and easier to understand. In the future, chatbots could support doctors by providing patients with reliable information, but they should not be the main source of medical guidance.

Keywords: large language models, ChatGPT, DeepSeek, Grok, vitiligo, patient education