Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

Asker, Omer; Recai, Muhammed; GENÇ, YUNUS; DOĞAN, KADER; ŞENER, TARIK; ŞAHİN, BAHADIR

doi:10.1111/bju.16873

Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

Asker O. F., Recai M. S., GENÇ Y. E., DOĞAN K. A., ŞENER T. E., ŞAHİN B.

BJU International, cilt.136, sa.5, ss.937-945, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 136 Sayı: 5
Basım Tarihi: 2025
Doi Numarası: 10.1111/bju.16873
Dergi Adı: BJU International
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, BIOSIS, CAB Abstracts, EMBASE, Gender Studies Database, MEDLINE, Public Affairs Index
Sayfa Sayıları: ss.937-945
Anahtar Kelimeler: artificial intelligence, ChatGPT, DeepSeek, exam, large language model, urology
Marmara Üniversitesi Adresli: Evet

Özet

Objective: To evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in-service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations. Materials and Methods: A total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT-4o, DeepSeek-R1, Gemini, Grok-2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale. Results: The models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT-4o, 0.764 for DeepSeek-R, and 0.765 for Grok-2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT-4o (19.2%) and DeepSeek-R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others. Conclusion: Chatbots demonstrated various powers across different tasks. DeepSeek-R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.