Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?


Asker O. F., Recai M. S., GENÇ Y. E., DOĞAN K. A., ŞENER T. E., ŞAHİN B.

BJU International, cilt.136, sa.5, ss.937-945, 2025 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 136 Sayı: 5
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1111/bju.16873
  • Dergi Adı: BJU International
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, BIOSIS, CAB Abstracts, EMBASE, Gender Studies Database, MEDLINE, Public Affairs Index
  • Sayfa Sayıları: ss.937-945
  • Anahtar Kelimeler: artificial intelligence, ChatGPT, DeepSeek, exam, large language model, urology
  • Marmara Üniversitesi Adresli: Evet

Özet

Objective: To evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in-service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations. Materials and Methods: A total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT-4o, DeepSeek-R1, Gemini, Grok-2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale. Results: The models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT-4o, 0.764 for DeepSeek-R, and 0.765 for Grok-2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT-4o (19.2%) and DeepSeek-R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others. Conclusion: Chatbots demonstrated various powers across different tasks. DeepSeek-R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.