BJU International, cilt.136, sa.5, ss.937-945, 2025 (SCI-Expanded)
Objective: To evaluate widely used chatbots’ accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in-service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations. Materials and Methods: A total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT-4o, DeepSeek-R1, Gemini, Grok-2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman–Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents’ ratings by a Likert scale. Results: The models and answer key were in substantial agreement with a Fleiss’ kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT-4o, 0.764 for DeepSeek-R, and 0.765 for Grok-2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT-4o (19.2%) and DeepSeek-R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others. Conclusion: Chatbots demonstrated various powers across different tasks. DeepSeek-R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.