Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes

Bıçakçı, YUNUS

doi:10.35377/saucis...1727583

Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes

Sakarya University Journal of Computer and Information Sciences (Online), cilt.9, sa.1, ss.119-133, 2026 (Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 9 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.35377/saucis...1727583
Dergi Adı: Sakarya University Journal of Computer and Information Sciences (Online)
Derginin Tarandığı İndeksler: Scopus, Central & Eastern European Academic Source (CEEAS), Directory of Open Access Journals, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.119-133
Marmara Üniversitesi Adresli: Evet

Özet

This study presents the first comprehensive benchmark of seven open-source multimodal vision-language models with Turkish language support—namely, Aya Vision 32B, Gemma 3 27B, InternVL3 38B, Qwen2-VL 72B-AWQ, Qwen2.5-VL 72B-AWQ, Cosmos-LLaVA, and Phi-4 Multimodal—on two image datasets of Turkish cuisine, TurkishFoods-15 and TurkishFoods-25. All models were evaluated zero-shot, without additional training or fine-tuning, utilizing a fully standardized Turkish system and user prompts. We report macro and weighted averages of accuracy, precision, recall, and F1-score, along with end-to-end inference time. Aya Vision 32B obtained the best weighted F1-score (85.9%) on TurkishFoods‑15, whereas Gemma 3 27B led on TurkishFoods‑25 (76.7%). Across metrics and datasets, Aya Vision 32B, Gemma 3 27B, Qwen2‑VL 72B‑AWQ, and InternVL3 38B formed the most reliable models. These results establish a solid reference for future work on culturally aware multimodal AI and demonstrate, for the first time, that vision-language models can categorize Turkish dishes without task‑specific training.