Performance Comparison of Multimodal Vision-Language Models in Classifying Turkish Dishes


Bıçakçı Y. S.

Sakarya University Journal of Computer and Information Sciences (Online), cilt.9, sa.1, ss.119-133, 2026 (Scopus, TRDizin)

Özet

This study presents the first comprehensive benchmark of seven open-source multimodal vision-language models with Turkish language support—namely, Aya Vision 32B, Gemma 3 27B, InternVL3 38B, Qwen2-VL 72B-AWQ, Qwen2.5-VL 72B-AWQ, Cosmos-LLaVA, and Phi-4 Multimodal—on two image datasets of Turkish cuisine, TurkishFoods-15 and TurkishFoods-25. All models were evaluated zero-shot, without additional training or fine-tuning, utilizing a fully standardized Turkish system and user prompts. We report macro and weighted averages of accuracy, precision, recall, and F1-score, along with end-to-end inference time. Aya Vision 32B obtained the best weighted F1-score (85.9%) on TurkishFoods‑15, whereas Gemma 3 27B led on TurkishFoods‑25 (76.7%). Across metrics and datasets, Aya Vision 32B, Gemma 3 27B, Qwen2‑VL 72B‑AWQ, and InternVL3 38B formed the most reliable models. These results establish a solid reference for future work on culturally aware multimodal AI and demonstrate, for the first time, that vision-language models can categorize Turkish dishes without task‑specific training.