Evaluation of Classification Models for Language Processing

Kilimci Z. H. , Ganiz M. C.

International Symposium on Innovations in Intelligent SysTems and Applications (INISTA 2015), Madrid, İspanya, 2 - 04 Eylül 2015, ss.454-461 identifier identifier


Naive Bayes is a commonly used algorithm in text categorization because of its easy implementation and low complexity. Naive Bayes has mainly two event models used for text categorization which are multivariate Bernoulli and multinomial models. A very large number of studies choose multinomial model and Laplace smoothing just based on the assumption that it performs better than multivariate model under almost any conditions. This study aims to shed some light into this widely adopted assumption by analyzing Naive Bayes event models and smoothing methods from a different perspective. To clarify the difference between events models of Naive Bayes, their classification performance are compared on different languages - English and Turkish-datasets. Results of our extensive experiments demonstrate that superior performance of multinomial model does not observed all the time. On the other hand, multivariate Bernoulli model can perform well when combined with an appropriate smoothing method under different training data size conditions.