ISSN 2071-8594

Russian academy of sciences


Gennady Osipov

F.N Solovyev, A.M. Chepovskiy An extension of the short text language identification model


In our work we address the problem of the natural language identification in short texts. A Bayesian classifier is employed. We propose an extension of the language identification model by the incorporation of the new cyrillic languages of the russian small nations.


statistical language model, natural language identification, languages of russian small nations.

PP. 21-26


1. Gusev S.V., Chepovskiy A.M. Natural language identification model // Business-informatics. 2011. No3 (17).
2. Zaidelman L.Y., Krylova I.V., Orekhov B.V. The technology of web-texts collection of Russian minor languages // In proceedings of International conference CPT2015, 2015. Мoscow Region, Prorvino, ICPT, 2016. P. 179-181.
3. Vogel, J., Tresner-Kirsch, D. Robust language identification in short, noisy texts: Improvements to LIGA // Proceedings of the Third International Workshop on Mining Ubiquitous and Social Environments, 2012. - P. 43-50.
4. Carter, S., Weerkamp, W., Tsagkias, M. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text // Language Resources and Evaluation, 2013, 47(1). - P.195-215.
5. Kneser, R., Ney, H. Improved backing-off for m-gram language modeling. // Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on (Vol. 1, P. 181-184). IEEE.
6. Shannon C.E. A Mathematical Theory of Communication // The Bell System Technical Journal, 1948, 27. - P. 379-423, 623-656.