ISSN 2071-8594

Russian academy of sciences

Editor-in-Chief

Gennady Osipov

F.N Solovyev, A.M. Chepovskiy An extension of the short text language identification model

Abstract:

In our work we address the problem of the natural language identification in short texts. A Bayesian classifier is employed. We propose an extension of the language identification model by the incorporation of the new cyrillic languages of the russian small nations.

Keywords:

statistical language model, natural language identification, languages of russian small nations.

PP. 21-26

REFERENCES

1. Gusev S.V., Chepovskiy A.M. Natural language identification model // Business-informatics. 2011. No3 (17).
2. Zaidelman L.Y., Krylova I.V., Orekhov B.V. The technology of web-texts collection of Russian minor languages // In proceedings of International conference CPT2015, 2015. Мoscow Region, Prorvino, ICPT, 2016. P. 179-181.
3. Vogel, J., Tresner-Kirsch, D. Robust language identification in short, noisy texts: Improvements to LIGA // Proceedings of the Third International Workshop on Mining Ubiquitous and Social Environments, 2012. - P. 43-50.
4. Carter, S., Weerkamp, W., Tsagkias, M. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text // Language Resources and Evaluation, 2013, 47(1). - P.195-215.
5. Kneser, R., Ney, H. Improved backing-off for m-gram language modeling. // Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on (Vol. 1, P. 181-184). IEEE.
6. Shannon C.E. A Mathematical Theory of Communication // The Bell System Technical Journal, 1948, 27. - P. 379-423, 623-656.