ISSN 2071-8594

Russian academy of sciences

Editor-in-Chief

Gennady Osipov

S.N. Karpovich, A.V. Smirnov, N.N. Teslya Text documents classification based on probabilistic topic model

Abstract.

The paper proposes an approach to the classification of text documents using a probabilistic topic model, with a training set of documents represented by instances of one class. The proposed approach allows selecting positive instances similar to a given class from collections and text document flows. The models learned on instances of one class, solving problems of classification in application to text documents are considered, the key features of such models are indicated. The classification model Positive Example Based Learning-TM is presented and a software prototype is developed, which realizes the classification of text documents based on it. The developed model demonstrates high classification accuracy, which exceeds the alternative approaches. The proposed model as well as existing models was evaluated based on the SCTM-ru text corpora. Experimentally proved the superiority of Positive Example Based Learning-TM by the criterion of classification accuracy with a small size of training set.

Keywords:

classification, binary classification, topic model, natural language processing.

PP. 69-77.

DOI 10.14357/20718594180317

References

1. Schütze H., Manning C. D., Raghavan P. Introduction to information retrieval. 2008. 39. 482 p.
2. Bartkowiak A. M. 2011. Anomaly, novelty, one-class classification: a comprehensive introduction. International Journal of Computer Information Systems and Industrial Management Applications. 3(1):61-71.
3. Karpovich S.N. 2015. Russkoyazychnyj korpus tekstov SCTM-RU dlya postroeniya tematicheskikh modelej [The Russian Language Text Corpus for Testing Algorithms of Topic Model]. Trudy SPIIRАN [SPIIRAS Proceedings] 2(39):123-142.
4. Tax D., Duin R. 2004. Support vector data description. Machine Learning. 54(1):45–66
5. Tax D., Duin R. 1999. Support vector domain description. Pattern Recognition Letters. 20:1191-1199.
6. Schölkopf B. et al. 2001. Estimating the support of a high-dimensional distribution. Neural computation. 13(7):1443-1471.
7. Utkin L. 2014. A framework for imprecise robust one-class classification models. International Journal of Machine Learning and Cybernetics. 5(3):379-393.
8. Utkin L., Zhuk Y. 2014. Imprecise prior knowledge incorporating into one-class classification. Knowledge and information systems. 41(1):53-76.
9. Utkin L. V., Zhuk Y A. 2012. Robastnye modeli odnoklassovoj klassifikatsii i krajnie tochki mnozhestva veroyatnostej [Robust models of the one-class classification and extreme points of the probability set]. Mezhdunarodnaya konferentsiya po myagkim vychisleniyam i izmereniyam [International Conference on Soft Computing and Measurement] 1:220-224.
10. Denis F., Gilleron R., Tommasi M. 2002. Text classification from positive and unlabeled examples. Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU'02. 1927-1934.
11. Denis F. et al. 2003. Text classification and co-training from positive and unlabeled examples. Proceedings of the ICML 2003 workshop: the continuum from labeled to unlabeled data. 80-87.
12. Pan S., Zhang Y., Li X. 2012. Dynamic classifier ensemble for positive unlabeled text stream classification. Knowledge and information systems. 33(2):267-287.
13. Hoffman T. 1999. Probabilistic Latent Semantic Indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. 50-57.
14. Blei D.M., Ng A.Y., Jordan M. I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 3:993-1022.
15. Karpovich S.N. 2016. Mnogoznachnaya klassifikatsiya tekstovykh dokumentov s ispol'zovaniem veroyatnostnogo tematicheskogo modelirovaniya ml-PLSI [Multi-label classification of text documents using probabilistic topic modeling]. Trudy SPIIRАN [SPIIRAS Proceedings] 4(47):92-104.
16. Vorontsov K. V., Potapenko A. A. 2013 Modifikatsii EM-algoritma dlya veroyatnostnogo tematicheskogo modelirovaniya [EM-like algorithms for probabilistic topic modeling]. Mashinnoe obuchenie i analiz dannykh [Machine Learning and Data Analysis]. 1(6):657-686.
17. Pedregosa F. et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research. 12:2825-2830.
18. Bird S., Loper E. 2004 NLTK: the natural language toolkit //Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. – Association for Computational Linguistics. 31.