ISSN 2071-8594

Russian academy of sciences

Editor-in-Chief

Gennady Osipov

V. V. Zhebel , S.-N. A. Zharikova , I. V. Sochenkov Feature Selection for Text Classification of a News Flows based on Topical Importance Characteristic

Abstract.

The paper presents an approach for ranking the most valuable features for text classification task. The introduced Topical Importance Characteristic leverages the feature selection method comprising the information about the distributions of words or phrases among the topics. We compare this method to well-known TF-IDF approach and use the introduced word-ranking scheme in two classifiers: Random Forrest and Multinomial Naïve Bayes. The Accuracy of classification results was tested in the “20-Newsgroups” dataset. The developed approach outperforms TF-IDF-based methods and matches the Accuracy achieved by the more powerful state of the art approaches such as SVC on the same dataset.

Keywords:

topical text classification, machine learning, topical importance characteristic, 20-Newsgroups.

PP. 52-59.

DOI 10.14357/20718594190306

References

1. Huberman B. A., Adamic L. A. Internet: growth dynamics of the world-wide web //Nature. – 1999. – Т. 401. – №. 6749. – С. 131.
2. Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. – Carnegiemellon univ pittsburgh pa dept of computer science, 1996. – №. CMU-CS-96-118.
3. CAROPRESO, M. F., MATWIN, S., AND SEBASTIANI, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, A. G. Chin, ed. Idea Group Publishing, Hershey, PA, P.78–102.
4. Ying Liu, Han Tong Loh, Aixin Sun Imbalanced text classification: A term weighting approach // Expert Systems with Applications.Volume 36, Issue 1. 2009. pp. 690-701
5. Sonawane S. S., Kulkarni P. A. Graph based representation and analysis of text document: A survey of techniques //International Journal of Computer Applications. – 2014. – Т. 96. – № 19.
6. Robertson S. E. et al. Okapi at trec-3 proceedings of the third text retrieval conference. – TREC, 1994.
7. Robertson S. et al. The probabilistic relevance framework: BM25 and beyond //Foundations and Trends® in Information Retrieval. – 2009. – Т. 3. – №. 4. – P. 333-389.
8. Lv Y., Zhai C. X. Lower-bounding term frequency normalization //Proceedings of the 20th ACM international conference on Information and knowledge management. – ACM, 2011. – P. 7-16.
9. Suvorov R., Sochenkov I., Tikhomirov I. Method for pornography filtering in the web based on automatic classification and natural language processing //International Conference on Speech and Computer. – Springer, Cham, 2013. – P. 233-240.
10. Martineau J. et al. Delta TFIDF: An Improved Feature Space for Sentiment Analysis //Icwsm. – 2009. – Т. 9. – С. 106.
11. The Twenty Newsgroups. [Электронный ресурс]. URL: http://qwone.com/~jason/20Newsgroups/ (дата обращения 10.02.2018 г.)
12. Albishre K., Albathan M., Li Y. Effective 20 newsgroups dataset cleaning //Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE/WIC/ACM International Conference on. – IEEE, 2015. – Т. 3. – P. 98-101.
13. Prettenhofer P. et al. Scikit-learn: Machine Learning in Python. – 2016.
14. Lilleberg J., Zhu Y., Zhang Y. Support vector machines and word2vec for text classification with semantic features //Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on. – IEEE, 2015. – P. 136-140.
15. Tang B. et al. A Bayesian classification approach using class-specific features for text categorization //IEEE Transactions on Knowledge and Data Engineering. – 2016. – Т. 28. – №. 6. – P. 1602-1606.
16. Mendoza, M. A new term-weighting scheme for naïve Bayes text categorization // International Journal of Web Information Systems, 2012. Vol. 8 No. 1, pp. 55-72.
17. Devyatkin D., Shelmanov A. Text Processing Framework for Emergency Event Detection in the Arctic Zone //International Conference on Data Analytics and Management in Data Intensive Domains. – Springer, Cham, 2016. – P. 74-88.