ISSN 2071-8594

Russian academy of sciences

Editor-in-Chief

Gennady Osipov

A.C. Zlatov, A.A. Kuzmin. The matic model of major conference proceedings

Abstract.

The aim of this paper is to construct a hierarchical thematic model for abstracts of a major conference. We use Discriminative Probabilistic Model for abstracts clustering at each level of hiererchical structure. We propose to modify Discriminative Probabilistic Model to the balanced structure of the conference. The influence of cluster size is decreased in modified models. Semi-supervised learning is used for document clustering. We construct thematic model at each level of conference structure. We also propose the hierarchal divisive clustering algorithm to construct the hierarchical thematic model. The hierarchical model is based on models for each level of hiererchical structure. The algorithms are applied to collection of conference EURO abstracts. The constructed model is compered with experts model of EURO.

Keywords:

thematic model, hiererchical model, probabilistic thematic model, document clustering, semi-supervised learning.

PP. 77-86.

REFERENCES

1. Tezisy konferentsii EURO. URL:
https://sourceforge.net/p/mlalgorithms/code/HEAD/tree/EURO_data/data obrashcheniya: 27.06.2016.
2. Hartigan J.A., Wong M. A. Algorithm AS 136: A k-means clustering algorithm // Applied Statistics. 1979. Vol. 28, no. 1. Pp. 100–108.
3. Qi He, Kuiyu Chang, Ee-Peng Lim, Arindam Banerjee Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models // IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.32, no. 10, pp. 1795-1808, October 2010, doi:10.1109/TPAMI.2009.203.
4. Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Suvrit Sra Generative Model-based Clustering of Directional Data // Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2003. Pp. 19–28.
5. David M. Blei, Andrew Y. Ng, Michael I. Jordan Latent dirichlet allocation // The Journal of Machine Learning Research. Vol. 3, 2003. Pp. 993-1022.
6. Ackermann Marcel R., Blomer Johannes, Sohler Christian. Clustering for Metric and Nonmetric Distance Measures //ACM Trans. Algorithms. 2010. Vol. 6, no. 4. Pp. 1:59.http://doi.acm.org/10.1145/1824777.1824779.
7. Hand DJ, Krzanowski WJ. Optimising k-means clustering results with standard software packages // Computational statistics and Data analysis. 2005. Vol. 49. Pp. 969–973.
8. Leisch Friedrich. A Toolbox for K-centroids Cluster Analysis // Comput. Stat. Data Analysis. 2006. Vol. 51, no. 2. Pp. 526–544.
9. Yih Wen-tau. Learning Term-weighting Functions for Similarity Measures // Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, EMNLP ’09 Stroudsburg, PA, USA: Association for Computational Linguistics, 2009. Pp. 793–802. http://dl.acm.org/citation.cfm?id=1699571.1699616.
10. Hofmann Thomas. Probabilistic Latent Semantic Indexing // Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’99. New York, NY, USA: ACM, 1999. Pp. 50–57.
11. Vorontsov Konstantin, Potapenko Anna, Plavin Alexander. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization // Statistical Learning and Data Sciences / edited byAlexander Gammerman, Vladimir Vovk, Harris Papadopoulos. Springer International Publishing, 2015. Vol. 9047 of Lecture Notes in Computer Science. Pp. 193–202.
12. Hao Pei-Yi, Chiang Jung-Hsien, Tu Yi-Kun. Hierarchically SVM classification based on support vector clustering method and its application to document categorization // Expert Systems with Applications. 2007. Vol. 33, no. 3. Pp. 627–635.
13. Eric Gaussier, Cyril Goutte, Kris Popat, Francine Chen. A Hierarchical Model for Clustering and Categorising Documents // Advances in Information Retrieval Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02), 2002.
14. Tu Z. Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering // Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on. – IEEE, 2005. – T. 2. – S. 1589-1596.
15. Dhillon Inderjit S., Sra Suvrit. Modeling Data using Directional Distributions: Tech. Rep. TR-03-06: The University of Texas, Department of Computer Sciences, 2003. January.
16. Kuzmin A.A., Aduenko A.A., Strijov V.V. Thematic Classification for EURO/IFORS Conference Using Expert Model // Conference of the International Federation of Operational Research Societies, 2014.
17. Aduenko A.A., Kuzmin A.A., Strizhov V.V. Vybor priznakov i optimizatsiya metriki pri klasterizatsii kollektsii dokumentov // Izvestiya Tulskogo gosudarstvennogo universiteta, Yestestvennye nauki. 2012. № 3. S.119-131.