ISSN 2071-8594

Russian academy of sciences


Gennady Osipov

R.E. Suvorov, A.O. Shelmanov, M.A. Kamenskaya, I.V. Smirnov Information Extraction from Scientific Texts Using Active Machine Learning


The paper addresses the task of information extraction from natural language texts using machine learning methods. For creating an information extraction system based on machine learning, usually, large annotated text corpora are required. Another problem that arises during the development of such systems is feature engineering. To solve the first problem, we propose methods of information extraction based on active machine learning techniques. To solve the second problem, we investigate methods for generating the feature space based on the results of the deep linguistic analysis. Experimental studies of the proposed methods showed that active learning significantly reduces the amount of labor required for creating an information extraction system, while maintaining the quality of the trained models. Using the results of deep linguistic analysis for generating feature space improves the quality of models for information extraction.


information extraction, deep linguistic analysis, active machine learning, multipurpose feature engineering, scientific texts processing.

PP. 40-52.


1. Chiticariu L., Li Y., Reiss F. Transparent machine learning for information extraction // The materials of the tutorial of the Conference on Empirical Methods in Natural Language Processing. — 2015.
2. Piskorski J., Yangarber R. Information extraction: past, present and future // Multi-source, multilingual information extraction  and summarization. — 2013. — P. 23–49.
3. Chiticariu L., Li Y., Reiss F. R. Rule-based information extraction is dead! Long live rule-based information extraction systems! // Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. — 2013. — P. 827–832.
4. Gupta S., Manning C. D. SPIED: Stanford pattern-based information extraction and diagnostics // Association for Computational Linguistics (ACL) Workshop on Interactive Language Learning, Visualization, and Interfaces. — 2014.
5. Espinosa K. J., Batista-Navarro R., Ananiadou S. Learning to recognise named entities in tweets by exploiting weakly labelled data // Proceedings of the 2nd Workshop on Noisy User-generated Text (W-NUT). — 2016.
6. Learning from human-generated lists / Kwang-Sung Jun, Jerry Zhu, Burr Settles, Timothy Rogers // International Conference on Machine Learning. — 2013. — P. 181–189.
7. IKE - an interactive tool for knowledge extraction / Bhavana Dalvi, Sumithra Bhakthavatsalam, Chris Clark et al. // Proceedings of the 5th Workshop on Automated Knowledge Base Construction. — 2016. — P. 12–17.
8. The benefits of word embeddings features for active learning in clinical information extraction / Mahnoosh Kholghi, Lance De Vine, Laurianne Sitbon et al. // Proceedings of the Australasian Language Technology Association Workshop. — 2016. — P. 25–34.
9. External knowledge and query strategies in active learning: a study in clinical information extraction / Mahnoosh Kholghi, Laurianne Sitbon, Guido Zuccon, Anthony Nguyen // Proceedings of the 24th ACM International on Conference on Information  and Knowledge Management. — 2015. — P. 143–152.
10. Augenstein I., Maynard D., Ciravegna F. Relation Extraction from the Web Using Distant Supervision // Knowledge  Engineering and Knowledge Management: 19th International Conference (EKAW 2014). — 2014. — P. 26–41.
11. OCR++: A robust framework for information extraction from scholarly articles / Mayank Singh, Barnopriyo Barua, Priyank Palod et al. // Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. — 2016. — P. 3390–3400.
12. Liu H. Automatic argumentative-zoning using word2vec // arXiv preprint arXiv:1703.10152. — 2017.
13. Figureseer: Parsing result-figures in research papers / Noah Siegel, Zachary Horvitz, Roie Levin et al. // European  Conference on Computer Vision. — 2016. — P. 664–680.
14. Clark C. A., Divvala S. K. Looking beyond text: Extracting figures, tables and captions from computer science papers // AAAI Workshop: Scholarly Big Data. — 2015.
15. Lever J., Jones S. J. M. VERSE: Event and relation extraction in the BioNLP 2016 shared task // Proceedings of the 4th BioNLP Shared Task Workshop. — 2016.
16. Which techniques does your application use?: An information extraction framework for scientific articles / Soham Dan, Sanyam Agarwal, Mayank Singh et al. // arXiv preprint arXiv:1608.06386. — 2016.
17. SemEval 2017 task 10: ScienceIE – extracting keyphrases and relations from scientific publications / Isabelle Augenstein, Mrinal Kanti Das, Sebastian Riedel et al. // Proceedings of the 11th International Workshop on Semantic Evaluation. — 2017.
18. Kovar V., Mociarikova M., Rychly P. Finding definitions in large corpora with sketch engine // Proceedings of the Tenth International Conference on Language Resources and Evaluation. — 2016.
19. DEFEXT: A semi supervised definition extraction tool / Luis Espinosa-Anke, Roberto Carlini, Horacio Saggion, Francesco Ronzano // GLOBALEX 2016: Lexicographic Resources for Human Language Technology Workshop. — 2016.
20. Del Gaudio R. Automatic extraction of definitions : Ph.D. thesis / R. Del Gaudio. — 2014. — University of Lisbon.
21. Bolshakova E. I., Efremova N. E. A heuristic strategy for extracting terms from scientific texts // International Conference on Analysis of Images, Social Networks and Texts. — 2015. — P. 297–307.
22. Smirnov, I.V., A.O. Shelmanov, E.S. Kuznetsova, I.V. Khramoin. 2014. Semantiko-sintaksicheskiy analiz estestvennykh yazykov Chast' II. Metod semantiko-sintaksicheskogo analiza tekstov [Semantic-syntactic analysis of natural languages. Part II. Method for semantic-syntactic analysis of texts]. Iskusstvennyy intellekt i prinyatie resheniy [Artificial intelligence and  decision making]. 1: 11–24.
23. Sokirko, A.V. 2004. Morfologicheskie moduli na sayte [Morphological modules on the site]. Trudy mezhdunarodnoy konferentsii "Dialog 2004" [Proceedings of the International Conference "Dialogue-2004"]. 559–565.
24. MaltParser: A language-independent system for data-driven dependency parsing / Joakim Nivre, Johan Hall, Jens Nilsson et al. // Natural Language Engineering. — 2007. — Vol. 13, no. 2. — P. 95–135.
25. Apresyan, Yu. D., I.M. Boguslavskiy, B.L. Iomdin, L.L. Iomdin, A.V. Sannikov, V.Z. Sannikov ... and L.L. Tsinman. 2005. Sintaksicheski i semanticheski annotirovannyy korpus russkogo yazyka: sovremennoe sostoyanie i perspektivy [Syntactically and semantically annotated corpus of the Russian language: current state and prospects]. Natsional'nyy korpus russkogo yazyka, 2003-2005 [National corpus of Russian Language, 2003-2005]. 193–214.
26. Shelmanov A. O., Smirnov I. V. Methods for semantic role labeling of Russian texts // Computational Linguistics and  Intellectual Technologies. Papers from the Annual International Conference "Dialogue 2014". — No. 13. — 2014. — P. 607– 620.
27. Shelmanov, A.O., M.A. Kamenskaya, M.I. Anan’eva, I.V. Smirnov. 2016. Semantiko-sintaksicheskiyanaliz tekstov v zadachakh voprosno-otvetnogo poiska i izvlecheniya opredeleniy [Semantic-syntactic analysis for question-answering and  definition extraction]. Iskusstvennyy intellekt i prinyatie resheniy [Artificial intelligence and decision-making].4: 47-61.