ISSN 2071-8594

Russian academy of sciences

Editor-in-Chief

Gennady Osipov

A.O. Shelmanov, V.A. Isakov, M.A. Stankevich, I.V. Smirnov Open Information Extraction. Part I. The Task and the Review of the State of the Art

Abstract.

The paper discusses the task of open information extraction from natural language texts. Open information extraction – is rather new approach to solving tasks of information extraction that do not specify structure and semantics of the information to be extracted. This approach is domain independent and does not require big annotated corpora. We present the formulation of the problem and review the state of the art related to extraction of entities and semantic relations from texts including methods of information extraction based on semi-supervised and unsupervised learning. We present the future directions of research of methods for relation extraction based on unsupervised learning.

Keywords:

open information extraction, semantic relations, term extraction, unsupervised learning, semi-supervised learning.

PP. 47-61.

References

1. Appelt D. E. The common pattern specification language // Technical report / SRI International, Artificial Intelligence Center. — 1998.
2. A framework and graphical development environment for robust NLP tools and applications / Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan // Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. — 2002. — P. 168–175.
3. Bolshakova E.I. Yazik leksiko-syntaksicheskih shablonov LSPL: opit ispolzovania i puti razvitija [The language of lexicalsyntactic templates LSPL: the experience and development directions.] // Programnye systemi i instrumenty: tematicheskij sbornik [Program systems and instruments: the topical collection] — 2014 — P. 15–26.
4. UIMA Ruta: Rapid development of rule-based information extraction applications / Peter Kluegl, Martin Toepfer, Philip-Daniel Beck et al. // Natural Language Engineering. — 2016. — Vol. 22, no. 1. — P. 1–40.
5. Starostin A. S., Smurov I. M., Stepanova M. E. A production system for information extraction based on complete syntactic semantic analysis // Papers from the Annual International Conference "Dialogue" (2014). — 2014. — P. 659–667.
6. Culotta A., Sorensen J. Dependency tree kernels for relation extraction // Proceedings of the 42nd Meeting of the Association for Computational Linguistics. — 2004. — P. 423–429.
7. Bunescu R., Mooney R. A shortest path dependency kernel for relation extraction // Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. — 2005. — P. 724–731.
8. Ebrahimi J., Dou D. Chain based RNN for relation classification // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2015. — P. 1244–1249.
9. Semantic relation classification via convolutional neural networks with simple negative sampling / Kun Xu, Yansong Feng, Songfang Huang, Dongyan Zhao // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 536–540.
10. Nguyen T. H., Grishman R. Relation extraction: Perspective from convolutional neural networks // Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. — 2015. — P. 39–48.
11. TextRunner: open information extraction on the web / Alexander Yates, Michael Cafarella, Michele Banko et al. // Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. — 2007. — P. 25–26.
12. Open information extraction from the web / Michele Banko, Michael J. Cafarella, Stephen Soderland et al. // Proceedings of the 20th International Joint Conference on Artifical Intelligence. — 2007. — P. 2670–2676.
13. Marcus M. P., Marcinkiewicz M. A., Santorini B. Building a large annotated corpus of English: The Penn Treebank // Computational linguistics. — 1993. — Vol. 19, no. 2. — P. 313–330.
14. Banko M., Etzioni O. The tradeoffs between open and traditional relation extraction // Proceedings of ACL-08: HLT. — 2008. — P. 28–36.
15. StatSnowball: a statistical approach to extracting entity relationships / Jun Zhu, Zaiqing Nie, Xiaojiang Liu et al. // Proceedings of the 18th international conference on World wide web. — 2009. — P. 101–110.
16. Wu F., Weld D. S. Open information extraction using Wikipedia // Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. — 2010. — P. 118–127.
17. Unsupervised relation extraction by mining Wikipedia texts using information from the web / Yulan Yan, Naoaki Okazaki, Yutaka Matsuo et al. // Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. — 2009. — P. 1021–1029.
18. Fader A., Soderland S., Etzioni O. Identifying relations for open information extraction // Proceedings of the Conference on Empirical Methods in Natural Language Processing. — 2011. — P. 1535–1545.
19. Open information extraction: The second generation / Oren Etzioni, Anthony Fader, Janara Christensen et al. // Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. — 2011. — P. 3–10.
20. Open language learning for information extraction / Michael Schmitz, Robert Bart, Stephen Soderland et al. // Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. — 2012. — P. 523–534.
21. Nivre J., Nilsson J. Multiword units in syntactic parsing // Proceedings of Methodologies and Evaluation of Multiword Units in Real-World Applications (MEMURA). — 2004.
22. Angeli G., Johnson Premkumar M. J., Manning C. D. Leveraging linguistic structure for open domain information extraction // Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. — 2015. — P. 344–354.
23. Overview of the TAC 2010 knowledge base population track / Heng Ji, Ralph Grishman, Hoa Trang Dang et al. // Third Text
Analysis Conference (TAC 2010). — Vol. 3. — 2010. — P. 3–3.
24. Surdeanu M. Overview of the TAC 2013 knowledge base population evaluation: English slot filling and temporal slot filling // Proceedings of the TAC-KBP 2013 Workshop. — 2013.
25. Combining distant and partial supervision for relation extraction / Gabor Angeli, Julie Tibshirani, Jean Wu, Christopher D. Manning // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. — 2014. — P. 1556–1567.
26. Stanford typed dependencies manual : Rep. / Technical report, Stanford University ; Executor: Marie-Catherine De Marneffe, Christopher D Manning : 2008.
27. Nakashole N., Weikum G., Suchanek F. PATTY: A taxonomy of relational patterns with semantic types // Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.— 2012. — P. 1135–1145.
28. Distant supervision for relation extraction without labeled data / Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky // Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. — 2009. — P. 1003–1011.
29. Freebase: a collaboratively created graph database for structuring human knowledge / Kurt Bollacker, Colin Evans, Praveen Paritosh et al. // Proceedings of the 2008 ACM SIGMOD international conference on Management of data. — 2008.— P. 1247–1250.
30. Riedel S., Yao L., McCallum A. Modeling relations and their mentions without labeled text // Machine learning and knowledge discovery in databases. — 2010. — P. 148–163.
31. Knowledge-based weak supervision for information extraction of overlapping relations / Raphael Hoffmann, Congle Zhang, Xiao Ling et al. // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. — 2011. — P. 541–550.
32. Multi-instance multi-label learning for relation extraction / Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D Manning // Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. — 2012. — P. 455–465.
33. Dietterich T. G., Lathrop R. H., Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles // Artificial intelligence. — 1997. — Vol. 89, no. 1. — P. 31–71.
34. Connecting language and knowledge bases with embedding models for relation extraction / Jason Weston, Antoine Bordes, Oksana Yakhnenko, Nicolas Usunier // Conference on Empirical Methods in Natural Language Processing. — 2013. — P. 1366–1371.
35. Relation extraction with matrix factorization and universal schemas / Sebastian Riedel, Limin Yao, Andrew McCallum, Benjamin M. Marlin // Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2013. — P. 74–84.
36. Distant supervision for relation extraction via piecewise convolutional neural networks / Daojian Zeng, Kang Liu, Yubo Chen, Jun Zhao // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 1753–1762.
37. A customized attention-based long short-term memory network for distant supervised relation extraction / Dengchao He, Hongjun Zhang, Wenning Hao et al. // Neural Computation. — 2017.
38. Hochreiter S., Schmidhuber J. Long short-term memory // Neural computation. — 1997. — Vol. 9, no. 8. — P. 1735–1780.
39. Brin S. Extracting patterns and relations from the World wide web // International Workshop on The World Wide Web and Databases. — 1998. — P. 172–183.
40. Agichtein E., Gravano L. Snowball: Extracting relations from large plain-text collections // Proceedings of the fifth ACM conference on Digital libraries. — 2000. — P. 85–94.
41. Batista D. S., Martins B., Silva M. J. Semi-supervised bootstrapping of relationship extractors with distributional semantics // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 499–504.
42. Efficient estimation of word representations in vector space / Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean // arXiv preprint arXiv:1301.3781. — 2013.
43. Structured relation discovery using generative models / Limin Yao, Aria Haghighi, Sebastian Riedel, Andrew McCallum // Proceedings of the Conference on Empirical Methods in Natural Language Processing. — 2011. — P. 1456–1466.
44. Blei D. M., Ng A. Y., Jordan M. I. Latent dirichlet allocation // Journal of machine Learning research. — 2003. — Vol. 3. — P. 993–1022.
45. Yao L., Riedel S., McCallum A. Unsupervised relation discovery with sense disambiguation // Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. — 2012. — P. 712–720.
46. Lopez de Lacalle O., Lapata M. Unsupervised relation extraction with general domain knowledge // Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. — 2013. — P. 415–425.
47. A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic / David Andrzejewski, Xiaojin Zhu, Mark Craven, Benjamin Recht // Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. — 2011. — P. 1171–1177.
48. Takase S., Okazaki N., Inui K. Fast and large-scale unsupervised relation extraction. // Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. — 2015.
49. Marcheggiani D., Titov I. Discrete-state variational autoencoders for joint discovery and factorization of relations // Transactions of the Association for Computational Linguistics. — 2016. — Vol. 4. — P. 231–244.
50. Vorontsov K., Potapenko A. Additive regularization of topic models // Machine Learning. — 2015. — Vol. 101, no. 1-3. — P. 303–323.
51. Mnih A., Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation // Advances in neural information processing systems. — 2013. — P. 2265–2273.