ISSN 2071-8594

Российская академия наук

Главный редактор

Академик С. В. Емельянов

А.О. Шелманов, В.А. Исаков, М.А. Станкевич, И.В. Смирнов "Открытое извлечение информации из текстов Часть I. Постановка задачи и обзор методов"

Аннотация.

В статье представлена постановка задачи открытого извлечения информации. Выполнен аналитический обзор работ в этой области, а также обзор смежных работ по извлечению из текстов сущностей и семантических отношений, в которых применяется машинное обучение с частичным привлечением учителя и без него. Рассмотрены предполагаемые направления исследований в области открытого извлечения информации из текстов, основанные на машинном обучении без учителя.

Ключевые слова:

открытое извлечение информации, семантические отношения, извлечение терминов, машинное обучение без учителя, машинное обучение с частичным привлечением учителя.

Стр. 47-61.

Литература

1. Appelt D. E. The common pattern specification language // Technical report / SRI International, Artificial Intelligence Center. — 1998.
2. A framework and graphical development environment for robust NLP tools and applications / Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan // Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. — 2002. — P. 168–175.
3. Большакова Е. И. Язык лексико-синтаксических шаблонов LSPL: опыт использования и пути развития // Программные системы и инструменты: тематический сборник. — 2014. — 15. — С. 15–26.
4. UIMA Ruta: Rapid development of rule-based information extraction applications / Peter Kluegl, Martin Toepfer, Philip-Daniel Beck et al. // Natural Language Engineering.— 2016. — Vol. 22, no. 1. — P. 1–40.
5. Starostin A. S., Smurov I. M., Stepanova M. E. A production system for information extraction based on complete syntactic semantic analysis // Papers from the Annual International Conference "Dialogue" (2014). — 2014. — P. 659–667.
6. Culotta A., Sorensen J. Dependency tree kernels for relation extraction // Proceedings of the 42nd Meeting of the Association for Computational Linguistics. — 2004. — P. 423–429.
7. Bunescu R., Mooney R. A shortest path dependency kernel for relation extraction // Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. — 2005. — P. 724–731.
8. Ebrahimi J., Dou D. Chain based RNN for relation classification // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2015. — P. 1244–1249.
9. Semantic relation classification via convolutional neural networks with simple negative sampling / Kun Xu, Yansong Feng, Songfang Huang, Dongyan Zhao // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 536–540.
10. Nguyen T. H., Grishman R. Relation extraction: Perspective from convolutional neural networks // Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. — 2015. — P. 39–48.
11. TextRunner: open information extraction on the web / Alexander Yates, Michael Cafarella, Michele Banko et al. // Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. — 2007. — P. 25–26.
12. Open information extraction from the web / Michele Banko, Michael J. Cafarella, Stephen Soderland et al. // Proceedings of the 20th International Joint Conference on Artificial Intelligence. — 2007. — P. 2670–2676.
13. Marcus M. P., Marcinkiewicz M. A., Santorini B. Building a large annotated corpus of English: The Penn Treebank // Computational linguistics. — 1993. — Vol. 19, no. 2. — P. 313–330.
14. Banko M., Etzioni O. The tradeoffs between open and traditional relation extraction // Proceedings of ACL-08: HLT. — 2008. — P. 28–36.
15. StatSnowball: a statistical approach to extracting entity relationships / Jun Zhu, Zaiqing Nie, Xiaojiang Liu et al. // Proceedings of the 18th international conference on World wide web. — 2009. — P. 101–110.
16. Wu F., Weld D. S. Open information extraction using Wikipedia // Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. — 2010. — P. 118–127.
17. Unsupervised relation extraction by mining Wikipedia texts using information from the web / Yulan Yan, Naoaki Okazaki, Yutaka Matsuo et al. // Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. — 2009. — P. 1021–1029.
18. Fader A., Soderland S., Etzioni O. Identifying relations for open information extraction // Proceedings of the Conference on Empirical Methods in Natural Language Processing. — 2011. — P. 1535–1545.
19. Open information extraction: The second generation / Oren Etzioni, Anthony Fader, Janara Christensen et al. // Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. — 2011. — P. 3–10.
20. Open language learning for information extraction / Michael Schmitz, Robert Bart, Stephen Soderland et al. // Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. — 2012. — P. 523–534.
21. Nivre J., Nilsson J. Multiword units in syntactic parsing // Proceedings of Methodologies and Evaluation of Multiword Units in Real-World Applications (MEMURA). — 2004.
22. Angeli G., Johnson Premkumar M. J., Manning C. D. Leveraging linguistic structure for open domain information extraction // Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. — 2015. — P. 344–354.
23. Overview of the TAC 2010 knowledge base population track / Heng Ji, Ralph Grishman, Hoa Trang Dang et al. // Third Text Analysis Conference (TAC 2010). — Vol. 3. — 2010. — P. 3–3.
24. Surdeanu M. Overview of the TAC 2013 knowledge base population evaluation: English slot filling and temporal slot filling // Proceedings of the TAC-KBP 2013 Workshop.
— 2013.
25. Combining distant and partial supervision for relation extraction / Gabor Angeli, Julie Tibshirani, Jean Wu, Christopher D. Manning // Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. — 2014. — P. 1556–1567.
26. Stanford typed dependencies manual : Rep. / Technical report, Stanford University ; Executor: Marie-
Catherine De Marneffe, Christopher D Manning : 2008.
27. Nakashole N., Weikum G., Suchanek F. PATTY: A taxonomy of relational patterns with semantic types // Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. — 2012. — P. 1135–1145.
28. Distant supervision for relation extraction without labeled data / Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky // Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. — 2009. — P. 1003–1011.
29. Freebase: a collaboratively created graph database for structuring human knowledge / Kurt Bollacker,
Colin Evans, Praveen Paritosh et al. // Proceedings of the 2008 ACM SIGMOD international conference on Management of data. — 2008. — P. 1247–1250.
30. Riedel S., Yao L., McCallum A. Modeling relations and their mentions without labeled text // Machine learning and knowledge discovery in databases. — 2010. — P. 148–163.
31. Knowledge-based weak supervision for information extraction of overlapping relations / Raphael Hoffmann, Congle Zhang, Xiao Ling et al. // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. — 2011. — P. 541–550.
32. Multi-instance multi-label learning for relation extraction / Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D Manning // Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. — 2012. — P. 455–465.
33. Dietterich T. G., Lathrop R. H., Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles // Artificial intelligence. — 1997. — Vol. 89, no. 1. — P. 31–71.
34. Connecting language and knowledge bases with embedding models for relation extraction / Jason Weston, Antoine Bordes, Oksana Yakhnenko, Nicolas Usunier // Conference on Empirical Methods in Natural Language Processing. — 2013. — P. 1366–1371.
35. Relation extraction with matrix factorization and universal schemas / Sebastian Riedel, Limin Yao, Andrew McCallum, Benjamin M. Marlin // Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. — 2013. — P. 74–84.
36. Distant supervision for relation extraction via piecewise convolutional neural networks / Daojian Zeng, Kang Liu, Yubo Chen, Jun Zhao // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 1753–1762.
37. A customized attention-based long short-term memory network for distant supervised relation extraction / Dengchao He, Hongjun Zhang, Wenning Hao et al. // Neural Computation. — 2017.
38. Hochreiter S., Schmidhuber J. Long short-term memory // Neural computation. — 1997. — Vol. 9, no. 8. — P. 1735–1780.
39. Brin S. Extracting patterns and relations from the World wide web // International Workshop on The World Wide Web and Databases. — 1998. — P. 172–183.
40. Agichtein E., Gravano L. Snowball: Extracting relations from large plain-text collections // Proceedings of the fifth ACM conference on Digital libraries. — 2000. — P. 85–94.
41. Batista D. S., Martins B., Silva M. J. Semi-supervised bootstrapping of relationship extractors with distributional semantics // Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. — 2015. — P. 499–504.
42. Efficient estimation of word representations in vector space / Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean // arXiv preprint arXiv:1301.3781. — 2013.
43. Structured relation discovery using generative models / Limin Yao, Aria Haghighi, Sebastian Riedel, Andrew McCallum // Proceedings of the Conference on Empirical Methods in Natural Language Processing. — 2011. — P. 1456–1466.
44. Blei D. M., Ng A. Y., Jordan M. I. Latent dirichlet allocation // Journal of machine Learning research. — 2003. — Vol. 3. — P. 993–1022.
45. Yao L., Riedel S., McCallum A. Unsupervised relation discovery with sense disambiguation // Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. — 2012. — P. 712–720.
46. Lopez de Lacalle O., Lapata M. Unsupervised relation extraction with general domain knowledge // Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. — 2013. — P. 415–425.
47. A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic / David Andrzejewski, Xiaojin Zhu, Mark Craven, Benjamin Recht // Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. — 2011.— P. 1171–1177.
48. Takase S., Okazaki N., Inui K. Fast and large-scale unsupervised relation extraction. // Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. — 2015.
49. Marcheggiani D., Titov I. Discrete-state variational autoencoders for joint discovery and factorization of relations // Transactions of the Association for Computational Linguistics. — 2016. — Vol. 4. — P. 231–244.
50. Vorontsov K., Potapenko A. Additive regularization of topic models // Machine Learning. — 2015. — Vol. 101, no. 1-3. — P. 303–323.
51. Mnih A., Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation // Advances in neural information processing systems. — 2013. — P. 2265–2273.