ISSN 2071-8594

Russian academy of sciences

Editor-in-Chief

Gennady Osipov

V.N. Zakharov, A.A. Khoroshilov, A.A. Khoroshilov А method for detecting implicit plagiarism in scientific and technical texts

Abstract.

The paper considers the process of automatic detection of implicit plagiarism in documents on the base of comparison of their formalized representations and calculation of the measures of local semantic similarity of concepts and global semantic similarity of text fragments. In solving this problem we developed a model of the semantic structure of texts and methods for formalization and detection of semantic proximity of the texts under comparison. We also developed the methods for identification of the text fragments similar in semantic structure. The main advantage of this method is that it makes it possible to detect different kinds of plagiarism including the most complex cases of implicit plagiarism. In the study, the results of the work were compared to the results obtained with the use of the method of "shingles". The proposed method showed high efficiency.

Keywords:

plagiarism detection, automated text processing, formal description of text, semantic structure, linguistic software, declarative means.

PP. 10-20

REFERENCES

1. Salton, G.; Wong, A.; Yang, C. S. (1975). "A vector space model for automatic indexing" / Communications of the ACM Volume 18 Issue 11, New York, NY, USA, Nov. 1975 Pages 613-620., Salton et al. 1994.
2. Abdur Chowdhury, Ophir Frieder, David Grossman, Mary Catherine McCabe Collection statistics for fast duplicate document detection // Journal ACM Transactions on Information Systems (TOIS) TOIS Homepage archive Volume 20 Issue 2, April 2002, Pages 171-191.
3. Zelenkov Yu.G., Segalovich I.V. 2007. Sravnitel'nyy analiz metodov opredeleniya nechetkikh dublikatov dlya WEBdokumentov [Comparative analysis of near-duplicate detection methods of Web documents]. Rus. Trudy 9-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL’2007
[Digital libraries: advanced methods and technologies, digital collections: works IX of All-Russian National Research Conference Pereslavl], Pereslavl-Zalesskiy, Russia, 2007.
4. Broder On the resemblance and containment of documents. Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE Computer Society, 1998.
5. Broder, S. Glassman, M. Manasse and G. Zweig. Syntactic clustering of the Web. Proc. of the 6th International World Wide Web Conference, April 1997.
6. V. Palagin, S. L. Krivoy, N. G. Petrenko. 2009. Kontseptual'nye grafy i semanticheskie seti v sistemakh obrabotki estestvennoyazykovoy informatsii [Conceptual graphs and semantic network in natural language information processing systems]. Rus. Matematicheskie mashiny i sistemy [Mathematical Machines and Systems]. Kiev, 2009, N 3.-p.67-79
7. M. Bogatyrev, V. Latov, I. Stolbovskaya. 2007. Primenenie kontseptual'nykh grafov v sistemakh podderzhki elektronnykh bibliotek [Application of Conceptual Graphs in Digital Libraries]. Rus. Trudy 9-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL’2007 [Digital libraries: advanced methods and technologies, digital collections: works IX of All-Russian National Research Conference Pereslavl], Pereslavl, Russia, 2007.
8. Hassan S., Mihalcea R. Measuring semantic relatedness using salient encyclopedic concepts// Artificial Intelligence, Special Issue, 2011.
9. Mohler M., Mihalcea R. Text-to-text semantic similarity for automatic short answer grading// In Proc. of the European Association for Computational Linguistics (EACL 2009), Athens, Greece.
10. Zakharov V.N., Khoroshilov A.A. 2013. Metody resheniya zadachi avtomaticheskogo vyyavleniya zaimstvovaniy v strukturirovannykh nauchno-tekhnicheskikh dokumentakh na osnove ikh semanticheskogo analiza [Semantic methods for solving a problem of automatic detection of plagiarism in structured scientific and technical documents] Rus. Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii: Trudy XV Vserossiyskoy nauchnoy konferentsii RCDL'2013 (Yaroslavl', 14–17 oktyabrya 2013) [E-libraries: prospective methods and technology, electronic collections: works XV of All-Russian scientific conference of rcdl'2013 (Yaroslavl ,14-17 October 2013 )] Yaroslavl, 30–38.
11. Belonogov G.G. 2008. Teoreticheskie problemy informatiki] Semanticheskie problemy informatiki [Theoretical problems of Informatics. Semantic problems of Informatics]. Rus. Moscow: G. V. Plekhanova REA 2:238.
12. Lukashevich N.V. 2011. Tezaurusy v zadachakh informatsionnogo poiska [thesaurus of information search]. Rus. Moscow: Publishing Moscow State University, 508 p.
13. Zakharov V.N., Khoroshilov A.A. 2012. Avtomaticheskaya otsenka podobiya tematicheskogo soderzhaniya tekstov na osnove sravneniya ikh formalizovannykh smyslovykh opisaniy [Automatic assessment of similarity of the texts’ thematic content on the base of their formalized semantic descriptions comparison] Rus. Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii: Trudy XIV Vserossiyskoy nauchnoy konferentsii RCDL'2012 [Digital libraries: advanced methods and technologies, digital collections: works XIV of All-Russian scientific conference of rcdl'2012]. Pereslavl,October 15-18, 2012.
14. Khoroshilov A.A. 2015. Metody vyyavleniya implitsitno vyrazhennykh zaimstvovaniy v nauchno-tekhnicheskikh tekstakh na osnove ikh kontseptual'nogo analiza [A Method for Detecting Implicit Plagiarism in Scientific and Technical Texts on the Basis of their Conceptual Analysis] Rus. Trudy XV-oy Vseros. nauch. konf. «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» [XVII International Conference DAMDID/RCDL’2015 Data Analytics and Management in Data Intensive Domains]. October 13 – 16, 2015, Obninsk, Russia.