R. V. Kuznetsova, O. Yu. Bakhteev, Yu. V. Chekhovich, “Methods of cross-lingual text reuse detection in large textual collections”, Inform. Primen., 15:1 (2021), 30

Informatika i Ee Primeneniya [Informatics and its Applications]

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Inform. Primen.:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Informatika i Ee Primeneniya [Informatics and its Applications], 2021, Volume 15, Issue 1, Pages 30–41
DOI: https://doi.org/10.14357/19922264210105 (Mi ia709)

This article is cited in 3 scientific papers (total in 3 papers)

Methods of cross-lingual text reuse detection in large textual collections

R. V. Kuznetsova^a, O. Yu. Bakhteev^ba, Yu. V. Chekhovich^c

^a Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian Federation
^b Antiplagiat Co., 42-1 Bolshoy Blvd., Moscow 121205, Russian Federation
^c A. A. Dorodnicyn Computing Center, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 40 Vavilov Str., Moscow 119333, Russian Federation

Full-text PDF (411 kB) Citations (3)

References:

PDF

HTML

DOI: https://doi.org/10.14357/19922264210105

Abstract: The paper investigates the cross-lingual text reuse detection problem. The paper proposes a monolingual approach to this problem: to translate the suspicious document into the language of the collection for the further monolingual analysis. One of the major requirements for the proposed method is robustness to the machine translation ambiguity. The further document analysis is divided into two steps. At the first step, the authors retrieve documents-candidates which are likely to be the source of the text reuse. For the robustness, the authors propose to retrieve the documents using word clusters that are constructed using distributional semantics. At the second step, the authors compare the suspicious document with candidates using sentence embeddings that are obtained by deep learning neural networks. The experiment was conducted for the “English–Russian” language pair both on the synthetic data and on the articles included in the Russian Science Citation Index.

Keywords: natural language processing, machine translation, deep learning, cross-lingual text reuse detection, distributional semantics.

Funding agency	Grant number
Russian Foundation for Basic Research	18-07-01441_а
Foundation for Assistance to Small Innovative Enterprises in Science and Technology	44116
This research was supported by RFBR (project 18-07-01441) and Foundation for Assitance to Small Innovative Enterprises in Science and Technology (project 44116).

Received: 19.03.2020

Document Type: Article

Language: Russian

Citation: R. V. Kuznetsova, O. Yu. Bakhteev, Yu. V. Chekhovich, “Methods of cross-lingual text reuse detection in large textual collections”, Inform. Primen., 15:1 (2021), 30–41

Citation in format AMSBIB

\Bibitem{KuzBakChe21}

\by R.~V.~Kuznetsova, O.~Yu.~Bakhteev, Yu.~V.~Chekhovich

\paper Methods of cross-lingual text reuse detection in~large textual collections

\jour Inform. Primen.

\yr 2021

\vol 15

\issue 1

\pages 30--41

\mathnet{http://mi.mathnet.ru/ia709}

\crossref{https://doi.org/10.14357/19922264210105}

Linking options:

https://www.mathnet.ru/eng/ia709

https://www.mathnet.ru/eng/ia/v15/i1/p30

This publication is cited in the following 3 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	199
Full-text PDF :	114
References:	34

Что такое QR-код?

Registration to the website

Logotypes