|
This article is cited in 3 scientific papers (total in 3 papers)
Cross-lingual similar document retrieval methods
D. V. Zubarev, I. V. Sochenkov Federal Research Center «Computer Science and Control» of Russian Academy of Sciences
Abstract:
In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.
Keywords:
cross-lingual document retrieval, cross-lingual plagiarism detection, cross-lingual word embeddings.
Citation:
D. V. Zubarev, I. V. Sochenkov, “Cross-lingual similar document retrieval methods”, Proceedings of ISP RAS, 31:5 (2019), 127–136
Linking options:
https://www.mathnet.ru/eng/tisp458 https://www.mathnet.ru/eng/tisp/v31/i5/p127
|
Statistics & downloads: |
Abstract page: | 227 | Full-text PDF : | 75 | References: | 27 |
|