|
This article is cited in 1 scientific paper (total in 1 paper)
Multicriteria method for detecting near-duplicates in a stream of text messages
A. Andreev, D. Berezkin, I. Kozlov, K. Simakov Bauman Moscow State Technical University, 5 Baumanskaya 2nd Str., Moscow 105005, Russian Federation
Abstract:
The problem of near-duplicate detection in a stream of text messages is considered. A model of a text document and a multicriteria duplicate identification method is proposed. The model provides flexible adjustment for different domains. The method is based on binary classification using support vector machine. The paper also provides a method of candidates prefiltration in order to ensure high efficiency of the approach. Several experiments with data obtained from a stream of news articles were carried out. The results show feasibility of the suggested approach.
Keywords:
near-duplicate detection; similarity measure; binary classification.
Received: 30.12.2014
Citation:
A. Andreev, D. Berezkin, I. Kozlov, K. Simakov, “Multicriteria method for detecting near-duplicates in a stream of text messages”, Sistemy i Sredstva Inform., 25:1 (2015), 34–53
Linking options:
https://www.mathnet.ru/eng/ssi392 https://www.mathnet.ru/eng/ssi/v25/i1/p34
|
|