|
Improving classification quality for the task of finding intrinsic plagiarism
I. O. Molybogab, A. P. Motrenkoa, V. V. Strijovc a Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian
Federation
b Center for Energy Systems, Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, 3 Nobel
Str., Moscow 143026, Russian Federation
c A. A. Dorodnicyn Computing Center, Federal Research Center “Computer Science and Control” of the Russian
Academy of Sciences, 40 Vavilov Str., Moscow 119333, Russian Federation
Abstract:
The paper addresses the classification problem in multidimensional spaces. The authors propose a supervised modification of the t-distributed Stochastic Neighbor Embedding Algorithm. Additional features of the proposed modification are that, unlike the original algorithm, it does not require retraining if new data are added to the training set and can be easily parallelized. The novel method was applied to detect intrinsic plagiarism in a collection of documents. The authors also tested the performance of their algorithm using synthetic data and showed that the quality of classification is higher with the algorithm than without or with other algorithms for dimension reduction.
Keywords:
data analysis; dimension reduction; nonlinear dimension reduction; manifold learning; intrinsic plagiarism detection.
Received: 20.02.2017
Citation:
I. O. Molybog, A. P. Motrenko, V. V. Strijov, “Improving classification quality for the task of finding intrinsic plagiarism”, Inform. Primen., 11:3 (2017), 60–72
Linking options:
https://www.mathnet.ru/eng/ia486 https://www.mathnet.ru/eng/ia/v11/i3/p60
|
|