Informatics and Automation
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Informatics and Automation:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Informatics and Automation, 2021, Issue 20, volume 4, Pages 869–904
DOI: https://doi.org/10.15622/ia.20.4.5
(Mi trspy1169)
 

This article is cited in 1 scientific paper (total in 1 paper)

Information Security

Optimization approach to selecting methods of detecting anomalies in homogeneous text collections

F. Krasnov, I. Smaznevich, E. Baskakova

NAUMEN R&D
Abstract: The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity.
The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated.
The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms.
The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine).
The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.
Keywords: anomaly detection, novelty detection, outlier detection, homogeneous text collections, sparse space dimension reduction, topic modeling.
Document Type: Article
UDC: 004.896
Language: Russian
Citation: F. Krasnov, I. Smaznevich, E. Baskakova, “Optimization approach to selecting methods of detecting anomalies in homogeneous text collections”, Informatics and Automation, 20:4 (2021), 869–904
Citation in format AMSBIB
\Bibitem{KraSmaBas21}
\by F.~Krasnov, I.~Smaznevich, E.~Baskakova
\paper Optimization approach to selecting methods of detecting anomalies in homogeneous text collections
\jour Informatics and Automation
\yr 2021
\vol 20
\issue 4
\pages 869--904
\mathnet{http://mi.mathnet.ru/trspy1169}
\crossref{https://doi.org/10.15622/ia.20.4.5}
Linking options:
  • https://www.mathnet.ru/eng/trspy1169
  • https://www.mathnet.ru/eng/trspy/v20/i4/p869
  • This publication is cited in the following 1 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Informatics and Automation
    Statistics & downloads:
    Abstract page:270
    Full-text PDF :275
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024