Computer Research and Modeling
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Computer Research and Modeling:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Computer Research and Modeling, 2022, Volume 14, Issue 5, Pages 1185–1197
DOI: https://doi.org/10.20537/2076-7633-2022-14-5-1185-1197
(Mi crm1025)
 

This article is cited in 1 scientific paper (total in 1 paper)

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Semantic structuring of text documents based on patterns of natural language entities

N. A. Ignat'ev, U. Yu. Tuliyev

National university of Uzbekistan, Tashkent, Uzbekistan
Full-text PDF (194 kB) Citations (1)
References:
Abstract: The technology of creating patterns from natural language words (concepts) based on text data in the bag of words model is considered. Patterns are used to reduce the dimension of the original space in the description of documents and search for semantically related words by topic. The process of dimensionality reduction is implemented through the formation of patterns of latent features. The variety of structures of document relations is investigated in order to divide them into themes in the latent space.
It is considered that a given set of documents (objects) is divided into two non-overlapping classes, for the analysis of which it is necessary to use a common dictionary. The belonging of words to a common vocabulary is initially unknown. Class objects are considered as opposition to each other. Quantitative parameters of oppositionality are determined through the values of the stability of each feature and generalized assessments of objects according to non-overlapping sets of features.
To calculate the stability, the feature values are divided into non-intersecting intervals, the optimal boundaries of which are determined by a special criterion. The maximum stability is achieved under the condition that the boundaries of each interval contain values of one of the two classes.
The composition of features in sets (patterns of words) is formed from a sequence ordered by stability values. The process of formation of patterns and latent features based on them is implemented according to the rules of hierarchical agglomerative grouping.
A set of latent features is used for cluster analysis of documents using metric grouping algorithms. The analysis applies the coefficient of content authenticity based on the data on the belonging of documents to classes. The coefficient is a numerical characteristic of the dominance of class representatives in groups.
To divide documents into topics, it is proposed to use the union of groups in relation to their centers. As patterns for each topic, a sequence of words ordered by frequency of occurrence from a common dictionary is considered.
The results of a computational experiment on collections of abstracts of scientific dissertations are presented. Sequences of words from the general dictionary on 4 topics are formed.
Keywords: topic modeling, hierarchical agglomerative grouping, ontology, general vocabulary, content authenticity.
Received: 30.03.2022
Revised: 07.06.2022
Accepted: 08.06.2022
Document Type: Article
UDC: 519.8
Language: Russian
Citation: N. A. Ignat'ev, U. Yu. Tuliyev, “Semantic structuring of text documents based on patterns of natural language entities”, Computer Research and Modeling, 14:5 (2022), 1185–1197
Citation in format AMSBIB
\Bibitem{IgnTul22}
\by N.~A.~Ignat'ev, U.~Yu.~Tuliyev
\paper Semantic structuring of text documents based on patterns of natural language entities
\jour Computer Research and Modeling
\yr 2022
\vol 14
\issue 5
\pages 1185--1197
\mathnet{http://mi.mathnet.ru/crm1025}
\crossref{https://doi.org/10.20537/2076-7633-2022-14-5-1185-1197}
Linking options:
  • https://www.mathnet.ru/eng/crm1025
  • https://www.mathnet.ru/eng/crm/v14/i5/p1185
  • This publication is cited in the following 1 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Computer Research and Modeling
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024