N. A. Ignat'ev, U. Yu. Tuliyev, “Semantic structuring of text documents based on patterns of natural language entities”, Computer Research and Modeling, 14:5 (2022), 1185

Computer Research and Modeling

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Computer Research and Modeling:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Computer Research and Modeling, 2022, Volume 14, Issue 5, Pages 1185–1197
DOI: https://doi.org/10.20537/2076-7633-2022-14-5-1185-1197 (Mi crm1025)

This article is cited in 1 scientific paper (total in 1 paper)

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Semantic structuring of text documents based on patterns of natural language entities

N. A. Ignat'ev, U. Yu. Tuliyev

National university of Uzbekistan, Tashkent, Uzbekistan

Full-text PDF (194 kB) Citations (1)

References:

PDF

HTML

DOI: https://doi.org/10.20537/2076-7633-2022-14-5-1185-1197

Abstract: The technology of creating patterns from natural language words (concepts) based on text data in the bag of words model is considered. Patterns are used to reduce the dimension of the original space in the description of documents and search for semantically related words by topic. The process of dimensionality reduction is implemented through the formation of patterns of latent features. The variety of structures of document relations is investigated in order to divide them into themes in the latent space.
It is considered that a given set of documents (objects) is divided into two non-overlapping classes, for the analysis of which it is necessary to use a common dictionary. The belonging of words to a common vocabulary is initially unknown. Class objects are considered as opposition to each other. Quantitative parameters of oppositionality are determined through the values of the stability of each feature and generalized assessments of objects according to non-overlapping sets of features.
To calculate the stability, the feature values are divided into non-intersecting intervals, the optimal boundaries of which are determined by a special criterion. The maximum stability is achieved under the condition that the boundaries of each interval contain values of one of the two classes.
The composition of features in sets (patterns of words) is formed from a sequence ordered by stability values. The process of formation of patterns and latent features based on them is implemented according to the rules of hierarchical agglomerative grouping.
A set of latent features is used for cluster analysis of documents using metric grouping algorithms. The analysis applies the coefficient of content authenticity based on the data on the belonging of documents to classes. The coefficient is a numerical characteristic of the dominance of class representatives in groups.
To divide documents into topics, it is proposed to use the union of groups in relation to their centers. As patterns for each topic, a sequence of words ordered by frequency of occurrence from a common dictionary is considered.
The results of a computational experiment on collections of abstracts of scientific dissertations are presented. Sequences of words from the general dictionary on 4 topics are formed.

Keywords: topic modeling, hierarchical agglomerative grouping, ontology, general vocabulary, content authenticity.

Received: 30.03.2022
Revised: 07.06.2022
Accepted: 08.06.2022

Document Type: Article

UDC: 519.8

Language: Russian

Citation: N. A. Ignat'ev, U. Yu. Tuliyev, “Semantic structuring of text documents based on patterns of natural language entities”, Computer Research and Modeling, 14:5 (2022), 1185–1197

Citation in format AMSBIB

\Bibitem{IgnTul22}

\by N.~A.~Ignat'ev, U.~Yu.~Tuliyev

\paper Semantic structuring of text documents based on patterns of natural language entities

\jour Computer Research and Modeling

\yr 2022

\vol 14

\issue 5

\pages 1185--1197

\mathnet{http://mi.mathnet.ru/crm1025}

\crossref{https://doi.org/10.20537/2076-7633-2022-14-5-1185-1197}

Linking options:

https://www.mathnet.ru/eng/crm1025

https://www.mathnet.ru/eng/crm/v14/i5/p1185

This publication is cited in the following 1 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Что такое QR-код?

Registration to the website

Logotypes