D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov, “An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets”, Computer Optics, 41:3 (2017), 461

Computer Optics

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Computer Optics:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Computer Optics, 2017, Volume 41, Issue 3, Pages 461–471
DOI: https://doi.org/10.18287/2412-6179-2017-41-3-461-471 (Mi co406)

This article is cited in 8 scientific papers (total in 8 papers)

NUMERICAL METHODS AND DATA ANALYSIS

An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets

D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov

Yaroslav-the-Wise Novgorod State University, Velikii Novgorod, Russia

Full-text PDF (312 kB) Citations (8)

References:

PDF

HTML

DOI: https://doi.org/10.18287/2412-6179-2017-41-3-461-471

Abstract: In this paper we look at two interrelated problems of extracting knowledge units from a set of subject-oriented texts (the so-called corpus) and completeness of reflection of revealed actual knowledge in initial phrases. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problems are of importance when constructing systems for processing, analysis, estimation and understanding of information. In this paper the text relevance to the initial phrase in terms of the described fragment of actual knowledge (including forms of its expression in a given natural language) is measured by estimating the coupling strength of words from the initial phrase jointly occurring in phrases of the analyzed text together with classifying these words according to their values of TF-IDF metrics in relation to text corpus. The paper considers an extension of links of words from traditional bigrams to three and more elements for the revelation of constituents of an image of the initial phrase in the form of combinations of related words. Variants of link revelation with and without application of a database of known syntactic relations are considered here. To describe more completely the fragment of expert knowledge revealed in corpus texts, sets of the initial phrases mutually equivalent or complementary in sense and related to the same image are entered into consideration. In comparison with the search of components of the analyzed image on a syntactically marked text corpus the method for text selection offered in the current paper can reduce, on average, by 17 times the output of phrases which are irrelevant to the initial ones in terms of either the knowledge fragment described or its expression forms in a given natural language.

Keywords: pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge, contextual annotation, document ranking in information retrieval.

Funding agency	Grant number
Ministry of Education and Science of the Russian Federation
Russian Foundation for Basic Research	16-01-00004 а
The work was partially funded by the Russian Federation Ministry of Education and Science (the basic part of the state task) and the Russian Foundation of Basic Research, grant No. 16-01-00004.

Received: 10.04.2017
Accepted: 01.06.2017

Document Type: Article

Language: Russian

Citation: D. V. Mikhaylov, A. P. Kozlov, G. M. Emelyanov, “An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets”, Computer Optics, 41:3 (2017), 461–471

Citation in format AMSBIB

\Bibitem{MikKozEme17}

\by D.~V.~Mikhaylov, A.~P.~Kozlov, G.~M.~Emelyanov

\paper An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets

\jour Computer Optics

\yr 2017

\vol 41

\issue 3

\pages 461--471

\mathnet{http://mi.mathnet.ru/co406}

\crossref{https://doi.org/10.18287/2412-6179-2017-41-3-461-471}

Linking options:

https://www.mathnet.ru/eng/co406

https://www.mathnet.ru/eng/co/v41/i3/p461

This publication is cited in the following 8 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	1129
Full-text PDF :	60
References:	40

Что такое QR-код?

Registration to the website

Logotypes