M. Yu. Voronina, Yu. N. Orlov, “Identification of the author of the text by segmentation method”, Computer Research and Modeling, 14:5 (2022), 1199

Computer Research and Modeling

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Computer Research and Modeling:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Computer Research and Modeling, 2022, Volume 14, Issue 5, Pages 1199–1210
DOI: https://doi.org/10.20537/2076-7633-2022-14-5-1199-1210 (Mi crm1026)

This article is cited in 3 scientific papers (total in 3 papers)

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Identification of the author of the text by segmentation method

M. Yu. Voronina, Yu. N. Orlov

Keldysh Institute of Applied Mathematics Russian Academy of Sciences, 4 Miusskaya sq., Moscow, 125047, Russia

Full-text PDF (765 kB) Citations (3)

References:

PDF

HTML

DOI: https://doi.org/10.20537/2076-7633-2022-14-5-1199-1210

Abstract: The paper describes a method for recognizing authors of literary texts by the proximity of fragments into which a separate text is divided to the standard of the author. The standard is the empirical frequency distribution of letter combinations, built on a training sample, which included expertly selected reliably known works of this author. A set of standards of different authors forms a library, within which the problem of identifying the author of an unknown text is solved. The proximity between texts is understood in the sense of the norm in L1 for the frequency vector of letter combinations, which is constructed for each fragment and for the text as a whole. The author of an unknown text is assigned the one whose standard is most often chosen as the closest for the set of fragments into which the text is divided. The length of the fragment is optimized based on the principle of the maximum difference in distances from fragments to standards in the problem of recognition of «friend-foe». The method was tested on the corpus of domestic and foreign (translated) authors. 1783 texts of 100 authors with a total volume of about 700 million characters were collected. In order to exclude the bias in the selection of authors, authors whose surnames began with the same letter were considered. In particular, for the letter L, the identification error was 12 %. Along with a fairly high accuracy, this method has another important property: it allows you to estimate the probability that the standard of the author of the text in question is missing in the library. This probability can be estimated based on the results of the statistics of the nearest standards for small fragments of text. The paper also examines statistical digital portraits of writers: these are joint empirical distributions of the probability that a certain proportion of the text is identified at a given level of trust. The practical importance of these statistics is that the carriers of the corresponding distributions practically do not overlap for their own and other people's standards, which makes it possible to recognize the reference distribution of letter combinations at a high level of confidence.

Keywords: empirical frequency distribution, bigrams, author identification, literature text, the nearest pattern.

Funding agency	Grant number
Ministry of Science and Higher Education of the Russian Federation	075-15-2020- 808
This study was supported by the Ministry of Science and Higher Education of the Russian Federation, Contract No. 075-15- 2020-808.

Received: 27.06.2022
Revised: 09.08.2022
Accepted: 12.08.2022

Document Type: Article

UDC: 519.243

Language: Russian

Citation: M. Yu. Voronina, Yu. N. Orlov, “Identification of the author of the text by segmentation method”, Computer Research and Modeling, 14:5 (2022), 1199–1210

Citation in format AMSBIB

\Bibitem{VorOrl22}

\by M.~Yu.~Voronina, Yu.~N.~Orlov

\paper Identification of the author of the text by segmentation method

\jour Computer Research and Modeling

\yr 2022

\vol 14

\issue 5

\pages 1199--1210

\mathnet{http://mi.mathnet.ru/crm1026}

\crossref{https://doi.org/10.20537/2076-7633-2022-14-5-1199-1210}

Linking options:

https://www.mathnet.ru/eng/crm1026

https://www.mathnet.ru/eng/crm/v14/i5/p1199

This publication is cited in the following 3 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	79
Full-text PDF :	31
References:	21

Что такое QR-код?

Registration to the website

Logotypes