Computer Research and Modeling
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Computer Research and Modeling:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Computer Research and Modeling, 2022, Volume 14, Issue 5, Pages 1199–1210
DOI: https://doi.org/10.20537/2076-7633-2022-14-5-1199-1210
(Mi crm1026)
 

This article is cited in 3 scientific papers (total in 3 papers)

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Identification of the author of the text by segmentation method

M. Yu. Voronina, Yu. N. Orlov

Keldysh Institute of Applied Mathematics Russian Academy of Sciences, 4 Miusskaya sq., Moscow, 125047, Russia
Full-text PDF (765 kB) Citations (3)
References:
Abstract: The paper describes a method for recognizing authors of literary texts by the proximity of fragments into which a separate text is divided to the standard of the author. The standard is the empirical frequency distribution of letter combinations, built on a training sample, which included expertly selected reliably known works of this author. A set of standards of different authors forms a library, within which the problem of identifying the author of an unknown text is solved. The proximity between texts is understood in the sense of the norm in L1 for the frequency vector of letter combinations, which is constructed for each fragment and for the text as a whole. The author of an unknown text is assigned the one whose standard is most often chosen as the closest for the set of fragments into which the text is divided. The length of the fragment is optimized based on the principle of the maximum difference in distances from fragments to standards in the problem of recognition of «friend-foe». The method was tested on the corpus of domestic and foreign (translated) authors. 1783 texts of 100 authors with a total volume of about 700 million characters were collected. In order to exclude the bias in the selection of authors, authors whose surnames began with the same letter were considered. In particular, for the letter L, the identification error was 12 %. Along with a fairly high accuracy, this method has another important property: it allows you to estimate the probability that the standard of the author of the text in question is missing in the library. This probability can be estimated based on the results of the statistics of the nearest standards for small fragments of text. The paper also examines statistical digital portraits of writers: these are joint empirical distributions of the probability that a certain proportion of the text is identified at a given level of trust. The practical importance of these statistics is that the carriers of the corresponding distributions practically do not overlap for their own and other people's standards, which makes it possible to recognize the reference distribution of letter combinations at a high level of confidence.
Keywords: empirical frequency distribution, bigrams, author identification, literature text, the nearest pattern.
Funding agency Grant number
Ministry of Science and Higher Education of the Russian Federation 075-15-2020- 808
This study was supported by the Ministry of Science and Higher Education of the Russian Federation, Contract No. 075-15- 2020-808.
Received: 27.06.2022
Revised: 09.08.2022
Accepted: 12.08.2022
Document Type: Article
UDC: 519.243
Language: Russian
Citation: M. Yu. Voronina, Yu. N. Orlov, “Identification of the author of the text by segmentation method”, Computer Research and Modeling, 14:5 (2022), 1199–1210
Citation in format AMSBIB
\Bibitem{VorOrl22}
\by M.~Yu.~Voronina, Yu.~N.~Orlov
\paper Identification of the author of the text by segmentation method
\jour Computer Research and Modeling
\yr 2022
\vol 14
\issue 5
\pages 1199--1210
\mathnet{http://mi.mathnet.ru/crm1026}
\crossref{https://doi.org/10.20537/2076-7633-2022-14-5-1199-1210}
Linking options:
  • https://www.mathnet.ru/eng/crm1026
  • https://www.mathnet.ru/eng/crm/v14/i5/p1199
  • This publication is cited in the following 3 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Computer Research and Modeling
    Statistics & downloads:
    Abstract page:79
    Full-text PDF :31
    References:21
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024