Modelirovanie i Analiz Informatsionnykh Sistem
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Model. Anal. Inform. Sist.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Modelirovanie i Analiz Informatsionnykh Sistem, 2021, Volume 28, Number 3, Pages 280–291
DOI: https://doi.org/10.18255/1818-1015-2021-3-280-291
(Mi mais750)
 

This article is cited in 2 scientific papers (total in 2 papers)

Theory of data

Text classification by genre based on rhythm features

K. V. Lagutinaa, N. S. Lagutinaa, E. I. Boychukb

a P. G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia
b Yaroslavl State Pedagogical University named after K. D.Ushinsky, 108/1 Respublikanskaya str., Yaroslavl 150000, Russia
Full-text PDF (748 kB) Citations (2)
References:
Abstract: The article is devoted to the analysis of the rhythm of texts of different genres: fiction novels, advertisements, scientific articles, reviews, tweets, and political articles. The authors identified lexico-grammatical figures in the texts: anaphora, epiphora, diacope, aposiopesis, etc., that are markers of the text rhythm. On their basis, statistical features were calculated that describe quantitatively and structurally these rhythm features.
The resulting text model was visualized for statistical analysis using boxplots and heat maps that showed differences in the rhythm of texts of different genres. The boxplots showed that almost all genres differ from each other in terms of the overall density of rhythm features. Heatmaps showed different rhythm patterns across genres. Further, the rhythm features were successfully used to classify texts into six genres. The classification was carried out in two ways: a binary classification for each genre in order to separate a particular genre from the rest genres, and a multi-class classification of the text corpus into six genres at once. Two text corpora in English and Russian were used for the experiments. Each corpus contains 100 fiction novels, scientific articles, advertisements and tweets, 50 reviews and political articles, i.e. a total of 500 texts. The high quality of the classification with neural networks showed that rhythm features are a good marker for most genres, especially fiction. The experiments were carried out using the ProseRhythmDetector software tool for Russian and English languages. Text corpora contains 300 texts for each language.
Keywords: stylometry, natural language processing, rhythm features, genres, text classification.
Funding agency Grant number
Russian Foundation for Basic Research 19-07-00243
The reported study was funded by RFBR, project number 19-07-00243.
Received: 20.08.2021
Revised: 30.08.2021
Accepted: 01.09.2021
Bibliographic databases:
Document Type: Article
UDC: 004.912
MSC: 68T50
Language: Russian
Citation: K. V. Lagutina, N. S. Lagutina, E. I. Boychuk, “Text classification by genre based on rhythm features”, Model. Anal. Inform. Sist., 28:3 (2021), 280–291
Citation in format AMSBIB
\Bibitem{LagLagBoy21}
\by K.~V.~Lagutina, N.~S.~Lagutina, E.~I.~Boychuk
\paper Text classification by genre based on rhythm features
\jour Model. Anal. Inform. Sist.
\yr 2021
\vol 28
\issue 3
\pages 280--291
\mathnet{http://mi.mathnet.ru/mais750}
\crossref{https://doi.org/10.18255/1818-1015-2021-3-280-291}
\elib{https://elibrary.ru/item.asp?id=46677108}
Linking options:
  • https://www.mathnet.ru/eng/mais750
  • https://www.mathnet.ru/eng/mais/v28/i3/p280
  • This publication is cited in the following 2 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Моделирование и анализ информационных систем
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2025