Modelirovanie i Analiz Informatsionnykh Sistem
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Model. Anal. Inform. Sist.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Modelirovanie i Analiz Informatsionnykh Sistem, 2022, Volume 29, Number 4, Pages 334–347
DOI: https://doi.org/10.18255/1818-1015-2022-4-334-347
(Mi mais783)
 

This article is cited in 1 scientific paper (total in 1 paper)

Theory of data

Classification of russian texts by genres based on modern embeddings and rhythm

K. V. Lagutina

P. G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia
References:
Abstract: The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora.
Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.
Keywords: stylometry, natural language processing, rhythm features, genres, text classification, BERT, ELMo.
Funding agency Grant number
Ministry of Science and Higher Education of the Russian Federation СП-2109.2021.5
The work is supported by the President of Russian Federation Scholarship for young scientists and postgraduates No. SP-2109.2021.5.
Received: 17.08.2022
Revised: 04.11.2022
Accepted: 09.11.2022
Document Type: Article
UDC: 004.912
MSC: 68T50
Language: Russian
Citation: K. V. Lagutina, “Classification of russian texts by genres based on modern embeddings and rhythm”, Model. Anal. Inform. Sist., 29:4 (2022), 334–347
Citation in format AMSBIB
\Bibitem{Lag22}
\by K.~V.~Lagutina
\paper Classification of russian texts by genres based on modern embeddings and rhythm
\jour Model. Anal. Inform. Sist.
\yr 2022
\vol 29
\issue 4
\pages 334--347
\mathnet{http://mi.mathnet.ru/mais783}
\crossref{https://doi.org/10.18255/1818-1015-2022-4-334-347}
Linking options:
  • https://www.mathnet.ru/eng/mais783
  • https://www.mathnet.ru/eng/mais/v29/i4/p334
  • This publication is cited in the following 1 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Моделирование и анализ информационных систем
    Statistics & downloads:
    Abstract page:99
    Full-text PDF :70
    References:8
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024