Modelirovanie i Analiz Informatsionnykh Sistem
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Model. Anal. Inform. Sist.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Modelirovanie i Analiz Informatsionnykh Sistem, 2023, Volume 30, Number 3, Pages 202–213
DOI: https://doi.org/10.18255/1818-1015-2023-3-202-213
(Mi mais799)
 

This article is cited in 2 scientific papers (total in 2 papers)

Theory of data

Text classification by CEFR levels using machine learning methods and BERT language model

N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina

P.G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia
Full-text PDF (525 kB) Citations (2)
References:
Abstract: This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.
Keywords: natural language processing, text classification, CEFR, BERT.
Funding agency Grant number
Ministry of Science and Higher Education of the Russian Federation GM-2023-123061600058-4
This study was supported by YarSU Development Program until 2030, project No. GM-2023-123061600058-4 “Development of an automated system for the development of mediative competence in language education”.
Received: 14.08.2023
Revised: 25.08.2023
Accepted: 30.08.2023
Document Type: Article
UDC: 004.912
MSC: 93A30, 68Q60
Language: Russian
Citation: N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina, “Text classification by CEFR levels using machine learning methods and BERT language model”, Model. Anal. Inform. Sist., 30:3 (2023), 202–213
Citation in format AMSBIB
\Bibitem{LagLagBre23}
\by N.~S.~Lagutina, K.~V.~Lagutina, A.~M.~Brederman, N.~N.~Kasatkina
\paper Text classification by CEFR levels using machine learning methods and BERT language model
\jour Model. Anal. Inform. Sist.
\yr 2023
\vol 30
\issue 3
\pages 202--213
\mathnet{http://mi.mathnet.ru/mais799}
\crossref{https://doi.org/10.18255/1818-1015-2023-3-202-213}
Linking options:
  • https://www.mathnet.ru/eng/mais799
  • https://www.mathnet.ru/eng/mais/v30/i3/p202
  • This publication is cited in the following 2 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Моделирование и анализ информационных систем
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024