N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina, “Text classification by CEFR levels using machine learning methods and BERT language model”, Model. Anal. Inform. Sist., 30:3 (2023), 202

Modelirovanie i Analiz Informatsionnykh Sistem

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Model. Anal. Inform. Sist.:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Modelirovanie i Analiz Informatsionnykh Sistem, 2023, Volume 30, Number 3, Pages 202–213
DOI: https://doi.org/10.18255/1818-1015-2023-3-202-213 (Mi mais799)

This article is cited in 2 scientific papers (total in 2 papers)

Theory of data

Text classification by CEFR levels using machine learning methods and BERT language model

N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina

P.G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia

Full-text PDF (525 kB) Citations (2)

References:

PDF

HTML

DOI: https://doi.org/10.18255/1818-1015-2023-3-202-213

Abstract: This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.

Keywords: natural language processing, text classification, CEFR, BERT.

Funding agency	Grant number
Ministry of Science and Higher Education of the Russian Federation	GM-2023-123061600058-4
This study was supported by YarSU Development Program until 2030, project No. GM-2023-123061600058-4 “Development of an automated system for the development of mediative competence in language education”.

Received: 14.08.2023
Revised: 25.08.2023
Accepted: 30.08.2023

Document Type: Article

UDC: 004.912

MSC: 93A30, 68Q60

Language: Russian

Citation: N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina, “Text classification by CEFR levels using machine learning methods and BERT language model”, Model. Anal. Inform. Sist., 30:3 (2023), 202–213

Citation in format AMSBIB

\Bibitem{LagLagBre23}

\by N.~S.~Lagutina, K.~V.~Lagutina, A.~M.~Brederman, N.~N.~Kasatkina

\paper Text classification by CEFR levels using machine learning methods and BERT language model

\jour Model. Anal. Inform. Sist.

\yr 2023

\vol 30

\issue 3

\pages 202--213

\mathnet{http://mi.mathnet.ru/mais799}

\crossref{https://doi.org/10.18255/1818-1015-2023-3-202-213}

Linking options:

https://www.mathnet.ru/eng/mais799

https://www.mathnet.ru/eng/mais/v30/i3/p202

This publication is cited in the following 2 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Моделирование и анализ информационных систем

Что такое QR-код?

Registration to the website

Logotypes