D. Karpov, M. Burtsev, “Monolingual and cross-lingual knowledge transfer for topic classification”, Investigations on applied mathematics and informatics. Part II–1, Zap. Nauchn. Sem. POMI, 529, POMI, St. Petersburg, 2023, 54

Zapiski Nauchnykh Seminarov POMI

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Zap. Nauchn. Sem. POMI:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Zapiski Nauchnykh Seminarov POMI, 2023, Volume 529, Pages 54–71 (Mi znsl7419)

Monolingual and cross-lingual knowledge transfer for topic classification

D. Karpov^a, M. Burtsev^b

^a Moscow Institute of Physics and Technology, Dolgoprudny, Russia
^b London Institute for Mathematical Sciences, London, United Kingdom

Full-text PDF (445 kB)

References:

PDF

HTML

Abstract: In this work, we investigate knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large number of data points (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the “Yandex Que” raw data. By evaluating the models trained on RuQTopics on the six matching classes from the Russian MASSIVE subset, we show that the RuQTopics dataset is suitable for real-world conversational tasks, as Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We have also found that for the multilingual BERT trained on RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of BERT pretraining data for the corresponding language. At the same time, the correlation of language-wise accuracy with the linguistic distance from the Russian language is not statistically significant.

Key words and phrases: dataset, topic classification, knowledge transfer, cross-lingual knowledge transfer.

Received: 06.09.2023

Document Type: Article

UDC: 81.322.2

Language: English

Citation: D. Karpov, M. Burtsev, “Monolingual and cross-lingual knowledge transfer for topic classification”, Investigations on applied mathematics and informatics. Part II–1, Zap. Nauchn. Sem. POMI, 529, POMI, St. Petersburg, 2023, 54–71

Citation in format AMSBIB

\Bibitem{KarBur23}

\by D.~Karpov, M.~Burtsev

\paper Monolingual and cross-lingual knowledge transfer for topic classification

\inbook Investigations on applied mathematics and informatics. Part~II--1

\serial Zap. Nauchn. Sem. POMI

\yr 2023

\vol 529

\pages 54--71

\publ POMI

\publaddr St.~Petersburg

\mathnet{http://mi.mathnet.ru/znsl7419}

Linking options:

https://www.mathnet.ru/eng/znsl7419

https://www.mathnet.ru/eng/znsl/v529/p54

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	36
Full-text PDF :	11
References:	17

Что такое QR-код?

Registration to the website

Logotypes