V. I. Yuferev, N. A. Razin, “Word-embedding based text vectorization using clustering”, Model. Anal. Inform. Sist., 28:3 (2021), 292

Modelirovanie i Analiz Informatsionnykh Sistem

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Impact factor

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Model. Anal. Inform. Sist.:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Modelirovanie i Analiz Informatsionnykh Sistem, 2021, Volume 28, Number 3, Pages 292–311
DOI: https://doi.org/10.18255/1818-1015-2021-3-292-311 (Mi mais751)

This article is cited in 1 scientific paper (total in 1 paper)

Theory of data

Word-embedding based text vectorization using clustering

V. I. Yuferev^a, N. A. Razin^b

^a Department of Information Technologies of the Central Bank of the Russian Federation, Laboratory of innovations ”Novosibirsk”, 12 Neglinnaya str., Moscow 107016, Russia
^b Department of Counteraction to Unfair Practices, the Central Bank of the Russian Federation, 12 Neglinnaya str., Moscow 107016, Russia

Full-text PDF (814 kB) Citations (1)

References:

PDF

HTML

DOI: https://doi.org/10.18255/1818-1015-2021-3-292-311

Abstract: It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short.
The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word.
This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster.
A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem — averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.

Keywords: word embedding, Fasttext, TF-IDF, averaging, clustering, text similarity, distance, text ranking.

Received: 23.06.2021
Revised: 16.08.2021
Accepted: 25.08.2021

Document Type: Article

UDC: 004.8

MSC: 97R40, 68T50

Language: Russian

Citation: V. I. Yuferev, N. A. Razin, “Word-embedding based text vectorization using clustering”, Model. Anal. Inform. Sist., 28:3 (2021), 292–311

Citation in format AMSBIB

\Bibitem{YufRaz21}

\by V.~I.~Yuferev, N.~A.~Razin

\paper Word-embedding based text vectorization using clustering

\jour Model. Anal. Inform. Sist.

\yr 2021

\vol 28

\issue 3

\pages 292--311

\mathnet{http://mi.mathnet.ru/mais751}

\crossref{https://doi.org/10.18255/1818-1015-2021-3-292-311}

Linking options:

https://www.mathnet.ru/eng/mais751

https://www.mathnet.ru/eng/mais/v28/i3/p292

This publication is cited in the following 1 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Моделирование и анализ информационных систем

Что такое QR-код?

Registration to the website

Logotypes