Ts. Ghukasyan, “Character n-gram-based word embeddings for morphological analysis of texts”, Proceedings of ISP RAS, 32:2 (2020), 7

Loading [MathJax]/jax/output/SVG/config.js

Proceedings of the Institute for System Programming of the RAS

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Proceedings of the Institute for System Programming of the RAS, 2020, Volume 32, Issue 2, Pages 7–14
DOI: https://doi.org/10.15514/ISPRAS-2020-32(2)-1 (Mi tisp494)

Character n-gram-based word embeddings for morphological analysis of texts

Ts. Ghukasyan

Russian-Armenian University

Full-text PDF (410 kB)

References:

PDF

HTML

DOI: https://doi.org/10.15514/ISPRAS-2020-32(2)-1

Abstract: The paper presents modifications of fastText word embedding model based solely on n-grams, for morphological analysis of texts. fastText is a library for classifying texts and teaching vector representations. The representation of each word is calculated as the sum of its individual vector and the vectors of its symbolic n-grams. fastText stores and uses a separate vector for the whole word, but in extra-vocabular cases there is no such vector, which leads to a deterioration in the quality of the resulting word vector. In addition, as a result of storing vectors for whole words, fastText models usually require a lot of memory for storage and processing. This becomes especially problematic for morphologically rich languages, given the large number of word forms. Unlike the original fastText model, the proposed modifications only pretrain and use vectors for the character n-grams of a word, eliminating the reliance on word-level vectors and at the same time helping to significantly reduce the number of parameters in the model. Two approaches are used to extract information from a word: internal character n-grams and suffixes. Proposed models are tested in the task of morphological analysis and lemmatization of the Russian language, using SynTagRus corpus, and demonstrate results comparable to the original fastText.

Keywords: word embeddings, morphological analysis, lemmatization.

Document Type: Article

Language: Russian

Citation: Ts. Ghukasyan, “Character n-gram-based word embeddings for morphological analysis of texts”, Proceedings of ISP RAS, 32:2 (2020), 7–14

Citation in format AMSBIB

\Bibitem{Ghu20}

\by Ts.~Ghukasyan

\paper Character n-gram-based word embeddings for morphological analysis of texts

\jour Proceedings of ISP RAS

\yr 2020

\vol 32

\issue 2

\pages 7--14

\mathnet{http://mi.mathnet.ru/tisp494}

\crossref{https://doi.org/10.15514/ISPRAS-2020-32(2)-1}

Linking options:

https://www.mathnet.ru/eng/tisp494

https://www.mathnet.ru/eng/tisp/v32/i2/p7

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Proceedings of the Institute for System Programming of the RAS

Statistics & downloads:
Abstract page:	220
Full-text PDF :	174
References:	38

Registration to the website

Logotypes