Loading [MathJax]/jax/output/SVG/config.js
Proceedings of the Institute for System Programming of the RAS
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Proceedings of the Institute for System Programming of the RAS, 2020, Volume 32, Issue 2, Pages 7–14
DOI: https://doi.org/10.15514/ISPRAS-2020-32(2)-1
(Mi tisp494)
 

Character n-gram-based word embeddings for morphological analysis of texts

Ts. Ghukasyan

Russian-Armenian University
References:
Abstract: The paper presents modifications of fastText word embedding model based solely on n-grams, for morphological analysis of texts. fastText is a library for classifying texts and teaching vector representations. The representation of each word is calculated as the sum of its individual vector and the vectors of its symbolic n-grams. fastText stores and uses a separate vector for the whole word, but in extra-vocabular cases there is no such vector, which leads to a deterioration in the quality of the resulting word vector. In addition, as a result of storing vectors for whole words, fastText models usually require a lot of memory for storage and processing. This becomes especially problematic for morphologically rich languages, given the large number of word forms. Unlike the original fastText model, the proposed modifications only pretrain and use vectors for the character n-grams of a word, eliminating the reliance on word-level vectors and at the same time helping to significantly reduce the number of parameters in the model. Two approaches are used to extract information from a word: internal character n-grams and suffixes. Proposed models are tested in the task of morphological analysis and lemmatization of the Russian language, using SynTagRus corpus, and demonstrate results comparable to the original fastText.
Keywords: word embeddings, morphological analysis, lemmatization.
Document Type: Article
Language: Russian
Citation: Ts. Ghukasyan, “Character n-gram-based word embeddings for morphological analysis of texts”, Proceedings of ISP RAS, 32:2 (2020), 7–14
Citation in format AMSBIB
\Bibitem{Ghu20}
\by Ts.~Ghukasyan
\paper Character n-gram-based word embeddings for morphological analysis of texts
\jour Proceedings of ISP RAS
\yr 2020
\vol 32
\issue 2
\pages 7--14
\mathnet{http://mi.mathnet.ru/tisp494}
\crossref{https://doi.org/10.15514/ISPRAS-2020-32(2)-1}
Linking options:
  • https://www.mathnet.ru/eng/tisp494
  • https://www.mathnet.ru/eng/tisp/v32/i2/p7
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Proceedings of the Institute for System Programming of the RAS
    Statistics & downloads:
    Abstract page:220
    Full-text PDF :174
    References:38
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2025