Computer Research and Modeling
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Computer Research and Modeling:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Computer Research and Modeling, 2020, Volume 12, Issue 6, Pages 1515–1528
DOI: https://doi.org/10.20537/2076-7633-2020-12-6-1515-1528
(Mi crm863)
 

This article is cited in 3 scientific papers (total in 3 papers)

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Additive regularizarion of topic models with fast text vectorizartion

I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov

Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow oblast, Dolgoprudny, 141701, Russia
Full-text PDF (321 kB) Citations (3)
References:
Abstract: The probabilistic topic model of a text document collection finds two matrices: a matrix of conditional probabilities of topics in documents and a matrix of conditional probabilities of words in topics. Each document is represented by a multiset of words also called the “bag of words”, thus assuming that the order of words is not important for revealing the latent topics of the document. Under this assumption, the problem is reduced to a low-rank non-negative matrix factorization governed by likelihood maximization. In general, this problem is illposed having an infinite set of solutions. In order to regularize the solution, a weighted sum of optimization criteria is added to the log-likelihood. When modeling large text collections, storing the first matrix seems to be impractical, since its size is proportional to the number of documents in the collection. At the same time, the topical vector representation (embedding) of documents is necessary for solving many text analysis tasks, such as information retrieval, clustering, classification, and summarization of texts. In practice, the topical embeddingis calculated for a document “on-the-fly”, which may require dozens of iterations over all the words of the document. In this paper, we propose a way to calculate a topical embedding quickly, by one pass over document words. For this, an additional constraint is introduced into the model in the form of an equation, which calculates the first matrix from the second one in linear time. Although formally this constraint is not an optimization criterion, in fact it plays the role of a regularizer and can be used in combination with other regularizers within the additive regularization framework ARTM. Experiments on three text collections have shown that the proposed method improves the model in terms of sparseness, difference, logLift and coherence measures of topic quality. The open source libraries BigARTM and TopicNet were used for the experiments.
Keywords: natural language processing, unsupervised learning, topic modeling, additive regularization of topic model, EM-algorithm, PLSA, LDA, ARTM, BigARTM, TopicNet.
Funding agency Grant number
Russian Foundation for Basic Research 20-07-00936
Ministry of Science and Higher Education of the Russian Federation № 7/1251/2019
The work was carried out within the project “Intelligent data analysis tools for large text collections” under the NTI Program “Center for big data storage and analysis” supported by the Ministry of Science and Higher Education of theRussian Federation under the agreement of 15.08.2019 No. 7/1251/2019. Also the work is partially supported by RFBR, project 20-07-00936
Received: 21.09.2020
Revised: 01.10.2020
Accepted: 05.10.2020
Document Type: Article
UDC: 004.852, 519.853
Language: Russian
Citation: I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov, “Additive regularizarion of topic models with fast text vectorizartion”, Computer Research and Modeling, 12:6 (2020), 1515–1528
Citation in format AMSBIB
\Bibitem{IrkBulVor20}
\by I.~A.~Irkhin, V.~G.~Bulatov, K.~V.~Vorontsov
\paper Additive regularizarion of topic models with fast text vectorizartion
\jour Computer Research and Modeling
\yr 2020
\vol 12
\issue 6
\pages 1515--1528
\mathnet{http://mi.mathnet.ru/crm863}
\crossref{https://doi.org/10.20537/2076-7633-2020-12-6-1515-1528}
Linking options:
  • https://www.mathnet.ru/eng/crm863
  • https://www.mathnet.ru/eng/crm/v12/i6/p1515
  • This publication is cited in the following 3 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Computer Research and Modeling
    Statistics & downloads:
    Abstract page:239
    Full-text PDF :90
    References:12
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024