I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov, “Additive regularizarion of topic models with fast text vectorizartion”, Computer Research and Modeling, 12:6 (2020), 1515

Loading [MathJax]/jax/output/SVG/config.js

Computer Research and Modeling

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Computer Research and Modeling:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Computer Research and Modeling, 2020, Volume 12, Issue 6, Pages 1515–1528
DOI: https://doi.org/10.20537/2076-7633-2020-12-6-1515-1528 (Mi crm863)

This article is cited in 4 scientific papers (total in 4 papers)

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Additive regularizarion of topic models with fast text vectorizartion

I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov

Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow oblast, Dolgoprudny, 141701, Russia

Full-text PDF (321 kB) Citations (4)

References:

PDF

HTML

DOI: https://doi.org/10.20537/2076-7633-2020-12-6-1515-1528

Abstract: The probabilistic topic model of a text document collection finds two matrices: a matrix of conditional probabilities of topics in documents and a matrix of conditional probabilities of words in topics. Each document is represented by a multiset of words also called the “bag of words”, thus assuming that the order of words is not important for revealing the latent topics of the document. Under this assumption, the problem is reduced to a low-rank non-negative matrix factorization governed by likelihood maximization. In general, this problem is illposed having an infinite set of solutions. In order to regularize the solution, a weighted sum of optimization criteria is added to the log-likelihood. When modeling large text collections, storing the first matrix seems to be impractical, since its size is proportional to the number of documents in the collection. At the same time, the topical vector representation (embedding) of documents is necessary for solving many text analysis tasks, such as information retrieval, clustering, classification, and summarization of texts. In practice, the topical embeddingis calculated for a document “on-the-fly”, which may require dozens of iterations over all the words of the document. In this paper, we propose a way to calculate a topical embedding quickly, by one pass over document words. For this, an additional constraint is introduced into the model in the form of an equation, which calculates the first matrix from the second one in linear time. Although formally this constraint is not an optimization criterion, in fact it plays the role of a regularizer and can be used in combination with other regularizers within the additive regularization framework ARTM. Experiments on three text collections have shown that the proposed method improves the model in terms of sparseness, difference, logLift and coherence measures of topic quality. The open source libraries BigARTM and TopicNet were used for the experiments.

Keywords: natural language processing, unsupervised learning, topic modeling, additive regularization of topic model, EM-algorithm, PLSA, LDA, ARTM, BigARTM, TopicNet.

Funding agency	Grant number
Russian Foundation for Basic Research	20-07-00936
Ministry of Science and Higher Education of the Russian Federation	№ 7/1251/2019
The work was carried out within the project “Intelligent data analysis tools for large text collections” under the NTI Program “Center for big data storage and analysis” supported by the Ministry of Science and Higher Education of theRussian Federation under the agreement of 15.08.2019 No. 7/1251/2019. Also the work is partially supported by RFBR, project 20-07-00936

Received: 21.09.2020
Revised: 01.10.2020
Accepted: 05.10.2020

Document Type: Article

UDC: 004.852, 519.853

Language: Russian

Citation: I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov, “Additive regularizarion of topic models with fast text vectorizartion”, Computer Research and Modeling, 12:6 (2020), 1515–1528

Citation in format AMSBIB

\Bibitem{IrkBulVor20}

\by I.~A.~Irkhin, V.~G.~Bulatov, K.~V.~Vorontsov

\paper Additive regularizarion of topic models with fast text vectorizartion

\jour Computer Research and Modeling

\yr 2020

\vol 12

\issue 6

\pages 1515--1528

\mathnet{http://mi.mathnet.ru/crm863}

\crossref{https://doi.org/10.20537/2076-7633-2020-12-6-1515-1528}

Linking options:

https://www.mathnet.ru/eng/crm863

https://www.mathnet.ru/eng/crm/v12/i6/p1515

This publication is cited in the following 4 articles:

Alex Gorbulev, Vasiliy Alekseev, Konstantin Vorontsov, Lecture Notes in Computer Science, 15419, Analysis of Images, Social Networks and Texts, 2025, 87
Victor Bulatov, Vasiliy Alekseev, Konstantin Vorontsov, Communications in Computer and Information Science, 1905, Recent Trends in Analysis of Images, Social Networks and Texts, 2024, 3
Konstantin Vorontsov, Springer Optimization and Its Applications, 202, Data Analysis and Optimization, 2023, 397
Klyachin Vladimir, Khizhnyakova Ekaterina, Lecture Notes in Networks and Systems, 724, Artificial Intelligence Application in Networks and Systems, 2023, 463

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Statistics & downloads:
Abstract page:	299
Full-text PDF :	113
References:	26

Registration to the website

Logotypes