Abstract:
The probabilistic topic model of a text document collection finds two matrices: a matrix of conditional probabilities of topics in documents and a matrix of conditional probabilities of words in topics. Each document is represented by a multiset of words also called the “bag of words”, thus assuming that the order of words is not important for revealing the latent topics of the document. Under this assumption, the problem is reduced to a low-rank non-negative matrix factorization governed by likelihood maximization. In general, this problem is illposed having an infinite set of solutions. In order to regularize the solution, a weighted sum of optimization criteria is added to the log-likelihood. When modeling large text collections, storing the first matrix seems to be impractical, since its size is proportional to the number of documents in the collection. At the same time, the topical vector representation (embedding) of documents is necessary for solving many text analysis tasks, such as information retrieval, clustering, classification, and summarization of texts. In practice, the topical embeddingis calculated for a document “on-the-fly”, which may require dozens of iterations over all the words of the document. In this paper, we propose a way to calculate a topical embedding quickly, by one pass over document words. For this, an additional constraint is introduced into the model in the form of an equation, which calculates the first matrix from the second one in linear time. Although formally this constraint is not an optimization criterion, in fact it plays the role of a regularizer and can be used in combination with other regularizers within the additive regularization framework ARTM. Experiments on three text collections have shown that the proposed method improves the model in terms of sparseness, difference, logLift and coherence measures of topic quality. The open source libraries BigARTM and TopicNet were used for the experiments.
Keywords:
natural language processing, unsupervised learning, topic modeling, additive regularization of topic model, EM-algorithm, PLSA, LDA, ARTM, BigARTM, TopicNet.
The work was carried out within the project “Intelligent data analysis tools for large text collections” under the NTI Program “Center for big data storage and analysis” supported by the Ministry of Science and Higher Education of theRussian Federation under the agreement of 15.08.2019 No. 7/1251/2019. Also the work is partially supported by RFBR, project 20-07-00936
Citation:
I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov, “Additive regularizarion of topic models with fast text vectorizartion”, Computer Research and Modeling, 12:6 (2020), 1515–1528
\Bibitem{IrkBulVor20}
\by I.~A.~Irkhin, V.~G.~Bulatov, K.~V.~Vorontsov
\paper Additive regularizarion of topic models with fast text vectorizartion
\jour Computer Research and Modeling
\yr 2020
\vol 12
\issue 6
\pages 1515--1528
\mathnet{http://mi.mathnet.ru/crm863}
\crossref{https://doi.org/10.20537/2076-7633-2020-12-6-1515-1528}
Linking options:
https://www.mathnet.ru/eng/crm863
https://www.mathnet.ru/eng/crm/v12/i6/p1515
This publication is cited in the following 4 articles:
Alex Gorbulev, Vasiliy Alekseev, Konstantin Vorontsov, Lecture Notes in Computer Science, 15419, Analysis of Images, Social Networks and Texts, 2025, 87
Victor Bulatov, Vasiliy Alekseev, Konstantin Vorontsov, Communications in Computer and Information Science, 1905, Recent Trends in Analysis of Images, Social Networks and Texts, 2024, 3
Konstantin Vorontsov, Springer Optimization and Its Applications, 202, Data Analysis and Optimization, 2023, 397
Klyachin Vladimir, Khizhnyakova Ekaterina, Lecture Notes in Networks and Systems, 724, Artificial Intelligence Application in Networks and Systems, 2023, 463