S. N. Karpovich, A. V. Smirnov, N. N. Teslya, “Penalty for unknown words in topic model”, Informatsionnye Tekhnologii i Vychslitel'nye Sistemy, 2020, no. 4, 111

Informatsionnye Tekhnologii i Vychslitel'nye Sistemy

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive
	Guidelines for authors

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Informatsionnye Tekhnologii i Vychslitel'nye Sistemy:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Informatsionnye Tekhnologii i Vychslitel'nye Sistemy, 2020, Issue 4, Pages 111–124
DOI: https://doi.org/10.14357/20718632200410 (Mi itvs433)

TEXT MINING

Penalty for unknown words in topic model

S. N. Karpovich^a, A. V. Smirnov^b, N. N. Teslya^b

^a JSC "Olympus", Moscow, Russia
^b St. Petersburg Federal Research Center of the Russian Academy of Sciences, Russia

Full-text PDF (2812 kB)

DOI: https://doi.org/10.14357/20718632200410

Abstract: The paper considers approaches to accounting for unknown words in language models used in natural language processing algorithms. A method is proposed for accounting for unknown words in probabilistic topic modeling, which allows to determine the probability of a document's novelty in relation to existing topics. Topic models calculate the probabilistic assessment of classifying a word to some topic. The word-topic probabilistic relationship matrix in such a model is filled with posterior values of word probabilities. To calculate the probabilistic assessment of a document's novelty, this paper proposes to introduce the concept of a penalty for obscurity or an a priori probability estimate for unknown words into the model. A software prototype has been developed that allows calculating the probability of a document's novelty taking into account various penalty values. Experiments were conducted on the SCTM-ru text corpus, demonstrating the capabilities of the method for classifying collections and flows of text documents containing unknown words that reflect their influence on the topic of documents. During the experiments, the classification results were also compared using a thematic model and a classifier model based on logistic regression.

Keywords: topic modeling, natural language processing, penalty unknown words.

Funding agency	Grant number
Russian Foundation for Basic Research	20-07-00904
Ministry of Education and Science of the Russian Federation	0073-2019-0005

Bibliographic databases:

Document Type: Article

Language: Russian

Citation: S. N. Karpovich, A. V. Smirnov, N. N. Teslya, “Penalty for unknown words in topic model”, Informatsionnye Tekhnologii i Vychslitel'nye Sistemy, 2020, no. 4, 111–124

Citation in format AMSBIB

\Bibitem{KarSmiTes20}

\by S.~N.~Karpovich, A.~V.~Smirnov, N.~N.~Teslya

\paper Penalty for unknown words in topic model

\jour Informatsionnye  Tekhnologii i Vychslitel'nye Sistemy

\yr 2020

\issue 4

\pages 111--124

\mathnet{http://mi.mathnet.ru/itvs433}

\crossref{https://doi.org/10.14357/20718632200410}

\elib{https://elibrary.ru/item.asp?id=44396804}

Linking options:

https://www.mathnet.ru/eng/itvs433

https://www.mathnet.ru/eng/itvs/y2020/i4/p111

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Informatsionnye Tekhnologii i Vychslitel'nye Sistemy

Statistics & downloads:
Abstract page:	125
Full-text PDF :	73
First page:	3

Что такое QR-код?

Registration to the website

Logotypes