Modelirovanie i Analiz Informatsionnykh Sistem
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Model. Anal. Inform. Sist.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Modelirovanie i Analiz Informatsionnykh Sistem, 2022, Volume 29, Number 3, Pages 266–279
DOI: https://doi.org/10.18255/1818-1015-2022-3-266-279
(Mi mais780)
 

Theory of data

Classification of articles from mass media by categories and relevance of the subject area

V. D. Larionov, I. V. Paramonov

P. G. Demidov Yaroslavl State University, 14 Sovetskaya str., Yaroslavl 150003, Russia
References:
Abstract: The research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.
The proposed approaches are based on using the BERT neural network and methods of machine learning: SVM, Logistic Regression, K-Neighbors, Random Forest, in combination of different embedding types: Word2Vec, FastText, TF-IDF, GPT-3. Also approaches of text preprocessing are considered to achieve higher quality of the classification. The experiments showed that the SVM classifier with TF-IDF embedding and trained on full article texts with titles achieved the best result. Its micro-F-measure and macro-F-measure are 0.8214 and 0.8308 respectively. The BERT neural network trained on fragments of paragraphs with YarSU mentions, from which the first 128 words and the last 384 words were taken, showed comparable results. The resulting micro-F-measure and macro-F-measure are 0.8304 and 0.8181 respectively. Thus, using paragraphs with the target organisation mentions is enough to classify text by categories efficiently.
Keywords: classification by categories, automatic text processing, subject area, Russian language, news articles.
Funding agency
‘is work was supported by P. G. Demidov Yaroslavl State University Project No. VIP-016.
Received: 05.06.2022
Revised: 23.08.2022
Accepted: 26.08.2022
Document Type: Article
UDC: 004.912
Language: Russian
Citation: V. D. Larionov, I. V. Paramonov, “Classification of articles from mass media by categories and relevance of the subject area”, Model. Anal. Inform. Sist., 29:3 (2022), 266–279
Citation in format AMSBIB
\Bibitem{LarPar22}
\by V.~D.~Larionov, I.~V.~Paramonov
\paper Classification of articles from mass media by categories and relevance of the subject area
\jour Model. Anal. Inform. Sist.
\yr 2022
\vol 29
\issue 3
\pages 266--279
\mathnet{http://mi.mathnet.ru/mais780}
\crossref{https://doi.org/10.18255/1818-1015-2022-3-266-279}
Linking options:
  • https://www.mathnet.ru/eng/mais780
  • https://www.mathnet.ru/eng/mais/v29/i3/p266
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Моделирование и анализ информационных систем
    Statistics & downloads:
    Abstract page:76
    Full-text PDF :30
    References:18
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024