Zapiski Nauchnykh Seminarov POMI
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Impact factor

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Zap. Nauchn. Sem. POMI:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Zapiski Nauchnykh Seminarov POMI, 2023, Volume 529, Pages 123–139 (Mi znsl7423)  

Pre-training longt5 for vietnamese mass-media multi-document summarization

N. Rusnachenkoa, The Anh Leb, Ngoc Diep Nguyenc

a Bauman Moscow State Technical University
b FPT University, Can Tho, Viet Nam
c CyberIntellect, Moscow, Russia
References:
Abstract: Multi-document summarization is a task aimed to extract the most salient information from a set of input documents. One of the main challenges in this task is the long-term dependency problem. When we deal with texts written in Vietnamese, it is also accompanied by the specific syllable-based text representation and lack of labeled datasets. Recent advances in machine translation have resulted in significant growth in the use of a related architecture known as the Transformer. Being pretrained on large amounts of raw texts, Transformers allow to capture a deep knowledge of the texts. In this paper, we survey the findings of language model applications for text summarization problems, including important Vietnamese text summarization models. According to the latter, we select LongT5 to pretrain and then fine-tune it for the Vietnamese multi-document text summarization problem from scratch. We analyze the resulting model and experiment with multi-document Vietnamese datasets, including ViMs, VMDS, and VLSP2022. We conclude that using a Transformer-based model pretrained on a large amount of unlabeled Vietnamese texts allows us to achieve promising results, with further enhancement via fine-tuning within a small amount of manually summarized texts. The pretrained model utilized in the experiment section has been made available online at https://github.com/nicolay-r/ViLongT5.
Key words and phrases: vietnamese multi-document summarization, text summarization, Transformers, language models.
Received: 06.09.2023
Document Type: Article
UDC: 81.322.2
Language: English
Citation: N. Rusnachenko, The Anh Le, Ngoc Diep Nguyen, “Pre-training longt5 for vietnamese mass-media multi-document summarization”, Investigations on applied mathematics and informatics. Part II–1, Zap. Nauchn. Sem. POMI, 529, POMI, St. Petersburg, 2023, 123–139
Citation in format AMSBIB
\Bibitem{RusLeNgu23}
\by N.~Rusnachenko, The~Anh~Le, Ngoc~Diep~Nguyen
\paper Pre-training longt5 for vietnamese mass-media multi-document summarization
\inbook Investigations on applied mathematics and informatics. Part~II--1
\serial Zap. Nauchn. Sem. POMI
\yr 2023
\vol 529
\pages 123--139
\publ POMI
\publaddr St.~Petersburg
\mathnet{http://mi.mathnet.ru/znsl7423}
Linking options:
  • https://www.mathnet.ru/eng/znsl7423
  • https://www.mathnet.ru/eng/znsl/v529/p123
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Записки научных семинаров ПОМИ
    Statistics & downloads:
    Abstract page:97
    Full-text PDF :45
    References:18
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024