Proceedings of the Institute for System Programming of the RAS
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Proceedings of the Institute for System Programming of the RAS, 2020, Volume 32, Issue 4, Pages 189–202
DOI: https://doi.org/10.15514/ISPRAS-2020-32(4)-14
(Mi tisp534)
 

This article is cited in 2 scientific papers (total in 2 papers)

Synthetic data usage for document segmentation models fine-tuning

O. V. Belyaevaa, A. I. Perminovb, I. S. Kozlova

a Ivannikov Institute for System Programming of the RAS
b Lomonosov Moscow State University
Full-text PDF (729 kB) Citations (2)
References:
Abstract: In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN [ren2015faster] model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.
Keywords: Document Layout Analysis, Document Segmentation, Physical Document Structure, Image Object Detection, Model fine-tuning, Active Learning.
Document Type: Article
Language: Russian
Citation: O. V. Belyaeva, A. I. Perminov, I. S. Kozlov, “Synthetic data usage for document segmentation models fine-tuning”, Proceedings of ISP RAS, 32:4 (2020), 189–202
Citation in format AMSBIB
\Bibitem{BelPerKoz20}
\by O.~V.~Belyaeva, A.~I.~Perminov, I.~S.~Kozlov
\paper Synthetic data usage for document segmentation models fine-tuning
\jour Proceedings of ISP RAS
\yr 2020
\vol 32
\issue 4
\pages 189--202
\mathnet{http://mi.mathnet.ru/tisp534}
\crossref{https://doi.org/10.15514/ISPRAS-2020-32(4)-14}
Linking options:
  • https://www.mathnet.ru/eng/tisp534
  • https://www.mathnet.ru/eng/tisp/v32/i4/p189
  • This publication is cited in the following 2 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Proceedings of the Institute for System Programming of the RAS
    Statistics & downloads:
    Abstract page:122
    Full-text PDF :55
    References:10
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024