O. V. Belyaeva, A. I. Perminov, I. S. Kozlov, “Synthetic data usage for document segmentation models fine-tuning”, Proceedings of ISP RAS, 32:4 (2020), 189

Loading [MathJax]/jax/output/SVG/config.js

Proceedings of the Institute for System Programming of the RAS

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Proceedings of the Institute for System Programming of the RAS, 2020, Volume 32, Issue 4, Pages 189–202
DOI: https://doi.org/10.15514/ISPRAS-2020-32(4)-14 (Mi tisp534)

This article is cited in 2 scientific papers (total in 2 papers)

Synthetic data usage for document segmentation models fine-tuning

O. V. Belyaeva^a, A. I. Perminov^b, I. S. Kozlov^a

^a Ivannikov Institute for System Programming of the RAS
^b Lomonosov Moscow State University

Full-text PDF (729 kB) Citations (2)

References:

PDF

HTML

DOI: https://doi.org/10.15514/ISPRAS-2020-32(4)-14

Abstract: In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN [ren2015faster] model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.

Keywords: Document Layout Analysis, Document Segmentation, Physical Document Structure, Image Object Detection, Model fine-tuning, Active Learning.

Document Type: Article

Language: Russian

Citation: O. V. Belyaeva, A. I. Perminov, I. S. Kozlov, “Synthetic data usage for document segmentation models fine-tuning”, Proceedings of ISP RAS, 32:4 (2020), 189–202

Citation in format AMSBIB

\Bibitem{BelPerKoz20}

\by O.~V.~Belyaeva, A.~I.~Perminov, I.~S.~Kozlov

\paper Synthetic data usage for document segmentation models fine-tuning

\jour Proceedings of ISP RAS

\yr 2020

\vol 32

\issue 4

\pages 189--202

\mathnet{http://mi.mathnet.ru/tisp534}

\crossref{https://doi.org/10.15514/ISPRAS-2020-32(4)-14}

Linking options:

https://www.mathnet.ru/eng/tisp534

https://www.mathnet.ru/eng/tisp/v32/i4/p189

This publication is cited in the following 2 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Proceedings of the Institute for System Programming of the RAS

Statistics & downloads:
Abstract page:	161
Full-text PDF :	94
References:	30

Registration to the website

Logotypes