Abstract:
In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN [ren2015faster] model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.
Keywords:
Document Layout Analysis, Document Segmentation, Physical Document Structure, Image Object Detection, Model fine-tuning, Active Learning.
Document Type:
Article
Language: Russian
Citation:
O. V. Belyaeva, A. I. Perminov, I. S. Kozlov, “Synthetic data usage for document segmentation models fine-tuning”, Proceedings of ISP RAS, 32:4 (2020), 189–202