L. Bureš, I. Gruber, P. Neduchal, M. Hlaváč, M. Hrúz, “Semantic text segmentation from synthetic images of full-text documents”, Tr. SPIIRAN, 18:6 (2019), 1381

Trudy SPIIRAN

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Informatics and Automation:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Trudy SPIIRAN, 2019, Issue 18, volume 6, Pages 1381–1406
DOI: https://doi.org/10.15622/sp.2019.18.6.1381-1406 (Mi trspy1085)

This article is cited in 4 scientific papers (total in 4 papers)

Digital Information Telecommunication Technologies

Semantic text segmentation from synthetic images of full-text documents

L. Bureš, I. Gruber, P. Neduchal, M. Hlaváč, M. Hrúz

University of West Bohemia

Full-text PDF (9329 kB) Citations (4)

DOI: https://doi.org/10.15622/sp.2019.18.6.1381-1406

Abstract: An algorithm (divided into multiple modules) for generating images of full-text documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR).
The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly.
The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents).
A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.

Keywords: generation of synthetic images, semantic text segmentation, variational autoencoder, VAE, optical character recognition, OCR, aged-looking text generation.

Funding agency	Grant number
Ministry of Education, Youth and Sports of the Czech Republic	LTARF18017 LO1506
National Grid Infrastructure MetaCentrum	CESNET LM2015042
University of West Bohemia	SGS-2019-027
This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017 and Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. The work has been supported by the grant of the University of West Bohemia, project No. SGS-2019-027.

Received: 24.09.2019

Document Type: Article

UDC: 004.9

Language: English

Citation: L. Bureš, I. Gruber, P. Neduchal, M. Hlaváč, M. Hrúz, “Semantic text segmentation from synthetic images of full-text documents”, Tr. SPIIRAN, 18:6 (2019), 1381–1406

Citation in format AMSBIB

\Bibitem{BurGruNed19}

\by L.~Bure{\v s}, I.~Gruber, P.~Neduchal, M.~Hlav\'a{\v{c}}, M.~Hr\'uz

\paper Semantic text segmentation from synthetic images of full-text documents

\jour Tr. SPIIRAN

\yr 2019

\vol 18

\issue 6

\pages 1381--1406

\mathnet{http://mi.mathnet.ru/trspy1085}

\crossref{https://doi.org/10.15622/sp.2019.18.6.1381-1406}

Linking options:

https://www.mathnet.ru/eng/trspy1085

https://www.mathnet.ru/eng/trspy/v18/i6/p1381

This publication is cited in the following 4 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Что такое QR-код?

Registration to the website

Logotypes