Trudy SPIIRAN
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Informatics and Automation:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Trudy SPIIRAN, 2019, Issue 18, volume 6, Pages 1381–1406
DOI: https://doi.org/10.15622/sp.2019.18.6.1381-1406
(Mi trspy1085)
 

This article is cited in 4 scientific papers (total in 4 papers)

Digital Information Telecommunication Technologies

Semantic text segmentation from synthetic images of full-text documents

L. Bureš, I. Gruber, P. Neduchal, M. Hlaváč, M. Hrúz

University of West Bohemia
Abstract: An algorithm (divided into multiple modules) for generating images of full-text documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR).
The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly.
The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents).
A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.
Keywords: generation of synthetic images, semantic text segmentation, variational autoencoder, VAE, optical character recognition, OCR, aged-looking text generation.
Funding agency Grant number
Ministry of Education, Youth and Sports of the Czech Republic LTARF18017
LO1506
National Grid Infrastructure MetaCentrum CESNET LM2015042
University of West Bohemia SGS-2019-027
This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017 and Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. The work has been supported by the grant of the University of West Bohemia, project No. SGS-2019-027.
Received: 24.09.2019
Document Type: Article
UDC: 004.9
Language: English
Citation: L. Bureš, I. Gruber, P. Neduchal, M. Hlaváč, M. Hrúz, “Semantic text segmentation from synthetic images of full-text documents”, Tr. SPIIRAN, 18:6 (2019), 1381–1406
Citation in format AMSBIB
\Bibitem{BurGruNed19}
\by L.~Bure{\v s}, I.~Gruber, P.~Neduchal, M.~Hlav\'a{\v{c}}, M.~Hr\'uz
\paper Semantic text segmentation from synthetic images of full-text documents
\jour Tr. SPIIRAN
\yr 2019
\vol 18
\issue 6
\pages 1381--1406
\mathnet{http://mi.mathnet.ru/trspy1085}
\crossref{https://doi.org/10.15622/sp.2019.18.6.1381-1406}
Linking options:
  • https://www.mathnet.ru/eng/trspy1085
  • https://www.mathnet.ru/eng/trspy/v18/i6/p1381
  • This publication is cited in the following 4 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Informatics and Automation
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024