Abstract:
The authors investigate the task of metadata extraction. The authors consider scientific PDF documents
in Russian. One of the features of PDF is a rich layout. It is difficult to extract metadata due to this fact. The
authors propose a method based on considering blocks from pdf-parser as objects in a machine learning task. The
feature space is constructed not only of text statistics but also of formatting and positioning features of the block.
The authors performed computational experiments and compared their approach with the baseline.
Keywords:
metadata extraction; natural language processing; layout features; information retrieval; metadescriptions.
This publication is cited in the following 2 articles:
Benjamin W. Cramer, “The Influence of Topography and Fracking on Cellular Network Availability in Underserved Areas of Pennsylvania”, SSRN Journal, 2018
Richard Vines, Joseph Firestone, “Interoperability and the Exchange of Humanly Usable Digital Content”, SSRN Journal, 2011