|
This article is cited in 2 scientific papers (total in 2 papers)
Automatic metadata extraction from scientific PDF documents
A. V. Ogaltsovab, O. Y. Bakhteevcb a National Research University Higher School of Economics, 20 Myasnitskaya Str., Moscow 101000, Russian Federation
b Antiplagiat JSC, 33 Varshavskoe Shosse, Moscow 117105, Russian Federation
c Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian Federation
Abstract:
The authors investigate the task of metadata extraction. The authors consider scientific PDF documents
in Russian. One of the features of PDF is a rich layout. It is difficult to extract metadata due to this fact. The
authors propose a method based on considering blocks from pdf-parser as objects in a machine learning task. The
feature space is constructed not only of text statistics but also of formatting and positioning features of the block.
The authors performed computational experiments and compared their approach with the baseline.
Keywords:
metadata extraction; natural language processing; layout features; information retrieval; metadescriptions.
Received: 20.12.2017
Citation:
A. V. Ogaltsov, O. Y. Bakhteev, “Automatic metadata extraction from scientific PDF documents”, Inform. Primen., 12:2 (2018), 75–82
Linking options:
https://www.mathnet.ru/eng/ia535 https://www.mathnet.ru/eng/ia/v12/i2/p75
|
|