|
NUMERICAL METHODS AND THE BASIS FOR THEIR APPLICATION
Automated citation graph building from a corpora of scientific documents
V. A. Polezhaev RUKONT-PhysTech Laboratory, CMAM department, MIPT, 9 Institutskii per., Dolgoprudny, Moscow Region, 141700, Russia
Abstract:
In this paper the problem of automated building of a citation graph from a collection of scientific documents is considered as a sequence of machine learning tasks. The overall data processing technology is described which consists of six stages: preprocessing, metainformation extraction, bibliography lists extraction, splitting bibliography lists into separate bibliography records, standardization of each bibliography record, and record linkage. The goal of this paper is to provide a survey of approaches and algorithms suitable for each stage, motivate the choice of the best combination of algorithms, and adapt some of them for multilingual bibliographies processing. For some of the tasks new algorithms and heuristics are proposed and evaluated on the mixed English and Russian documents corpora.
Keywords:
text mining, machine learning, information extraction, citation graph, bibliography, matching, record linkage, labeling, segmentation, conditional random fields.
Received: 06.09.2012
Citation:
V. A. Polezhaev, “Automated citation graph building from a corpora of scientific documents”, Computer Research and Modeling, 4:4 (2012), 707–719
Linking options:
https://www.mathnet.ru/eng/crm523 https://www.mathnet.ru/eng/crm/v4/i4/p707
|
Statistics & downloads: |
Abstract page: | 90 | Full-text PDF : | 32 | References: | 24 |
|