D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov, “High performance distributed web-scraper”, Proceedings of ISP RAS, 33:3 (2021), 87

Loading [MathJax]/jax/output/SVG/config.js

Proceedings of the Institute for System Programming of the RAS

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Proceedings of the Institute for System Programming of the RAS, 2021, Volume 33, Issue 3, Pages 87–100
DOI: https://doi.org/10.15514/ISPRAS-2021-33(3)-7 (Mi tisp601)

High performance distributed web-scraper

D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov

Peter the Great St.Petersburg Polytechnic University

Full-text PDF (454 kB)

References:

PDF

HTML

DOI: https://doi.org/10.15514/ISPRAS-2021-33(3)-7

Abstract: Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.

Keywords: web-scraping, web-crawling, distributed data collection, distributed data analysis.

Document Type: Article

Language: English

Citation: D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov, “High performance distributed web-scraper”, Proceedings of ISP RAS, 33:3 (2021), 87–100

Citation in format AMSBIB

\Bibitem{EyzRamNik21}

\by D.~S.~Eyzenakh, A.~S.~Rameykov, I.~V.~Nikiforov

\paper High performance distributed web-scraper

\jour Proceedings of ISP RAS

\yr 2021

\vol 33

\issue 3

\pages 87--100

\mathnet{http://mi.mathnet.ru/tisp601}

\crossref{https://doi.org/10.15514/ISPRAS-2021-33(3)-7}

Linking options:

https://www.mathnet.ru/eng/tisp601

https://www.mathnet.ru/eng/tisp/v33/i3/p87

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Proceedings of the Institute for System Programming of the RAS

Statistics & downloads:
Abstract page:	191
Full-text PDF :	441
References:	45

Registration to the website

Logotypes