|
High performance distributed web-scraper
D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov Peter the Great St.Petersburg Polytechnic University
Abstract:
Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.
Keywords:
web-scraping, web-crawling, distributed data collection, distributed data analysis.
Citation:
D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov, “High performance distributed web-scraper”, Proceedings of ISP RAS, 33:3 (2021), 87–100
Linking options:
https://www.mathnet.ru/eng/tisp601 https://www.mathnet.ru/eng/tisp/v33/i3/p87
|
Statistics & downloads: |
Abstract page: | 168 | Full-text PDF : | 332 | References: | 40 |
|