Proceedings of the Institute for System Programming of the RAS
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Proceedings of ISP RAS:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Proceedings of the Institute for System Programming of the RAS, 2021, Volume 33, Issue 3, Pages 87–100
DOI: https://doi.org/10.15514/ISPRAS-2021-33(3)-7
(Mi tisp601)
 

High performance distributed web-scraper

D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov

Peter the Great St.Petersburg Polytechnic University
References:
Abstract: Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.
Keywords: web-scraping, web-crawling, distributed data collection, distributed data analysis.
Document Type: Article
Language: English
Citation: D. S. Eyzenakh, A. S. Rameykov, I. V. Nikiforov, “High performance distributed web-scraper”, Proceedings of ISP RAS, 33:3 (2021), 87–100
Citation in format AMSBIB
\Bibitem{EyzRamNik21}
\by D.~S.~Eyzenakh, A.~S.~Rameykov, I.~V.~Nikiforov
\paper High performance distributed web-scraper
\jour Proceedings of ISP RAS
\yr 2021
\vol 33
\issue 3
\pages 87--100
\mathnet{http://mi.mathnet.ru/tisp601}
\crossref{https://doi.org/10.15514/ISPRAS-2021-33(3)-7}
Linking options:
  • https://www.mathnet.ru/eng/tisp601
  • https://www.mathnet.ru/eng/tisp/v33/i3/p87
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Proceedings of the Institute for System Programming of the RAS
    Statistics & downloads:
    Abstract page:158
    Full-text PDF :299
    References:29
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024