|
This article is cited in 1 scientific paper (total in 1 paper)
Unsupervised approach to web wrapper maintenance
A. M. Andreev, D. V. Berezkin, I. A. Kozlov, K. V. Simakov Bauman Moscow State Technical University
Abstract:
HTML-wrapper applications rely on formatting regularities of targeted websites. Therefore, maintenance of such applications is connected with the problem of detecting markup changes of web pages. This article describes the unsupervised approach to this problem. The proposed method of detection consists of two parts: the real-time one based on clustering considering HTML-document as a vector of some features and the time-lagged one based on comparison of distributions of such features for learning and testing sets of HTML-documents. There have been carried out several experiments with data obtained from real wrapper. The results reveal feasibility of the suggested approach.
Keywords:
wrapper maintenance; web-site parsing; clustering; HTML-markup statistical processing.
Citation:
A. M. Andreev, D. V. Berezkin, I. A. Kozlov, K. V. Simakov, “Unsupervised approach to web wrapper maintenance”, Inform. Primen., 7:3 (2013), 2–13
Linking options:
https://www.mathnet.ru/eng/ia267 https://www.mathnet.ru/eng/ia/v7/i3/p2
|
Statistics & downloads: |
Abstract page: | 224 | Full-text PDF : | 95 | References: | 34 | First page: | 2 |
|