WebCQ-detecting and delivering information changes on the web
Proceedings of the ninth international conference on Information and knowledge management
Parallel crawler architecture and web page change detection
WSEAS Transactions on Computers
Topical web crawling using weighted anchor text and web page change detection techniques
WSEAS Transactions on Information Science and Applications
Hi-index | 0.00 |
The Internet and the World Wide Web have enabled a publishing explosion of useful online information, which has produced the unfortunate side effect of information overload: it is increasingly difficult for individuals to keep abreast of fresh information. In this paper we describe an approach for building a system for efficiently monitoring changes to Web documents. This paper has three main contributions. First, we present a coherent framework that captures different characteristics of Web documents. The system uses the Page Digest encoding to provide a comprehensive monitoring system for content, structure, and other interesting properties of Web documents. Second, the Page Digest encoding enables improved performance for individual page monitors through mechanisms such as short-circuit evaluation, linear time algorithms for document and structure similarity, and data size reduction. Finally, we develop a collection of sentinel grouping techniques based on the Page Digest encoding to reduce redundant processing in large-scale monitoring systems by grouping similar monitoring requests together. We examine how effective these techniques are over a wide range of parameters and have seen an order of magnitude speed up over existing Web-based information monitoring systems.