A constrained crawling approach and its application to a specialised search engine

  • Authors:
  • Mehdi Adda

  • Affiliations:
  • Department of Computer Science, Engineering and Mathematics, University of Quebec at Rimouski, 300, allee des Ursulines, C.P. 3300, succ. A, Rimouski, Quebec, G5L 3A1, Canada

  • Venue:
  • International Journal of Information and Communication Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present an approach to crawl and parse websites based on their logical structure rather than on an aleatory exploration method. In this approach, we use a set of constraints to identify web pages and their components. To enforce these constraints, we present a set of primitives that rely on predicate verification. Our model has the attractiveness of being flexible to reflect tree-like logical structures of websites, thus it avoids the need to use complex information analysis and content classification techniques. Furthermore, because the model is implemented as a domain specific language (DSL), describing crawling tasks is straightforward. Using this DSL, we developed and deployed a prototype of dynamic web application with full-text search capabilities that periodically crawls, parses, and analyses the content of selected online newspapers. A set of experiments, and comparisons highlight the effectiveness of the proposed crawling approach.