Not so creepy crawler: easy crawler generation with standard xml queries

Authors:
Franziska von dem Bussche;Klara Weiand;Benedikt Linse;Tim Furche;François Bry
Affiliations:
University of Munich, Munich, Germany;University of Munich, Munich, Germany;University of Munich, Munich, Germany;University of Munich, Munich, Germany;University of Munich, Munich, Germany
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 6
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Incremental Maintenance of Materialized XQuery Views

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Four lessons in versatility or how query languages adapt to the web

Semantic techniques for the web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web crawlers are increasingly used for focused tasks such as the extraction of data from Wikipedia or the analysis of social networks like last.fm. In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis. In this demonstration, we present a focused, structure-based crawler generator, the "Not so Creepy Crawler" (nc2 ). What sets nc2 apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together sufice to realize a wide variety of focused crawlers. We demonstrate nc2 with two applications: The first extracts data about cities from Wikipedia with a customizable set of attributes for selecting and reporting these cities. It illustrates the power of nc2 where data extraction from Wiki-style, fairly homogeneous knowledge sites is required. In contrast, the second use case demonstrates how easy nc2 makes even complex analysis tasks on social networking sites, here exemplified by last.fm.