Crawling the web for structured documents

Authors:
Julián Urbano;Juan Loréns;Yorgos Andreadakis;Mónica Marrero
Affiliations:
University Carlos III of Madrid, Leganés, Spain;University Carlos III of Madrid, Leganés, Spain;University Carlos III of Madrid, Leganés, Spain;University Carlos III of Madrid, Leganés, Spain
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 3
Cited 1

An Empirical Study of Representation Methods for Reusable Software Components

IEEE Transactions on Software Engineering
User expectations from XML element retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
OfCourse: web content discovery, classification and information extraction for online course materials

Proceedings of the 18th ACM conference on Information and knowledge management

Bringing undergraduate students closer to a real-world information retrieval setting: methodology and resources

Proceedings of the 16th annual joint conference on Innovation and technology in computer science education

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like SourceForge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software-oriented search engines or RDF for semantic searching.