Efficient, automatic web resource harvesting

Authors:
Michael L. Nelson;Joan A. Smith;Ignacio Garcia del Campo
Affiliations:
Old Dominion University, Norfolk VA;Old Dominion University, Norfolk VA;Old Dominion University, Norfolk VA
Venue:
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Year:
2006

Citing 20
Cited 6

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
When documents deceive: trust and provenance as new factors for information retrieval in a tangled web

Journal of the American Society for Information Science and Technology - Special issue on the still the frontier: Information Science at the Millenium
Crawler-Friendly Web Servers

ACM SIGMETRICS Performance Evaluation Review
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
The open archives initiative: building a low-barrier interoperability framework

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Notes from the Interoperability Front: A Progress Report on the Open Archives Initiative

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
The DSpace institutional digital repository system: current functionality

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Repository synchronization in the OAI framework

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
The OAI-PMH static repository and static repository gateway

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
A Probabilistic Approach to Metasearching with Adaptive Probing

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
METS: standardized encoding for digital library objects

International Journal on Digital Libraries
Representing digital assets usingMPEG-21 Digital Item Declaration

International Journal on Digital Libraries

Factors affecting website reconstruction from the web infrastructure

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Generating best-effort preservation metadata for web resources at time of dissemination

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A Quantitative Evaluation of Dissemination-Time Preservation Metadata

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Estimating deep web data source size by capture---recapture method

Information Retrieval
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are two problems associated with conventional web crawling techniques: a crawler cannot know if all resources at a non-trivial web site have been discovered and crawled ("the counting problem") and the human-readable format of the resources are not always suitable for machine processing ("the representation problem"). We introduce an approach that solves these two problems by implementing support for both the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and MPEG-21 Digital Item Declaration Language (DIDL) into the web server itself. We present the Apache module "mod_oai", which can be used to address the counting problem by listing all valid URIs at a web server and efficiently discovering updates and additions on subsequent crawls. Our experiments indicated comparable performance for initial crawls, and dramatic increases in update speed mod_oaican also be used to address the representation problem by providing "preservation ready" versions of web resources aggregated with their respective forensic metadata in MPEG-21 DIDL format.