What's there and what's not?: focused crawling for missing documents in digital libraries

Authors:
Ziming Zhuang;Rohit Wagle;C. Lee Giles
Affiliations:
Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA;Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Year:
2005

Citing 15
Cited 12

Dynamic reference sifting: a case study in the homepage domain

Selected papers from the sixth international conference on World Wide Web
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
WTMS: a system for collecting for collecting and analyzing topic-specific Web information

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Finding scientific papers with homepagesearch and MOPS

SIGDOC '01 Proceedings of the 19th annual international conference on Computer documentation
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Machine Learning Approach for Homepage Finding Task

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Panorama: extending digital libraries with topical crawlers

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Probabilistic inference from arbitrary uncertainty using mixtures of factorized generalized gaussians

Journal of Artificial Intelligence Research
PaSE: locating online copy of scientific documents effectively

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization

Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
OnCU system: ontology-based category utility approach for author name disambiguation

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Enhancing digital libraries using missing content analysis

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Finding what is missing from a digital library: A case study in the Computer Science field

Information Processing and Management: an International Journal
Profile-based focused crawling for social media-sharing websites

Journal on Image and Video Processing
State of the Art in Semantic Focused Crawlers

ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
Exploiting Tags and Social Profiles to Improve Focused Crawling

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
PaMS: A component-based service for finding the missing full text of articles cataloged in a digital library

Information Systems
Addressing the limited scope problem of focused crawling using a result merging approach

Proceedings of the 2010 ACM Symposium on Applied Computing
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
Beyond digital incunabula: modeling the next generation of digital libraries

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Focused crawling of tagged web resources using ontology

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue.We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures.