Dynamic reference sifting: a case study in the homepage domain
Selected papers from the sixth international conference on World Wide Web
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
WTMS: a system for collecting for collecting and analyzing topic-specific Web information
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Finding scientific papers with homepagesearch and MOPS
SIGDOC '01 Proceedings of the 19th annual international conference on Computer documentation
Proceedings of the 11th international conference on World Wide Web
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Machine Learning Approach for Homepage Finding Task
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Panorama: extending digital libraries with topical crawlers
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Journal of Artificial Intelligence Research
PaSE: locating online copy of scientific documents effectively
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
OnCU system: ontology-based category utility approach for author name disambiguation
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Enhancing digital libraries using missing content analysis
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Finding what is missing from a digital library: A case study in the Computer Science field
Information Processing and Management: an International Journal
Profile-based focused crawling for social media-sharing websites
Journal on Image and Video Processing
State of the Art in Semantic Focused Crawlers
ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
Exploiting Tags and Social Profiles to Improve Focused Crawling
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Addressing the limited scope problem of focused crawling using a result merging approach
Proceedings of the 2010 ACM Symposium on Applied Computing
Evaluating methods to rediscover missing web pages from the web infrastructure
Proceedings of the 10th annual joint conference on Digital libraries
Beyond digital incunabula: modeling the next generation of digital libraries
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Focused crawling of tagged web resources using ontology
Computers and Electrical Engineering
Hi-index | 0.00 |
Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue.We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures.