Finding what is missing from a digital library: A case study in the Computer Science field

Authors:
Allan J. C. Silva;Marcos André Gonçalves;Alberto H. F. Laender;Marco A. B. Modesto;Marco Cristo;Nivio Ziviani
Affiliations:
Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil;FUCAPI - Technological and Research Foundation, Manaus, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil
Venue:
Information Processing and Management: an International Journal
Year:
2009

Citing 14
Cited 3

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding information on the World Wide Web: the retrieval effectiveness of search engines

Information Processing and Management: an International Journal
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing

Communications of the ACM
Finding scientific papers with homepagesearch and MOPS

SIGDOC '01 Proceedings of the 19th annual international conference on Computer documentation
Modern Information Retrieval

Modern Information Retrieval
BDBComp: building a digital library for the Brazilian computer science community

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
What's there and what's not?: focused crawling for missing documents in digital libraries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Google Scholar coverage of a multidisciplinary field

Information Processing and Management: an International Journal
Keeping a digital library clean: new solutions to old problems

Proceedings of the eighth ACM symposium on Document engineering
PaSE: locating online copy of scientific documents effectively

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization

Keeping a digital library clean: new solutions to old problems

Proceedings of the eighth ACM symposium on Document engineering
PaMS: A component-based service for finding the missing full text of articles cataloged in a digital library

Information Systems
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article proposes a process to retrieve the URL of a document for which metadata records exist in a digital library catalog but a pointer to the full text of the document is not available. The process uses results from queries submitted to Web search engines for finding the URL of the corresponding full text or any related material. We present a comprehensive study of this process in different situations by investigating different query strategies applied to three general purpose search engines (Google, Yahoo!, MSN) and two specialized ones (Scholar and CiteSeer), considering five user scenarios. Specifically, we have conducted experiments with metadata records taken from the Brazilian Digital Library of Computing (BDBComp) and The DBLP Computer Science Bibliography (DBLP). We found that Scholar was the most effective search engine for this task in all considered scenarios and that simple strategies for combining and re-ranking results from Scholar and Google significantly improve the retrieval quality. Moreover, we study the influence of the number of query results on the effectiveness of finding missing information as well as the coverage of the proposed scenarios.