PaSE: locating online copy of scientific documents effectively

Authors:
Byung-Won On;Dongwon Lee
Affiliations:
Department of Computer Science and Engineering, The Pennsylvania State University, PA;School of Information and Sciences and Technology, The Pennsylvania State University, PA
Venue:
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Year:
2004

Citing 7
Cited 4

Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Autonomous citation matching

Proceedings of the third annual conference on Autonomous Agents
Finding scientific papers with homepagesearch and MOPS

SIGDOC '01 Proceedings of the 19th annual international conference on Computer documentation
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
BibFinder/StatMiner: effectively mining and using coverage and overlap statistics in data integration

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

What's there and what's not?: focused crawling for missing documents in digital libraries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
SlideSeer: a digital library of aligned document and presentation pairs

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Finding what is missing from a digital library: A case study in the Computer Science field

Information Processing and Management: an International Journal
PaMS: A component-based service for finding the missing full text of articles cataloged in a digital library

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The need for fast and vast dissemination of research results has led a new trend such that more number of authors post their documents to personal or group Web spaces so that others can easily access and download them. Similarly, more and more researchers use online search for accessing documents of interest in Web, instead of paying a visit to libraries. Currently, to locate and download an online copy of a particular document D, one typically (1) uses Search Engines with the citation information and browses through returned web pages (e.g., author's homepage) to see if any contains D, or (2) uses searching facilities of an individual Digital Library (e.g., CiteSeer, e-Print) looking for D, and if not found, repeats the search in another Digital Library. However, the scheme (1) involves human browsing to get to the final online copy, while the scheme (2) suffers from incomplete coverage. To remedy these shortcomings, in this paper, we present a system, named as PaSE, which can effectively locate online copies (e.g., PDF or PS) of scientific documents using citation information. We consider a myriad of alternatives in crawling and parsing the Web to arrive at the right document quickly, and present a preliminary experimental study. Using some of the best alternatives that we have identified, we show that PaSE can locate online copy of documents more accurately and conveniently than human users would do at the cost of elongated search time.