OverCite: a distributed, cooperative citeseer

Authors:
Jeremy Stribling;Jinyang Li;Isaac G. Councill;M. Frans Kaashoek;Robert Morris
Affiliations:
MIT Computer Science and Artificial Intelligence Laboratory;New York University and MIT Computer Science and Artificial Intelligence Laboratory via University of California, Berkeley;Pennsylvania State University;MIT Computer Science and Artificial Intelligence Laboratory;MIT Computer Science and Artificial Intelligence Laboratory
Venue:
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Year:
2006

Citing 0
Cited 7

A framework for P2P application development

Computer Communications
Events can make sense

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
A Novel Content Distribution Mechanism in DHT Networks

NETWORKING '09 Proceedings of the 8th International IFIP-TC 6 Networking Conference
Peer-to-peer systems

Communications of the ACM
An analysis of Linux scalability to many cores

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
STAIRS: Towards efficient full-text filtering and dissemination in DHT environments

The VLDB Journal — The International Journal on Very Large Data Bases
Scalable address spaces using RCU balanced trees

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

CiteSeer is a popular online resource for the computer science research community, allowing users to search and browse a large archive of research papers. CiteSeer is expensive: it generates 35 GB of network traffic per day, requires nearly one terabyte of disk storage, and needs significant human maintenance. OverCite is a new digital research library system that aggregates donated resources at multiple sites to provide CiteSeer-like document search and retrieval. OverCite enables members of the community to share the costs of running CiteSeer. The challenge facing OverCite is how to provide scalable and load-balanced storage and query processing with automatic data management. OverCite uses a three-tier design: presentation servers provide an identical user interface to CiteSeer's; application servers partition and replicate a search index to spread the work of answering each query among several nodes; and a distributed hash table stores documents and metadata, and coordinates the activities of the servers. Evaluation of a prototype shows that OverCite increases its query throughput by a factor of seven with a nine-fold increase in the number of servers. OverCite requires more total storage and network bandwidth than centralized CiteSeer, but spreads these costs over all the sites. OverCite can exploit the resources of these sites to support new features such as document alerts and to scale to larger data sets.