Stanford WebBase components and applications

Authors:
Junghoo Cho;Hector Garcia-Molina;Taher Haveliwala;Wang Lam;Andreas Paepcke;Sriram Raghavan;Gary Wesley
Affiliations:
Stanford University, Los Angeles, CA;Stanford University, Stanford, CA;Stanford University, Mountain View, CA;Stanford University, Mountain View, CA;Stanford University, Stanford, CA;Stanford University, San Jose, CA;Stanford University, Stanford, CA
Venue:
ACM Transactions on Internet Technology (TOIT)
Year:
2006

Citing 32
Cited 18

Distributed operating systems

ACM Computing Surveys (CSUR) - The MIT Press scientific computation series
Supporting full-text information retrieval with a persistent object store

EDBT '94 Proceedings of the 4th international conference on extending database technology: Advances in database technology
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
Dissemination of collection wide information in a distributed information retrieval system

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Serverless network file systems

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Resource scheduling for parallel database and scientific applications

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Crawler-Friendly Web Servers

ACM SIGMETRICS Performance Evaluation Review
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Database System Implementation

Database System Implementation
Mercator: A scalable, extensible Web crawler

World Wide Web
Query processing and inverted indices in shared: nothing text document information retrieval systems

The VLDB Journal — The International Journal on Very Large Data Bases - Parallelism in database systems
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
An Efficient Indexing Technique for Full Text Databases

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Kqueue - A Generic and Scalable Event Notification Facility

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Berkeley DB

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Complex queries over web repositories

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

The Web as a graph: How far we are

ACM Transactions on Internet Technology (TOIT)
A cautious surfer for PageRank

Proceedings of the 16th international conference on World Wide Web
Measuring similarity to detect qualified links

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Investigating web services on the world wide web

Proceedings of the 17th international conference on World Wide Web
Towards breaking the quality curse.: a web-querying approach to web people search.

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Separate and inequal: preserving heterogeneity in topical authority flows

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
From whence does your authority come?: utilizing community relevance in ranking

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Computing strongly connected components in the streaming model

TAPAS'11 Proceedings of the First international ICST conference on Theory and practice of algorithms in (computer) systems
Bridging link and query intent to enhance web search

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
A scalable eigensolver for large scale-free graphs using 2D graph partitioning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploiting Web querying for Web people search

ACM Transactions on Database Systems (TODS)
Mining anchor text trends for retrieval

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exploring temporal evidence in web information retrieval

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design and performance of WebBase, a tool for Web research. The system includes a highly customizable crawler, a repository for collected Web pages, an indexer for both text and link-related page features, and a high-speed content distribution facility. The distribution module enables researchers world-wide to retrieve pages from WebBase, and stream them across the Internet at high speed. The advantage for the researchers is that they need not all crawl the Web before beginning their research. WebBase has been used by scores of research and teaching organizations world-wide, mostly for investigations into Web topology and linguistic content analysis. After describing the system's architecture, we explain our engineering decisions for each of the WebBase components, and present respective performance measurements.