The Link Database: Fast Access to Graphs of the Web

Authors:
Keith H. Randall;Raymie Stata;Janet L. Wiener;Rajiv G. Wickremesinghe
Affiliations:
-;-;-;-
Venue:
DCC '02 Proceedings of the Data Compression Conference
Year:
2002

Citing 17
Cited 19

Compression of correlated bit-vectors

Information Systems
Introduction to parallel algorithms and architectures: array, trees, hypercubes

Introduction to parallel algorithms and architectures: array, trees, hypercubes
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
External-memory graph algorithms

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The implementation and performance of compressed databases

ACM SIGMOD Record
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Mercator: A scalable, extensible Web crawler

World Wide Web
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
A Functional Approach to External Graph Algorithms

ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Improved Algorithms and Data Structures for Solving Graph Problems in External Memory

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Towards Compressing Web Graphs

DCC '01 Proceedings of the Data Compression Conference
Compressing the Graph Structure of the Web

DCC '01 Proceedings of the Data Compression Conference

I/O-efficient techniques for computing pagerank

Proceedings of the eleventh international conference on Information and knowledge management
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Comparing the effectiveness of hits and salsa

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A scalable pattern mining approach to web graph compression with communities

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Graph summarization with bounded error

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An Efficient Algorithm and Its Parallelization for Computing PageRank

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
The scalable hyperlink store

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Study on efficiency and effectiveness of KSORD

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Neighbor query friendly compression of social networks

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks

Proceedings of the 20th international conference on World wide web
Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Proceedings of the fifth ACM international conference on Web search and data mining
Query preserving graph compression

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Graph pattern matching revised for social network analysis

Proceedings of the 15th International Conference on Database Theory
Compressed representation of web and social networks via dense subgraphs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Speeding up graph clustering via modular decomposition based compression

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URLs) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms, we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.