Compression of correlated bit-vectors
Information Systems
Introduction to parallel algorithms and architectures: array, trees, hypercubes
Introduction to parallel algorithms and architectures: array, trees, hypercubes
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web
WWW7 Proceedings of the seventh international conference on World Wide Web 7
External-memory graph algorithms
Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Authoritative sources in a hyperlinked environment
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The implementation and performance of compressed databases
ACM SIGMOD Record
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
Modern Information Retrieval
Mercator: A scalable, extensible Web crawler
World Wide Web
Compressing Relations and Indexes
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
A Functional Approach to External Graph Algorithms
ESA '98 Proceedings of the 6th Annual European Symposium on Algorithms
Improved Algorithms and Data Structures for Solving Graph Problems in External Memory
SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
Towards Compressing Web Graphs
DCC '01 Proceedings of the Data Compression Conference
Compressing the Graph Structure of the Web
DCC '01 Proceedings of the Data Compression Conference
I/O-efficient techniques for computing pagerank
Proceedings of the eleventh international conference on Information and knowledge management
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
Comparing the effectiveness of hits and salsa
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A scalable pattern mining approach to web graph compression with communities
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Graph summarization with bounded error
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An Efficient Algorithm and Its Parallelization for Computing PageRank
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
On compressing social networks
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 20th ACM conference on Hypertext and hypermedia
ACM Transactions on Information Systems (TOIS)
Sorting out the document identifier assignment problem
ECIR'07 Proceedings of the 29th European conference on IR research
Study on efficiency and effectiveness of KSORD
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Mining Query Logs: Turning Search Usage Data into Knowledge
Foundations and Trends in Information Retrieval
Neighbor query friendly compression of social networks
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 20th international conference on World wide web
Of hammers and nails: an empirical comparison of three paradigms for processing large graphs
Proceedings of the fifth ACM international conference on Web search and data mining
Query preserving graph compression
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Graph pattern matching revised for social network analysis
Proceedings of the 15th International Conference on Database Theory
Compressed representation of web and social networks via dense subgraphs
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Speeding up graph clustering via modular decomposition based compression
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
The Connectivity Server is a special-purpose database whose schema models the Web as a graph: a set of nodes (URLs) connected by directed edges (hyperlinks). The Link Database provides fast access to the hyperlinks. To support easy implementation of a wide range of graph algorithms, we have found it important to fit the Link Database into RAM. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in 22 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.