Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Authors:
Deepavali Bhagwat;Kave Eshghi;Pankaj Mehra
Affiliations:
University of California;Hewlett Packard Labs;Hewlett Packard Labs
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 32
Cited 5

Automatic text processing

Automatic text processing
Distributed indexing: a scalable mechanism for distributed information retrieval

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
LH*—a scalable, distributed data structure

ACM Transactions on Database Systems (TODS)
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Data clustering: a review

ACM Computing Surveys (CSUR)
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Compactly encoding unstructured inputs with differential compression

Journal of the ACM (JACM)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Comparing top k lists

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Cluster-Based Delta Compression of a Collection of Files

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Routing networks for distributed hash tables

Proceedings of the twenty-second annual symposium on Principles of distributed computing
PAST: A Large-Scale, Persistent Peer-to-Peer Storage Utility

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Strategies for Cooperative Search in Distributed Databases

IAT '03 Proceedings of the IEEE/WIC International Conference on Intelligent Agent Technology
Guiding queries to information sources with InfoBeacons

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
User modeling for full-text federated search in peer-to-peer networks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A cooperative internet backup scheme

ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Tapestry: a resilient global-scale overlay for service deployment

IEEE Journal on Selected Areas in Communications

The effectiveness of deduplication on virtual machine disk images

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
A running time improvement for the two thresholds two divisors algorithm

Proceedings of the 48th Annual Southeast Regional Conference
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A scalable inline cluster deduplication framework for big data protection

Proceedings of the 13th International Middleware Conference
Rangoli: space management in deduplication environments

Proceedings of the 6th International Systems and Storage Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.