Searching by corpus with fingerprints

Authors:
Charu C. Aggarwal;Wangqun Lin;Philip S. Yu
Affiliations:
IBM T. J. Watson Research Center, Hawthorne, NY;National University of Defense Technology, Changsha, Hunan, China;University of Illinois at Chicago, Chicago, IL
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 17
Cited 0

Access methods for text

ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Automatic phrase indexing for document retrieval

SIGIR '87 Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Access by content of documents in an office information system

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
S-tree: a dynamic balanced signature index for office retrieval

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Signature files: an access method for documents and its analytical performance evaluation

ACM Transactions on Information Systems (TOIS)
Managing Gigabytes: Compressing and Indexing Documents and Images

Managing Gigabytes: Compressing and Indexing Documents and Images
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Query by document

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Graphical models for text: a new paradigm for text representation and processing

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing sizes of text repositories on the world wide web has created a need for efficient indexing and retrieval methods for text collections. Almost all of the text retrieval and indexing methods have been designed for the case of simple keyword search, in which a few keywords are specified, and the text is retrieved on the basis of matches to these keywords. However, in many applications there is a need for a greater specificity during the search, such as the use of phrases, sentences, text fragments, or even documents for the retrieval process. An even more general case is one in which a collection of documents is available as a query to the search process. In such cases, it is desirable to return sets of all pairwise similar documents. Such queries are referred to as corpus to corpus queries, and are computationally intensive because of the very large number of document pairs which need to be compared. Such cases cannot be efficiently processed by the available indexing and searching methods. Most of the currently available techniques can index the text based on only a small number of keywords or representative phrases. In this paper, we design a compressed finger print index which can support the following more general queries: (a) The method can process very efficient document-to-corpus search because of their efficient bit-wise operations for the search process. (b) We further extend the method to work for corpus-to-corpus queries, in which it is desirable to determine the most similar pairs of documents in two collections. We design an efficient search technique which is able to reduce the search time for large collections. The key technique used to enable this is an efficient fingerprint representation, which can be used effectively for the search process. To the best of our knowledge, this is the first work on corpus-based search in massive document collections.