ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
Complete inverted files for efficient text retrieval and analysis
Journal of the ACM (JACM)
Automatic phrase indexing for document retrieval
SIGIR '87 Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval
Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Access by content of documents in an office information system
SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
S-tree: a dynamic balanced signature index for office retrieval
Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Inverted files versus signature files for text indexing
ACM Transactions on Database Systems (TODS)
KEA: practical automatic keyphrase extraction
Proceedings of the fourth ACM conference on Digital libraries
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Signature files: an access method for documents and its analytical performance evaluation
ACM Transactions on Information Systems (TOIS)
Managing Gigabytes: Compressing and Indexing Documents and Images
Managing Gigabytes: Compressing and Indexing Documents and Images
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Learning Algorithms for Keyphrase Extraction
Information Retrieval
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Graphical models for text: a new paradigm for text representation and processing
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
The growing sizes of text repositories on the world wide web has created a need for efficient indexing and retrieval methods for text collections. Almost all of the text retrieval and indexing methods have been designed for the case of simple keyword search, in which a few keywords are specified, and the text is retrieved on the basis of matches to these keywords. However, in many applications there is a need for a greater specificity during the search, such as the use of phrases, sentences, text fragments, or even documents for the retrieval process. An even more general case is one in which a collection of documents is available as a query to the search process. In such cases, it is desirable to return sets of all pairwise similar documents. Such queries are referred to as corpus to corpus queries, and are computationally intensive because of the very large number of document pairs which need to be compared. Such cases cannot be efficiently processed by the available indexing and searching methods. Most of the currently available techniques can index the text based on only a small number of keywords or representative phrases. In this paper, we design a compressed finger print index which can support the following more general queries: (a) The method can process very efficient document-to-corpus search because of their efficient bit-wise operations for the search process. (b) We further extend the method to work for corpus-to-corpus queries, in which it is desirable to determine the most similar pairs of documents in two collections. We design an efficient search technique which is able to reduce the search time for large collections. The key technique used to enable this is an efficient fingerprint representation, which can be used effectively for the search process. To the best of our knowledge, this is the first work on corpus-based search in massive document collections.