Searching large text collections

Authors:
Ricardo Baeza-Yates;Alistair Moffat;Gonzalo Navarro
Affiliations:
Dept. of Computer Science, Universidad de Chile, Santiago, Chile;Dept. Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia;Dept. of Computer Science, Universidad de Chile, Santiago, Chile
Venue:
Handbook of massive data sets
Year:
2002

Citing 69
Cited 6

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
New indices for text: PAT Trees and PAT arrays

Information retrieval
An introduction to parallel algorithms

An introduction to parallel algorithms
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Parallel computing (2nd ed.): theory and practice

Parallel computing (2nd ed.): theory and practice
The World-Wide Web

Communications of the ACM
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
In situ generation of compressed inverted files

Journal of the American Society for Information Science
Adding compression to a full-text retrieval system

Software—Practice & Experience
HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering

Proceedings of the the seventh ACM conference on Hypertext
Fast text searching for regular expressions or automaton searching on tries

Journal of the ACM (JACM)
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Silk from a sow's ear: extracting usable structures from the Web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Hierarchies of indices for text searching

Information Systems
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Practical digital libraries: books, bytes, and bucks

Practical digital libraries: books, bytes, and bucks
Block addressing indices for approximate text retrieval

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Structuring and visualising the WWW by generalised similarity analysis

HYPERTEXT '97 Proceedings of the eighth ACM conference on Hypertext
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Query performance for tightly coupled distributed digital libraries

Proceedings of the third ACM conference on Digital libraries
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Exploring the similarity space

ACM SIGIR Forum
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
ParaSite: mining structural information on the Web

Selected papers from the sixth international conference on World Wide Web
The quest for correct information on the Web: hyper search engines

Selected papers from the sixth international conference on World Wide Web
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

21st Annual ACM/SIGIR International Conference on Research and Development in Information Retrieval
Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast searching on compressed text allowing errors

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automated link generation: can we do better than term repetition?

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Efficient distributed algorithms to build inverted files

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Effective document presentation with a locality-based similarity heuristic

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient suffix trees on secondary storage

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Prefix B-trees

ACM Transactions on Database Systems (TODS)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient passage ranking for document databases

ACM Transactions on Information Systems (TOIS)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Case for NOW (Networks of Workstations)

IEEE Micro
Text Compression for Dynamic Document Databases

IEEE Transactions on Knowledge and Data Engineering
XML: A Door to Automated Web Applications

IEEE Internet Computing
Querying Semistructured Heterogeneous Information

DOOD '95 Proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases
Partial Answers for Unavailable Data Sources

FQAS '98 Proceedings of the Third International Conference on Flexible Query Answering Systems
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Probabilistic Analysis of Generalized Suffix Trees (Extended Abstract)

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
Distributed Generation of Suffix Arrays

CPM '97 Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching
On Constructing Suffix Arrays in External Memory

ESA '99 Proceedings of the 7th Annual European Symposium on Algorithms
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Parallel Generation of Inverted Files for Distributed Text Collections

SCCC '98 Proceedings of the XVIII International Conference of the Chilean Computer Science Society
A Model for Visualizing Large Answers in WWW Retrieval

SCCC '98 Proceedings of the XVIII International Conference of the Chilean Computer Science Society
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
An Efficient Method for in Memory Construction of Suffix Arrays

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
A Fast Distributed Suffix Array Generation Algorithm

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Methodologies for Distributed Information Retrieval

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
A Fast Algorithms for Making Suffix Arrays and for Burrows-Wheeler Transformation

DCC '98 Proceedings of the Conference on Data Compression
STARTS: Stanford Protocol Proposal for Internet Retrieval and Search

STARTS: Stanford Protocol Proposal for Internet Retrieval and Search
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Inverted files for text search engines

ACM Computing Surveys (CSUR)
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Ranked document retrieval in (almost) no space

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Dual-Sorted inverted lists in practice

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Faster and smaller inverted indices with treaps

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also cover parallel and distributed implementations of these two structures. As an example, we show how mechanisms based upon inverted files can be used to index and search the Web.