Space-efficient data structures for Top-k completion

Authors:
Bo-June (Paul) Hsu;Giuseppe Ottaviano
Affiliations:
Microsoft Research, Redmond, WA, USA;Università di Pisa, Pisa, Italy
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 25
Cited 1

The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Compact pat trees

Compact pat trees
Trie memory

Communications of the ACM
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees

SIAM Journal on Computing
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Representing Trees of Higher Degree

Algorithmica
Effective phrase prediction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Efficient type-ahead search on relational data: a TASTIER approach

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
K-best suffix arrays

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Broadword implementation of rank/select queries

WEA'08 Proceedings of the 7th international conference on Experimental algorithms
Fully-functional succinct trees

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Online spelling correction for query completion

Proceedings of the 20th international conference on World wide web
Compressed string dictionaries

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Rank-Sensitive data structures

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Universal codeword sets and representations of the integers

IEEE Transactions on Information Theory
Supporting efficient top-k queries in type-ahead search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.