Indexing Google 1T for low-turnaround wildcarded frequency queries

Authors:
Steinar Vittersø Kaldager
Affiliations:
University of Oslo
Venue:
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Year:
2012

Citing 2
Cited 0

The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Web-scale features for full-scale parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a technique to prepare the Google 1T n-gram data set for wildcarded frequency queries with a very low turnaround time, making unbatched applications possible. Our method supports token-level wildcarding and -- given a cache of 3.3 GB of RAM - requires only a single read of less than 4 KB from the disk to answer a query. We present an indexing structure, a way to generate it, and suggestions for how it can be tuned to particular applications.