Indexing Google 1T for low-turnaround wildcarded frequency queries

  • Authors:
  • Steinar Vittersø Kaldager

  • Affiliations:
  • University of Oslo

  • Venue:
  • NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a technique to prepare the Google 1T n-gram data set for wildcarded frequency queries with a very low turnaround time, making unbatched applications possible. Our method supports token-level wildcarding and -- given a cache of 3.3 GB of RAM - requires only a single read of less than 4 KB from the disk to answer a query. We present an indexing structure, a way to generate it, and suggestions for how it can be tuned to particular applications.