An efficient indexer for large N-gram corpora

Authors:
Hakan Ceylan;Rada Mihalcea
Affiliations:
University of North Texas, Denton, TX;University of North Texas, Denton, TX
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Year:
2011

Citing 6
Cited 1

File structures: an analytic approach

File structures: an analytic approach
Smoothing a tera-word language model

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
SemEval-2007 task 10: English lexical substitution task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
FBK-irst: lexical substitution task exploiting domain and syntagmatic coherence

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
KU: word sense disambiguation by substitution

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Google web 1T 5-grams made easy (but not for the computer)

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop

Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore, we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5-gram corpus (Yuret, 2008). We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substitution task by a simple implementation that gives considerable improvement over a basic language model.