Recursive hashing functions for n-grams

Authors:
Jonathan D. Cohen
Affiliations:
National Security Agency, Fort Meade, MD
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1997

Citing 19
Cited 15

Effective text compression with simultaneous digram and trigram encoding

Journal of Information Science
Selecting a hashing algorithm

Software—Practice & Experience
An analysis of the Karp-Rabin string matching algorithm

Information Processing Letters
Introduction to algorithms

Introduction to algorithms
The Reactive Keyboard: A Predictive Typing Aid

Computer
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)

Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
An approximate string-matching algorithm

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
An assessment of N-phoneme statistics in phoneme guessing algorithms which aim to incorporate phonotactic constraints

Speech Communication
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
A dynamic hypertext environment through n-gram analysis

A dynamic hypertext environment through n-gram analysis
One-time complete indexing of text: theory and practice

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Key-to-address transform techniques: a fundamental performance study on large existing formatted files

Communications of the ACM
Implementation of the substring test by hashing

Communications of the ACM
An information-theoretic approach to text searching in direct access systems

Communications of the ACM
The use of context for correcting garbled English text

ACM '64 Proceedings of the 1964 19th ACM national conference
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing

Using Visualization to Detect Plagiarism in Computer Science Classes

INFOVIS '00 Proceedings of the IEEE Symposium on Information Vizualization 2000
Algebraic Signatures for Scalable Distributed Data Structures

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Comparing inverted files and signature files for searching a large lexicon

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Incremental Hashing for Spin

SPIN '08 Proceedings of the 15th international workshop on Model Checking Software
SNIF TOOL: sniffing for patterns in continuous streams

Proceedings of the 17th ACM conference on Information and knowledge management
Dynamic Incremental Hashing in Program Model Checking

Electronic Notes in Theoretical Computer Science (ENTCS)
Recursive n-gram hashing is pairwise independent, at best

Computer Speech and Language
SplitScreen: enabling efficient, distributed malware detection

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Designing a cross-language comparison-shopping agent

Decision Support Systems
Efficient inference in large discrete domains

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
The universality of iterated hashing over variable-length strings

Discrete Applied Mathematics
A compact representation of nondeterministic (suffix) automata for the bit-parallel approach

Information and Computation
Exact pattern matching with feed-forward bloom filters

Journal of Experimental Algorithmics (JEA)
Space savings and design considerations in variable length deduplication

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many indexing, retrieval, and comparison methods are based on counting or cataloguing n-grams in streams of symbols. The fastest method of implementing such operations is through the use of hash tables. Rapid hashing of consecutive n-grams is best done using a recursive hash function, in which the hash value of the current n-gram is drived from the hash value of its predecessor. This article generalizes recursive hash functions found in the literature and proposes new methods offering superior performance. Experimental results demonstrate substantial speed improvement over conventional approaches, while retaining near-ideal hash value distribution.