Efficient indexing of repeated n-grams

Authors:
Samuel Huston;Alistair Moffat;W. Bruce Croft
Affiliations:
University of Massachusetts Amherst, Amherst, MA, USA;University of Melbourne, Melbourne, Australia;University of Massachusetts Amherst, Amherst, MA, USA
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 12
Cited 5

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Efficiency optimizations for interpolating subqueries

Proceedings of the 20th ACM international conference on Information and knowledge management
Sketch-based indexing of n-words

Proceedings of the 21st ACM international conference on Information and knowledge management
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Mind the gap: large-scale frequent sequence mining

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The identification of repeated n-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated n-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the 1.5 TB TREC ClueWeb-B text collection, using both single-processor and multi-processor approaches. The experiments show that our method offers an important tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of n, and provides a uniform workload balance across the set of available processors.