Computing n-gram statistics in MapReduce

Authors:
Klaus Berberich;Srikanta Bedathur
Affiliations:
Max Planck Institute for Informatics, Saarbrücken, Germany;Indraprastha Institute of Information Technology, New Delhi, India
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 29
Cited 1

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Parallel sequence mining on shared-memory machines

Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Parallel data mining for association rules on shared memory systems

Knowledge and Information Systems
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Data Mining and Knowledge Discovery
MARSYAS: a framework for audio analysis

Organised Sound
Parallel tree-projection-based sequence mining algorithms

Parallel Computing
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach

IEEE Transactions on Knowledge and Data Engineering
Association mining

ACM Computing Surveys (CSUR)
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Pfp: parallel fp-growth for query recommendation

Proceedings of the 2008 ACM conference on Recommender systems
Statistical Language Models for Information Retrieval A Critical Review

Foundations and Trends in Information Retrieval
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
The design, implementation, and use of the Ngram statistics package

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
A taxonomy of sequential pattern mining algorithms

ACM Computing Surveys (CSUR)
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
Social content matching in MapReduce

Proceedings of the VLDB Endowment
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
An efficient indexer for large N-gram corpora

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
DPSP: distributed progressive sequential pattern mining on the cloud

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II

Mind the gap: large-scale frequent sequence mining

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.