Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
SPADE: an efficient algorithm for mining frequent sequences
Machine Learning
Parallel sequence mining on shared-memory machines
Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Parallel data mining for association rules on shared memory systems
Knowledge and Information Systems
Mining Sequential Patterns: Generalizations and Performance Improvements
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
Data Mining and Knowledge Discovery
MARSYAS: a framework for audio analysis
Organised Sound
Parallel tree-projection-based sequence mining algorithms
Parallel Computing
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach
IEEE Transactions on Knowledge and Data Engineering
ACM Computing Surveys (CSUR)
Accurate discovery of co-derivative documents via duplicate text detection
Information Systems
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Frequent pattern mining: current status and future directions
Data Mining and Knowledge Discovery
Pfp: parallel fp-growth for query recommendation
Proceedings of the 2008 ACM conference on Recommender systems
Statistical Language Models for Information Retrieval A Critical Review
Foundations and Trends in Information Retrieval
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Hadoop: The Definitive Guide
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Proceedings of the 19th international conference on World wide web
The design, implementation, and use of the Ngram statistics package
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
A taxonomy of sequential pattern mining algorithms
ACM Computing Surveys (CSUR)
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
An overview of Microsoft web N-gram corpus and applications
HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Efficient indexing of repeated n-grams
Proceedings of the fourth ACM international conference on Web search and data mining
Social content matching in MapReduce
Proceedings of the VLDB Endowment
Processing theta-joins using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
An efficient indexer for large N-gram corpora
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
DPSP: distributed progressive sequential pattern mining on the cloud
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Mind the gap: large-scale frequent sequence mining
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.