Robust and distributed top-n frequent-pattern mining with SAP BW accelerator

Authors:
Thomas Legler;Wolfgang Lehner;Jan Schaffner;Jens Krüger
Affiliations:
SAP AG, Walldorf, Germany;Technische Universität Dresden, Dresden, Germany;Hasso-Plattner-Institut, Potsdam, Germany;Hasso-Plattner-Institut, Potsdam, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 17
Cited 0

Using association rules for product assortment decisions: a case study

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Parallel frequent set counting

Parallel Computing - Parallel data-intensive algorithms and applications
Parallel and Distributed Association Mining: A Survey

IEEE Concurrency
An Architecture for Distributed Enterprise Data Mining

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
An Efficient Algorithm for Mining Association Rules in Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Effect of Data Skewness in Parallel Mining of Association Rules

PAKDD '98 Proceedings of the Second Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining
Mining Top.K Frequent Closed Patterns without Minimum Support

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
New Algorithms for Fast Discovery of Association Rules

New Algorithms for Fast Discovery of Association Rules
Fast vertical mining using diffsets

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient top-K query calculation in distributed networks

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
KLEE: a framework for distributed top-k query algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Data mining with the SAP NetWeaver BI accelerator

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Mining for combined association rules on multiple datasets

Proceedings of the 2007 international workshop on Domain driven data mining
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Mining top-k frequent patterns in the presence of the memory constraint

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mining for association rules and frequent patterns is a central activity in data mining. However, most existing algorithms are only moderately suitable for real-world scenarios. Most strategies use parameters like minimum support, for which it can be very difficult to define a suitable value for unknown datasets. Since most untrained users are unable or unwilling to set such technical parameters, we address the problem of replacing the minimum-support parameter with top-n strategies. In our paper, we start by extending a top-n implementation of the ECLAT algorithm to improve its performance by using heuristic search strategy optimizations. Also, real-world datasets are often distributed and modern database architectures are switching from expensive SMPs to cheaper shared-nothing blade servers. Thus, most mining queries require distribution handling. Since partitioning can be forced by user-defined semantics, it is often forbidden to transform the data. Therefore, we developed an adaptive top-n frequent-pattern mining algorithm that simplifies the mining process on real distributions by relaxing some requirements on the results. We first combine the PARTITION and the TPUT algorithms to handle distributed top-n frequent-pattern mining. Then, we extend this new algorithm for distributions with real-world data characteristics. For frequent-pattern mining algorithms, equal distributions are important conditions, and tiny partitions can cause performance bottlenecks. Hence, we implemented an approach called MAST that defines a minimum absolute-support threshold. MAST prunes patterns with low chances of reaching the global top-n result set and high computing costs. In total, our approach simplifies the process of frequent-pattern mining for real customer scenarios and data sets. This may make frequent-pattern mining accessible for very new user groups. Finally, we present results of our algorithms when run on the SAP NetWeaver BW Acceleratorwith standard and real business datasets.