Efficient Search Methods for Statistical Dependency Rules

  • Authors:
  • Wilhelmiina Hämäläinen

  • Affiliations:
  • (Correspd.) (Emil Aaltonen Fund (Emil Alltosen säätiö) has supported this research with a personal grant) Department of Biosciences, University of Eastern Finland, Kuopio, Finland. ...

  • Venue:
  • Fundamenta Informaticae - Machine Learning in Bioinformatics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dependency analysis is one of the central problems in bioinformatics and all empirical science. In genetics, for example, an important problem is to find which gene alleles are mutually dependent or which alleles and diseases are dependent. In ecology, a similar problem is to find dependencies between different species or groups of species. In both cases a classical solution is to consider all pairwise dependencies between single attributes and evaluate the relationships with some statistical measure like the χ 2-measure. It is known that the actual dependency structures can involve more attributes, but the existing computational methods are too inefficient for such an exhaustive search. In this paper, we introduce efficient search methods for positive dependencies of the form X → A with typical statistical measures. The efficiency is achieved by a special kind of a branch-and-bound search which also prunes out redundant rules. Redundant attributes are especially harmful in dependency analysis, because they can blur the actual dependencies and even lead to erroneous conclusions. We consider two alternative definitions of redundancy: the classical one and a stricter one. We improve our previous algorithm for searching for the best strictly non-redundant dependency rules and introduce a totally new algorithm for searching for the best classically non-redundant rules. According to our experiments, both algorithms can prune the search space very efficiently, and in practice no minimum frequency thresholds are needed. This is an important benefit, because biological data sets are typically dense, and the alternative search methods would require too large minimum frequency thresholds for any practical purpose.