Fast, effective molecular feature mining by local optimization

Authors:
Albrecht Zimmermann;Björn Bringmann;Ulrich Rückert
Affiliations:
Katholieke Universiteit Leuven, Leuven, Belgium;Katholieke Universiteit Leuven, Leuven, Belgium;UC Berkeley, EECS Department, Berkeley, CA
Venue:
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Year:
2010

Citing 13
Cited 1

Making large-scale support vector machine learning practical

Advances in kernel methods
Transversing itemset lattices with statistical metric pruning

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Levelwise Search and Borders of Theories in KnowledgeDiscovery

Data Mining and Knowledge Discovery
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity

Bioinformatics
Direct mining of discriminative and essential frequent patterns via model-based search tree

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing Feature Sets for Structured Data

ECML '07 Proceedings of the 18th European conference on Machine Learning
ORIGAMI: Mining Representative Orthogonal Graph Patterns

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
One in a million: picking the right patterns

Knowledge and Information Systems
Capacity Control for Partially Ordered Feature Sets

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Don't be afraid of simpler patterns

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Pattern teams

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Tree2: decision trees for tree structured data

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Learning from graph data by putting graphs on the lattice

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In structure-activity-relationships (SAR) one aims at finding classifiers that predict the biological or chemical activity of a compound from its molecular graph. Many approaches to SAR use sets of binary substructure features, which test for the occurrence of certain substructures in the molecular graph. As an alternative to enumerating very large sets of frequent patterns, numerous pattern set mining and pattern set selection techniques have been proposed. Existing approaches can be broadly classified into those that focus on minimizing correspondences, that is, the number of pairs of training instances from different classes with identical encodings and those that focus on maximizing the number of equivalence classes, that is, unique encodings in the training data. In this paper we evaluate a number of techniques to investigate which criterion is a better indicator of predictive accuracy. We find that minimizing correspondences is a necessary but not sufficient condition for good predictive accuracy, that equivalence classes are a better indicator of success and that it is important to have a good match between training set and pattern set size. Based on these results we propose a new, improved algorithm which performs local minimization of correspondences, yet evaluates the effect of patterns on equivalence classes globally. Empirical experiments demonstrate its efficacy and its superior run time behavior.