Capacity Control for Partially Ordered Feature Sets

Authors:
Ulrich Rückert
Affiliations:
International Computer Science Institute, Berkeley 94704
Venue:
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Year:
2009

Citing 6
Cited 1

gSpan: Graph-Based Substructure Pattern Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Rademacher and gaussian complexities: risk bounds and structural results

The Journal of Machine Learning Research
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Comparison of descriptor spaces for chemical compound retrieval and classification

Knowledge and Information Systems
Optimizing Feature Sets for Structured Data

ECML '07 Proceedings of the 18th European conference on Machine Learning
Don't be afraid of simpler patterns

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Fast, effective molecular feature mining by local optimization

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III

Quantified Score

Hi-index	0.01

Visualization

Abstract

Partially ordered feature sets appear naturally in many classification settings with structured input instances, for example, when the data instances are graphs and a feature tests whether a specific substructure occurs in the instance. Since such features are partially ordered according to an "is substructure of" relation, the information in those datasets is stored in an intrinsically redundant form. We investigate how this redundancy affects the capacity control behavior of linear classification methods. From a theoretical perspective, it can be shown that the capacity of this hypothesis class does not decrease for worst case distributions. However, if the data generating distribution assigns lower probabilities to instances in the lower levels of the hierarchy induced by the partial order, the capacity of the hypothesis class can be bounded by a smaller term. For itemset, subsequence and subtree features in particular, the capacity is finite even when an infinite number of features is present. We validate these results empirically on three graph datasets and show that the limited capacity of linear classifiers on such data makes underfitting rather than overfitting the more prominent capacity control problem. To avoid underfitting, we propose using more general substructure classes with "elastic edges" and we demonstrate how such broad feature classes can be used with large datasets.