Feature Selection in Taxonomies with Applications to Paleontology

Authors:
Gemma C. Garriga;Antti Ukkonen;Heikki Mannila
Affiliations:
HIIT, Helsinki University of Technology and University of Helsinki, Finland;HIIT, Helsinki University of Technology and University of Helsinki, Finland;HIIT, Helsinki University of Technology and University of Helsinki, Finland
Venue:
DS '08 Proceedings of the 11th International Conference on Discovery Science
Year:
2008

Citing 10
Cited 1

Network flows: theory, algorithms, and applications

Network flows: theory, algorithms, and applications
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Mining generalized association rules

Future Generation Computer Systems - Special double issue on data mining
Using Feature Hierarchies in Bayesian Network Learning

SARA '02 Proceedings of the 4th International Symposium on Abstraction, Reformulation, and Approximation
Combinatorial feature selection problems

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Using Category-Based Adherence to Cluster Market-Basket Data

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data

Knowledge and Information Systems
Exploiting known taxonomies in learning overlapping concepts

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Relevancy in constraint-based subgroup discovery

Proceedings of the 2004 European conference on Constraint-Based Mining and Inductive Databases

Using ontologies in semantic data mining with SEGS and g-SEGS

DS'11 Proceedings of the 14th international conference on Discovery science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Taxonomies for a set of features occur in many real-world domains. An example is provided by paleontology, where the task is to determine the age of a fossil site on the basis of the taxa that have been found in it. As the fossil record is very noisy and there are lots of gaps in it, the challenge is to consider taxa at a suitable level of aggregation: species, genus, family, etc. For example, some species can be very suitable as features for the age prediction task, while for other parts of the taxonomy it would be better to use genus level or even higher levels of the hierarchy. A default choice is to select a fixed level (typically species or genus); this misses the potential gain of choosing the proper level for sets of species separately. Motivated by this application we study the problem of selecting an antichain from a taxonomy that covers all leaves and helps to predict better a specified target variable. Our experiments on paleontological data show that choosing antichains leads to better predictions than fixing specific levels of the taxonomy beforehand.