Dirichlet component analysis: feature extraction for compositional data

Authors:
Hua-Yan Wang;Qiang Yang;Hong Qin;Hongbin Zha
Affiliations:
Peking University;Hong Kong University of Science and Technology;State University of New York at Stony Brook;Peking University
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 5
Cited 1

Nonlinear component analysis as a kernel eigenvalue problem

Neural Computation
Genetic Algorithms and Machine Learning

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Dirichlet aggregation: unsupervised learning towards an optimal metric for proportional data

Proceedings of the 24th international conference on Machine learning
Learning riemannian metrics

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Learning to rank tags

Proceedings of the ACM International Conference on Image and Video Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider feature extraction (dimensionality reduction) for compositional data, where the data vectors are constrained to be positive and constant-sum. In real-world problems, the data components (variables) usually have complicated "correlations" while their total number is huge. Such scenario demands feature extraction. That is, we shall de-correlate the components and reduce their dimensionality. Traditional techniques such as the Principle Component Analysis (PCA) are not suitable for these problems due to unique statistical properties and the need to satisfy the constraints in compositional data. This paper presents a novel approach to feature extraction for compositional data. Our method first identifies a family of dimensionality reduction projections that preserve all relevant constraints, and then finds the optimal projection that maximizes the estimated Dirichlet precision on projected data. It reduces the compositional data to a given lower dimensionality while the components in the lower-dimensional space are de-correlated as much as possible. We develop theoretical foundation of our approach, and validate its effectiveness on some synthetic and real-world datasets.