A rigorous analysis of population stratification with limited data

Authors:
Kamalika Chaudhuri;Eran Halperin;Satish Rao;Shuheng Zhou
Affiliations:
University of California, Berkeley, CA;International Computer Science Institute, Berkeley, CA;University of California, Berkeley, CA;Carnegie Mellon University, Pittsburgh, PA
Venue:
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2007

Citing 18
Cited 2

Hill-climbing finds random planted bisections

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Algorithms for graph partitioning on the planted partition model

Random Structures & Algorithms
Learning mixtures of arbitrary gaussians

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Collaborative filtering with privacy via factor analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Spectral Algorithm for Learning Mixtures of Distributions

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Correlation Clustering

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
Cluster Graph Modification Problems

WG '02 Revised Papers from the 28th International Workshop on Graph-Theoretic Concepts in Computer Science
Convergent algorithms for collaborative filtering

Proceedings of the 4th ACM conference on Electronic commerce
Recommendation Systems: A Probabilistic Analysis

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Spectral Partitioning of Random Graphs

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Clustering with Qualitative Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Correlation Clustering: maximizing agreements via semidefinite programming

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Using mixture models for collaborative filtering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Aggregating inconsistent information: ranking and clustering

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
On Learning Mixtures of Heavy-Tailed Distributions

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Routing, disjoint paths, and classification

Routing, disjoint paths, and classification
On spectral learning of mixtures of distributions

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Separating populations with wide data: a spectral analysis

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
Learning mixtures of arbitrary distributions over large discrete domains

Proceedings of the 5th conference on Innovations in theoretical computer science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding the genetic factors of complex diseases such as cancer, currently a major effort of the international community, will potentially lead to better treatment of these diseases. One of the major difficulties in these studies, is the fact that the genetic components of an individual not only depend on the disease, but also on its ethnicity. Therefore, it is crucial to find methods that could reduce the population structure effects on these studies. This can be formalized as a clustering problem, where the individuals are clustered according to their genetic information. Mathematically, we consider the problem of clustering bit "feature" vectors, where each vector represents the genetic information of an individual. Our model assumes that this bit vector is generated according to a prior probability distribution specified by the individual's membership in a population. We present methods that can cluster the vectors while attempting to optimize the number of features required. The focus of the paper is not on the algorithms, but on showing that optimizing certain objective functions on the data yields the right clustering, under the random generative model. In particular, we prove that some of the previous formulations for clustering are effective. We consider two different clustering approaches. The first approach forms a graph, and then clusters the data using a connected components algorithm, or a max cut algorithm. The second approach tries to estimate simultanously the feature frequencies in each of the populations, and the classification of vectors into populations. We show that using the first approach Θ(logN/γ2) data (i.e., total number of features times number of vectors) is sufficient to find the correct classification, where N is the number of vectors of each population, and γ is the average l22 distance between the feature probability vectors of the two populations. Using the second approach, we show that O(log N/α4) data is enough, where α is the average l1 distance between the populations. We also present polynomial time algorithms for the resulting max margin which, for now, needs only slightly more data than stated above. Our methods can also be used to give a simple combinatorial algorithm for finding a bisection in a random graph that matches Boppana's convex programming approach (and McSherry's spectral results).