Maximum entropy density estimation and modeling geographic distributions of species

  • Authors:
  • Miroslav Dudik

  • Affiliations:
  • Princeton University

  • Venue:
  • Maximum entropy density estimation and modeling geographic distributions of species
  • Year:
  • 2007

Quantified Score

Hi-index 0.01

Visualization

Abstract

Maximum entropy (maxent) approach, formally equivalent to maximum likelihood, is a widely used density-estimation method. When input datasets are small, maxent is likely to overfit. Overfitting can be eliminated by various smoothing techniques, such as regularization and constraint relaxation, but theory explaining their properties is often missing or needs to be derived for each case separately. In this dissertation, we propose a unified treatment for a large and general class of smoothing techniques. We provide fully general guarantees on their statistical performance and propose optimization algorithms with complete convergence proofs. As special cases, we can easily derive performance guarantees for many known regularization types including L1 and L2-squared regularization. Furthermore, our general approach enables us to derive entirely new regularization functions with superior statistical guarantees. The new regularization functions use information about the structure of the feature space, incorporate information about sample selection bias, and combine information across several related density-estimation tasks. We propose algorithms solving a large and general subclass of generalized maxent problems, including all discussed in the dissertation, and prove their convergence. Our convergence proofs generalize techniques based on information geometry and Bregman divergences as well as those based more directly on compactness.As an application of maxent, we discuss an important problem in ecology and conservation: the problem of modeling geographic distributions of species. Here, small sample sizes hinder accurate modeling of rare and endangered species. Generalized maxent offers several advantages over previous techniques. In particular, generalized maxent addresses the problem in a statistically sound manner and allows principled extensions to situations when data collection is biased or when we have access to data on many related species. The utility of our unified approach is demonstrated in comprehensive experiments on large real-world datasets. We find that generalized maxent is among the best-performing species-distribution modeling techniques. Our experiments also show that the contributions of this dissertation, i.e., regularization strategies, bias-removal approaches, and multiple-estimation techniques, all significantly improve the predictive performance of maxent.