Predicting the potential habitat of oaks with data mining models and the R system

  • Authors:
  • Rafael Pino-Mejías;María Dolores Cubiles-de-la-Vega;María Anaya-Romero;Antonio Pascual-Acosta;Antonio Jordán-López;Nicolás Bellinfante-Crocci

  • Affiliations:
  • Department of Statistics and Operational Research, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain;Department of Statistics and Operational Research, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain;Department of Cristallography, Mineralogy and Agricultural Chemistry, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain;Andalusian Prospective Center, Avda. Reina Mercedes s/n, Seville, Spain;Department of Cristallography, Mineralogy and Agricultural Chemistry, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain;Department of Cristallography, Mineralogy and Agricultural Chemistry, University of Seville, Avda. Reina Mercedes s/n, Seville, Spain

  • Venue:
  • Environmental Modelling & Software
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Oak forests are essential for the ecosystems of many countries, particularly when they are used in vegetal restoration. Therefore, models for predicting the potential habitat of oaks can be a valuable tool for work in the environment. In accordance with this objective, the building and comparison of data mining models are presented for the prediction of potential habitats for the oak forest type in Mediterranean areas (southern Spain), with conclusions applicable to other regions. Thirty-one environmental input variables were measured and six base models for supervised classification problems were selected: linear and quadratic discriminant analysis, logistic regression, classification trees, neural networks and support vector machines. Three ensemble methods, based on the combination of classification tree models fitted from samples and sets of variables generated from the original data set were also evaluated: bagging, random forests and boosting. The available data set was randomly split into three parts: training set (50%), validation set (25%), and test set (25%). The analysis of the accuracy, the sensitivity, the specificity, together with the area under the ROC curve for the test set reveal that the best models for our oak data set are those of bagging and random forests. All of these models can be fitted by free R programs which use the libraries and functions described in this paper. Furthermore, the methodology used in this study will allow researchers to determine the potential distribution of oaks in other kinds of areas.