PresenceAbsence : An R Package for Presence Absence Analysis

The PresenceAbsence package for R provides a set of functions useful when evaluating the results of presence-absence analysis, for example, models of species distribution or the analysis of diagnostic tests. The package provides a toolkit for selecting the optimal threshold for translating a probability surface into presence-absence maps speciﬁcally tailored to their intended use. The package includes functions for calculating threshold dependent measures such as confusion matrices, percent correctly classiﬁed (PCC), sensitivity, speciﬁcity, and Kappa, and produces plots of each measure as the threshold is varied. It also includes functions to plot the Receiver Operator Characteristic (ROC) curve and calculates the associated area under the curve (AUC), a threshold independent measure of model quality. Finally, the package computes optimal thresholds by multiple criteria, and plots these optimized thresholds on the graphs.


Introduction
Binary classification is a technique crucial to multiple areas of study.In the medical field these techniques are used to evaluate diagnostic tests (Greiner et al. 2000;Cantor et al. 1999) and in radiology (Hanley and McNeil 1982).Ecological applications include predicting and mapping species distributions (Guisan and Thuiller 2005;Moisen et al. 2006;Anderson et al. 2003), conservation planning (Wilson et al. 2004;Fielding and Bell 1997), remote sensing (Congalton 1991), habitat modeling (Hirzel et al. 2006;Pearce and Ferrier 2000;Reineking and Schröder 2003), and wildlife biology (Manel et al. 1999;Guisan and Hofer 2003).
Many modeling techniques for binary classification problems result in predictions that are analogous to a probability of presence.To translate this to a simple 0/1 classification it is necessary to specify a choice of threshold, or cut-off probability beyond which something is classified as present.The selection of this threshold value can have dramatic effects on model accuracy and predicted prevalence (the overall proportion of locations where the variable is predicted to be present).The traditional default is to simply use a threshold of 0.5 as the cutoff, but this does not necessarily preserve the prevalence or result in the highest prediction accuracy, especially for data sets with very high or very low prevalence.
The PresenceAbsence package (Freeman 2007) in R (R Development Core Team 2007) is a collection of tools (both analytical and graphical) for evaluating the performance of binary classification models and determining the optimum threshold for translating a probability surface to a presence-absence map.The PresenceAbsence package includes functions for calculating threshold dependent accuracy measures such as confusion matrices, percent correctly classified (PCC), sensitivity, specificity, and Kappa, and produces several types of graphical plots illustrating the effect on each measure as the threshold is varied.The package calculates optimal thresholds by multiple criteria, and plots these optimized thresholds on the graphs.It also includes functions to plot the Receiver Operator Characteristic (ROC) curve and calculates the associated area under the curve (AUC), a threshold independent measure of model quality (Hanley and McNeil 1982;Swets 1988).The package produces calibration plots to assess the accuracy of the predicted probabilities.Finally, it includes a function that uses beta distributions to produce simulated presence-absence data.
There are stand alone software applications for evaluating binary classification models.Examples range from full featured ROC analysis tools such as ROC AUC (Schröder 2006), to general purpose model evaluation tools such as PERF (Caruana 2004), to the Web-based ROC calculators ROCOn (Farrand and Langford 2002) and jrocfit.org(Eng 2006).In R, it is possible to calculate the AUC from the Wilcox test provided in the stats package, part of a standard installation of R.There are also several R packages available that will produce ROC plots and calculate performance measures such as the AUC.These include the packages ROCR (Sing et al. 2005) and the ROC (Carey and Redestig 2007), as well as the roc.area and roc.plot functions from verification package (NCAR -Research Application Program 2007), the colAUC function from the caTools package (Tuszynski 2007), the ROC function from the Epi package (Carstensen et al. 2007), and the auROC function from the limma package (Smyth 2005).Some of these R packages will optimize the threshold choice by one or more techniques and plot this threshold on the ROC plot, or produce plots of sensitivity and specificity as a function of threshold.In S-PLUS (Insightful Corp. 2003), the roc package (Atkinson and Mahoney 2004) goes beyond the PresenceAbsence package by providing significance testing of correlated ROC curves.
The PresenceAbsence package, however, provides multiple graph types, model evaluation statistics, and threshold optimization criteria in a single integrated unit.It allows researchers to carry out a complete analysis of their model results, and immediately see the effects of their threshold selections on all aspects of the prediction accuracy.Thresholds can be optimized by 12 different techniques, and these optimized thresholds can be plotted automatically on the graphs.It also provides tools, both analytical and graphical, to examine the relationships between prevalence, model accuracy, and threshold selection.
If the model building is being carried out in R, then using the PresenceAbsence package (rather than stand alone ROC software) streamlines the model evaluation process.And if the final presentation of the results is created as a Sweave document, the entire analysis and document creation can be carried out in a single step, allowing for automatic updates of the figures for new data or slight changes in analysis.
In this paper, we give a demonstration of a basic exploratory analysis of a presence-absence dataset using this package.The PresenceAbsence package is available from the Comprehensive R Archive Network (http://CRAN.R-project.org/).

Dataset
This demonstration dataset is included in the PresenceAbsence package.It is the validation dataset from Moisen et al. (2006), and more detailed information can be found in this paper.
In brief, presence-absence data for 13 tree species (Table 1) were collected by the USDA Forest Service, Forest Inventory and Analysis Program in a six million ha study area predominately in the mountains of northern Utah.These species presence data were modeled as functions of satellite imagery and bioclimatic information using three different types of models.These models included generalized additive models implemented in R, a variant on classification and regression trees implemented in Rulequest's See5 package, and stochastic gradient boosting implemented in R, using the gbm package (Ridgeway 2006) (hereafter GAM, See5, and SGB, respectively).The demonstration data used here contains presence-absence predictions for the 13 tree species at 386 independent model validation sites.Specifically, the data set consists of species, observed presence-absence values (1 or 0), and the predicted probability of presence from each of the three modeling techniques.
After installing the PresenceAbsence package, load the dataset for this demo: R> library("PresenceAbsence") R> data("SPDATA") Begin by examining the data structure: R> head(SPDATA) Note that the functions rely on column order to identify the observed and predicted values, not on the column names.Therefore, as long as the columns are in the correct order, column names can be anything you choose.The only time the functions use the column names is for the default labels on graphs.
We will define a few variables for later use:

Initial exploration
The presence.absence.histfunction produces a bar plot of observed values as a function predicted probability.These plots make prevalence dramatically obvious.Many of the traditional accuracy measures (such as PCC, sensitivity and specificity) are highly dependent on species prevalence.Therefore, it is important to check prevalence early in an investigation.
Consider three models for a single species (Figure 1 It is immediately obvious that this is a low prevalence species.This is important to note as extra care must be used when analyzing species with very low or very high prevalence.Many of the traditional measures of accuracy, such as percent correctly classified (PCC), sensitivity, and specificity are highly dependent on prevalence.Even some of the threshold optimization criteria can be affected by species prevalence (Manel et al. 2001).
The immediate effect of low prevalence on this particular plot is that the zero bar is so much taller than the other bars that it is difficult to see any other details of the species distributions.Setting the argument truncate.tallest= TRUE will truncate the tallest bar to fit better in the plot region and allow examination of finer structures in the histogram.If truncate.tallest= TRUE and the tallest bar is more than twice the height of the other bars, it is truncated so that it is 20 percent taller than the next bar, the truncated bar is cross hatched, and the true count is printed at the top of the bar.Note though, that the truncate.tallestoption can visually obscure the true prevalence (Figure 2).Next, we will compare the graphs of 3 species with differing prevalence (in this case the truncate.tallestargument is set to FALSE to avoid obscuring the true prevalence, see Figure 3): R> par(oma = c(0, 0, 3, 0), mar = c(4, 4, 4, 1), mfrow = c(1, 3), cex = 1) R> mod <-2 R> for (sp in c("ACGR3", "PIEN", "POTR5")) { These plots are also a good visual indicator of model quality.A more accurate model (such as the See5 model for PICO) will have a distinctly double humped histogram.Choosing a threshold in the valley will divide the data cleanly into observed present and observed absent groups.A less accurate model (such as the See5 model for JUSC2) will have considerable overlap between observed present and observed absent, and no threshold exists that will cleanly divide the data into observe present and observed absent.The challenge here is to find a threshold that offers the best compromise.There are several methods available to optimize the choice of threshold, but this will be dealt with in a later section.

Predicted prevalence
If it is important that the model predicts the prevalence of a species correctly, you can choose a threshold where the predicted prevalence equals the prevalence threshold.The function predicted.prevalencecalculates a table showing the effect of threshold choice on the predicted prevalence of the models.
Begin with the standard cutoff of threshold = 0.5, for declaring a species present or absent: The standard deviations can be suppressed by setting st.dev=FALSE.Note that the standard deviations calculated by presence.absence.accuracyare only appropriate for examining each model individually.To do significance testing for differences between models it is necessary to take into account correlation.DeLong et al. (1988) examines this issue and describes a method for comparing the AUC from two or more correlated models.This method is not included in the PresenceAbsence package, but can be found in the S-PLUS roc library from the Mayo clinic (Atkinson and Mahoney 2004).See help(auc) for further details.
While the default for presence.absence.accuracy is to calculate accuracy measures for all models at threshold = 0.5, the threshold argument can be used to set other thresholds.When the table is being produced for multiple models, threshold must be either a single threshold used for all models, or it must be a vector of thresholds of the same length as the number of models in the table.In the later case, the first threshold will be used for the first model, the second threshold for the second row, and so on.On the other hand, when examining a single model (the which.modelargument can be used to specify the model), multiple thresholds can be calculated.This allows us to discover how the accuracy measures change as a function of threshold.Thus when producing a table for a single model, the threshold argument can be given as a single threshold between zero and one, a vector of thresholds between zero and one, or a positive integer representing the number of evenly spaced thresholds to calculate.In these tables, the AUC is constant for all thresholds.The AUC is the only one of these basic accuracy statistics that is threshold independent.Calculating the AUC for large datasets can be very memory intensive.Setting the argument find.auc= FALSE will turn off the AUC calculations, and increase the speed.

Graphing the error statistics as a function of threshold
Tables are useful for looking up specific values, but often it is easier to spot patterns visually in a graph.The PresenceAbsence package provides visual displays of error statistics.
In this dataset, JUSC2 and PICO have the same prevalence, but PICO has better model quality for all three models.This can be seen in these graphs in several ways.For PICO, Kappa reaches a higher maximum value, and Kappa stays at higher values for a greater range of possible thresholds.Also, the lines representing sensitivity and specificity cross at higher value, thus it is possible to choose a threshold that gives a high level of sensitivity without sacrificing specificity, and vice versa (Figure 5).

ROC plots and the AUC
ROC plots, with their associated AUC provide a threshold independent method of assessing model performance.ROC plots are produced by assessing the ratio of true positives to false positives for all possible thresholds.The AUC is equivalent to the chance that a randomly chosen plot with an observed value of present will have a predicted probability higher than that of a randomly chosen plot with an observed value of absent.
The plot from a good model will rise steeply to the upper left corner then level off quickly, resulting in an AUC near 1.0.A poor model (i.e. a model that is no better than random assignment) will have a ROC plot lying along the diagonal, with an AUC near 0.5 (Figure 6).

Calculating the optimal thresholds
It is possible to get rough ideas of good threshold choices by looking at the error graphs, and the accuracy and prevalence tables, however the PresenceAbsence library also includes the function optimal.thresholds to calculate tables of optimized thresholds, by a choice of 12 different criteria (Table 2).
By default, optimal.thresholdschecks 101 thresholds evenly spaced between zero and one, and thus finds the optimal threshold to 2 significant digits.Some of the optimization criteria (Sens=Spec, MaxSens+Spec, MaxKappa, MaxPCC, PredPrev=Obs, MinROCdist, Cost) can result in tied threshold values.In these cases the ties are averaged to find the optimal threshold.With criteria that are optimized by finding minimum or maximum values, the smoothing argument can also be employed to deliberately average the best thresholds.Set smoothing=10, for example, to find the average of the ten best thresholds for a given criterion.
If the sample size is large, the precision can be increased by setting the argument threshold to larger numbers.Instead of the default threshold=101, set it to threshold=1001, or threshold=10001.Note that the precision will still be limited by the sample size, as the upper limit on precision is set by the number of unique thresholds in the data set.
To find the theoretically 'best' thresholds, would require calculating every possible unique threshold (not necessarily evenly spaced).This is not included in the package, but if these were calculated, they could be fed into the function optimal.thresholdsvia the threshold argument.The twelve possible criteria for optimizing thresholds are summarized in (Table 2).The opt.methods argument is specified by a vector of criteria from this list.The methods can be given by number, for example opt.methods=1:10 or opt.methods=c (1,2,4).They can also be given by name, for example opt.methods = c("Default", "Sens=Spec", "MaxKappa").
The 12 criteria in more detail: Default -The default criterion of setting threshold = 0.5.
Sens=Spec -The second criterion for optimizing threshold choice is by finding the threshold where sensitivity equals specificity.In other words, find the threshold where positive observations are just likely to be wrong as negative observations.
MaxSens+Spec -The third criterion chooses the threshold that maximizes the sum of sensitivity and specificity.In other words, it is minimizing the mean of the error rate for positive observations and the error rate for negative observations.This is equivalent to maximizing (sensitivity + specificity − 1)), otherwise know as the Youden's index, or the True Skill Statistic.Note that while Youden's Index is independent of prevalence, Manel et al. (2001) found that using the sum of sensitivity and specificity to select a threshold does have an effect on the predicted prevalence, causing rare species to be given much lower thresholds than widespread species, as a result, the distribution of rare species can be over predicted.
MaxKappa -The forth criterion for optimizing the threshold choice finds the threshold that gives the maximum value of Kappa.Kappa makes full use of the information in the confusion matrix to asses the improvement over chance prediction.
MaxPCC -The fifth criterion is to maximize the total accuracy PCC.A warning: while it may seem like maximizing total accuracy would be the obvious goal, there are many problems with using PCC to assess model accuracy.For example, with species with very low prevalence, it is possible to maximize PCC simply by declaring the species absent at all locations.This is not a very useful prediction.
PredPrev=Obs -The sixth criterion is to find the threshold where the predicted prevalence is equal to the observed prevalence.This is a useful method when preserving prevalence is of prime importance.If you have observed prevalence data from another source, you can use the argument obs.prev to set the observed prevalence.Otherwise, obs.prev defaults to the observed prevalence from DATA.
ObsPrev -The seventh criterion is an even simpler variation, where you simply set the threshold to the Observed prevalence.It is nearly as good at preserving prevalence as method 6 and requires no computation.Again, if you have observed prevalence data from another source, you can use the argument obs.prev to set the observed prevalence.Otherwise, obs.prev defaults to the observed prevalence from DATA.
MeanProb -The eighth criterion sets the threshold to the mean probability of occurrence from the model results.
MinROCdist -The ninth criterion is to find the threshold that minimizes the distance between the ROC plot and the upper left corner of the unit square.
ReqSens -The tenth criterion allows the user to set a required sensitivity, and then finds the highest threshold (i.e. with the highest possible specificity) that still meets the required sensitivity.The user can decide that the model must miss no more than, for example 15% of the plots where the species is observed to be present.Therefore they require a sensitivity of at least 0.85.This will then find the threshold that has the highest possible specificity while still meeting the required sensitivity.This may be useful if, for example, the goal is to define a management area for a rare species, and one wants to be certain that the management area does not leave populations unprotected.
ReqSpec -The eleventh criterion allows the user to set a required specificity, and then finds the lowest threshold (i.e. with the highest possible sensitivity) that still meets the required specificity.The user can decide that the model must miss no more than, for example 15% of the plots where the species is observed to be absent.Therefore they require a specificity of at least 0.85.This will then find the threshold that has the highest possible sensitivity while still meeting the required specificity.This may be useful if, for example, the goal is to determine if a species is threatened, and one wants to be certain not to over inflate the population by over declaring true absences as predicted presences.
Note: For the ReqSens and ReqSpec criteria, the required sensitivity (or specificity) can be changed to meet specific user requirements by the use of the req.sens and req.spec arguments.If unspecified, the defaults are req.sens= 0.85 and req.spec = 0.85.Be aware, though, that if the model is poor, and the requirement is too strict, it is possible that the only way to meet it will be by declaring every single plot to be present (for ReqSens) or absent (for ReqSpec), which is not a very useful method of prediction.
Cost -The twelfth criterion balances the relative costs of false positive predictions and false negative predictions, as described by Fielding and Bell (1997).A slope is calculated as (FPC /FNC ) ((1 − prevalence) /prevalence), where FPC is false positive costs and FNC is false negative costs.To determine the threshold, a line of this slope is moved from the top left of the ROC plot, till it first touches the ROC curve.If the arguments for false positive costs FPC and FNC are missing, the function will assume costs are equal.
Once again, if you have observed prevalence data from another source, you can use the argument obs.prev to set the observed prevalence.Otherwise, obs.prev defaults to the observed prevalence from DATA.
The criterion Cost can also be used for C/B ratio analysis of diagnostic tests.In this case FPC is equivalent to C (the net costs of treating non-diseased individuals) and FNC is equivalent to B (the net benefits of treating diseased individuals).For further information on the Cost criterion see Wilson et al. (2004) and Cantor et al. (1999).

Adding optimal thresholds to graphs
The optimized thresholds can be added to the histogram plots, the error statistic plots, and to the ROC plots.To add optimal thresholds set the argument opt.thresholds=TRUE.(Note: If the argument opt.methods is supplied, opt.thresholds will default to TRUE, otherwise opt.thresholds defaults to FALSE.)When the argument add.opt.legend = TRUE a legend is added giving the symbols used for each threshold criterion, as well as the value of each optimized threshold.In this case, the legend is giving the optimized thresholds for the SGB model for PICO.quality models, notice how sensitivity and specificity cross at a higher value, and also, the error statistics show good values across a broader range of thresholds.
Note that while the symbols in the optimal threshold legend are correct for all 6 plots, the actual thresholds given in the legend are specific to that particular plot (in this case, species PICO, model SGB).In this particular case, if both legends had been included on all 6 plots, there would not be room to see the plots themselves.

See5
Threshold Accuracy Measures q q sensitivity specificity Kappa 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 SGB Threshold Accuracy Measures q q q q 0.50 Default 0.10 Sens=Spec 0.20 ReqSens 0.38 MaxKappa   Predicted Probability of Occurrence Observed Occurrence as Proportion of Sites Surveyed In the calibration plots provided in the PresenceAbsence package, the locations are grouped into bins based on their predicted values, and then the ratio of plots with observed values of present verses the total number of plots in that bin is calculated.The confidence interval for each bin is also included, and the total number of plots is labeled above each the bin.Here, we produce calibration plots for JUSC2 and PICO (Figure 13):

Observed vs. Predicted
Predicted Probability of Occurrence observed as proportion of bin q q q q q q 257 76 27 1111

Summary plots
Finally, the function presence.absence.summaryproduces a useful summary page of the four types of presence-absence plots, along with optimal thresholds and AUC value (Figure 14):

Conclusion
We have demonstrated the use of the PresenceAbsence package in a typical species mapping scenario.We have explored the model output both graphically and analytically, and examined the effects of optimizing the cut-off threshold by several different criteria.This analysis and decision making process would be similar in other binary classification datasets.We hope we have shown that the PresenceAbsence package provides a useful toolkit for map makers and map users when evaluating the probability surface output from a presence-absence model, and for translating the probability surface into species distribution maps for use in the field.

Figure 1 :
Figure 1: Presence-Absence histogram for three models and one species.

Figure 2 :
Figure 2: Presence-Absence histogram for three models and one species, with tallest bar truncated.

Figure 3 :
Figure 3: Presence-Absence histogram for the See5 model for three species with increasing prevalence.

Figure 4 :
Figure 4: Presence-Absence histogram for the See5 model for three species with similar prevalence but increasing model quality.

Figure 5 :
Figure5: Graph of error measures (sensitivity, specificity, and Kappa) as a function of threshold for two species with similar prevalence but differing model quality.

Figure 6 :
Figure 6: ROC plot for two species with similar model quality but differing prevalence.

Figure 7 :
Figure 7: ROC plot for two species with similar model quality but differing prevalence, with evenly spaced thresholds marked along the ROC curves.

Figure 8 :
Figure 8: ROC plot for three species with similar prevalence but increasing model quality.

Figure 9 :
Figure 9: Histogram plot for three species with similar prevalence but increasing model quality, with thresholds optimized by four criteria.

Figure 10 :
Figure 10: Error as a function of threshold for two species with similar prevalence, but differing model quality, with optimized thresholds marked along plots

Figure 11 :
Figure11: Error as a function of threshold for three species similar model quality but with increasing prevalence, with optimized thresholds marked on plots.

Figure 13 :
Figure 13: Calibration plot for two species with similar prevalence but different model quality.

Table 1 :
Species codes for the 13 tree species.
Column 1 is for the row ID's.In this case, The ID column is made up by the species code of each observation.Column 2 is the observed values where 0 = absent and 1 = present.If the observed values had been actual values (for example, basal area or tree counts) any values other than zero would have been treated by the functions as present.The remaining columns are for model predictions.Preferably these will be predicted probabilities ranging from zero to one.If all that is available are predicted presence-absence values, basic accuracy statistics can still be calculated, but you loose the ability to examine the effect of threshold choice, and most of the graphs are not meaningful.In this dataset, these are the model predictions from the three types of models: GAM, See5, and SGB.
DeLong et al. (1988)oose a threshold where the predicted prevalence equals the observed prevalence.To calculate the prevalence at particular thresholds, the threshold argument can be set to the values of interest, for example threshold = c(0.23,0.6,0.97)Thefunctionpresence.absence.accuracycalculatesbasicaccuracy measures (PCC, sensitivity, specificity, Kappa, and AUC) for presence-absence data, and (optionally) the associated standard deviations.The PresenceAbsence package uses the method fromDeLong et al. (1988)to calculate AUC and its associated standard deviation.By default the function will calculate all basic accuracy measures at the threshold of 0.5: We can look even closer at individual species to examine the effect of threshold choice on prevalence:R> sp <-1 R> DATA <-SPDATA[SPDATA$SPECIES == species[1], ]R> pred.prev<-predicted.prevalence(DATA=DATA, threshold = 11) R> pred.prev[,2:5]<-round(pred.prev[,2:5],digits = 2) R> print(paste("Species:", species[sp]))Setting threshold = 11 causes the function to calculate 11 evenly spaced thresholds between zero and one.If it is important that the model predicts the prevalence of a species correctly, you can

Table 2 :
Possible optimization criteria In this second example, the arguments req.sens, req.spec,FPC and FNC are used to supply these values: