Prevalence dependence in model goodness measures with special emphasis on true skill statistics

Abstract It has long been a concern that performance measures of species distribution models react to attributes of the modeled entity arising from the input data structure rather than to model performance. Thus, the study of Allouche et al. (Journal of Applied Ecology, 43, 1223, 2006) identifying the true skill statistics (TSS) as being independent of prevalence had a great impact. However, empirical experience questioned the validity of the statement. We searched for technical reasons behind these observations. We explored possible sources of prevalence dependence in TSS including sampling constraints and species characteristics, which influence the calculation of TSS. We also examined whether the widespread solution of using the maximum of TSS for comparison among species introduces a prevalence effect. We found that the design of Allouche et al. (Journal of Applied Ecology, 43, 1223, 2006) was flawed, but TSS is indeed independent of prevalence if model predictions are binary and under the strict set of assumptions methodological studies usually apply. However, if we take realistic sources of prevalence dependence, effects appear even in binary calculations. Furthermore, in the widespread approach of using maximum TSS for continuous predictions, the use of the maximum alone induces prevalence dependence for small, but realistic samples. Thus, prevalence differences need to be taken into account when model comparisons are carried out based on discrimination capacity. The sources we identified can serve as a checklist to safely control comparisons, so that true discrimination capacity is compared as opposed to artefacts arising from data structure, species characteristics, or the calculation of the comparison measure (here TSS).

Prevalence of different species may differ for two basic reasons: Eithersamplingpointsarefixed,butdifferentspeciesoccurwithdifferent frequency, or presence information of species is independent because of a presence-only collection scheme, which is often true fordatasetsoriginatingfrommuseumcollections (Elith&Leathwick, 2007). It is difficult to imagine a project with real data, where each specieshasthesameprevalenceunlesscommonspeciesareresampledtolowprevalence.Thelatterwouldhowevermeaninformation reduction,whichwouldbeunnecessaryifmeasureswouldnotdepend onprevalence.

| A critique to the design of Allouche et al. (2006)
The true skill statistics is defined based on the components of the standardconfusionmatrixrepresentingmatchesandmismatchesbetweenobservationsandpredictions(Fielding&Bell,1997;Table1.).

Where
The literature refers to TPR as true-positive rate or sensitivity, whiletoTNRastrue-negativerateorspecificity (Fielding&Bell,1997).  As Allouche etal. (2006) did not appropriately prove that TSS is independent of prevalence and empirical experience indicates such an effect, there is a need to revisit prevalence dependence in TSS.
We take the strategy of proceeding from simple cases toward complexones.Weassumethatifprevalencedependenceappearsina simplecase,itisunlikelythatitdisappearsinthecorrespondingmore complexcases.
In case 1), we assume that the observed pattern coincides with the suitability. In such a case, the contingency   where xdenotesthepredictedvalue.

| The case of continuous predictions
If we have continuous probabilities as prediction, the equations areasfollows: where x c refers to the cutoff value corresponding to maximum TSS.
Predicted probability values were randomly chosen from the beta distribution with parameters given in the Table7 representing

| Results
We found a response to prevalence changes in the maximum value of TSS for small sample sizes (Figures2 and 3), which however decreasedwithanincreaseinsamplesizeandapproachedthetheoreticallyexpectedvalue.Samplesizeof10,000eliminatedanyTSSbias evenfortheworstmodelevenwithlowestprevalencecorrespond-ing500presences.Samplesizeof1,000with50presencesshowed The f 0 and f 1 functionsusedinoursimulationsare specificcasesofthebetadistributionifα=1orβ=1.Thetable showsthecorrespondingotherparameterofthebetadistribution producingtheprobabilityfunctionofselectingacertainprobability valueforpresenceobservations.Selectionsforabsenceobservations followtheoppositetrend.TherbetafunctioninRgeneratesrandom numberswithsuchdistributions(AppendixS1) Thereisabundantevidenceagainstspeciescloselyfollowingsuitabilitypatterns,includingmetapopulationtheory (Hanski,1991),extinction debt (Tilman,May,Lehman,&Nowak,1994),andotherconsiderations (Gu&Swihart,2004 Gallardo & Albridge 2013, Baross et al. 2015). It is also one of the default measures in BIOMOD ), one of the most widespread SDM tool and also propagated in reviews Liu, Newell, & White, 2016). While users of max TSS still assume that they use a prevalence independent measure, we observed as large differences as almost 0.2 in the average maximumTSSduetodifferencesinprevalenceonlyevenin"good models" at the lowest sample size. Differences in maximum TSS as small as 0.001 and 0.06 have been interpreted as the model with the higher TSS being superior to the one with the lower maximum value (Coetzee et al., 2009 andZurell et al., 2012, respectively). Therefore, the level of influence of prevalence detected for low sample sizes has a message for the practice, too.