Selecting site characteristics at different spatial and thematic scales for shrubby cinquefoil (Potentilla fruticosa L.) distribution mapping

Abstract The largest natural population of shrubby cinquefoil (Potentilla fruticosa) in the Baltic States was observed in the field to reveal the scale-dependent explanatory value of site characteristics for subsequent spatial distribution modelling of the species. About 700 km was crossed during field observations in 2008–2014. Thinning of the raw field records to ensure a distance of at least 50 metres between each point yielded 1459 presences and 7327 absences. These occurrence data were related to present and historical land cover, soil, elevation, human population density, the proportion of presence sites, and P. fruticosa mean coverage in the neighbourhood. Boosted classification tree models were used to compare the value of 60 individual site features at thematically and spatially different levels of generalization as indicators of the species’ presence or absence. P. fruticosa presence is significantly non-random regarding most of the studied site features but only a few of these are valuable predictors. The proportion of presences in the neighbourhood had the highest indicative value. P. fruticosa occurrence also coincides with moist thin calcareous soils according to the soil map, with larger scrubland patches according to the topographical database, and with tussock areas according to a topographical map from the 1930s. The explanatory value of nominal site characteristics primarily drops when the most indicative category is merged with other classes to form a more general category. Site characteristics calculated at the observation point are not always the most effective predictors for P. fruticosa occurrence – features of the neighbourhood are related to the occurrence as well. The study area was classified into: confirmed absence area, unclear presence/absence area and probable presence area. Subsequent distribution modelling in the unclear area should be targeted on a species presence/absence, while abundance could be the priority within the probable presence area.


Introduction
The general approaches for species and habitat distribution mapping can be outlined as 1) direct fi eld observations, 2) remote detection by remote sensing and telemetry techniques, or 3) indirect estimation by experts, using indicators from statistical models or from similarity-based reasoning. Direct observation by an expert can be reliable, but covering any larger area by direct observation is labour-intensive. Therefore, indirect estimations are inevitably required to create a detailed distribution map for a larger area. Site characteristics related to the occurrence of the target species and its potential habitat, are used to estimate the likely absence, presence or abundance of the species.
For the application of numerical methods, detailed formal elementary features (e.g. site elevation above sea level or land cover/land use type according to a given classifi cation) are extracted from data layers representing environmental conditions in the study area. Among these, neighbourhood effects, i.e. features representing the site neighbourhood (e.g. proportion of suitable habitat or the existence of earlier presence records of the species within a certain radius), have confi rmed to be equally useful as the focal feature values (Mack & Harper, 1977;Guisan & Thuiller, 2005;Latimer et al., 2006).
The selection of the appropriate spatial and thematic resolution is one of the central issues in land cover and habitat mapping using remote sensing data (Ju et al., 2005) and also in landscape ecology (Turner et al., 1989;Wiens, 1989) and in distribution mapping. The predictive ability of spatial models depends on scale, as species expectedly have characteristic scales of response to their environment (Thuiller et al., 2003;Holland et al., 2004;Graf et al., 2005;Boscolo & Metzger, 2009). Vale et al. (2014) found that fi ne resolution models are more accurate at selecting marginal habitats, and distribution models with coarse resolution tended to overestimate species distribution at the edge of distribution area. However, the limited availability of high resolution data precludes its frequent use. As a rule, abiotic variables (e.g. climate) tend to be more important in continental or global models, while variables representing habitat and biotic interactions can be more important at fi ner spatial scales (Boulangeat et al., 2012).
Scale has spatial and thematic aspects. Spatial scale is described by grain size and spatial extent (O'Neill et al., 1986;Wiens, 1989). Grain is the resolution or minimum mapping unit of the data, while extent is the size of a mapped area. In addition, local extent has been defi ned as the radius or area of a kernel around each focal point (Thompson & McGarigal, 2002), whereas distance-dependent neighbourhood effects are related to scale by the neighbourhood extent. Thematic resolution is often limited by the available datasets. In the case of a nominal variable, the thematic resolution refers to the level of detail in the categories (categorical scale); in the case of a continuous variable, the thematic scale is expressed by measurement precision. The reliability of spatial predictions affected by thematic resolution has been paid much less attention than the effect of the spatial scale (Liang et al., 2013;Zhou et al., 2014).
By combining different data layers: their spatial scale and thematic generalization, kernels covering different extent from the surrounding area, and statistics calculated locally from these kernels, the number of possible numerical features for any geographical location approaches infi nity. However, it is not reasonable to include excessive number of features to prediction models because since irrelevant explanatory variables add noise to the predictions and increase the risk of model over-fi tting (Remm, 2004;Dormann, 2011;Ficetola et al., 2014), and also because the pre-processing of each characteristic to a numerical format is a time-consuming task. It is not always easy to identify a priori how many (and which) site features are actually relevant and should be used in predictive models.
In this experiment, comprehensive fi eld observation data of shrubby cinquefoil Potentilla fruticosa were used to fi nd out which site characteristics in which spatial and thematic scale are the best predictors of the species' presence and absence in the study region. These spatial features should be involved to distribution models in subsequent studies as explanatory factors, applied for creating detailed full-cover predictive distribution maps, and considered when planning protection measures for the species.

Study area
The study area covering 819 km² is located in north-western Estonia and is bounded by the Baltic Sea to the north. The capital of Estonia, Tallinn, lies on its eastern boundary ( Figure 1). Figure 1. Location of the study area (black rectangle). Joonis 1. Uurimisala (must ristkülik) paiknemine.
The study area contains the Vääna Land scape Reserve (4.09 km²), created to protect, inter alia, the largest natural population of P. fruticosa in the Baltic States. Similar alvar sites can be found elsewhere in western and northern Estonia, but the species does not grow there.

Shrubby cinquefoil habitat demands
Shrubby cinquefoil (Potentilla fruticosa L. syn. Dasiphora fruticosa (L.) Rydb.) (Rosaceae) is a perennial fl owering shrub mainly known as a decorative cultivar. Its natural populations are widespread in Asian mountains and in North America. Its distribution in Europe is sporadic (Elkington & Woodell, 1963). Shrubby cinquefoil is a protected plant in Estonia, where the only sustainable population is between Tallinn, Keila and Paldiski. Here, mainly on alvar grasslands situated on Middle and Upper Ordovician limestone, grows the largest natural population in the Baltic States.
P. fruticosa prefers open sites, although resists moderately dense scrub and can survive for decades under a young forest canopy. However, shade is considered the main limiting factor (Gorchakovskiy, 1960;Elkington & Woodell, 1963). Moderate grazing seems to be benefi cial for the species, suppressing potential competitors -lush herbs and bushes. The species' attitude to soil characteristics is not clear. According to current knowledge, at least northern European populations grow mainly in moist base-rich soil, although the plant is tolerant of slightly acid soils, drought and temporal fl ooding (Gorchakovskiy, 1960;Elkington & Woodell, 1963;Roland & Smith, 1969;Reier & Leht, 1999;Lonati et al., 2014).

Data and methods
The methodological framework of this study consists of the following stages. The relations between data pre-processing, data processing stages and intermediate results are given in the data fl ow diagram ( Figure 2 2) area, where the species occurrence is probable; 3) area where the species occurrence is unclear.

Field observations
Field observations were made when P. fruticosa was in bloom in the summers of 2008-2014 by conducting walking tours across the terrain. The sites for fi eld observations were dynamically planned according to the accumulating experience on P. fruticosa distribution and its habitat preferences. The survey was focused on sites that should be relatively suitable for the species, but where it had not been recorded. Locations characterized by site features excluding the species (e.g. currently cultivated fi elds and the south-west part of the study area) were paid less attention. Dynamic planning of observation tracks enabled to increase the species detection to the sampling effort ratio and to prevent collecting superfl uous amount of absence data, which is inevitable in case of regular and random sampling. The coordinates of P. fruticosa presence locations, the movement track, and selected typical absence locations were recorded using a Garmin Vista HC + GPS receiver. The total length of the observation tracks during 82 observation days was 700 km. The coordinates of a presence location were measured when standing on a shrubby cinquefoil bush; the species was not visible within about 20 m in actively selected typical absence locations. The relative area covered by the species at each presence site was recorded but not used in this study as a dependent variable.
The density of observations was not equal throughout the study area, since the areal proportion of habitat types, ground visibility, terrain roughness, restricted areas and probability of species occurrence were also considered when planning fi eld tours and moving across terrain. Private and industrial yards (43 km²), a gunnerypractice ground (7 km²), and the territory of a military air fi eld (9 km²) were the largest restricted parts of the study area. Private lands were generally not an obstacle for the fi eld survey, as it is permitted to access unrestricted private property in Estonia, unless the owner forbids it. Field movement was not restricted to tracks, since following roads and paths causes unequal likelihood of locations to be observed (sampling bias) (Kadmon et al., 2004;Albert et al., 2010). Moreover -intentional dodges were frequent when moving along a vehicle track.

Thinning
The raw fi eld data contained 58 842 track points automatically recorded by the GPS receiver and 5687 intentionally recorded observations (2854 presence and 2833 absence sites) in locations considered typical by the observer. A method for reducing the effect of uneven sampling, removing redundant points and evening the observation density is spatial thinning (Remm et al., 2009;Boria et al., 2014). The raw records were thinned using an online spatial data calculator (SDC) (Remm & Kelviste, 2014) to ensure at least 50 m distance between each accepted location in order to even out the density of observations, reduce the share of automatically recorded absences, and avoid spatially close records. Aiello-Lammens et al. (2015) published a thinning function spThin in R. Thinning in the SDC differs from it by: 1) enabling different source and target points, 2) enabling different types of input (including 1D, 2D and 3D), 3) by starting from the fi rst point in a list, not from the location with the greatest number of neighbouring occurrences.
The thinning in the SDC was a step-bystep process implemented in the following order: 1. Removal of intentionally recorded absence sites less than 50 m from presence sites. 2. Thinning of retained absence sites to ensure intervals of at least 50 m. 3. Removal of track points less than 50 m from the retained absence sites. 4. Removal of track points less than 50 m from the presence sites. 5. Thinning of retained absence points to ensure intervals of at least 50 m. 6. Thinning of presence sites to ensure intervals of at least 50 m. The retained intentionally observed and automatically recorded absence sites gave a total of 8317 absence exemplars, which, together with the 1480 retained presence sites (total = 9797), were used in the calculations ( Figure 3). The species cover estimations in the thinned locations are freely available as an archived dataset . The dataset advantages are the recorded absences; all observations made during the same season, during a relatively short period of years and predominantly by one person; data thinning reduced pseudoreplication and disproportion between the number of recorded presence and absence locations. The mean density of thinned records per terrestrial study area is 0.12 points per hectare (0.09 in forests and 0.08 on fi elds), while being higher in the typical P. fruticosa habitats: scrubland 0.44, natural grassland 0.27 and in open swampy ground 0.24 records per hectare.

Data layers
Areal categories from the Estonian National Topographic Database (ENTD) were used to describe present land cover and land use. The spatial resolution of this database corresponds to a 1: 10 000 map; data were from the year 2013. The areal categories used at different levels of thematic generalization are listed in the Supplement (Remm, 2015). The most detailed level includes 48 categories, the medium level, 15 and the most generalized level, 6 categories.
In addition to the present land cover, the main areal categories from a 1: 50 000 topographical map surveyed in the second half of the 1930s were included, since the present P. fruticosa distribution is presumably related to previous land use. This historical map is more generalized than the ENTD data; land cover and land use is fl exibly depicted by different combinations of topographical symbols, which were classifi ed into 18 categories at the most detailed level. These categories were digitized from scanned map sheets as vector polygons. The meaning of categories included from the historical topographical map and their number does not correspond to the classifi cation used in the ENTD, since the scale and mapping principles of these data sources do not match. E.g. the historical map includes a special symbol for tussocks, used either as a separate areal category or in combination with grassland, marshland or shrub symbols. Unfortunately, tussock areas are not represented by any single ENTD land cover category nor by a combination of categories.
Soil data were obtained from the Estonian 1: 10 000 soil map and land elevation from digital elevation models derived from detailed LiDAR measured raw data. The soil types with their names according the original map and the closest World Reference Base for Soil Resources (WRB) taxonomic unit are listed in the Supplement (Remm, 2015), since categories used in the Estonian soil map are not directly transferable to the WRB soil system. The above-mentioned land cover, soil and elevation data were obtained from the Estonian Land Board, and human population data from Statistics Estonia (Ministry of Finance). The human population data layer had a 1000 m grid interval, the other layers were rasterized to 10 m grids.

Site features
The recorded sites were described using 60 site characteristics (features) calculated from data layers representing land cover, soil and terrain properties, human population density and also P. fruticosa observation results from the vicinity (Table 1). A site feature was defi ned by: 1) the data layer and its thematic detail (resolution), 2) spatial scale, as the radius limiting the inclusion extent of the values, 3) a statistic calculated from the values within the radius in the layer.
The values of the site features at the thinned locations were calculated using the SDC. Radii of 0 (local value), 100, 200, 500 and 1000 m were used to derive features corresponding to different spatial scales, except for the human density layer, which has an original grid interval of 1 km. The statistics calculated within a given radius were the mean of values in a numerical layer and the most frequent (modal) category in the case of a nominal data layer. For local features, the closest value was read from the data layer. Land elevation was included as the mean elevation and as the relative elevation compared to the mean in a given radius. Different thematic resolutions were represented by merging initial detailed categories into a smaller number of more generalized units, as given in Remm (2015).
Selecting site characteristics at different spatial and thematic scales for shrubby cinquefoil (Potentilla fruticosa L.) distribution mapping

Presence locations / leiukohad
Absence locations / puudumiskohad per number of records applied within different radii, instead of the raw number of presences, was used to reduce the effect of variable survey intensity. The focal record was not included when calculating the proportion of presences in the neighbourhood of an observation site. The use of proportions instead of the number of presences minimizes the effect of different observation density

Estimating predictive values of features
The explanatory value of site features was compared by: 1) looking for feature values that directly indicate P. fruticosa presence or absence, 2) by the correctness of predictions using this single variable ( Figure 2). Feature values excluding species presence or excluding absence were tested for both categorical and numerical features. For categorical features, this is simple examination of presence and absence frequency in the categories. In the case of numerical features, value intervals including at least 1% of observations (100 cases) and containing no recorded presences were considered as excluding species presence. The opposite is an absence excluding interval, which should include ≥ 1% of cases and no absences. For fi nding the excluding intervals, a gradual search algorithm is available in the SDC (Regions of frequency → Code).
The algorithm works as follows: 1. Sort cases according to values. 2. Mark the fi rst case in each group of equal values. 3. Check correspondence to pre-determined proportion and frequency crite- ria while extending the value interval starting from a fi rst case. Groups of equal values are always included together. When looking for a simple but effective and universal model applicable for a bivariate dependent variable, where the frequency of categories (presence/absence) is unequal, using single numerical and categorical predictors, the boosted classifi cation tree (BCT) method was preferred since it is highly resistant to over fi tting (since subsampling is included to model calibration) and has confi rmed to be among the most effective prediction methods (Elith et al., 2006;Zurell et al., 2009). The predicted presence or absence was calculated from a BCT model per each feature at each spatial and thematic scale. Interactions of site characteristics were not studied, as the aim of this investigation was to compare the explanatory value of features one by one.
The modelling results were compared in three aspects: 1) goodness-of-fi t between observed and model-predicted presences and absences, 2) the proportion of true positive predictions among model-predicted presences, and 3) statistical signifi cance of the difference in feature values between P. fruticosa presence and absence sites. The True Skill Statistic (TSS) -also called Hanssen-Kuipers Skill Score (Hanssen & Kuipers, 1965) -was used as the objective function to compare the predictive ability of the BCT models. The TSS value is calculated as the proportion of true positive cases plus the proportion of true negative cases minus one. The TSS statistic is preferable as it does not depend on the proportion of categories, and integrates correct prediction of both presences and absences (McPherson et al., 2004).
Another statistic for the comparison of site features was indicator precision or positive predictive value (PPV), which is the proportion of true positives among all positive predictions. The PPV enables both feature-level estimates and the comparison of single categories of a nominal feature. In the case of a single category, PPV is the proportion of the presence sites among observed sites characterized by this particular presence-indicating category.
The statistical signifi cance of the predictors was estimated using the SDC, by comparing the frequency distribution of explanatory categories in presence sites and absence sites using the χ² test, and by comparing mean values of numerical variables using the Mann-Whitney U test.

Predictive values of site features
Although all the 60 explanatory variables were statistically signifi cantly (p < 0.001) related to the occurrence of P. fruticosa, as the number of observations is large and species occurrence does not correspond randomly to different feature values, most of them were found to be weak predictors (Table 2). P. fruticosa presence is characterized fi rst of all by a higher density of presences in the vicinity, especially within the nearest 100 m (TSS = 0.85, PPV = 0.70). The best single categories of land cover and soil, which predict the species presence sites correctly in most cases (PPV ≥ 0.5) if they are the modal category within the given radius, were: 1) land cover type scrubland (indicative in all radii, the highest PPV = 0.88 if radius = 200 m), 2) tussock surface according to the historical map (all radii, PPV = 0.85 if combined with grassland) and 3) relatively thin gleyic soil laying on limestone bedrock in the study area: (Gleyic Rendzic Leptosol Gh, PPV = 0.53 locally and Endogleyic Leptosol Khg, PPV = 0.52 if radius = 1000 m) ( Table 3).
The area where at least one of these four predictors is favourable covers 38.4 km². However, a single favourable predictor may be occasional, so, for this species, it is preferable to delineate the favourable area by the presence of at least two favourable conditions. The area where at least two of  these four predictors are favourable, covers 1.4% (11.4 km²) of the terrestrial study area and has a density of fi nd sites of 92.4% of all observed sites. Alternative categories of the same features cover larger parts of the study area but are weak predictors. E.g. modal land cover category "scrubland" coincides with P. fruticosa presences to a great extent but the predictive value of land cover data layer, if all categories are included, was weak at all spatial and thematic scales (TSS ≤ 0.40) ( Table 2). The binary variable Dominating historical land cover within 1000 m is tussock area is even more specifi c, having a high PPV = 0.83, but has a low TSS = 0.016. That means P. fruticosa can likely be found in large alvar patches mapped as tussock areas, but the feature is practically useless as an explanatory variable in all other regions of the study area.
There are also some less confi dent numerical predictors. P. fruticosa presence was recorded most often at altitudes of 19-35 m asl and in places where other presence sites occur within 1 km ( Figure  4). More than a half (54%) of all recorded presences are located in square kilometres with no permanent habitants.

Excluding values
No landscape feature defi nes P. fruticosa presence for certain (excludes absence), that is, the species is not constantly present at any values of site features. The absence excluding feature values are possible only in case of an extremely abundant species since whatever current or previous limiting factor can exclude the species despite optimal conditions concerning other site features.
Regardless of the large number of observations, species absence is also rarely confi rmed by map categories. Only a few of them meet the >1% of observations and no presences condition. P. fruticosa was never found where water is the most common land cover category within 100 mit avoids sea coasts and does not occur at lake shores. More frequent sandy soil types were also found to exclude P. fruticosa. As site features, these are Umbric Gleysol (LkG) and Endogleyic Umbric Podzol (Lkg) in all radii and Haplic Podzol (L) in radii up to 200 m. These excluding categories cover 5.1% (41.3 km²) of the terrestrial study area ( Figure 5). There are other map categories that never coincided with P. fruticosa sites, but the number of observations matching these areal categories did not meet the 1% frequency criterion. E.g. a few observations are recorded as being located in the sea according to the maps, but this is because of coast line drift due to sand movement and sediment deposition. The species was predominantly not found in places where no presences were recorded in the neighbourhood but, unexpectedly, only absence records in the vicinity are not a fi rm absence indicator, since there are exclusions -fi ve solitary presence locations have only absence records within 1 km.
Among values of the numerical variables, 53 excluding intervals, often disrupted by occasional presences, were de- limited ( Figure 4). In general, P. fruticosa is absent in the lowest parts the study area (coastal plains) and in the highest altitude regions at the southern boundary of the area. Extreme values of relative elevation also exclude species presence. P. fruticosa was not found growing naturally in square kilometres where the human population exceeded 1575 inhabitants. The occurrence of P. fruticosa remains undetermined in most (93.5%) of the study area ( Figure 5) due to: 1) spatially close presence and absence records, 2) missing direct observations nearby or 3) low predictive value of the site features.

Spatial and thematic resolution
According to the TSS, thematically and spatially more detailed features are by and large more useful in predicting P. fruticosa occurrence than more generalized site characteristics (Table 4). Although this is not always the case, e.g. scrubland as the modal category has the highest predictive value when applied at a radius of 200 m, but not focally and not within 100 m (Figure 6). That means that species presence is statistically related to larger scrubland patches dominating in land cover within some hundreds of metres, rather than to single small groves.
Spatial generalization of the historical map increases the proportion of false negative predictions, as the PPV values tend to be higher at more generalized spatial and thematic scales (Table 4). This trend in mean values is mainly based on the historical map -more general versions of the map highlight the largest tussock areas, which coincide with the P. fruticosa populations, while most of the P. fruticosa sites remain unrecognized. Consequently,

K. Remm
Indicator precision as PPV

Radius [m]
Scrubland according to the ENTD / põõsastik ETAK järgi Tussock surface in the hostorical map / mätlik ala ajaloolisel kaardil Gleyic Rendzic Leptosol / Gh muld Endogleyic Leptosol / Khg muld Proportion of fi nds / leidude osa spatial generalization of the historical map increases the proportion of false negative predictions. By virtue of a more generalized scale and lower planar precision, the characteristics derived from the historical map are less sensitive to spatial generalization compared to category merging (thematic generalization) (Table 4). P. fruticosa distribution largely matches the historical map tussock areas, a land cover category that was kept separate from other categories at all generalization levels. The explanatory value of the soil map drops abruptly when Gleyic Rendzic Leptosol is merged with other gleyic soils, and the value of the land cover decreases when merging scrubland with forest categories. A rule can be summarized, that the change in indicative value of a categorical layer, when altering the thematic detail, depends on how the most indicative single categories have been merged during generalization.

Discussion
P. fruticosa preference for Endogleyic Leptosol and Gleyic Rendzic Leptosol in our data confi rms the notion that the natural populations of the species in northern Europe prefer thin calcareous soils (alvars), although, according to previous authors and our observations of cultivars, the species is tolerant of different soil conditions and the main limiting factor is shade (Elkington & Woodell, 1963;Marosz, 2004).
The limiting role of shade is confi rmed by this study, as two of the best predictive site features (tussock signs in the historical map and scrubland as the present land cover) represent non-forest areas, since the tussock signs are on some occasions combined with grassland and bushes but never with signs of a forest area. The scrubland and forest categories are mutually exclusive categories also in the current land cover data. In nature, the undergrowth in forests is shaded by canopy but in alvar scrublands there is suffi cient light for P. fruticosa to grow, as the shade from junipers in most places is not as dense as from the forest canopy.
P. fruticosa's preference for alvar soils in natural and semi-natural habitats can be explained by a weaker competition for light, since soil at these sites has been unsuitable for the establishment of dense forest in the period after the last glaciation. According to this, the main natural threat to P. fruticosa in the study area is afforestation of alvar grasslands, and to a lesser extent, competition with junipers, deciduous bushes and herbaceous species. Even a reduction in browsing and the closing down of a cattle farm near the population core area will probably have a long-term negative effect due to the gradual afforestation of the pastures.
Several methodological problems are related to distribution monitoring, including the selection and critical assessment of data sources. For example, the tussock areas on the historical map largely match the current distribution of P. fruticosa, but not everywhere. The fi rst reason for this -a 1: 50 000 topographical map is spatially more generalized than the view of a fi eld mapper; secondly -according to the experience from fi eld observations, a mapped tussock area does not necessarily indicate P. fruticosa dwarf bushes, but can also denote Vaccinium uliginosum, Betula nana, Calluna vulgaris bushes or Molinia caerulea Carex cespitosa tussocs. There are tussock areas according to the historical map, where it is currently hard to assess whether P. fruticosa grew there 80 years ago or not (e.g built-up areas).
A number of studies indicate that a major source of sampling bias in the fi eld mapping of animals and plants is imperfect detection, which leads to under estimated occurrence of the study object (Kéry et al., 2005(Kéry et al., , 2010Bornand et al., 2014;Lahoz-Monfort et al., 2014), especially if its abundance is low (McCarthy et al., 2012). The detectability of even common, persistent vegetation species is far below 100% (Kéry et al., 2006;Clarke et al., 2012). The recorded absence of a focal species is likely a false absence where the environment is conducive and the species was observed in the neighbourhood (Manceur & Kühn, 2014); however, data gathering using prior expert knowledge and relationships between existing occurrence data and environmental variables improves detection rate of rare objects (Albert et al., 2010). Although recognition of P. fruticosa in summer is mostly easy, as the plants reach above the dense grass layer and are blossoming, a few false absence records presumably occur in the original fi eld records, but their share was presumably minimized by higher observation density at major distribution patches, combined with thinning out the excess absences and track points within 50 m of presence sites.
If we consider the width of the on-foot observed track to be approximately 10 m and ignore the partial overlap of the observed belt, the directly observed area still comprises only about 1% of the study area, showing that fi eld surveys alone cannot completely cover any larger study area in detail. Consequently, the distribution pattern of a species has to be modelled, not only in the case of global and regional studies, but also if landscape complexity and mapping resolution makes it unrealistic to cover the total study area with direct observations.
Sites that were relatively suitable for the species but where it had not been registered were preferably observed in order to increase detection relative to the time spent. Preferential sampling based on a priori knowledge is a frequent decision to maximize discovery and cover the targetspecifi c site variability (Luoto et al., 2001;Roleček et al., 2007;Giljohann et al., 2011;Steege et al., 2011). If we had applied a spatially random or regular sampling schema, most of our time in the fi eld would have been spent on sites unsuitable for the species, the proportion of absence records would have been higher, and many occur-rence sites would have remained undiscovered. The results of preferential sampling can better represent the species occurrence than most data in natural history museums and species occurrence registers commonly used in ecological and methodological studies. In this case, as: 1) the species absences were recorded; 2) all observations are made during the same season and during a relatively short period; 3) observations were made predominantly by one person; 4) data thinning reduced pseudoreplication and disproportion between the number of recorded presence and absence locations; 5) the use of proportions instead of the number of presences minimizes the effect of different observation density.
A special effort was made in this study to extract absence sites from recorded moving tracks. A species may absent in a location due to: 1) currently unsuitable environment; 2) dispersal limitations, historical factors, local extinctions or 3) due to incomplete and biased samples (Lobo et al., 2010). Some authors (Jiménez-Valverde et al., 2008;Gogol-Prokurat, 2011) even suggest discarding records on species' absence because of the high risk of false negative records. According to other authors, direct recording of absence sites reduces biases in predicted presence/absence (Phillips et al., 2009;Václavík & Meentemeyer, 2009;Phillips & Elith, 2013). If absence locations are not recorded in addition to presence sites, these should be generated as pseudo-absences, extracted at random from the sites in the study area where the species has not been recorded (Elith & Leathwick, 2007).
Registering absences during surveying can result in so-called zero-infl ated data dominated by absence records. A high overall prediction fi t is easy to obtain if a large part of the study area is clearly unsuitable for the species. To avoid the obscuring effect of surplus absences, eliminating surely expected absence areas has been suggested as the fi rst step in distribution modelling (Mullahy, 1986;Heilbron, 1994;Welsh et al., 1996;Barry & Welsh, 2002), although it is still not a common practice (Titeux et al., 2007). Another means to reduce the excess of absence records is spatial thinning as proposed in this study.
Site features for the recognition of P. fruticosa sites derived from topographical data layers at the most detailed spatial resolution were the most effi cient according to TSS, but not following PPV (Table  2 and 4). Relatively low values of the PPV statistic compared to the TSS point to prevailing false positive predictions -which means, the spatially detailed site features tend to over-estimate species occurrence. Spatially smoothed features (except elevation asl) are not able to distinguish small patches, but are more reliable in species occurrence predictions. The spatially most detailed remote sensing data layer has also not been the best indicator for identifi cation of vegetation units according to some other authors (Marceau et al., 1994;Ghosh et al., 2014). Supposedly, the optimal spatial scale has to be estimated empirically according to the dependent variable and the properties of explanatory data layers.
The predictive power of a categorical explanatory feature may depend remarkably on the order of merging the detailed categories. Formal merging of highly indicative categories of a categorical feature, while reducing thematic resolution, may increase or decrease the predictive value. The best option would be to reduce the only predictor to categories statistically matching the dependent variable. Unfortunately, such a perfect match cannot usually be found in real data.
In this study, categories of explanatory variables were merged according to prior knowledge about their meaning, which might not yield the best classifi cation for the particular prediction task. A statistical algorithm for merging categories of a predictive variable is Exhaustive CHAID (Chi-squared Automatic Interaction Detector), which tests all pairs of categories and merges the categories according to the statistical signifi cance to the classifi cation of the dependent variable until a single splitting pair remains (Kass, 1980). Exhaustive CHAID may require signifi cant computing time if the number of categories is large. Categories merged purely according to statistical signifi cance for a particular target may not be easy to interpret meaningfully and applicable for other data.
In this study, several excluding values of categorical site features were found. Setting value intervals that exclude the species occurrence on the axis of a continuous feature assumes splitting the axis and checking if any of the intervals meet the given frequency and data density premises. The task would be a trivial braking data to quantiles if splitting only according to the number of records is needed, or production of a histogram if the class boundaries were given. The combination of the two conditions complicates the issue as the number of possible (split) points on a continuous axis is infi nite. The proposed regions of frequency algorithm starts from data ordering and checks split points only at existing values.
Overlaying the spatial distribution of excluding and presence-predicting site features does not result in a detailed predictive distribution map, but outlines regions for further fi eld survey and special modelling effort. Delimiting the species-excluding area enables attention to be focussed on the unclear presence/absence area and on the likely presence area when creating and calibrating predictive distribution models in subsequent studies on detailed distribution of the species. Presence/absence modelling or more fi eld observations are needed for the undetermined area, while species abundance modelling could be a priority in the probable presence area.

Conclusions
1. All site features were signifi cantly related to the occurrence of P. fruticosa but most of them are weak predictors.
2. The best predictors of P. fruticosa occurrence are 1) the proportion of presences in the neighbourhood, 2) moist thin calcareous soils according to the soil map, 3) larger scrubland patches according to the topographical database, and 4) tussock areas according to the topographical map from the 1930s. 3. Thematically and spatially the most detailed site features are not always the best indicators. 4. The change in indicative value of a categorical layer, when altering the thematic detail, depends on how the most indicative single categories have been merged during generalization. 5. Restricting a species distribution modelling by excluding the area where the species surely cannot be found could support the modelling effort, although, in this case, the species occurrence remains undetermined in most (93.5%) of the study area. Combination of different predictors at different scales is needed for detailed distribution modelling. 6. Innovative algorithms and tools for preprocessing distribution modelling data were applied for: 1) delimiting the species confi rmed absence, probable and unclear occurrence area; 2) thinning of observed locations; 3) fi nding intervals in values of a numerical variable which match given ratio between states of a Boolean variable.