A predictive model for seasonal pond counts in the United States Prairie Pothole Region using large-scale climate connections

The Prairie Pothole Region (PPR), located in central North America, is an important region hydrologically and ecologically. Millions of wetlands, many containing ponds, are located here, and they serve as habitats for various biota and breeding grounds for waterfowl. They also provide carbon sequestration, sediment and nutrient attenuation, and floodwater storage. Land modification and climate change are threatening the PPR, and water and wildlife managers face important conservation decisions due to these threats. We developed predictive, multisite forecasting models using canonical correlation analysis (CCA) for pond counts in the southeast PPR, the portion located within the United States, to aid in these important decisions. These forecast models predict spring (May) and summer (July) pond counts for each region (stratum) of the United States Fish and Wildlife Service’s pond and waterfowl surveys using a suite of antecedent, large-scale climate variables and indices including 500 millibar heights, sea surface temperatures (SSTs), and Palmer Drought Severity Index (PDSI). Models were developed to issue forecasts at the start of all preceding months beginning on March 1st. The models were evaluated for their performance in a predictive mode by leave-one-out cross-validation. The models exhibited good performance (R values above 0.6 for May forecasts and 0.4 for July forecasts), with performance increasing as lead time decreased. This simple and versatile modeling approach offers a robust tool for efficient management and sustainability of ecology and natural resources. It demonstrates the ability to use large-scale climate variables to predict a local variable in a skilful way and could serve as an example to develop similar models for use in management and conservation decisions in other regions and sectors of the environment.


Introduction
Encompassing nearly 800 000 km 2 in North America [1], the Prairie Pothole Region (PPR) is important hydrologically and ecologically. It is the location of over 2.5 million wetlands [2] which are located in depressions, or potholes, left from the recession of the Laurentide ice sheet in the Pleistocene epoch [3]. It is the largest wetland complex in North America [4]. The depressions sometimes contain ponds or lakes, ranging in permanence from ephemeral to persistent, that serve numerous environmental purposes. Colloquially called the 'duck factory', it is a favorable habitat for the yearly breeding of over 50% of the North American duck population during spring and early summer [5].
Other biota, such as insects and riparian vegetation, also call these potholes their home [1,5,6]. In addition to being a very important habitat for various forms of life, the wetlands also provide carbon sequestration, sediment and nutrient attenuation, and floodwater storage [7]. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Wildlife managers in the PPR must develop and implement hunting and conservation rules and regulations and actions to preserve or maintain habitat. These decisions become more complicated in the face of threats to the ecosystems they manage. Near-term threats include land modification due to agriculture or oil/gas well production and climate variability, and longer-term threats include climate change. Managers must respond to both, though near-term threats typically take priority for addressing within agency resource constraints [8]. Complicating matters further, much of the land in the United States portion of the PPR is privately owned [9]. Thus, managers must work closely with the private land owners and convince them to enter conservation easements, typically allocated by the Conservation Reserve Program. Additionally, the program is losing federal funding, further hampering the process of allocating easements.
Agricultural modifications (e.g. tiling) and energy development (e.g. oil/gas wells) are affecting the PPR and have reduced the ecological productivity of the region. These anthropogenic modifications can diminish or drain wetlands entirely, alter their landscape, or introduce contaminants. Between 16% and 18% of the PPR was historically (circa 1850) covered by wetlands, but up to 65% had been drained by the mid-1980s, primarily due to agriculture [10]. Total wetland area declined between 1997 and 2009 by another 1.1%, or approximately 30 100 ha [2]. Wright and Wimberly [11] found that there was a net loss of 528 ha of grassland from 2006 to 2011. Agricultural modifications change the landscape, altering soil permeability, runoff pathways, and evapotranspiration [12]. A United States Geological Survey evaluation found that 34 of 48 water samples from the PPR were contaminated with brine [13]. These modifications decrease pond and wetland numbers and introduce contaminants, therefore decreasing the area for viable habitat of dependent species.
Average temperature and precipitation are increasing in the region due to climate change [14], and the region is experiencing much climate variability on both spatial and temporal scales [14][15][16][17]. The region is highly sensitive to annual variation in weather conditions, with these playing an integral role in the hydrologic productivity of the region [18][19][20]. Summer rainfall in the southeast PPR is extremely important because it can contribute approximately half of the annual precipitation [16,21]. Summer rainfall in the southeast PPR is projected to decrease [22] therefore increasing the danger this region faces.
In the face of these multiple threats, managers in the region must make informed choices on actions to protect and manage wetlands and implement recreational regulations. Several conservation strategies are undertaken to maintain and improve habitat, both ponds and surrounding grasslands, in order to maximize waterfowl nesting success, and are timed within a season to ensure that treatments do not adversely affect migratory birds [23]. For example, light to moderate grazing treatments are performed to ensure proper nesting habitat for migratory waterfowl [23,24], and are conducted after 1 June to ensure waterfowl are not disturbed by livestock [23,25]. Similarly, noxious weed control is administered to prevent further degradation of grasslands and wetlands [23], and haying/prescribed burns are executed after 1 August each season [23]. These decisions are made while considering near-term threats, longer-term threats, and agency resource constraints.
These management responses are on two distinct time scales. First, there are month-to-month/year-toyear actions that may include habitat restoration/ revegetation or management of invasive plants that often respond to seasonal variability in temperature and precipitation. Some decisions, such as determining hunting quotas, use information on prior year hunting totals in addition to upcoming season forecasts. Managers often make decisions early in the year on whether and where to implement these and other actions, making the best use of their resources. Second, there are longer-range planning decisions with effects over multiple years, such as easements, intended to protect wetlands in the face of land use conversion or agreements with landowners to apply treatments such as those described above.
The focus of this work is on predictive models that may help managers make decisions earlier in the year regarding what actions to choose by providing longer lead times on a sub-PPR scale in regions managers are accustomed to working with. These decisions will therefore be based on a better model of how observable, predictable atmospheric and hydroclimate metrics relate to pond habitats.
We, therefore, develop predictive models that forecast pond counts on a sub-PPR scale using a novel, reduced-dimension approach based on principal component analysis (PCA) and canonical correlation analysis (CCA). These models have varying lead times back to 1 March. Other available predictive models make a single, regional pond count forecast, do not include large-scale climate variables, make currentyear forecasts with only weeks of lead time, or some combination thereof [26][27][28]. Multiple linear regression was used by Larson [26] to determine the amount of pond count variability accounted for by local climate variables. The model used a suite of 10 predictors: previous year's pond counts plus seasonal and yearly averages of minimum/maximum temperature and precipitation. However, most of these variables were not available until May meaning forecasts would be issued at that time. Sorenson et al [28] used simple linear regression on pond and duck counts with concurrent May Palmer Drought Severity Index (PDSI). Forecasts with this model can be issued after May PDSI becomes available-approximately early June. In a more recent study, multiple linear regression was used on May pond count numbers from sub-regions of the PPR [27]. Variables included the preceding year's temperature and precipitation (May-April), the previous year's pond counts, a temperature and precipitation interaction term, and location (longitude and latitude). Forecasts with this model can be issued in early May. We seek to improve on these forecast models by developing predictive models using a novel approach not used previously for this purpose; the models are informed by a suite of predictors including large-scale climate indices, have lead times beginning 1 March (therefore providing longer lead time), and forecast pond counts on a sub-PPR scale.
We begin by outlining the study region and data incorporated in the work. Following that, we discuss the methods involved in creating predictors, developing our predictive models, and evaluating model performance. We conclude by providing results, comparing them to other similar models, and discussing their relevance for decision-making.

Study region and data
This study focuses on the southeast PPR [22] located within the United States (SEPPR). We exclude the portion that extends into Manitoba, Canada. This region is drawn in a dashed line in figure 1. Data were retrieved from various sources for this study (table 1). We prioritized use of temporally concurrent and spatially conterminous data.
Data for pond counts and duck populations were retrieved from two surveys conducted by the United States Fish and Wildlife Service (USFWS). In May, the  , and strata 43-49 were determined to be spatially coincident with the SEPPR (figure 1). Sea surface temperature (SST) anomalies and 500 millibar height (500 mb) anomalies were retrieved from the International Research Institute's (IRI) Data Library which compiles raw data from various sources into a common format. The SST anomaly data are a concatenation of Kaplan et al [29] and Reynolds and Smith [30]. The 500 mb anomaly data are from National Oceanic and Atmospheric Administration's (NOAA) Climate Data Assimilation System I (CDAS-1) data.
We retrieved PDSI data spanning from 1895 to 2017 from the NOAA National Centers for Environmental Information (NCEI) nClimDiv dataset [31]. The data were retrieved on a monthly scale by climate division. We chose 22 climate divisions to represent the SEPPR because they were spatially coincident.
May pond counts and duck population correlate strongly (linear correlation has possible values of −1 to 1) with a linear correlation (Helsel and Hirsch [32], Montgomery and Runger [33]) of 0.87 (figure 2). We can infer duck population with pond count because of this strong correlation, a connection that has been well-documented [22,34,35]. July pond counts also have high linear correlation with May pond counts (0.78, figure 2) allowing us to extend our modeling framework to predict July pond counts.
This connection between variables, however, comes with stipulations. Other factors such as wetland-cover type, emergent vegetation distribution and characteristics, and interactions among ducks can cause ducks to relocate [36]. Salinity of ponds also plays a role, because the plant communities are sensitive to the level of brine in ponds [37]. We acknowledge these confounding factors exist, but rely on the strong relationship for inferences.
Ponds and ducks have shown an increasing trend in the SEPPR (figure 2). Ponds in the SEPPR are increasing due largely to the increase in precipitation in the PPR [14] which is in part due to increases in the SEPPR in all seasons but summer [22]. Also attributable are conservation efforts over the recent decades. Duck counts have followed due to the observed relationship. Variability has increased slightly since the beginning of the observed period. The cause for this is an ongoing research area. Some potential explanations include the increase in extreme events due to climate change and effects due to land use conversion. Variability is expected to continue to increase in the region [38].

Methods
When forecasting one field of data with another, CCA [39] is a common approach. It has been used for multisite forecasting of precipitation (e.g. [40]), temperature (e.g. [40], [41]), streamflow (e.g. Salas et al 2010), SSTs (e.g. [42]), and wind (e.g. [43]). In this study, we use CCA to predict the field of pond counts using a suite of developed predictors. First, we describe how the suite of predictors was developed. This is followed by a description of the CCA process used to develop our predictive models. A more detailed description of these processes can be found in appendices A and B. We conclude this section with details on how the models were evaluated on both fit and skill.

Development of predictors
Pond count is physically linked to the amount of water available-a function of groundwater storage, direct precipitation, and snowmelt runoff [18]. This direct connection to precipitation motivated exploratory data analysis of the region's precipitation (per climate division). This analysis provided evidence that SEPPR summer precipitation variability is driven by 500 mb systems and SSTs in the Pacific and Atlantic Ocean, and it has strong connections to PDSI. We performed PCA [44] on the climate division summer precipitation data to find the leading modes of variability. These modes were then found to be correlated to SST and 500 mb anomalies thus establishing connections to these large-scale variables. It was also observed that PDSI was correlated to the leading modes. Connections were found during wet and dry years as determined by the leading mode. Informed by this, we use these three variables, developed into indices, as predictors for use in our predictive models.
Reference [28] based their models on PDSI because of its ability to serve as a single composite measure of dryness. It is a widely used index which estimates relative dryness based on current and antecedent temperature and precipitation [45]. In our models, it represents past local climate conditions. Large-scale climate features such as 500 mb and SST persist seasonally, thereby representing future conditions.
We used regions of highest magnitude linear correlation (Helsel and Hirsch [32], Montgomery and Runger [33]) between May pond counts and spacetime fields of 500 mb and SST anomalies to develop eight predictors (three SST, five 500 mb). The fields were for the preceding winter (December-February), March, and April. The high correlation between May and July pond counts (figure 2) allowed us to use the same predictors along with May pond counts for July models.
Regional PDSI values from 22 climate divisions representing the SEPPR were averaged to one value per year for winter, March, and April resulting in three PDSI predictors.

Predictive model
For our predictive model, PCA [44] is performed first on the two datasets (pond counts and predictors); both are scaled (by subtracting the mean and dividing by the standard deviation) prior to PCA. It is frequently used alongside CCA as a dimension-reducing technique because, typically, datasets used are large. Though our datasets are not large, we include PCA to allow managers to add more strata or predictors making our framework versatile.
The July survey contains missing values, across all strata for some years (e.g. 1989) and for select strata during specific years (e.g. 1958). Years in which data for all strata were missing were removed for figure 2. For CCA, years with missing values in at least one stratum were removed.
After performing PCA, a user-defined number of PCs (N pc ) are retained for CCA. Then, CCA is performed on the selected N pc to provide N pc canonical component (CCs) pairs. Then, a regression is performed with the sets of CCs allowing prediction of N pc from one dataset using the other. Finally, predicted values (scaled) can be found by multiplying the eigenvalues (from PCA) for the appropriate dataset by the predicted PCs to back transform then rescaled as desired. For plotting purposes (figures 4, 6-8), they were left scaled.
Pond count was chosen as the response variable because it is dependent on physical processes thus lowering the potential for confounding and hidden variables. Duck population, however, is more dependent on biological processes in addition to environmental conditions therefore increasing the confounding factors. The connection between the two variables (figure 2) allows managers to infer duck population based on pond count.
Models were developed forecasting both May and July pond counts for individual strata at the start of all preceding months beginning on 1 March. Predictors were included based on availability related to forecast date. For example, forecasts made on 1 April cannot include predictors developed using concurrent April SST. This resulted in a smaller suite of predictors for earlier forecasts (longer lead time).

Model performance
Model fit and forecast skill are measured by calculating the correlation coefficient, R, and root mean square error, RMSE (Helsel and Hirsch [32], Montgomery and Runger [33]). These two metrics, commonly used to evaluate regression models, were chosen because they provide a measure of contiguity between predicted and observed values-RMSE is a measure of difference between the two, and R measures how closely the two vary together. For model fit, all data are used to construct the model, predictions are made for all years, and the observed values are compared to the modeled values using R and RMSE. To test model performance in a predictive mode (i.e. model skill), leave-one-out cross-validation (LOOCV) was used.
In LOOCV, one of the observations is dropped. The model is then fitted to the remaining observations and the dropped value is predicted. We performed this process chronologically, starting with the first year of observations then ending with the last year, similar to jackknife [46]. However, jackknife is typically used for parameter estimation. R and RMSE were then calculated using all the LOOCV predictions and the observed values.

Model fit
In general, models fit well for most strata. For the May pond count models, most R values were above 0.65, even when forecasting in March (figure 3). The R values for the July forecast models were mostly above 0.5 including models with the longest lead time (figure 5). Strata with models that fit well were close to the 1:1 line when plotting observed versus modeled values (figures 4(a), (b), 6(a) and (b)) whereas the poorly fitting strata displayed obvious outliers (figures 4(c), (d), 6(c) and (d)). Stratum 47, located in eastern North Dakota, fit worst for most models (figures 3, 4(d), 5, 6(d)), and its fit was disparately low in comparison. Stratum 45 (figures 3(a), 4(b)) and 43 (figures 5(a) and 6(b)), located in northern and western North Dakota, fit poorly in May and July forecast models, respectively, when forecasting in March, but fit improved for the successive lead times. For most strata, R increased and RMSE (not shown) decreased with shorter lead time. The only exception was stratum 47 in May forecast models which had static fit metrics from April to May lead times.

Cross-validation forecast skill
Model performance in LOOCV mode was similar to fitted model results with a minor reduction in performance. For May pond count forecasts, most R values were above 0.6 and RMSE is below 1 (equivalent to 1 standard deviation since the data is scaled), including forecasting in March (figure 7). The R values for the July forecast models were mostly above 0.4, and RMSE was mostly below 1 (figure 8). Stratum 47 again performed weaker than other strata, and its forecast skill was low regardless of lead time. Strata 45 and 43 performed poorly in May and July forecast models, respectively, but forecast skill improved with lead time. As with model fit, R increased and RMSE decreased with shorter lead time for most strata.      Exceptions include stratum 47 in May models and strata 47 and 48 in the July models.

Comparison to other similar models
Our models provide longer lead times, forecast for multiple strata, include large-scale variables, and exhibit similar performance when compared to other pond forecast models available. A common measure of model skill is the variation explained by the model, measured by the coefficient of determination, R 2 (Helsel and Hirsch [32], Montgomery and Runger [33]). Our May forecast models, from longest to shortest lead time, explain 43%, 52%, and 58% (median across strata) of pond count variation, respectively. The July forecast models, from longest to shortest lead time, explain 29%, 46%, 51%, and 56% (median across strata) of pond count variation, respectively. Individual strata R 2 values can be seen in table 2. Larson's [26] best model explained 65% of variation in percent basins holding water, and the model given by Niemuth et al [27] explained 62% of pond number variation Sorenson et al [28] were able to explain 72% of May pond count variation using only May PDSI. Generally, our shortest lead time models perform similarly to these models, and the variation explained remains strong with longer lead time aside from the March forecast for July pond counts.
Stratum 47 has lower R 2 than other strata for both models. Exploratory data analysis reveals one possible explanation. There are two extreme years present within the May pond data (figure 4(d)) and removing these improves model performance. In the July data, removing one extreme year also improves model performance ( figure 6(d)). Therefore, the models are not capturing the extreme years.

Summary and discussion
We developed a suite of predictors for SEPPR pond count informed by previous work investigating largescale climate links to summer precipitation in the SEPPR. Using these predictors, we developed predictive models using CCA that can provide relevant information for SEPPR managers as they plan conservation efforts such as determining hunting quotas and the timing and location for rotational grazing, prescribed burns, haying, and noxious weed control.
In providing a better predictive tool, these models can help managers make more informed choices for actions to preserve and maintain habitat in the face of near-term threats. This tool provides longer lead times than currently available models and pond forecasts in regions consistent with areas that managers use-the survey strata. Managers may use our models for spatially explicit forecasts of where pond habitat might be better or worse (i.e. where there are fewer ponds and therefore pond area), and thus inform their planning at varying times in the seasonal decision-making process. Additionally, these models use variables that influence the climate variability of the region and capture seasonal climate information and trend. Capturing this variability is important because managers are very interested in the climate variability of the region [8]. These models exhibit high skill exemplified by high linear correlation values and low RMSE in LOOCV mode. With a few exceptions, this skill increased as lead time decreased.
Our CCA framework provides versatility that can be adapted to accommodate needs and interests of managers. Variables can easily be added (or removed) to either dataset. For example, other strata could be included for predictions as desired by managers. Other variables, such as snowmelt, could be added to the predictor dataset to incorporate their information in the model.
An important utility of these models lies in their ability to provide information on ecological health based on the strong connection between pond count and duck population (figure 2). Managers must still be mindful that duck population depends on other factors such as pond salinity (LaBaugh et al [47]), wetland-cover type, vegetation characteristics (also dependent on salinity), and interactions among ducks [36]. They should also be careful making these inferences in strata with lower model skill, but broad inferences can still be made with confidence. Further, studies providing similar pond count models do not include large-scale climate variables, have little lead time, do not forecast for individual strata, or a combination thereof. This work has a broader impact because we demonstrate the use of a powerful forecasting technique, CCA, and how it might be used for management decisions. This application could serve as an example to develop similar models for use in other management and conservation decisions in other regions of the world and other sectors of environment. Further, we demonstrate the ability to use large-scale climate variables to predict a local variable (pond count) in a skillful way. This method could be applied to other variables, some of which were referenced previously. We hope the technique will reach a different audience than has been in previous works. ification. Emily Silverman and Meghan Eyler assisted in providing the USFWS survey strata shapefile. We also thank Heather Yocum and Angela Boag for insights regarding PPR wildlife management.
We thank two anonymous reviewers whose comments helped to improve this manuscript.

Data availability
The data that support the findings of this study are openly available at the following URL/DOI.