Using spatial statistics to identify emerging hot spots of forest loss

As sources of data for global forest monitoring grow larger, more complex and numerous, data analysis and interpretation become critical bottlenecks for effectively using them to inform land use policy discussions. Here in this paper, we present a method that combines big data analytical tools with Emerging Hot Spot Analysis (ArcGIS) to identify statistically significant spatiotemporal trends of forest loss in Brazil, Indonesia and the Democratic Republic of Congo (DRC) between 2000 and 2014. Results indicate that while the overall rate of forest loss in Brazil declined over the 14-year time period, spatiotemporal patterns of loss shifted, with forest loss significantly diminishing within the Amazonian states of Mato Grosso and Rondônia and intensifying within the cerrado biome. In Indonesia, forest loss intensified in Riau province in Sumatra and in Sukamara and West Kotawaringin regencies in Central Kalimantan. Substantial portions of West Kalimantan became new and statistically significant hot spots of forest loss in the years 2013 and 2014. Similarly, vast areas of DRC emerged as significant new hot spots of forest loss, with intensified loss radiating out from city centers such as Beni and Kisangani. While our results focus on identifying significant trends at the national scale, we also demonstrate the scalability of our approach to smaller or larger regions depending on the area of interest and specific research question involved. When combined with other contextual information, these statistical data models can help isolate the most significant clusters of loss occurring over dynamic forest landscapes and provide more coherent guidance for the allocation of resources for forest monitoring and enforcement efforts.


Introduction
Tropical forests have undergone substantial changes over past decades, driven largely by the expansion of agricultural lands to satisfy global demand for commodities such as palm oil, timber, wood fiber, soy and beef (Ramankutty et al 2008). Tropical deforestation has multiple environmental impacts, including greenhouse gas emissions, biodiversity loss, and reduction of other ecosystem services such as carbon storage and water supply (Foley et al 2007). Recognizing the environmental, economic and social importance of these ecosystems and the human potential to change outcomes through improved land management policies, many national and international initiatives to reduce deforestation are underway. Commitments to deforestation-free supply chains have been made by many of the world's largest commodity suppliers (Supply Change/Forest Trends 2015), and recent international agreements include ambitious goals to reduce or end deforestation over the next 5-15 years (UNEP 2010, UN 2015, UN Climate Summit 2014, UNFCCC 2016. To understand whether we are making progress towards achieving these goals, monitoring systems that deliver timely and accurate information about forest dynamics become critical. Due in part to REDDþ 6 initiatives and guided by simultaneous advances in technology, many governments are developing forest monitoring systems based on the interpretation of imagery from Earth-observing satellites (Romijn et al 2015). At the global scale, Hansen et al (2013) examined Landsat satellite data to characterize global forest extent, annual loss, and gain from 2000 to 2012. These data and their annual updates through the year 2014, available on the online Global Forest Watch (GFW) platform, provide spatially detailed and timely information on forest dynamics that is both globally consistent and locally relevant. The GFW platform enhances the practical use of these data by providing solutions to the challenges often associated with big data including visualization, storage, analysis, sharing and querying.
These and other data products derived from satellite imagery have fundamentally changed the way the world's forests are monitored. But as sources of data become larger, more complex and more numerous, the ability to quickly explore and interpret patterns with confidence becomes a critical bottleneck for effectively using these data to inform forest policy and management decisions. The Hansen et al (2013) data include billions of individual pixels, and visually evaluating patterns in the data at multiple spatial and temporal scales quickly proves infeasible, even on an online platform optimized for this purpose. For example, figure 1(a) shows forest loss in the Sankuru As the extent of analysis becomes larger, spatial patterns and trends become more difficult to observe by visual inspection alone. 6 Reducing Emissions from Deforestation and Forest Degradation in developing countries, and the role of conservation, sustainable management of forests, and enhancement of carbon stocks. Nature Reserve in the Democratic Republic of the Congo (DRC) between 2000 and 2014. At this local scale, spatial patterns of loss are clear enough to identify where to prioritize interventions or target potential future loss. At the provincial (figure 1(b)) or national (figure 1(c)) scale, however, analysis by visual inspection alone no longer provides clear guidance on which areas to target. Furthermore, these visual inspections do not allow an understanding of more subtle, but still important, trends in the acceleration or deceleration of loss.
Several machine learning techniques and land cover change prediction models have been developed to project future deforestation based on where it has occurred historically (Harris et al 2008, Fuller et al 2011, Rosa et al 2013, Aguilar-Amuchastegui et al 2014. Despite differences in model structure and complexity, most require at least two land cover maps as well as spatial information about the biophysical and socioeconomic factors that correlate with observed change. These models are conceptually robust, but predictive accuracy is dependent on the availability of spatially-explicit information about drivers of land cover change for model calibration that may not be available. Spatial pattern analysis, while not inherently predictive, also has the potential to assist in rapid, consistent identification of priority areas for management intervention. For example, statistical methods have been used to identify trends in biodiversity (Myers 1988, Mittermeier 1998, Myers et al 2000, pollution (Yong-Hui et al 2010, Li et al 2014, Ding et al 2015, and crime (Grubesic 2006, Chainey et al 2008, Xiaoland and Grubesic 2010, Sangamithra et al 2012. In the context of forest conservation, spatial statistics can assist in quickly identifying spatiotemporal trends of forest loss without the explicit need for pre-existing information on what underlying factors are driving these trends. It is clear that tropical forest countries must drastically reduce deforestation over the next several years if the goal set in the New York Declaration on Forests to halve natural forest loss by 2020 is to be realized (Zarin et al 2016). To succeed, decision makers could benefit from accessible information not just about specific locations of historical forest loss, but also about broader trends in these data that can be detected early enough to influence a new trajectory.
The objective of this study was to develop a scalable methodological approach that allows for rapid evaluation of statistical trends in forest monitoring data. To demonstrate our approach, we use the 14-year annual time series forest loss data for Brazil, Indonesia and the Democratic Republic of the Congo (DRC) from Hansen et al (2013). Recent research has pointed out the significant roles these three countries play under various scenarios to reduce deforestation (Zarin et al 2016), but did not identify where, geographically, each country should focus policy within its borders to achieve these reductions. An approach that delivers rapid insight towards answering this question could assist in the creation of more timely and coherent policies on where to focus forest monitoring and enforcement efforts going forward.

Definitions
The term 'hot spot' has been used generically across disciplines to describe a region or value that is higher relative to its surroundings (Lepers et al 2005, Aben et al 2012, Isobe et al 2015. In a forest conservation context, WWF (2015) identified deforestation 'fronts' as broad regions of concern based on expert opinion and scenario analyses where available. Here, we define a hot spot as an area that exhibits statistically significant clustering in the spatial pattern of forest loss. Hot spots are locations where observed patterns are not likely the result of random processes or of subjective cartographic design decisions; they represent places where there are underlying spatial processes at work (Getis and Ord 1992). Emerging Hot Spot Analysis extends this definition to incorporate information about the temporal dimension of the data.
In this analysis, forest is defined at the 30 m Landsat pixel scale as areas with the biophysical presence of trees >5 m in height and with tree canopy density greater than 25% in the base year 2000; these areas may take the form of natural forests or tree plantations. Forest loss is defined as a 'stand replacement disturbance', meaning the removal or significant reduction of tree cover that can result from a variety of factors globally (Hansen et al 2013). Thus 'forest loss' as defined here does not always equate to deforestation, i.e. conversion of forest to a new, nonforest land use (FRA 2015, Tropek et al 2014. However, deforestation dominates the forest loss dynamic observed within the three study areas chosen for this analysis. We analyzed emerging hot spots of forest loss only, because the forest gain product of Hansen et al (2013) is not annual. Furthermore, data from Hansen et al (2013) are used here for illustrative purposes; these could be combined with other information on land use or forest type or replaced with other spatiotemporally explicit data where available (e.g. annual deforestation data for the Brazilian Amazon, 1988-2014, INPE 2016 to reflect alternate definitions of forest and forest loss/deforestation.

Study area
Emerging Hot Spot Analysis could be performed at a global scale using one set of standardized parameters across all geographic domains. However, to be relevant for land use policy, these analyses are likely to be more useful if they relate to dynamics occurring within specific domains of legal responsibility. Brazil, Indonesia and DRC consistently rank within the top ten countries globally for highest absolute forest loss Environ. Res. Lett. 12 (2017) 024012 (Hansen et al 2013), and are home to the largest remaining contiguous expanses of tropical forest (Potapov et al 2008). In addition to their conservation value, these countries also represent ideal candidates to test our methodology because each is dominated by a different forest disturbance dynamic, ranging from large tracts of forest cleared in Brazil for industrial agriculture to small patches of forest cleared in DRC to dynamic forest landscapes in Indonesia interspersed with tree plantations managed for industrial oil palm and wood pulp. We focus our analysis at the national scale for these three countries, but also demonstrate the utility of a scalable approach by generating results for a single province as well as a single protected area within DRC.

Input data and pre-processing
A summary of the processing steps used in our analysis is shown in figure 2. Input data consisted of forest extent and annual forest loss data produced by the University of Maryland and Google (Hansen et al 2013 and updates available at globalforestwatch.org). These are delivered as raster data at 30 m spatial resolution. Forest loss in the original data product that occurred outside the forest extent mask applied (>25% tree canopy density) was not considered in the analysis.
Before running statistical analysis, we transformed the data into a netCDF (Network Common Data Form) data cube structure by aggregating forest loss points in each country into space-time 'bins' with a spatial resolution of 2.5 km (figure 2). The value of each bin was assigned as the count (number) of forest loss incidents in the bin for a given year. The netCDF structure stores space as latitude and longitude coordinates and time (i.e. the year the loss was observed) as another dimension. By assigning a count of loss points to each bin for all locations containing forest in the year 2000, the trend in point counts over time could be evaluated. Unlike Brazil and DRC, data for Indonesia were not processed as a country. Rather, we processed Kalimantan and Sumatra (the two Indonesian islands with highest forest cover) independently (i.e. data points on each island did not affect analysis results for the other island) and other island groups were excluded from the analysis. The decision to aggregate data into 2.5 km bins was made after empirically testing bin sizes ranging from 1 km to 50 km; the final selection of 2.5 km preserved a varied frequency distribution of forest loss counts across bins and also preserved a useful interpretation of results that visually tracked known ground features such as lakes, rivers and non-forested areas. Small adjustments to bin size did not significantly impact final results.
Parallel processing and data management were conducted using a combination of several open-source software platforms, including Spark and Hadoop. The Esri ArcGIS Emerging Hot Spot Analysis geoprocessing tool was used for statistical analysis and ESRI software was used for map symbolization. All processing was done in an Amazon Web Services cloud computing environment.

Statistical analysis
The Emerging Hot Spot Analysis tool evaluates spatiotemporal patterns in forest loss in each country using a combination of two statistical measures: (1) the Getis-Ord Gi Ã statistic (Ord and Getis 1995) to identify the location and degree of spatial clustering of forest loss, and (2) the Mann-Kendall trend test (Mann 1945, Kendall andGibbons 1990) to evaluate temporal trends across the time series.
First, the Getis-Ord Gi Ã statistic measures the intensity of clustering of high or low values (i.e. counts of forest loss) in a bin relative to its neighboring bins in the data cube. The sum for a bin and its neighbors is compared proportionally to the sum of all bins. When a bin's sum is different than expected, and that difference is too large to be the result of random chance, a statistically significant Z score is the result. The Getis-Ord Gi Ã statistic generates Z scores (standard deviations) and P values (statistical probabilities) for each bin that indicate whether forest loss in a given bin is statistically clustered compared to loss in neighboring bins, as well as loss across the entire analysis domain. A Z score above 1.96 or below À1.96 means that there is a statistically significant hot spot or a statistically significant cold spot of forest loss at a significance level of P <0.05. The larger a bin's Z-score, the more intense the clustering of values (hot spot). Due to the cube structure of the data, neighboring bins exist both in time and in space. The Queens Case Contiguity method was used to define neighborhood size in space and temporal neighbors were defined using one prior time-step interval (1 year). Environ. Res. Lett. 12 (2017) 024012 Second, the Emerging Hot Spot Analysis tool uses the Mann-Kendall statistic (Mann 1945, Kendall andGibbons 1990) to test whether a statistically significant temporal trend exists through each bin's 14-year time series of Z-scores resulting from the Getis-Ord Gi Ã statistic. The most common method for examining the existence of a trend is simple linear regression, a parametric statistical technique that requires independent random samples drawn from normally distributed populations (Taylor and Taylor 1977). However, these requirements may or may not be met in wall-towall geospatial data sets (vs. a statistically drawn sample (Schlagel and Newton 1996). Under these conditions, the use of a non-parametric statistical test is more appropriate. The Mann-Kendall test, an application of Kendall's rank order correlation test to time series data (Bradley 1968), is a nonparametric test for zero slope of the linear regression of time-ordered data versus time (Gilbert 1987).
In this case, each bin with data represented its own independent time series. To evaluate temporal trends for each bin, each time step was compared to the one after it. If the Z-score in the second time step was larger than the first time step, the result was þ1 (increasing trend). If the Z-score in the second time step was smaller than the first time step, the result was À1 (decreasing trend). Each pair of time steps was compared over the 14-year series, generating the Mann-Kendall statistic with associated trend Z-score and p-value for each bin. The expected sum is zero, indicating no trend in the values over time. Based on the variance for the values in the bin time series and the number of time periods, the observed sum is compared to the expected sum (zero) to determine if the difference is statistically significant or not (P <0.05). The cluster and trend results from the Getis Ord Gi Ã and Mann-Kendall statistics are then used to categorize each bin.

Output maps
By default, the Emerging Hot Spot Analysis tool categorizes each bin into one of seventeen distinct categories that cover a range of scenarios: one category of non-significance along with eight hot spot and eight cold spot categories, each reflecting a different configuration of spatiotemporal significance (ESRI 2016). However, introducing too much nuance in output maps has the potential to detract attention away from the most relevant results for forest conservation. Thus we chose not to display cold spots and other areas with non-significant trends, and aggregated hot spots into five categories, each representing a different temporal state: new, sporadic, intensifying, persistent, and diminishing (table 1)

Brazil
Between 2000 and 2014, Brazil lost an average of 2.7 Mha yr À1 of forest (figure 3(a)), although overall rates of loss declined over this time period. The diminishing hot spot category is prominent within the Amazonian states of Rondônia and Mato Grosso, indicating the deceleration of the clustering of loss there. Meanwhile, the historical 'arc of deforestation' is expanding west through the state of Acre and advancing inwards toward more intact portions of the Amazon ( figure 3(b)). A new and significant cluster of forest loss appears in the state of Roraima, where new settlements are expanding from an existing road network ( figure 3(c)). Loss has also shifted towards the cerrado biome (Beuchle et al 2015), as evidenced by the large areas of new and intensifying hot spots there ( figure 3(d)). The overlap of the two predominant hot spot clusters with locations of tree plantations in the southern part of the country in the Mata Atlantica biome likely reflect plantation harvest cycles of the pine and eucalyptus plantations that are prevalent in the states of Parana, Rio Grande do Sul, and Santa Catarina (figure 3(e)). A location that is an on-again then off-again hot spot. Less than 12 of the 14 years have been statistically significant hot spots. Intensifying A location that has been a statistically significant hot spot for more than 12 of the 14 years (>90%), including the final time step (2014). In addition, the intensity of clustering of high counts in each time step is increasing. Persistent A location that has been a statistically significant hot spot for more than 12 of the 14 years (>90%), with no discernible trend indicating an increase or decrease in the intensity of clustering over time.

Diminishing
A location that has been a statistically significant hot spot for more than 12 of the 14 years (>90%). In addition, the intensity of clustering of high counts in each time step is decreasing, or the most recent year (2014) is not hot.
Environ. Res. Lett. 12 (2017) 024012 However, since oil palm is cut and regrown on 25-30 year cycles and the palm oil industry here is relatively nascent, the majority of the identified hot spots likely reflect natural forest loss to establish the plantations, rather than harvest cycles occurring within already existing plantations.

Democratic republic of the Congo
From 2000 to 2014 DRC lost an average of 0.57 Mha yr À1 , and the rate of forest loss between 2011 and 2014 increased by a factor of 2.5 ( figure 5(a)). Like Kalimantan and Sumatra in Indonesia, DRC contains vast areas of new hotspots, many of which are situated along the country's road network ( figure 5(b)). Unlike Indonesia, however, many more of these new hot spots intersect with intact primary forests. New hot spots also extend within the boundaries of protected areas such as the Sankuru Nature Reserve, the world's largest continuous protected area for great apes ( figure 5(c)). New and intensifying hot spots of loss radiate out from the city of Beni in the eastern province of North Kivu,

Discussion
Here we demonstrate the usefulness of Emerging Hot Spot Analysis as an approach for evaluating statistically significant clusters of forest loss. Unlike traditional predictive modeling approaches, our method and workflow can be implemented quickly and requires no Environ. Res. Lett. 12 (2017) 024012 information on the underlying drivers of loss, yet results can provide insight about potential future trajectories and locations of forest change. We apply this method at the national scale for three tropical forested countries over 14 years, but the approach is scalable to any spatial or temporal domain with more than ∼10 time steps. For very local applications, this type of analysis may not offer much advantage over viewing the data in its native resolution (in this case, 30 m). Emerging Hot Spot Analysis becomes much more valuable for evaluating phenomena occurring over broader scales than can be directly observed, and where limited time and resources may necessitate the prioritization of monitoring and enforcement efforts to target specific fronts of new or accelerated loss. Decision makers tasked with vetting proposed forest protection policies or commodity supply chain sourcing decisions through multiple stakeholder groups may benefit from the support of maps that are easy to interpret, yet grounded in robust statistical analysis. National and subnational governments seeking to obtain and receive results-based payments for REDDþ activities under the UNFCCC may also find the emerging hot spot framework useful for informing their national strategies and action plans.
The fixed scale of an emerging hot spot analysis is a critical concept to understand for proper interpretation of results, because emerging hot spot maps are dependent upon the extent of the analysis domain under consideration. For example, figure 6 shows how emerging hot spot maps differ slightly depending on the three different analysis domains defined in figure 1: a single protected area, a province, and all forests of DRC. Output maps differ in part because results for each bin are calculated relative to different 'global means' , representing the average loss count across the entire analysis domain.
Results are also sensitive to neighborhood size, or the distance over which each point is compared to all others. As neighborhood size increases, hot spots will become larger and fewer; smaller neighborhood sizes capture more localized trends. There is inevitably an element of subjectivity in choosing an appropriate value for neighborhood distance; Atkinson and Unwin (2002) state that 'subjective judgement based on a range of density surfaces is as good a method as any' for determining an appropriate neighborhood size. In Environ. Res. Lett. 12 (2017) 024012 practice, the scale of the analysis should influence the neighborhood size, and specific neighborhood distances used in this analysis are provided in figures 3-6. A range of neighborhood distances, and the impact on resulting maps, is also shown in figure 7. Despite the variations that emerge from the use of different neighborhood distances, locations of emerging hot spots are broadly consistent across maps. Using an intermediate neighborhood distance relative to the scale of analysis, as we do for generating results in figures 3-6, highlights specific local hot spots while also allowing other more general patterns to emerge.
All spatial clustering approaches, regardless of their theoretical underpinning, statistical foundation, or mathematical specification, have limitations in accuracy, sensitivity, and the computational effort required for identifying clusters. As a result, a major challenge in practice is determining which technique(s) will provide the most meaningful insights for a particular issue or context (Grubesic et al 2014). Contextual data layers are therefore extremely important for interpreting results, as is knowledge about the accuracy of the underlying data product used to generate statistical output. Across the tropical biome, user's and producer's accuracies of the forest loss data used for this analysis are 87.0 and 83.1%, respectively (Hansen et al 2013). Another more detailed accuracy assessment by Tyukavina et al (2015) shows that the original Hansen et al (2013) data product (version 1.0) underestimates loss in Brazil by 9%, overestimates loss in Indonesia by 9%, and underestimates loss in DRC by 65%. Lower accuracy in DRC was due to missed loss in landscapes dominated by smallholder rotation agriculture. Since original publication, the forest loss data of Hansen et al (2013) have been updated twice, and version 1.2 now includes loss up to the year 2014. The The framework outlined here demonstrates the added analytical value offered by forest monitoring data with high temporal cadence, or resolution. While we use a widely cited data product to demonstrate our approach (Hansen et al 2013) in three countries with relatively well studied forest loss narratives, any other data with a space and a time component can be evaluated using the same method. Because the Hansen et al (2013) forest loss product is binary (i.e. values represent the presence or absence of loss), counts of forest loss within each bin were used as the aggregation method. Depending on the specific research question, similar analysis could be run on other binary data products, such as active fire points (NASA FIRMS 2016) or weekly forest disturbance alerts . Data could also be aggregated by value rather (d) 50 km. As neighborhood distance increases, hot spots become larger and fewer; smaller neighborhood distances capture more localized trends.
Environ. Res. Lett. 12 (2017) 024012 than by count for continuous variables; for example, for applications targeting locations for interventions to reduce emissions from deforestation (REDDþ), rather than the area of deforestation, bin values could reflect the amount of biomass lost within a bin in a given year. In this example, point locations and years would be the same as for the forest loss data of Hansen et al (2013), but resulting hot spot maps would differ based on the aggregation method, especially in areas where there is substantial variation in forest biomass, such as in the Amazon vs. cerrado biomes of Brazil (figure 9) , Ometto et al 2013, Achard et al 2014. In the emerging hot spot map of biomass loss, the quantity of biomass lost within a bin drives the analysis, rather than the presence or absence of loss.
The maximum domain over which these types of analyses can be run was formerly defined by the processing power of the machine used to run the analyses. Recent and rapid advances in cloud computing and parallel processing have removed this barrier, such that complex data analysis over both small and large regions is possible and sensitivity of results to different parameters can be assessed. We have demonstrated an approach for quickly and dynamically assessing emerging clusters of forest loss, and the analysis can be re-run for new temporal windows as additional time steps are added. Moving forward, analysis of forest monitoring data using the presented approach could help corporate sustainability officers, environmental groups, resource managers, and governments gain rapid insights on where to focus conservation and management interventions for custom areas and time periods of interest.

Conclusions
The acquisition of satellite data with decreased latency times is providing massive streams of near real-time Environ. Res. Lett. 12 (2017) 024012 data to the remote sensing research community. The key to applying these new data streams in a forest conservation context will be to process and analyze them quickly, turning raw data into actionable insights that facilitate timely and well informed management decisions. Continued innovation in both software and cloud computing will enable the approach outlined here to be applied in novel ways to allow greater understanding of, and better management of, complex environmental systems.