Mapping oil palm density at country scale: An active learning approach

Accurate mapping of oil palm is important for understanding its past and future impact on the environment. We propose to map and count oil palms by estimating tree densities per pixel for large-scale analysis. This allows for fine-grained analysis, for example regarding different planting patterns. To that end, we propose a new, active deep learning method to estimate oil palm density at large scale from Sentinel-2 satellite images, and apply it to generate complete maps for Malaysia and Indonesia. What makes the regression of oil palm density challenging is the need for representative reference data that covers all relevant geographical conditions across a large territory. Specifically for density estimation, generating reference data involves counting individual trees. To keep the associated labelling effort low we propose an active learning (AL) approach that automatically chooses the most relevant samples to be labelled. Our method relies on estimates of the epistemic model uncertainty and of the diversity among samples, making it possible to retrieve an entire batch of relevant samples in a single iteration. Moreover, our algorithm has linear computational complexity and is easily parallelisable to cover large areas. We use our method to compute the first oil palm density map with $10\,$m Ground Sampling Distance (GSD) , for all of Indonesia and Malaysia and for two different years, 2017 and 2019. The maps have a mean absolute error of $\pm$7.3 trees/$ha$, estimated from an independent validation set. We also analyse density variations between different states within a country and compare them to official estimates. According to our estimates there are, in total, $>1.2$ billion oil palms in Indonesia covering $>$15 million $ha$, and $>0.5$ billion oil palms in Malaysia covering $>6$ million $ha$.


Introduction
Oil palm is the third largest oil crop in the world by planted area, and accounted for 35% of the vegetable oil production in the world in 2019 [1]. With the highest yield per hectare of any fat oil it is an attractive economic alternative in many tropical countries [2]. However, large-scale oil palm production in Malaysia and Indonesia is a potential driver of deforestation [3,4]. Several works relate oil palm development with long-lasting effects on the environment, including loss of biodiversity [5], poor air quality and high greenhouse gas emissions [6,7]. On the other hand, it plays an important role for several aspects of the socio-economical life in producer countries, including positive impacts on the welfare of smallholder producers [8]. Balancing the economic and social benefits of oil palm plantations with the impact it has on the environment is a challenging task. We refer the reader to [9] for a thorough analysis of the ethics around the palm oil industry.
Transparent, evidence-based policies for sustainable palm oil production are only possible with accurate maps at countryscale and beyond. Besides, to assess long-term impacts they shall be frequently updated. This calls for a highly automated, objective mapping process. Several authors have studied ways to map oil palm plantations with remote sensing data, e.g., [10,11,12]. They mostly focus on classifying the land cover into oil palms vs. background. Such a binary approach implicitly assumes that oil palms occur only as predominant land cover, in closed-canopy plantations with a certain minimum density, and is prone to missing smaller or less dense plantings. This may be a reason for the large variation between different estimates of planting area. We argue that a map of oil palm density (respectively, tree count per pixel) is preferable, as it is more robust to non-uniform densities and at the same time allows for a more detailed analysis.
A general bottleneck of land cover mapping with supervised machine learning is the need for accurate, manually collected labels [13]. Although for some applications public online data can be used, for instance from openstreetmap.org [14], no such data is available for more specific mapping tasks, including oil palm density. The standard procedure to ensure that the learned models generalise across large, geographically diverse regions is to collect a large and diverse set of manually annotated reference labels, but that brute-force approach is laborious, and therefore time-consuming and costly.
Active learning (AL) offers a possible solution, by focusing the (manual) labelling effort on a much smaller set of training examples that optimally represent the data distribution [15], thus reducing the total annotation effort required to achieve a given mapping accuracy. The starting point for AL is the observation that naively selected training samples usually are highly redundant, such that the marginal utility of each additional sample is low. The goal of AL is to accumulate the same evidence with a much smaller number of samples, by gradually adding new samples that are selected in an informed manner, so as to maximise their utility. The active construction of the training set is guided by the assumption that, in order to improve a (preliminary) prediction function, one must supply it with the correct answers for those inputs that it has not yet learned to handle. This leads to the following principles: 1. to decide which additional samples to label, prediction uncertainty is a useful indicator to identify samples with high marginal utility, respectively low redundancy with respect to the training data used so far. 2. if, for efficiency reasons, multiple samples are added in each round, then they should be as different from each other as possible, to avoid redundancy within the newly added batch.
The function that combines uncertainty and diversity into a score for selecting new samples is commonly termed the acquisition function.
In the context of mapping, the set of unlabelled samples corresponds to the entire target region, in our case more than 10 6 km 2 at 10 m ground sampling distance (GSD). Most existing AL methods were not designed for such extremely large datasets and do not scale well enough. For instance, many methods require that the model be retrained multiple times, and each time, it must predict the output for the entire unlabelled data set to pick new samples, which is extremely costly for large regions. Another strategy that makes it possible to extract a large and varied training set in a single shot requires the pre-computation of all pairwise similarities between candidates one may want to add in the next round [16] -the computational complexity of that operation is obviously quadratic in the data set size. To leverage AL for country-scale mapping, in our case of oil palms, one must design a strategy that remains viable with millions of pixels, which in practice means its computational complexity should scale (approximately) linearly with the area of the region (respectively, the number of pixels) and require as few active learning iterations as possible.
In this work we produce a country-scale map of oil palm density, by devising an active learning procedure that scales to large remote sensing datasets. The contribution of our paper is twofold: • We design a practical AL methodology for large-scale applications in remote sensing. The method relies on an implicit ensemble of randomly perturbed convolutional networks to estimate model uncertainty, together with an efficient core-set construction that uses distances between the data points' learned feature embeddings to efficiently assemble a representative and diverse training set. Our AL approach scales linearly with the number of unlabelled samples and requires only 2 processing cycles over the area of interest.
• By applying our AL system to Sentinel-2 imagery of Malaysia and Indonesia, we produce a 10 m GSD, wall-towall map of oil palm density for those two countries, which together account for 84% of the world's palm oil production 1 . In contrast to other oil palm mapping efforts such as [12,17,10] we map not only the presence/absence of oil palm plantations, but the tree density (oil palms per 100 m 2 pixel). The map thus provides richer information for further geo-spatial analysis. For instance, it can reveal local variations in planting patterns, production intensity, and potentially relations to yield. 2 2. Related work 2.1. Oil palm mapping Several authors have created maps of oil palm plantations from remote sensing data. Some use only Synthetic Aperture Radar (SAR) [18,11], which is unaffected by the frequent cloud coverage in tropical regions [10]. Perhaps even more popular is the combination of optical and Radar images [19,20,21,22,23,12,17,24]. Only a few works map oil palms at country-wide or continental scales. [10] provide a oil palm map over 15 countries from around the world using, ALOS-2/PALSAR-2 data from 2016 at a 25 m GSD. [17] use ALOS-2/PALSAR-2 and MODIS to generate a map of palm oil plantations in Malaysia and Indonesia at 100 m GSD, for the years 2001-2016. Recently, [12] provided a map that distinguishes between industrial and small-holder plantations. This map covers all areas suitable for palm oil production at a GSD of 10 m and was derived from Sentinel-1 (SAR) and Sentinel-2 (Optical) images from 2019.

Density estimation
Density estimation is a recurrent task in remote sensing, including for instance the estimation of population density [25,26] and canopy density [27,28]. Still, land cover is in most cases mapped only as a presence/absence label. Apparently species-specific tree density estimation has rarely been considered as an alternative. Most authors prefer to only detect trees (respectively forest or plantation areas), presumably because the size of individual trees is below the pixel footprint of wide-area satellite sensors. However, earlier work [29] has demonstrated that density estimation for sub-pixel trees is feasible. In agricultural applications in particular, density maps allow for a more advanced analysis than binary land-cover maps of the crop. This is potentially important for oil palm plantations, where the distance between trees is known to correlate with the yield per tree [30,31].

Active learning in remote sensing
A fairly large body of literature exists about active learning methods for remote sensing. Early works often relied on Support Vector Machines (SVM) [32,33,34,35]. Uncertain samples are identified as those that fall within the margin of the SVM classifier (and thus also have a high likelihood to become 1  We present the average density over 10m pixels for illustration. support vectors). Those works already recognised the importance of diversity and proposed measures to avoid that multiple similar training samples are added in the same iteration of the active learning loop [32]. In [33], the authors analyse differences and similarities between active learning and semisupervised learning and propose a method that combines them, leveraging the distance from a training sample to the support vectors as a measure of uncertainty to select new samples. In [36] Gaussian Processes were used for classification, since they offer a natural mechanism to represent model uncertainty and, thereby, select informative samples. Active learning with neural networks dates back to at least [37] in the remote sensing literature. While neural networks achieve excellent predictions in many image analysis tasks, they suffer from a well-known limitation that is critical in the context of AL, namely that their estimates of uncertainty are poorly calibrated. The standard work-around, already used by [37], is to employ stochastic ensembles of models to quantify prediction uncertainty. To reduce the computational cost, recent active learning schemes tend to avoid explicit multi-model ensembles and instead use more efficient approximations. For instance, [38] use Monte Carlo (MC) dropout [39] to estimate model uncertainty. Moreover, that paper compares different acquisition functions, and confirms that significant performance gains can be achieved with active (rather than random) selection of samples. [40] combine AL with domain adaptation to detect which samples of a new target domain need to be labelled to improve the performance of a pre-trained model on the target domain. They propose a method based on optimal transport that, however, relies on having a fully labelled training set in the source domain. [41] use AL for binary change detection between co-registered image pairs. To represent model uncertainty they use Monte Carlo Batch Normalisation (MCBN) [42], another form of randomisation within the network to approximate an ensemble. Besides showing that MCBN can indeed substitute an explicit ensemble without performance penalty, the authors also show that AL tends to balance the training examples if the class distribution is highly imbalanced -which is often the case in remote sensing when some comparatively rare target class should be separated from a dominant "background". [16] proposes a single shot active learning approach where a relatively large and diverse core-set of samples is extracted from the unlabelled dataset. The main characteristic of the method is that the active learning is not iterative: the core-set is extracted only once and neglects prediction uncertainty. The limitation with this approach is that the cost of building such a core-set is quadratic in the size of the unlabelled dataset.
A main challenge, without an accepted standard solution, remains the question how to scale AL to large-scale scenarios. The typical scenario in remote sensing is that the "unlabelled data", from which one has to select the samples to be annotated, is the entire area of interest with hundreds of millions of pixels. But most AL methods either are iterative, requir-ing a new prediction pass over the unlabelled set each time the model is retrained, or, in order to build larger effective sets for labelling, have a quadratic complexity with respect to the unlabelled dataset (i.e., at least implicitly they form and examine all possible pairs of samples in the unlabelled set). To remain tractable one would therefore have to limit them to a tiny portion of the mapping area, which, if done naively, runs the risk of missing important parts of the data distribution. We aim to solve these problems with an approximate measure of sample diversity that has computational complexity linear in the size of the unlabelled dataset, yet makes it possible to extract a diverse set of samples with few (in practice even a single) active learning iterations.

Area of study
This study focuses on a geographical area that comprises the countries of Malaysia and Indonesia. We produce a 10 m GSD, wall-to-wall map of oil palm density for those two countries, which together account for 84% of the world's palm oil production. This covers a land area of 1.3 · 10 6 km 2 or 1.3 · 10 9 Sentinel-2 pixels. See Figure 1 for an overview of the oil palm density map. According to our estimates there are, in total, >1.2 billion oil palms in Indonesia covering >15 million ha, and >0.5 billion oil palms in Malaysia covering >6 million ha. See more details in Section 5.

Sentinel-2 imagery
As input to our model, we use Sentinel-2 Level-2 bottomof-atmosphere reflectance images acquired in 2017 and 2019. The Level-2 images were obtained using sen2cor [43]. All channels are re-sampled to a common pixel size of 10 m with bicubic interpolation and stacked into 13-channel images. First, we filter out all pixels that are labelled as cloud, cloud shadow, water or snow in the scene classification layer provided by sen-2cor [43]. Moreover, we use only pixels with <50% cloud probability for training (according to the cloud mask from sen2cor), and only pixels with <10% cloud probability for validation and inference. We use a different cloud probability for training to be more robust to different cloud conditions. At test time, the stricter cloud threshold yields a cleaner estimate less influenced by clouds.
To create a dense map without gaps, we process each image separately and then average all predictions at each location over the entire year. Instead of directly mosaicing images into a single composite before processing, we process all images and perform a late fusion of results. Major advantages are that we avoid radiometric distortions from composite images with different atmospheric conditions and, more importantly, we ensure that the model learns to process images with a great variety of atmospheric conditions, such that it can be applied to any Sentinel-2 image without retraining.
We download and process Sentinel-2 images for the entire years 2017 and 2019. Using all available images allows us to compute dense, wall-to-wall maps without holes that provide an oil palm density estimate at every land pixel across Malaysia and Indonesia, despite the high cloud coverage that is common in tropical South-east Asia.

Reference data
To obtain reference oil palm density maps to train and evaluate our method, we rely on very high-resolution overhead imagery (GSD < 30cm). At that resolution, individual trees are clearly visible and we thus manually annotated a total of 3.4·10 6 oil palms. All annotations were made with images acquired in 2017. A 30-min training was required for students to learn to identify oil palm plantations and not confuse them with other crops such as coconut or sugar palms that are also present in the area of the study. Only general notions of suitable areas of oil palm plantations were required for labelling (e.g., tropical areas and elevations below 1500m [44]). We aimed at labelling blocks of at least 250 ha each spanning large geographical areas. Manual annotators were asked to label oil palm plantations and to include as background examples other types of land cover, such as coconut plantations and forests. Inside each block, we obtained the centre location of each oil palm as observed on the very-high resolution imagery; to obtain a smooth density map at Sentinel-2 resolution (GSD 10m) we applied a smoothing operation with a squared kernel of 20 × 20 meters on a high-resolution grid (GSD 0.625 m) of the oil palm locations and then summed the densities under each Sentinel-2 pixel. This results in local oil palm density maps that can be used for training and validation of our proposed method. See more details in Section 5.2 about the geographical split of train and validation areas.

Estimation of oil palm density
Following [29], instead of attempting to predict the locations of individual oil palms, which are smaller than the GSD of the input images, we predicted the count of oil palms per pixel. This way, the estimation of oil palm density can be solved as a supervised regression task. We used a Convolutional Neural Network (CNN), F(x), that predicts the count of oil palms per pixel,ŷ, using the raw intensities of Sentinel-2 image x. The architecture of the CNN (i.e., layers, hyperparameters) is based on our previous work [29]. F(x) consists of an input convolutional layer, and 15 residual layers of the form x out = h(x in ) + x in . The residual layers [45] learn additive updates to the input features, which has emerged as a particularly successful strategy for many computer vision and image analysis tasks. In our case, each function h(x) has three sequential convolutional layers. After every layer we include (i) batch normalisation [46], and (ii) a Rectified Linear Unit (ReLU) as activation function. Batch normalisation tracks the second-order statistics in each batch and uses them to standardise the values between different layers so as to reduce the sensitivity of the network to small changes between the batches of samples that are fed during training. ReLU is an elementwise non-linear transformation of the form x out = max(0, x in ) (it 3. Solve k-means with B centres using .  is a basic principle of neural networks to interleave linear layers with element-wise non-linearities to approximate complex functions [47]). Furthermore, we have two different branches as output: (i) y for the predicted count, and (i) y cl for a binary classification task between pixel with oil palms and the background. Note that at each layer we kept the same resolution of the image from input to output, different from other architectures that reduce the resolution of the input image with an encoder and then up-sample the resolution with a decoder [48,49,50] . This allowed us to retain the input resolution and map at the original GSD without upsampling issues. See Section Appendix A.1 for more details on the architecture in the supplementary material. To train the model parameters of F(x) we used a reference density map that was obtained by identifying individual oil palms in high-resolution imagery as described in Section 3.3. The loss function to train the model is a sum over the two branches, The right-most part of Eq. 1 is the cross-entropy (CE) between the predicted class and a binary indicator that separates background pixels with density 0 from oil palms. Specifically, we used F(x) to perform a regression for each pixel in the image, where we only predicted the number of oil palms per pixel in the image. For further details, refer to Rodriguez et al. (2018) [29]. For simplicity, from now on we do not distinguish between the density prediction and the auxiliary classification output and simply denote the network output as y.

Active learning
To start with, we defined our interest region X made up of a very large number N of samples x. In our case, each individual 10m pixel from Sentinel-2 Images in the area of study is one sample x. We created an Active set A for which we annotated a subset of x (i.e., Sentinel-2 pixel) with the desired outputs y (i.e., oil palm density) using an annotation budget B, such that we can fit the model parameters θ and then apply the model to all the remaining (unlabelled) samples. The goal of active learning is to cleverly pick the most informative training data points until exhausting the budget B, such that we capture the most relevant correlations between radiometric input patterns in x and output densities y.
This leads to a chicken-and-egg problem: in the absence of any information about the model F we cannot measure how important a sample x is for fitting it, except for the general intuition that the sample set should be diverse, so as to capture the variability of the inputs. Hence, one has to annotate an (ideally, small) initial training set L = {x l , y l } and train a preliminary model F 0 . That function can now be applied to the set U = {x u } of unlabelled samples to obtain estimatesŷ u = F 0 (x u ).

Acquisition function
The fundamental assumption of active learning is that one can determine fromŷ u which samples should be labelled to improve the model, even without having access to the true values y u (and thus to the correctness ofŷ u ). To that end the estimates are scored with an acquisition function g(ŷ), and the ones with the highest scores are used to construct A = {x * }, their corresponding labels y * are obtained with manual annotators, and a new model F 1 is obtained using L and the new samples {x * , y * } The function g(x) is typically based on the estimated uncertainty of the prediction, often in the form of a predicted variance σ 2 (ŷ). Following the reasoning that the model must be improved in those regions of the data space where it is uncertain [51,38,41].
Even if g(ŷ) is an effective proxy for the actual objective to decrease the prediction error, the described procedure is only optimal if new samples are added one by one until the available annotation budget B is exhausted at iteration F B θ . Unfortunately, it is not practically viable to run the computationally expensive model fitting every time a sample has been added, so one has to add multiple samples at a time to the training setideally one could add all B samples in one shot and retrain only once. This leads to a new problem: the set A containing the B highest scoring samples can (and often will) contain redundant patterns, i.e., the scores g(ŷ i ), g(ŷ j ) are similarly high because their corresponding image observations x i , x j are also similar. But showing the same pattern to the model multiple times at the cost of excluding other, yet unseen ones is wasteful and will undermine the idea of active learning.
Consequently, the acquisition function, in this case, should be a set function that evaluates the benefit of adding an entire set of training data points, rather than individual points.

Core-set approaches
A core-set is a smaller subset that summarises a large data collection in terms of the relevant properties for some taskfor instance in the case of clustering, a good core-set is one that yields clusters similar to the ones obtained by clustering the complete (usually much larger) data set [52]. In our case, a good core-set is a training set that leads to predicted outputs similar to those that would be learned if annotations were available for all data points. The use of core-sets for active learning was pioneered by [16]. However, their work has two major drawbacks: (i) it ignores the model uncertainty; and (ii) it does not scale to large data sets, as it requires an exhaustive set of pairwise (dis-)similarities between data points to assess diversity. This second point highlights a more general principle: set functions are in general more expensive to compute than per-point scores, so we need to design our acquisition function carefully to ensure it remains tractable even for large values of N and B.

Proposed acquisition function
To address the two limitations previously mentioned, we propose to construct A with samples to be labelled by using the following acquisition function: where µ is the centre of gravity (mean value) of the projected data points x j ∈ U in a suitable feature space z, and d z is the corresponding distance measure. Instead of computing distances directly in the input space of image intensities, we compute them in a space z that provides meaningful distances with respect to our task at hand. In our case the natural choice is to compute the (Euclidean) distances between deep activation maps in the deep network, see Section 4.3 below. The first term is simply the variance of the prediction (normalised over the unlabelled set) and corresponds to the standard assumption in active learning, that points with high uncertainty should be added to the training set. The price to pay is that the g(x) is no longer a true set function, and does not prevent the selected samples from lying far from the mean but close to each other in the data space.

Active learning set A construction
Our solution to re-introduce pairwise dissimilarity is to select a larger set of q candidate points (in our implementation, q = 1 · 10 5 ) with the highest scores g(x i ), then cluster them into B clusters with weighted k-means, assigning each sample the weight 1 q·g(x i ) . To construct A, from each cluster we retained only the candidate that lies nearest to the cluster centre. This simple trick ensures that no two training samples are closer to each other than the cluster radius. Note that, moreover, the clustering makes the selection more robust against outliers, which might otherwise dominate the core-set selection, as they tend to lie far from the mean µ. See Figure 2 for an overview of our entire workflow.

Implementation
After introducing the general framework, we go on to discuss practical details of the technical implementation, with a particular focus on large-scale deployment with images totalling hundreds of giga-pixels.

Uncertainty Estimation
To select informative samples, we must estimate the prediction uncertainty σ 2 (ŷ) (first term in eq. (2)). More specifically, we are looking for samples with high epistemic uncertainty [53] , i.e. the uncertainty that arises from the model parameters due to limited training data. A standard way to represent epistemic uncertainty numerically, if the model fitting includes stochastic elements, is to learn multiple instances of the model and view their predictions as Monte-Carlo samples from the posterior distribution of the outputŷ. For the case of neural networks, random initialisation and stochastic gradient descent provide the basis for such an ensemble strategy. A more efficient alternative is Monte-Carlo dropout [39], where the model ensemble is approximated by multiple runs of the same model, where in each run a given proportion of the model parameters is suppressed. In this way one avoids having to train multiple mod-els, at the cost of potentially less well-calibrated uncertainty estimates [54]. With an ensemble of T model predictions it is straight-forward to derive an ensemble predictionȳ = 1 T T t=1ŷ t and and uncertainty estimate Distance function and deep feature embedding To quantify the distance d z (x i , x j ) between two samples we used the learned feature representation in the second-last layer of the neural network F. To achieve a meaningful embedding that is representative across the complete area of study, we also included a pixel's location, such that the distance function can adapt to geographical variations. To inject the geographical location into the network we used Space2Vec encoding [55], which relies on a grid representation of relative and absolute positions and allows the model to learn correlations at different scales. Intuitively, Space2Vec uses each pixel (longitude and latitude) location and computes its angle with respect to different directions in space, relative distances from each directions are computed at different scales, this allows our model to incorporate information from other pixels at different scales. For more details refer to [55].
Overall, the following transformation is applied to an input x at location (x lat,lon ): where [r, e] denotes concatenation along the channel dimension. Space2Vec is trained on our small, initial training set L. The function F e just reads out the activations (i.e. the output a specific layer in the CNN) from the density prediction network F. The simple attention mechanism A consists of a 1 × 1 convolution followed by a sigmoid function and determines the relative importance (weight) of the features from each pixel in the image. The final embedding F z (x) is the product of the CNN activations and the location encoding; since the latter is a value between [0, 1], it can be interpreted as a re-weighting value for CNN activations. The similarity between two data samples is then defined as the squared Euclidean distance between their embeddings, d z (x i , x j ) = z i − z j 2 2 .

Large scale implementation
The second term of Eq. (2) is the distance from a sample x to the dataset mean µ. Computing each individual distance d z (x, µ) 2 as well as the cumulative distance can easily be parallelised across multiple machines for large-scale deployment. To do this one simply splits the area of interest into Q mutually exclusive regions and processes each of them on a different machine. We first computed the number of samples n, the sum v and sum of squares w of the deep embedding z, for all samples in Q. Then we collected all statistics from all regions and compute the global statistics N = Q n and µ = Q v/N. Now, in parallel for each region, we can compute the average distance to the mean d z (x q , µ) = w−2vµ+ Nµ 2 , and after collecting those distances from all regions we can compute the global distance to the mean and evaluate g(x) for all the regions Q. For uncertainty the procedure is straightforward as, in order to compute the global uncertainty, we only require the cumulative sum over the uncertainties inside each region Q. See Algorithm 1 in the supplementary material for the pseudo-code of active learning method and Section Appendix A.1 for more details on the architecture.

Active learning reference dataset
A common procedure to validate active learning methods is to take a fully annotated dataset and simulate the active learning procedure, by letting the acquisition function select new "unlabelled" samples and reading out their labels. However, that strategy is not viable for realistic, large-scale applications like ours. Labelling all oil palms in South-East Asia manually to benchmark our method is prohibitive. We thus restrict the validation experiments -and thus the labelling effort -to a smaller subset and label 126 blocks of at least 250 ha each in Sumatra, Indonesia and Peninsular Malaysia. Each block consists of approximately 160×160 pixels of Sentinel-2 data. As a base dataset, we randomly sample 10 training blocks and 10 validation blocks. The "unlabelled" dataset U is formed by the remaining 106 blocks. As in this setup the unlabelled dataset U is not large, we skip the k-means clustering step and directly use the top-k samples with respect to the acquisition function g(x) during active learning. For measuring the uncertainty of the estimatesŷ, we evaluated the quality of MC-dropout vs. an explicit ensemble. In line with other works, we find that uncertainty estimated with an explicit ensemble of 5 models showed  Using different sample sizes B, we evaluate how much active learning improves performance compared to the original base dataset, and compare it against two alternative approaches: Naïve selection: here we add the same number of samples as with active learning, but in this case we pick random clusters of samples that are geographically close; thus simulating a naïve annotator.
Manual selection: To manually sample the area of interest, an expert annotator should aim to cover as good as possible the area of interest in terms of visual diversity and geographical location. This is how the 106 manually selected available samples were constructed, so to simulate different number of samples, we uniformly choose from that manually curated set.
In both cases, we run the experiment with five different random seeds, to quantify how much variance is introduced by the randomness of the selection. Fig. 3 illustrates an example of the selected samples for the different strategies. As it can be seen the three methods behave differently. The naïve and manual annotation strategies behave according to their design: the former selects points close to each other, whereas the latter annotates points spread across the entire region. Active learning results in a mixed strategy that aims for geographical spread, but avoids certain regions where none of the samples has high uncertainty. Figure 4 shows the estimation error of the three methods for different annotation budgets B. The manual and naïve annotation strategies lead to similar performance, with only slightly lower errors for manual selection, within the standard deviation of different runs. Active learning, on the other hand, consistently reduces the error, at all annotation budgets. Note, no standard deviation can be given for AL, because in the closed world of the available 106 annotated samples the selection is  40 20 0 20 40 Trees/ha deterministic, and the same points will be picked in each run. We further note that the advantage of AL is smaller with very few samples and increases with the sample size B, a behaviour also observed in [16].

Large-scale mapping
We start with a manually labelled base dataset of 250 regions (each with at least 250 ha), out of which which we use 166 for training, and 84 for validation. After training a model on the base training dataset, we aim to predict and label 50 new active samples. We define each region q to have size 144 ha, which in Sentinel-2 pixels represents an image patch of 120 2 pixels. This results in 2 million different regions, after removing regions completely covered by water. We calculate the global mean µ of the z(x) embedding from all regions in the area of study, to compute the acquisition function g(x) for each region.  across the 12 • strips proportional to their land area. The active learning samples were then manually annotated, resulting in the final dataset shown in Figure 5. To visualise how diverse our chosen samples are compared to the rest of the dataset, we construct a t-SNE visualisation [56]. t-SNE non-linearly projects high-dimensional vectors (in our case the deep features F z (x) from each sample) down to a low-dimensional vector (in our case 2-D vectors) while minimally distorting their distances. Figure 6 shows the t-SNE visualisation of our core-set samples. The actively selected samples (shown as red points) are well distributed across all deep features (blue), which indicates a high diversity among the selected samples. Furthermore, note that the selected samples are not only on oil palm plantations as the model may also be uncertain about other types of vegetation. We therefore added many samples on forest or crops that do not contain any oil palm trees. In practice, we ended up with 46 labelled areas; this difference arose because our method chose samples in areas where no cloud-free high-resolution images were available for labelling.

Quantitative validation of oil palm densities
We quantitatively evaluate our oil palm density estimates using validation regions not shown to the model during training. All reference data was obtained by manually annotating individual trees in very high resolution overhead images (GSD <30cm). At the Sentinel-2 pixel size of 10 m the number of trees per pixel is low (generally <2), and the definition uncertainty of plantation boundaries is larger than the pixel size. Thus, we aggregate both ground truth counts and predict densities into tree counts per 10×10 pixel block (1 ha) for the validation 3 . Evaluating our error on 1 ha blocks allows us to compare if the method manages to retrieve the density for meaningful numbers of trees and avoid random fluctuations at pixel level. As shown in Fig. 7  over all validation sites is 7.30 trees/ha. For comparison, using only the base training dataset, the MAE was 10.14 trees/ha, in other words the error without AL samples would be 39% higher. Fig. 8 illustrates how the additional samples improve the prediction and reduce its uncertainty, in particular suppressing spurious low-density predictions in regions without oil palms. Fig. 9 shows the true and predicted densities for two example blocks, and the corresponding errors at ha level. Although some extreme density values are underestimated, the overall counts and structure of the plantations are retrieved with high correctness.
Since it is not feasible to manually label several billion individual oil palm trees of all Malaysia and Indonesia, the evaluation of oil palm density maps beyond the manually labelled validation samples has to remain limited to visual inspection and a qualitative analysis. We find that extending our initial training set of 166 blocks with just 46 additional blocks selected by AL greatly improves the map (Fig. 8).

Analysis of oil palm densities
To the best of our knowledge, we provide the first map of high-resolution oil palm densities across the major planting regions of South-East Asia. According to our estimate, there were 0.55 · 10 9 oil palms in Malaysia and 1.28 · 10 9 oil palms in Indonesia in 2017. Assuming a cut-off threshold of >0.2 trees/pixel for planted crop areas, we estimate the total area of palm oil plantations in 2017 to be 6.24 · 10 6 ha in Malaysia, respectively 16.18 · 10 6 ha in Indonesia. For 2019, we estimate 0.54 · 10 9 oil palms covering 6.17 · 10 6 ha in Malaysia and 1.23 · 10 9 oil palms covering 15.29 · 10 6 ha in Indonesia.
Per-pixel tree counts allow us to evaluate how the tree density varies across different locations, or in function of other geographical factors. We show densities for individual states of Malaysia and Indonesia in Fig. 10 for 2017 and total estimated oil palms and covered areas for Malaysia in Fig. 11 and Indonesia in Fig. 12. same year, we can compare the density distributions in smallholder plots versus large-scale plantations. By the definition of [12], smallholder oil palm plantations are smaller than 25 ha, have heterogeneous tree age, and are less structured in terms of shape and layout. Some of these features correlate with density. In Fig. 13 (top row), we can see industrial and smallholder plantations as mapped by [12]. Our findings support those empirical findings of [12]: smallholder plantations exhibit lower tree densities and strong, local density variations. Nonetheless, distinguishing smallholder plantations from large-scale plantations is difficult from Sentinel-2 satellite imagery. For example, there is no clear threshold to distinguish smallholder from industrial plantations based solely on density, because tree density inside industrial plantations can vary significantly, too ( Fig. 13 (bottom row)). On our maps one can also see density prediction highlights detailed structures inside individual plantations. In industrial plantations, for example, much lower densities are observed on the access roads between blocks of oil palms.

Comparison between 2017 and 2019 oil palm density maps
We did not observe systematic density shifts from 2017 to 2019, which indicates that our model generalises well across different years.This allows us to evaluate density changes between 2017 and 2019 as shown in Fig. 14. In Central Kalimantan, for example, the median increased from 1.035 to 1.055, which is most likely caused by new plantings with higher densities, as can be observed in the histogram. A detailed view of an example area with a large density shift is displayed in Fig. 15a, where higher density areas appear to correspond mostly to new plantations in 2019. Furthermore, in Malaysia some areas showed a decline in the amount of planted oil palms, corresponding mostly to replanting schemes (Fig. 15b).

Validation of predicted land-cover
Although our primary goal is density estimation, we can also compare our approach with land cover maps that only show binary presence/absence of oil palms. Recently [12] published a worldwide map that focuses on classifying smallholder vs. industrial plantations. See Figure 16 for a comparison in selected areas.
The Malaysian Palm Oil Board (MPOB) [58] provides monthly statistics about the area planted with mature oil palms in every state. In Figure 17 we compare their estimates, [12] and ours. For Indonesia, we obtain official reports for 2019 from the Ministry of Agriculture [59]. We compare in Table 1 those to area estimates from our map, with cut-off thresholds of >0.2 and >0.4 trees/pixel. The lower threshold yields estimates closer to the official statistics in Indonesia, where the area has apparently been over-estimated, both with respect to our estimates and to [12]. On the other hand, thresholding our estimates at 0.4 aligns better with [12]. These large differences in Indonesia could possibly be explained by the way the plantations are delineated for the official statistics.
Overall, we observed a slight reduction of total planted area of 1.15% for the whole country. For example, in Sabah, Malaysia we found a decrease of 6.95% (in contrast to official statistics reporting a decrease of only 1.9%). As illustrated in Fig. 15b changes can be due to replanting schemes on previously densely planted areas.

Discussion
Our experiments indicate that oil palm density can be efficiently retrieved at large scale from Sentinel-2 imagery. Perpixel tree density has a number of potential advantages over conventional presence/absence land cover maps. One advantage of density maps, compared to conventional land cover maps, is additional evidence about changes in palm tree density across Our work presented in this paper is limited to Malaysia and Indonesia, which produce over 80% of global palm oil. In order to inform at global level, we need to expand to further grow- 4 Figures for the total FFB yield of 2017, according to [58] ing regions like West Africa. We are currently using multispectral satellite images from the two Sentinel-2 satellites as data source. The major disadvantage of optical sensors is that they cannot see through the frequent cloud cover in tropical regions. We circumnavigate this shortcoming by computing density across stacks of Sentinel-2 images per location, in the hope that any point on the ground will be visible at least once per year. However, this incurs a substantial computational burden. Like [12] we will thus explore the possibility to add Sentinel-1 SAR imagery to our approach, to be more robust against dense cloud cover.
In Indonesia, official estimates were almost always higher than our estimates and those from [12]. In fact, different cut-off thresholds between oil palms and background for our method yield different estimates that range between the two sources. This could be an indication that the methodology used to report the official planted area differs from what we define as oil palm areas. For example, at a density threshold of 0.4, roads inside large-scale plantations would be excluded in our map, which is almost certainly not the case for the official statistics and for [12].
We will make all 10 m resolution maps for 2017 and 2019 available upon publication of this paper.

Conclusions
We have proposed an active deep learning approach with a computationally lightweight acquisition function g(x) that selects large datasets for labelling very efficiently. Our active learning strategy allows to iteratively chose an optimal set of samples for interactive labelling, striking a balance between high diversity and high uncertainty of the selected samples. Approach and algorithm are designed to scale to entire world regions with billions of individual object instances. We hope to have shown that active (deep) learning is an excellent tool for remote sensing applications to environmental problems at very large scale, where manual annotation of sufficiently large ground truth is prohibitive.
We have applied our AL method to compute the first dense, 10-meter resolution map of oil palm densities of the world's two main producers, Malaysia and Indonesia, with a Mean Absolute Error of ±7.30 trees/ha. According to our maps, in 2019 there were 0.54 billion oil palms covering 6.17 × 10 6 ha in Malaysia and 1.23 billion oil palms covering 15.29 × 10 6 ha in Indonesia.

Credit author statement
A.C.R was responsible for the design and development of the methods, running experiments and writing the paper. S.D., K.S. and J.D.W. were responsible for creating the research design, writing and editing. All authors discussed the results and contributed to the final manuscript Algorithm 1 Large-scale Active Learning Implementation Require: ensemble of T trained models F t , M machines, annotation budget B, Unlabelled dataset U Output: Set A = {x * } with B samples to be labelled by human experts.
1: Split U in Q regions. 2: for each region q ∈ Q do In parallel on M machines 3: for each sample x ∈ q do 4: Obtain predictionsŷ = t F t (x)/T 5: Compute uncertainty s = t (ŷ − F t (x)) 2

6:
Obtain pixel count in n x

7:
Obtain deep embedding statistics v = t F z t (x) and w = t F z t (x) 8: Compute region statistics s q = x∈q s, n q = x∈q n x , v q = x∈q v, w q = x∈q w 9: From each region, retrieve s q , n q , v q and w q 10: Compute total pixel count and global mean N = n q , µ = v q /N 11: for each region q ∈ Q do 12: Compute total distance to the global mean d z (x q , µ) = w q − 2v q µ + Nµ 2 13: For each region g(x h ) = s q q s q + d z (x q , µ) q d z (x q , µ) 14: Construct set C with the q · B regions with highest g score 15: Cluster set C into B clusters with weighted k-means, where each region has weight 1/(q · B · g(x i )) 16: Construct set A = {x * } by obtaining the samples x * closest to each of the B centroids.

Architecture details
We implement all our experiments in Tensorflow using Python, see Tab. A.2 for the specific architecture details of our network F. For training we use a patch size of 16 × 16 pixels, a batch size of 128, trained for 100 epochs and used ADAM optimiser 5 with a learning rate of 1 · 10 −4 . From all labeled areas we randomly extracted a total of 1·10 6 patches proportional to the area of each labeled area.

Appendix A.2. Quality of uncertainty estimates
Uncertainty estimates are only useful if they are reasonably well calibrated with respect to the validation error we might expect. In Figure A.18, we constructed a precision-recall curve 5  x represents an input image from Sentinel-2, the outputŷ is the estimated density for the corresponding input image x, y cl is the output of the auxiliary classification task, used only for training. Conv2D(n, k) is a convolutional operations with n filters and a filter of size k, before every Conv2D(n, k) a BatchNorm Layer is applied [46]. based on the predicted uncertainty and the residual values from validation data. The curve shows how the average MSE reduces by removing samples with predicted uncertainty above percentile thresholds. Aligned with the related works, we can see that the estimates of an ensemble of 5 models yield much better calibrated estimates for the uncertainty than a MC-dropout.