Country-wide data of ecosystem structure from the third Dutch airborne laser scanning survey

The third Dutch national airborne laser scanning flight campaign (AHN3, Actueel Hoogtebestand Nederland) conducted between 2014 and 2019 during the leaf-off season (October–April) across the whole Netherlands provides a free and open-access, country-wide dataset with ∼700 billion points and a point density of ∼10(–20) points/m2. The AHN3 point cloud was obtained with Light Detection And Ranging (LiDAR) technology and contains for each point the x, y, z coordinates and additional characteristics (e.g. return number, intensity value, scan angle rank and GPS time). Moreover, the point cloud has been pre-processed by ‘Rijkswaterstraat’ (the executive agency of the Dutch Ministry of Infrastructure and Water Management), comes with a Digital Terrain Model (DTM) and a Digital Surface Model (DSM), and is delivered with a pre-classification of each point into one of six classes (0: Never Classified, 1: Unclassified, 2: Ground, 6: Building, 9: Water, 26: Reserved [bridges etc.]). However, no detailed information on vegetation structure is available from the AHN3 point cloud. We processed the AHN3 point cloud (∼16 TB uncompressed data volume) into 10 m resolution raster layers of ecosystem structure at a national extent, using a novel high-throughput workflow called ‘Laserfarm’ and a cluster of virtual machines with fast central processing units, high memory nodes and associated big data storage for managing the large amount of files. The raster layers (available as GeoTIFF files) capture 25 LiDAR metrics of vegetation structure, including ecosystem height (e.g. 95th percentiles of normalized z), ecosystem cover (e.g. pulse penetration ratio, canopy cover, and density of vegetation points within defined height layers), and ecosystem structural complexity (e.g. skewness and variability of vertical vegetation point distribution). The raster layers make use of the Dutch projected coordinate system (EPSG:28992 Amersfoort / RD New), are each ∼1 GB in size, and can be readily used by ecologists in a geographic information system (GIS) or analytical open-source software such as R and Python. Even though the class ‘1: Unclassified’ mainly includes vegetation points, other objects such as cars, fences, and boats can also be present in this class, introducing potential biases in the derived data products. We therefore validated the raster layers of ecosystem structure using >180,000 hand-labelled LiDAR points in 100 randomly selected sample plots (10 m × 10 m each) across the Netherlands. Besides vegetation, objects such as boats, fences, and cars were identified in the sampled plots. However, the misclassification rate of vegetation points (i.e. non-vegetation points that were assumed to be vegetation) was low (∼0.05) and the accuracy of the 25 LiDAR metrics derived from the AHN3 point cloud was high (∼90%). To minimize existing inaccuracies in this country-wide data product (e.g. ships on water bodies, chimneys on roofs, or cars on roads that might be incorrectly used as vegetation points), we provide an additional mask that captures water bodies, buildings and roads generated from the Dutch cadaster dataset. This newly generated country-wide ecosystem structure data product provides new opportunities for ecology and biodiversity science, e.g. for mapping the 3D vegetation structure of a variety of ecosystems or for modelling biodiversity, species distributions, abundance and ecological niches of animals and their habitats.

How the data were acquired We acquired the raw LiDAR data (AHN3 point cloud dataset) from the repository of the PDOK webservices ( https://app.pdok.nl/ahn3-downloadpage/ ) using a script for automatic downloading (available on GitHub: https://github.com/eEcoLiDAR/downloadAHN ). We subsequently processed the multi-terabyte point clouds from AHN3 with the 'Laserfarm' workflow ( https://pypi.org/project/laserfarm/ ) into 25 raster layers of ecosystem structure at a national extent with 10 m spatial resolution. In brief, the Laserfarm workflow (1) splits the raw data into tiles of appropriate size based on a defined grid (re-tiling), (2) calculates the normalized vegetation height for each individual point as the height relative to the lowest point within a defined grid cell (normalization), (3) calculates 25 LiDAR metrics with a defined spatial resolution (feature extraction), and (4) merges all tiles for each metric into raster layers in GeoTIFF format (rasterization). A detailed description of this high-throughput LiDAR workflow is provided in a paper describing the Laserfarm design, implementation and its performance [1] . In addition to the 25 LiDAR metrics, we calculated the AHN3 point density (# of points) for each 10 m x 10 m grid cell using the point density feature from the 'Laserchicken' software [2] which is also incorporated into the Laserfarm workflow [1] . Finally, we derived a mask (with 10 m spatial resolution) capturing water surfaces and human infrastructures (e.g. buildings and roads) by aggregating and rasterizing the water, building and road polygons from the shapefiles of the 2018 Dutch cadaster data (TOP10NL, https://zakelijk.kadaster.nl/-/top10nl ). For this step, a Jupyter Notebook was employed which is available on GitHub ( https://github.com/eEcoLiDAR/AHN/tree/main/AHN-mask ).

Data format
Raster layers in GeoTIFF format (10 m spatial resolution) derived from (1) processing raw data (AHN3 point clouds) into LiDAR metrics capturing different aspects of ecosystem structure (25 raster layers), (2) calculating AHN3 point density (1 raster layer), and (3) aggregating water surfaces and human infrastructures (e.g. buildings and roads) from shapefiles of the Dutch TOP10NL cadaster data (1 raster layer).

Description of data collection
Only the AHN3 class '1: Unclassified' includes vegetation points. Normalizing the height (z-values) of all points in the AHN3 class '1: Unclassified' relative to the terrain surface was done with the lowest point within a 1 m × 1 m cell using the 'Normalize' module of 'Laserchicken' ( https://laserchicken.readthedocs.io/en/latest/#normalize ).

Data source location
The raw data (LiDAR point clouds) have been obtained by the third Dutch national airborne laser scanning flight campaign (AHN3

Value of the Data
• Ecosystem structure data are important for understanding, modelling and predicting biodiversity because the species richness, composition, distribution and abundance of organisms and their habitat preferences (e.g. nest sites and shelter), food provisioning and foraging are tightly linked to the horizontal and vertical heterogeneity of vegetation. • The physical structure of ecosystems also influences microclimates at the land-air interface via respiration, heat and energy exchange. This affects species behavior, growth, reproduction, and survival. Predictions of species and ecosystem responses to global change thus require high resolution data of ecosystem structure to account for temperature buffering near the ground and microrefugia within landscapes. • Measurements of the vertical structure of forests and other ecosystems are also critical for accurately assessing biomass and carbon storage, and how land use changes, ecosystem restoration or variations in climate may impact atmospheric CO 2 concentrations. • Spatially contiguous, high resolution raster layers of ecosystem structure will thus be beneficial for a range of users, including field biologists, ecologists, conservationists, ecological modelers, geoinformaticians, land managers and environmental system analysts. Researchers from other domains (e.g. hydrology and climatology) might also take advantage of such data. • Ecosystem structure data can be used as predictors in statistical models (e.g. profile models, regressions, machine learning and geographical models) to correlate species observations with environmental layers. Such predictive habitat distribution models aim to quantify and map the determinants of species' ecological niches and their ability to cope with climate or land use change.

Objective
This dataset was generated during the development of the Laserfarm workflow, a highthroughput pipeline for generating geospatial data products of ecosystem structure from airborne laser scanning point clouds. The aim was to create a high-resolution (10 m) geospatial dataset of 25 LiDAR metrics (i.e. raster layers in GeoTIFF format) from multi-terabyte LiDAR point clouds at a national extent (i.e. across the Netherlands), capturing various aspects of ecosystem structure, including vegetation height, cover and structural complexity of vegetation. This provides a standardized set of Essential Biodiversity Variables (EBVs) derived from LiDAR for the EBV 'Ecosystem Vertical Profile' [4 , 5] and thereby facilitates the monitoring and modelling of biodiversity and ecosystems [6][7][8] . The original research article related to this data publication describes the design principles, architecture, implementation and performance of the Laserfarm workflow, and the statistical relationships among the 25 LiDAR metrics. Here, we describe the specific details and mathematical description of each LiDAR metric, perform a validation with hand-labelled points to quantify the misclassification rate and the accuracy of the LiDAR metrics, and provide a mask of water surfaces and human infrastructures (buildings and roads) from the Dutch cadaster data to minimize inaccuracies related to misclassifications. The overall objective of this data paper is therefore to provide an open-access dataset of ecosystem structure variables for modeling species distributions [9 , 10] and ecological niches [11] , for analyzing biodiversity in relation to vegetation structure and land use [12] , and for mapping land cover types [13] and other habitat features such as hedges and tree lines [14] .

Raster layers of ecosystem structure
A total of 25 LiDAR metrics of ecosystem structure were calculated ( Table 1 ). The spatial resolution of grid cells was 10 m and the spatial extent was the whole Netherlands. The Table 1 Twenty-five LiDAR metrics capturing ecosystem structure in three key dimensions (ecosystem height, ecosystem cover and ecosystem structural complexity). All metrics were calculated with the normalized point cloud, using the Dutch AHN3 point clouds as input and the features from the 'Laserchicken' software ( https://laserchicken.readthedocs.io/en/latest/#features ). More details on metric calculation are provided on GitHub ( https://github.com/eEcoLiDAR/laserchicken ) and on the 'Laserchicken' documentation page ( https://laserchicken.readthedocs. io/en/latest/ ). The processed 10 m resolution GeoTIFF files are available from the Zenodo repository [3] .
where R i are the residual after plane fitting, and R the mean of residuals AHN3 point cloud was used as input (see above 'Data source location') and metric calculation was performed with the Laserfarm workflow ( https://pypi.org/project/laserfarm/ ), using the feature extraction module from the 'Laserchicken' software ( https://laserchicken.readthedocs.io/en/ latest/#features ). For each of the 25 calculated LiDAR metrics, the GeoTIFF file name, the Laserchicken feature name, the formula, a general description and its ecological relevance is provided ( Table 1 ). The metrics are grouped into three key dimensions of ecosystem structure, following a standardized framework of ecosystem structure variables in the context of EBVs [4] , namely ecosystem height, ecosystem cover and ecosystem structural complexity. All metrics were calculated with the normalized point cloud using predominantly vegetation points (class 'unclassified' from AHN3 classification provided by 'Rijkswaterstraat'), except the pulse penetration ratio which additionally requires ground points.

Validation
For generating the 25 LiDAR metrics of ecosystem structure we used the ASPRS classification code '1: Unclassified' [15] of the AHN3 point cloud to represent vegetation points. This may introduce biases into the generated data products if the class contains not only vegetation points but also other objects such as cars, fences, poles, boats, etc. To validate the derived LiDAR metrics of ecosystem structure we randomly selected 100 sample plots (10 m × 10 m each) across the Netherlands and hand-labelled the segmented point clouds (183,837 points in total) into three classes: vegetation, ground, and others (e.g. buildings, cars, fences). We then used these handlabelled points as ground truth and calculated two validation metrics: (1) the misclassification rate for each plot (i.e. the number of points incorrectly classified as vegetation / total number of points in the ASPRS class '1: Unclassified'), and (2) the accuracy of the 25 derived LiDAR metrics (i.e. the number of plots in which the LiDAR metric calculation did not differ between hand-labelled and originally classified points / total number of plots).
The 100 randomly selected 10 m × 10 m plots were widely spread across the whole Netherlands ( Fig. 1a ). Most plots (88%) did not contain any misclassification (i.e. exclusively true vegetation points within the ASPRS class '1: Unclassified'). Four plots (4%) had more than half of the points in the ASPRS class '1: Unclassified' incorrectly classified as vegetation. Across all 100 plots the misclassification rate was very low (0.05 ± 0.19, n = 100).
As a consequence of the low misclassification rate, only a few plots showed differences in the LiDAR metric calculation between the hand-labelled vegetation points and the points originally classified as '1: Unclassified' (see dots in Fig. 1b ). Overall, the accuracy of the generated LiDAR metrics was high (0.90 ± 0.04, n = 25 LiDAR metrics), ranging from 0.87-1. The number of plots with differences in LiDAR metric values and the degree of difference varied among LiDAR metrics ( Fig. 1b ). For instance, the LiDAR metrics Hmax and Hp95 showed the strongest differences among height-related LiDAR metrics, whereas BR_below_1 and BR_below_5 showed the strongest differences among metrics characterizing the density of vegetation points in certain vegetation layers ( Fig. 1b ). Closer inspection of specific plots with misclassifications showed that incorrectly classified points mainly belonged to boats, fences, and cars ( Fig. 1c ).

Point density
Besides the 25 LiDAR metrics, we additionally calculated the point density for each 10 m x 10 m grid cell to quantify the variability in available AHN3 point densities across the Netherlands. This represents the spatial distribution of the point cloud density and can be used for additional analyses, e.g. to test how LiDAR metrics of ecosystem structure vary with point densities. The AHN3 point density was calculated for each 10 m x 10 m grid cell using the 'point_density' feature from the 'Laserchicken' software ( https://laserchicken.readthedocs.io/en/latest/#features ). Formal description : N/A where N is the total number of points (including points from all classes from the classification, i.e. not only the points from the class 'unclassified') and A is the area (here: 10 m x 10 m grid cell).

Mask
We additionally derived a mask of water surfaces and human infrastructures (buildings and roads) from the Dutch cadaster data which allows the user to minimize inaccuracies related to misclassifications, e.g. ships on water surface, chimneys on roofs, or cars on roads which might incorrectly be considered as vegetation. A mask of water surfaces and human infrastructures (buildings and roads) was created from the shapefiles of the 2018 Dutch cadaster data (TOP10NL, https://zakelijk.kadaster.nl/-/top10nl ) using the same spatial resolution (i.e. 10 meter) and projected coordinate system (EPSG:28992 Amersfoort / RD New) as the other GeoTIFF files. We aggregated the water, building and road polygons from the TOP10NL shapefiles into one type, and rasterized them using a binary classification (1: water, building and roads; 0: other). The mask allows users to minimize errors in the generated country-wide ecosystem structure data because not all points in the ASPRS point class '1: Unclassified' are vegetation points ( Fig. 2 ).

Raw data
The whole LiDAR point cloud dataset from AHN3 is large and contains ∼700 billion points and ∼16 TB uncompressed data volume, available in 1,367 point cloud files (LAZ format). We downloaded all 1,367 files from the AHN3 repository using the PDOK webservices ( https: //app.pdok.nl/ahn3-downloadpage/ ) and a script for automatic download ( https://github.com/ eEcoLiDAR/downloadAHN ). Data were downloaded to the GRID storage infrastructure ( http: //doc.grid.surfsara.nl/en/latest/Pages/Advanced/grid _ storage.html ) from SURF, the trans-national IT infrastructure for the Dutch academic community ( https://www.surf.nl/en/ict-facilities ).

Processing
For processing the files, we set-up a cluster of 11 virtual machines (VMs) using the HPC Cloud from SURF ( https://userinfo.surfsara.nl/systems/hpc-cloud ). Each of the 11 VMs had 2 cores (thus a total of 22 cores), 32 GB or 64 GB RAM, and 256 GB local HDD. We used the Laserfarm workflow (version 0.1.5) available from PyPI ( https://pypi.org/project/laserfarm/ ), GitHub ( https: //github.com/eEcoLiDAR/Laserfarm ) and Zenodo ( https://zenodo.org/record/5636773 ) to process the multi-terabytes of AHN3 point clouds into GeoTIFF files of LiDAR metrics capturing ecosystems structure. The Laserfarm workflow is implemented in Python and makes use of opensource tools such as 'Laserchicken' [2] , the Point Data Abstraction Library (PDAL, https://pdal.io/ ), the Geospatial Data Abstraction Library (GDAL, https://gdal.org/ ), the Dask library [16] , and numerous packages hosted on the open source Python Package Index (PyPI, https://pypi.org/ ). All steps of the AHN3 processing using the Laserfarm workflow are available in Jupyter Notebooks ( https://github.com/eEcoLiDAR/AHN/tree/main/AHN3 ) and involved the following processing steps: As a first step ('Re-tiling'), we split the 1,367 LAZ files into a grid with 1 km × 1 km size (512 × 512 cells across the Netherlands), making use of the Dutch projected coordinate system (EPSG:28992 Amersfoort / RD New). The LAZ files were retrieved from the storage and then split using functionality from the PDAL library [17] . This resulted in 37,457 tiles for further processing. In the second step ('Normalization'), we calculated the normalized height for each individual point as the height relative to the lowest point within a 1 m × 1 m cell. This pipeline employs the 'Normalize' module of the 'Laserchicken' software [2] .
In the third step ('Feature extraction'), we calculated the LiDAR metrics using the 'Features' and 'Compute Neighbors' modules of the 'Laserchicken' software [2] . We focused on 25 LiDAR metrics capturing ecosystem height, ecosystem cover and ecosystem structural complexity (see details above in 'Data description'). We used the ASPRS classification code '1: Unclassified' [15] of the AHN3 point cloud to represent vegetation points. We defined the spatial resolution (grid cell size) for all metrics as 10 m × 10 m using the centroids of square cells and an infinite vertical extent as the volume geometry in the 'Laserchicken' software [2] . We finally generated 37,457 PLY files for each LiDAR metric and exported them to separate folders.
In the fourth step ('Rasterization'), we merged and exported the PLY tiles for each metric as a single-band GeoTIFF file in the Dutch projected coordinate system (EPSG:28992 Amersfoort / RD New).

Validation
We randomly located 100 plots (each of 10 m × 10 m size) across the Netherlands using the sampleRandom() function in R ( https://www.rdocumentation.org/packages/raster/versions/ 3.5-15/topics/sampleRandom ). We then segmented the point cloud of each plot from the raw AHN3 point clouds using the 'lasclip' tool from the Lastools software ( https://rapidlasso. com/lastools/ ), using the polygons of the sampled plots. We then hand-labelled all segmented point clouds (i.e. 183,837 points in total) into the classes of vegetation, ground, and others (e.g. buildings, cars, fences). This was done with the ArcGIS Pro interactive editing tool for LAS classification (see https://pro.arcgis.com/en/pro-app/latest/help/data/las-dataset/ interactive-las-class-code-editing.htm ). Each of the 25 LiDAR metrics was then calculated for each plot using the Laserfarm workflow, once using all points from the ASPRS point class '1: Unclassified' and once using only the vegetation points from the hand-labelled point clouds (as ground truth). The values of each LiDAR metric (in GeoTIFF layers from both the original point clouds and the hand-labelled point clouds) for each plot were then extracted using the extract() function in R ( https://www.rdocumentation.org/packages/raster/versions/3.5-15/topics/extract ).
The misclassification rate was calculated for each plot as the number of points which were incorrectly classified as vegetation divided by the total number of points in the ASPRS class '1: Unclassified'. Accuracy of the 25 LiDAR metrics was assessed by taking the number of plots in which the LiDAR metric calculation did not differ between hand-labelled and originally classified points (i.e. difference = 0) and dividing it by the total number of plots ( n = 100). This allowed us to quantify how accurately the 25 LiDAR metrics capture vegetation structure, and to what extend their values might be affected by non-vegetation points that remain in the class 'unclassified' of the AHN3 pre-classification.

Mask
The mask layer was generated using the water, buildings and road polygons from the TOP10NL cadaster data (see above 'Data source location'). We aggregated the polygons and exported them as a raster layer with a binary classification (1: water, building and roads; 0: other) and 10 m resolution, using a Jupyter Notebook ( https://github.com/eEcoLiDAR/AHN/tree/main/ AHN-mask ).

Ethics Statements
This work meets the requirements for ethical publishing ( https://www.elsevier.com/authors/ policies-and-guidelines ). The work does not include chemicals, procedures or equipment that have any unusual hazards inherent in their use, nor does it involve the use of animal or human subjects. No studies on patients or volunteers have been performed.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Country-wide data products for the ecosystem structure metrics derived from ALS data across the Netherlands (AHN3) (Original data) (ZENODO)