Comparison of Filters for Archaeology-Specific Ground Extraction from Airborne LiDAR Point Clouds

Identifying bare-earth or ground returns within point cloud data is a crucially important process for archaeologists who use airborne LiDAR data, yet there has thus far been very little comparative assessment of the available archaeology-specific methods and their usefulness for archaeological applications. This article aims to provide an archaeology-specific comparison of filters for ground extraction from airborne LiDAR point clouds. The qualitative and quantitative comparison of the data from four archaeological sites from Austria, Slovenia, and Spain should also be relevant to other disciplines that use visualized airborne LiDAR data. We have compared nine filters implemented in free or low-cost off-the-shelf software, six of which are evaluated in this way for the first time. The results of the qualitative and quantitative comparison are not directly analogous, and no filter is outstanding compared to the others. However, the results are directly transferable to real-world problem-solving: Which filter works best for a given combination of data density, landscape type, and type of archaeological features? In general, progressive TIN (software: lasground_new) and a hybrid (software: Global Mapper) commercial filter are consistently among the best, followed by an open source slope-based one (software: Whitebox GAT). The ability of the free multiscale curvature classification filter (software: MCC-LIDAR) to remove vegetation is also commendable. Notably, our findings show that filters based on an older generation of algorithms consistently outperform newer filtering techniques. This is a reminder of the indirect path from publishing an algorithm to filter implementation in software.


Introduction
Airborne light detection and ranging (LiDAR) is now widely used in archaeological prospection [1,2]. Archaeologists interpret enhanced visualizations of airborne LiDAR-derived high-resolution digital terrain models (DTMs) with a combination of perception and comprehension, e.g., [3,4]. The results have proven to be very successful in detecting archaeological features and have uncovered numerable new settlements and almost innumerable new archaeological features worldwide, e.g., recently [5][6][7][8][9].
Airborne LiDAR-derived data provide archaeologists with detailed topography, which is a powerful resource for interpretation [10]. The often described process from data acquisition to archaeological interpretation [11][12][13][14][15][16] can be divided for the purposes of this article into raw data processing (project planning, system calibration, data acquisition, georeferencing, flight strip adjustment), point cloud processing (ground point filtering, additional processing), creation of end products (archaeology-specific raster elevation model interpolation and it visualization), and archaeological interpretation (interpretative mapping, ground inspection, deep integrated multi-scale interpretation).
Remote Sens. 2020, 12 For archaeology, the most critical part of the workflow is ground point filtering [16], and this is the subject of our article. It is worth reiterating that this process involves assumptions and decisions during the entire workflow, but these are often underestimated or even ignored [12,13]. As some of the end products (for example, a simple hill-shaded representation of the DTM) are almost self-explanatory, this may lead to a certain ignorance of the underlying data processing strategies [16].
In comparison to other fields the use of airborne LiDAR-derived data in archaeology is specific in several ways. First, visual inspection of enhanced raster visualizations remains the key method. The digital surface model (DSM) used often combines a digital terrain model (DTM) with off-terrain archaeological features [16] and modern buildings [10,17,18]. Such a specific DSM has been termed the digital feature model (DFM) [19].
Secondly, morphologically speaking archaeological features are anomalies, i.e., sporadically occurring atypical features of the terrain. The direct application of existing generic methods developed to assess the average accuracy of the terrain [20] is, therefore, less than ideal. For example, in archaeology, low noise can be tolerated as long as the size of the distortion is significantly smaller than the archaeological features.
Third, the time, effort, equipment, and human resources invested in airborne LiDAR data processing are only a small part of a typical archaeological project. For example, a detailed desk-based assessment of 1 km 2 takes about 1 person/day, and even the most rudimentary field survey of the same area takes up to 5 person/days. It is therefore less important whether the filtering of the point cloud takes 4 or 400 s than whether a single feature is incorrectly represented. For this purpose, archaeology-specific workflows have been proposed for both custom [16] and general-purpose data [15], none of which are as yet generally accepted. Two reasons for this are that the proposed workflows are too complex and in parts based on costly software.
Finally, we are currently experiencing an unprecedented expansion of archaeological applications of airborne LiDAR data due to the recent widespread availability of low or medium density, nationwide, general-purpose data, especially in Europe and the USA. However, the acquisition and processing of this data has not been optimized for archaeology. In most cases, the data are suitable for (at least some) archaeological purposes if custom archaeology-specific processing is applied, e.g., [15,21].
From the above, it can be concluded that archaeology as a discipline will greatly benefit from a workflow for the archaeology-specific processing of low and medium density airborne LiDAR data based on the free or low-cost off-the-shelf software. To achieve this, an archaeology-specific comparison of filters for ground extraction from airborne LiDAR point cloud is a necessary first step.
For this purpose, we compare the performance of nine automatic filters implemented in free or low-cost off-the-shelf software. The goal of this comparison is three-fold: to determine the archaeology-specific performance of existing filters, to determine the influence of the point density on the filter performance, and to identify directions for future archaeology-specific research.
In the past several studies have compared relevant filters implemented in free software [22][23][24][25][26][27]. However, none of them include newer filters, none of them use a standardized comparison method by Sithole and Vosselman [20], and none of them is archaeology-specific. Especially the former makes our comparison timely and relevant beyond archaeology.
A note on terminology is appropriate at this point. In general, we have followed the terminology widely accepted in relevant scientific publications. We use the term algorithm to refer to theory (e.g., academic articles) and filter to implementation (e.g., in software). The term filtering is used to describe the process of applying the filter to the data. The ground extraction filters are used to classify each point in the point cloud as either ground or non-ground and no point is deleted (filtered). However, when the DTM is interpolated from the point cloud, only the points classified as the ground are used and the rest are omitted (filtered) from the process, cf. [20]. This leads to a certain ambiguity in the terminology, especially concerning the term filter/filtering.

Data
Test sites from Austria (AT), Slovenia (SI), and Spain (ES) were selected based on their archaeological and morphological similarity to compare the filter performance at different point densities. Each site features: The additional test site (SI2) was chosen for extremely dense vegetation as an example of data with significant gaps in the coverage of the ground points. Each site is 500 × 500 m, but only the most relevant windows are shown in illustrations.
AT airborne LiDAR data was acquired in 2009. These data have a nominal average density of 4 pnt/m 2 and an estimated horizontal and vertical root mean square error (RMSE) of 0.15 and 0.40 m, respectively. Ground point filtering is not documented [28]. Point cloud data are available via a formal request (we have acquired the data in the non-classified XYZ format, other formats may be available). The AT test site has 3,844,873 points with an average density of 18.79 pts/m 2 , which is significantly higher than the nominal average density.
The SI data was acquired in 2014. These data have a nominal density of 5 pnt/m 2 and an estimated horizontal and vertical RMSE of 0.09 m. Ground point filtering was performed with the proprietary filter gLidar [29] and is distributed via the eVode WMS (LAS or LAZ format, ISPRS classes 0-7 are populated). The SI1 test site has 2,809,344 points with an average density of 12.28 pts/m 2 and the SI2 has 2,756,756 points with an average density of 12.43 pts/m 2 . It must be mentioned that due to the specific processing of the raw data, each point has a near-double "shadow point". The "shadow point" is within 0.14 m horizontally and within 0.03 m vertically from the original point, but the distance and bearing to the original is not constant. This means that the functional point density is half of the above-mentioned, namely 6.14 pts/m 2 .
The ES data was acquired in 2015, with an average density of 1 pnt/m 2 and the altimetric accuracy and precision with a RMSE of less than 0.20 m. Ground point filtering is not documented [30]. The point cloud data is distributed by the CNIG Download Center (LAZ format, ISPRS classes 0-7, 10, 12 are populated). The test site ES has 453,255 points with an average density of 1.83 pts/m 2 .
The most important metadata are summarized in Table 1. It is not our intention to present the archaeology of the test sites in any detail, but merely to provide the key reference for each archaeological site and to illustrate the archaeological features ( Figure 1). The test site AT is located south of Graz near the town Wildon on a rocky hill consisting of algal limestone and marlstone interspersed with layers of clay that rise above the Mur River valley [31]. It is covered with a dense, mature deciduous forest with little undergrowth. The multiperiod archaeological site Wildoner Schlossberg (N 46°53′05″, E 15°30′48″) is mainly known for two medieval castles and a settlement [32]. The test site AT is located south of Graz near the town Wildon on a rocky hill consisting of algal limestone and marlstone interspersed with layers of clay that rise above the Mur River valley [31]. It is covered with a dense, mature deciduous forest with little undergrowth. The multiperiod archaeological Remote Sens. 2020, 12, 3025 5 of 24 site Wildoner Schlossberg (N 46 • 53 05", E 15 • 30 48") is mainly known for two medieval castles and a settlement [32].
The neighboring test sites SI1 and SI2 are located in the Dinaric Karst region named after the Pivka River southwest of Ljubljana. Both sites are located on limestone formations from the Jurassic period, which rise above the karst field Pivška kotlina, known for its intermittent lakes [33]. SI1 is covered by decimated deciduous forest in the first stage of regrowth. It is the site of the prehistoric hillfort Kerin above Pivka (N 45 • 40 21", 14 • 11 40") [34]. SI2 is characterized by several meadows, each surrounded by very dense deciduous hedges and it is the site of the Roman period villa rustica Sela near Pivka (N 45 • 40 56", E 14 • 12 25").
Test site ES is located on the magmatic granite hillock midway between the valleys of the Ulla River and its tributary the Vea south of Santiago de Compostela in northwest Spain [35]. It is covered with open deciduous forest and with two small clearings to the east of the site. The archaeological site O Castelo (N 42 • 44 30", E 8 • 33 02") is a Roman military camp with a probable Iron Age predecessor [36].
On the above mentioned sites 3 out of 4 morphological types of archaeological features are distinguished: features embedded in the ground (slight positive or negative bulge, typically up to 0.5 m rise over 5 to 20 m run; blue in Figure 1), 2.
features partially embedded in the ground (positive or negative spike, typically more than 0.5 m rise over 5 m run; green in Figure 1), 3.
standing features (an archaeological term for an off-terrain object characterized by a sharp discontinuity in the ground; red in Figure 1), and 4.
large standing objects (large building-like structures).
Most of the archaeological features detected on airborne LiDAR data fall into the first two types. Examples of type 1 are remains of trenches, ditches, fossil fields, past land divisions, and ancient roads. Examples of type 2 features are dwellings, ramparts, terraces, and burial mounds. Type 3 features occur relatively rarely, mostly as isolated sections of walls (without roof). From the perspective of point cloud filtering, they are non-ground objects. Examples of type 4 features include Mayan monumental architecture at Aguada Fénix, Mexico [5] and Khmer temples in Angkor, Cambodia [7]. These are region specific and are, to our knowledge, not present in European archaeology. Type 4 features are not the subject of this article since they demand additional processing steps that are not accomplished with ground extraction filters.

Evaluated Filters
We compare all filters known to us that are implemented in an off-the-shelf free or low-cost software under active development or maintenance ( Table 2; Appendix A). Accordingly, BCAL Lidar Tools [37] was not considered because its freeware standalone version has been deprecated. Also not considered were legacy Whitebox GAT [38], ALDPAT [39], and LASground [40]. gLiDAR [41,42] was rejected because it was not designed for production and is arguably also legacy program. Furthermore, the skewness balancing filter [43], as implemented in PDAL, was not included because it does not work as intended (in our attempts, all points below a certain height above sea level have been marked as ground and all other as non-ground).
Archaeologists do not typically work with full-waveform airborne LiDAR data (but see [44]), therefore we do not assess post-processing methods for this specific category of airborne LiDAR data in the context of this article.
These and numerous other existing filters are based on algorithms that can be divided into five categories [26,[45][46][47]: Morphological filtering algorithms are based on the concept of mathematical morphology. They retain terrain details by comparing the elevation of neighboring points [46,47]. A point is classified as ground if its height does not exceed the height of the eroded surface. The filtering is done by two elementary operations: a morphological erosion with the kernel function and a test for the difference between the original height and the eroded height of a point [48].
Progressive densification algorithms work progressively and iteratively classify more and more points as ground by comparing the elevation of the points with the estimated values by different interpolation methods. The initial set of seed ground points is often obtained from the lowest points in a large grid. Ground points are then densified in an iterative process, in which within every triangle the offsets of the candidate points to the triangle surface are analyzed. Points with offsets below the threshold are classified as ground and the algorithm proceeds with the next triangle. The new points are inserted into the triangulation. The process is repeated on the thus densified network until (i) there are no more points in a triangle, (ii) a certain density of ground points is reached, or (iii) when all acceptable points are closer to the surface than a threshold value [26,46,47].
Surface-based filtering algorithms first assume that all points are ground and then gradually remove the points that are not ground. Typically, simple kriging is used in the first step to describe the surface using all points. An averaged surface is then created between the ground and non-ground points. The residual values are created according to their distance from the averaged surface and the points are weighted according to their residual values. Points that have negative residual values are weighted more heavily and are considered ground points. There are many extensions and variants of the surface-based filters [26,46].
Segmentation-based filtering algorithms differ from the above-mentioned ones in that they first form homogeneous segments and then analyze segments rather than individual points. Segmentation can be performed using region-growing techniques, adopting either the normal vector or its change, resulting in either planar or smoothly varying surfaces, respectively. Alternatively, homogeneity can be achieved from height or height change only. The segmentation process has been described by the rationale that every point of a segment can be reached from any other point of the same segment, in the sense that it would be possible for a person to walk along such a path. Many implementations rasterize the data in the process and the segmentation methods are often derived from image processing [46].
Other approaches have also been developed. Some are based on determining a set of features for each point depending on the distribution of neighboring points or as measured features of each point. A decision tree based on machine learning is then often used to classify points based on their feature values. Some methods are based on the slopes or curvatures between points [46].
Hybrid filters combine algorithms to merge the strengths of different approaches [46,47]. The filters used in this article and their specific implementation in the software are described in more detail in Appendix A.
When comparing the different categories of algorithms, the following points can be summarized. Morphological filtering invariably impose a tradeoff. The erosion of the ground on steep slopes on the one hand and the inclusion of the non-ground points in flat areas on the other hand has to be balanced on a case-by-case basis. A general advantage of both progressive densification and surface-based algorithms is that the geometric properties of the terrain surface and the point cloud data characteristics are separated to a certain extent. Surface-based algorithms produce more reliable results since they also consider the surface trend, and due to their iterative nature, allow for a better adaptation to the terrain. The segmentation-based algorithms have advantages mostly in urban areas where flat segments are prevailing [46]. Bujan recommends that morphological filtering algorithms are best suited for terrains with small objects, progressive densification for urban areas and terrains with diverse objects, while surface-and segmentation-based algorithms work best in forests and on steep slopes [47]. To this, we may add that one of the key features of segmentation-based algorithms is that the outliers (low and high) theoretically have a limited impact on the result. Thus, they should be suitable to handle data sets with significant noise such as point clouds obtained by structure from motion (SfM).
Nevertheless, all algorithms present similar difficulties when implemented in sharply changing terrain, low vegetation, areas with dense non-ground features, and complicated landscapes [20,46,47,49,50]. The two-decade-old expectation that the fusion of multiple sources (e.g., RGB values or intensity) and integration of different methods will improve the performance [20,51,52] is yet to materialize in off-the-shelf software.
The long-term comparison of the performance of specific algorithms is also reveling. Between 1996 and 2007, uniform algorithms prevailed, with densification being the most important one. Between 2007 and 2012, hybrid algorithms prevailed, and between 2012 and 2016 there was a period of renewed interest in uniform algorithms [47]. Early studies showed that algorithms incorporating a concept of surface performed better than morphological algorithms. In the test environment, a gradual improvement in the performance of algorithms, regardless of the concept they are based on, has been observed in the past two decades [47].

Method for Quantitative and Qualitative Assessment
Above mentioned studies that have compared relevant filters [22][23][24][25][26][27] used different approaches to calibrating filters with custom settings, which depends strongly on the skillset of the operator. Some omitted calibration by using default settings to minimize the subjective influence of the operator [25]. However, not only are the results more accurate when filters are calibrated, but some filters are also more susceptible to calibration than others [27]. Accordingly, most studies compare calibrated filters [22,23,26]. Our decision to compare calibrated filters was ultimately guided by our search for the best workflow, which must always include calibration.
We have minimized the subjectivity of the calibration process through intensive testing (383 preliminary DTMs were created as a result). For most filters, we used the method of relevant incremental adjustments for each variable, starting with the default value until the best result was found. As the process was guided by experience, in most cases, fewer than 4 attempts were required for each variable. As a typical example is the variable(s) of the MCC, case study AT: Such an approach was not possible with the SMRF, since out of five variables there are two inversely related pairs. This means that a pair of variables must be changed each time, resulting in almost infinite combinations. Thus, we tested 15 settings that were optimized for ISPRS reference data sets [53].
The 36 best results-9 filters each for AT (Figure 2), SI1 and SI2 (Figure 3), and ES ( Figure 4)-were compared qualitatively and quantitatively. The comparison was made by adapting the qualitative and quantitative methods of Sithole and Vosselman [20] for archaeology. Such an approach was not possible with the SMRF, since out of five variables there are two inversely related pairs. This means that a pair of variables must be changed each time, resulting in almost infinite combinations. Thus, we tested 15 settings that were optimized for ISPRS reference data sets [53].
The 36 best results-9 filters each for AT (Figure 2), SI1 and SI2 (Figure 3), and ES ( Figure 4)were compared qualitatively and quantitatively. The comparison was made by adapting the qualitative and quantitative methods of Sithole and Vosselman [20] for archaeology.  First, we manually classified a point cloud using context knowledge (AT, SI1, and SI2) and aerial photographs. The points were classified into the ground (2) and non-ground (1) using the Global Mapper software package, and only after careful inspection were the classifications accepted.   The qualitative assessment method is defined as a visual examination and comparison of the datasets. Five categories of circumstances in which the algorithms are likely to fail are observed: 1. outliers (low or high, occur mainly due to sensor errors); 2. object complexity (refers to buildings that are difficult to detect due to their size, low height, or complex shape); 3. detached objects (buildings on slopes, bridges, ramps); 4. vegetation (problematic categories are vegetation on slopes and low vegetation); 5. discontinuity (sharp changes in a slope like cliffs and sharp ridges are treated as buildings).
For the purposes of this study, we have added the archaeology-specific category: 6. archaeological features (type 1, type 2, and type 3 as defined above in Section 2.1).
Each observed category is rated as good (item is filtered most of the time), fair (item is not filtered a few times), or poor (item is not filtered most of the time). Influence on neighboring points is also rated as good (no influence), fair (small influence), or poor (large influence).
In a similar vein, for the purposes of this study, some of the categories are irrelevant either because all filters perform good (high outliers) or because they are irrelevant to archaeology (object complexity, detached objects). Nevertheless, where present, these categories were still evaluated but are marked in grey.
It should be noted that this method is subjective by design. In this particular case, it has been carried out by two archaeologists who have extensive experience with airborne LiDAR data. It is therefore to be expected that the visual assessment is biased towards the qualities that are archaeology-specific.
The quantitative assessment was done by generating cross-matrices of each generated classified point cloud with the manually classified point cloud. The cross-matrices were then used to evaluate Type I (rejection of bare-Earth points, i.e., false negative) and Type II (acceptance of object points as bare-Earth, i.e., false positive) errors. An example of type I error is when a ground point is classified as non-ground. An example of type II error is when a non-ground point, for example belonging to a building, is classified as ground.
Quantitative assessment was carried out on small areas of 50 × 50 m within archaeological sites. If we were to apply the original method directly, the archaeology-specific accuracy would be masked by the overall accuracy (similar to how low noise was masked in the original study).
The areas were selected according to two criteria: • the archaeological features must cover at least 50% of the area, and • the area represents the most difficult rather than average situation. First, we manually classified a point cloud using context knowledge (AT, SI1, and SI2) and aerial photographs. The points were classified into the ground (2) and non-ground (1) using the Global Mapper software package, and only after careful inspection were the classifications accepted.
The qualitative assessment method is defined as a visual examination and comparison of the datasets. Five categories of circumstances in which the algorithms are likely to fail are observed: 1.
outliers (low or high, occur mainly due to sensor errors); 2.
object complexity (refers to buildings that are difficult to detect due to their size, low height, or complex shape); 3.
vegetation (problematic categories are vegetation on slopes and low vegetation); 5. discontinuity (sharp changes in a slope like cliffs and sharp ridges are treated as buildings).
For the purposes of this study, we have added the archaeology-specific category: 6.
archaeological features (type 1, type 2, and type 3 as defined above in Section 2.1).
Each observed category is rated as good (item is filtered most of the time), fair (item is not filtered a few times), or poor (item is not filtered most of the time). Influence on neighboring points is also rated as good (no influence), fair (small influence), or poor (large influence).
In a similar vein, for the purposes of this study, some of the categories are irrelevant either because all filters perform good (high outliers) or because they are irrelevant to archaeology (object complexity, detached objects). Nevertheless, where present, these categories were still evaluated but are marked in grey.
It should be noted that this method is subjective by design. In this particular case, it has been carried out by two archaeologists who have extensive experience with airborne LiDAR data. It is therefore to be expected that the visual assessment is biased towards the qualities that are archaeology-specific.
The quantitative assessment was done by generating cross-matrices of each generated classified point cloud with the manually classified point cloud. The cross-matrices were then used to evaluate Type I (rejection of bare-Earth points, i.e., false negative) and Type II (acceptance of object points as bare-Earth, i.e., false positive) errors. An example of type I error is when a ground point is classified as non-ground. An example of type II error is when a non-ground point, for example belonging to a building, is classified as ground.
Quantitative assessment was carried out on small areas of 50 × 50 m within archaeological sites. If we were to apply the original method directly, the archaeology-specific accuracy would be masked by the overall accuracy (similar to how low noise was masked in the original study). The areas were selected according to two criteria: • the archaeological features must cover at least 50% of the area, and • the area represents the most difficult rather than average situation.

Qualitative Assessment
The results of the qualitative assessment are summarized in Figure 5.

Quantitative Assessment
The results of the qualitative assessment are summarized in Figure 6. To facilitate comparison, the performance of each filter in each category was evaluated as follows. The filter with the best performance and the lowest error received nine points, the second-best eight points... the filter with The PTIN achieved the highest average, followed by the BMHF, and PMF and SBF. The most robust performers-fewest one-were PTIN and SBF, followed by SMRF and BMHF.
On average, filters performed better on dense data than on medium and low density data. By far the best on dense data was PTIN, followed by MCC and SBF, WLS, SegBF, and BMHF. PTIN and BMHF performed best on medium density data, followed by CSF and SMRF. For low density data, PMF, PTIN, and WLS were the best.
For vegetation removal, MCC was best, followed by PMF and then WLS and BMHF as a distant third. In the preservation of discontinuities, PTIN was perfect, while SBF and WLS were acceptable. In the retention of archaeological features, SMRF and CSF were almost perfect, closely followed by SBF, PTIN, and BMHF.
PTIN was the only filter consistently among the three best; the only comparable was BMHF, which struggled with the preservation of discontinuities and failed almost completely for low-density data.

Quantitative Assessment
The results of the qualitative assessment are summarized in Figure 6. To facilitate comparison, the performance of each filter in each category was evaluated as follows. The filter with the best performance and the lowest error received nine points, the second-best eight points... The filter with the highest error-the last of nine filters evaluated-received one point. The best score in total error was obtained by SMRF, followed by a close group of BMHF, PTIN, SBF, and CSF. At the bottom of the group were PMF, SegBF, and, at some distance, WLS. The most robust performers were PTIN and BMHF (twice and once second best, respectively; never among the two worst), followed by SBF and CSF (never among the worst). The T1 error results are almost identical to total error. The T2 error is very different, though. MCC and WLS are decidedly the best, followed by a narrow group of PTIN, BMHF, and PMF. SBF, CSF, SegBF, and at some distance SMRF are the worst.

Discussion
The qualitative assessment revealed that some of the categories are irrelevant for this study either because all filters perform good and would only obscure relevant differences (high outliers) or because they are irrelevant to archaeology (detached objects) or are not the subject of this article (object complexity; see archaeological features type 4 in Section 2.1 above). Nevertheless, the results are included in the table where possible (there were no detached objects in our data) to comply with the original method but are omitted from the numerical average.
Extreme examples of very dense vegetation and low vegetation (SI2) remain unsolvable for all filters. Also, the removal of vegetation on steep slopes remains problematic, but with dense data (AT) it is solved well by all filters, as is the discontinuity retention. There is a direct relationship between vegetation on the one hand and archaeology and discontinuity retention on the other hand: the more the vegetation is removed the more of archaeology and discontinuities will be removed as well. The T2 error is very different, though. MCC and WLS are decidedly the best, followed by a narrow group of PTIN, BMHF, and PMF. SBF, CSF, SegBF, and at some distance SMRF are the worst.

Discussion
The qualitative assessment revealed that some of the categories are irrelevant for this study either because all filters perform good and would only obscure relevant differences (high outliers) or because they are irrelevant to archaeology (detached objects) or are not the subject of this article (object complexity; see archaeological features type 4 in Section 2.1 above). Nevertheless, the results are included in the table where possible (there were no detached objects in our data) to comply with the original method but are omitted from the numerical average.
Extreme examples of very dense vegetation and low vegetation (SI2) remain unsolvable for all filters. Also, the removal of vegetation on steep slopes remains problematic, but with dense data (AT) it is solved well by all filters, as is the discontinuity retention. There is a direct relationship between vegetation on the one hand and archaeology and discontinuity retention on the other hand: the more the vegetation is removed the more of archaeology and discontinuities will be removed as well.
With a few exceptions, all filters have performed well in the retention of archaeological features of types 1 (features embedded in the ground) and 2 (features partially embedded in the ground). However, there are significant differences among the filters in the costs of this retention on the vegetation removal and discontinuity retention. It is not surprising that none of the filters has fully preserved the type 3 features (standing features), as these are by definition off-terrain objects.
The results of the quantitative assessment are not clear-cut. Total error and the T1 error are almost identical, since the T1 error makes up the majority of the total error. The T2 error results are very different and-in the case of MCC, WLS, and SMRF-inverse. The latter can be easily explained: the more ground points classified (lower T1), the more are misclassified (higher T2) and vice versa. This means that MCC and WLS minimize the T2 error at the expense of the high T1 error and SMRF does the opposite. This insight could be exploited in the form of a multi-step hybrid algorithm combining, for example, MSS and SMRF. On their own MCC and WLS are suitable for high and possibly medium-density data, but not for low-density data.
On balance, BMHF and PTIN are notably better than the rest, with SBF and PMF providing acceptable performance.

Conclusions
Qualitative and quantitative results are not directly comparable. Thus, no filter is directly better than the rest, but the commercial filters PTIN and BMHF are consistently at the top, followed by an open access SBF. The ability of the MCC to remove vegetation is very good, as is the retention of discontinuities and archaeological features retention of the SBF. Unfortunately the BMHF-which is a hybrid of MCC and SBF-does not succeed in combining the strengths of both.
It is noticeable that in our assessment, filters based on newer algorithms (PMF, CSF, SMRF) are outperformed by some of the oldest (PTIN, SBF), which contradicts recently published findings [46]. This is a reminder of the indirect path from publishing an algorithm to the filter implementation. The latter needs to be refined and optimized, which seems to be best handled by commercial software developers. This insight puts into perspective the purpose of publishing more and more new algorithms that are only compared to the published results, e.g., [47], but are neither implemented as off-the-shelf filters nor compared to them.
Based on the qualitative assessment the archaeology-specific comparison of existing filters can be determined. The qualitative results are directly transferable to real-world problem-solving processes concerning which filter work best for a given combination of data density, landscape type, and type of archaeological features.
Concerning the influence of point density on filter performance, our results are consistent with the results presented by Sithole and Vosselman [20], namely in that the lower the point density, the worse the performance of all filters. This makes processing low-density data much more demanding.
As far as archaeological features are concerned, all filters handle the features embedded in the ground perfectly. Some filters (PMF, WLS, SegBF) struggle with semi-embedded features and all but two (SMRF, SegBF) erase standing features completely. An automatic solution for the handling of standing features has yet to be found. Therefore, a considerable manual classification effort is still inevitable [13,15].

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Description of and Experience with the Filters Used in the Case Studies
The descriptions of the filters are based on cited references. The descriptions of the user-facing parameters based on the documentation of the respective software (regular text) are followed by our own archaeology-specific experiences (in italics). All numerical values are given in map units, unless otherwise stated. The exact values used in each of the best results are summarized in Table A1.
Progressive morphological filter removes the off-ground objects applying elevation difference thresholds in gradually increasing window sizes. However, the implementation is not strict. It works at the point cloud level without any rasterization process, and the morphological operator is applied to the point cloud and not to a raster. Also, the sequence of windows sizes and thresholds is not computed from the original, but is free and user-defined.
Software: lidR 2.2.4 (R-CRAN module), cran.r-project.org. Price: free. Algorithm reference: [54]. User-facing parameters: (ws) Sequence of window sizes: smallest, largest, and step (default: 3,12, 3). Decreasing the smallest window size had no noticeable effect on our results. Increasing the largest window to 18 brought modest improvements in most areas, but it was extremely detrimental to sharp ridges. Therefore, the default settings were used.
(th) Sequence of threshold heights above the parameterized ground surface to be considered as ground return (default: 0.3, 1.5). Decreasing the lower threshold to 0.1 moderately improved all results. The higher threshold had the greatest influence of all parameters; the default value worked best for all of the terrain types other than on the sharpest ridges where entire sections have been misclassified. A value of 3.5 was best compromise with noticeable but tolerable T1 errors on both sharp archaeological features (i.e., ridges of the order of decimetres have their tops cut off) and on the ridges (i.e., ridges of the order of tens of meters have their tops cut off).
Appendix A.1.2. SBF Slope-based filtering is based on the observation that a large height difference between two nearby points is unlikely to be caused by a steep slope in the terrain. The probability that the higher point could be a ground point decreases as the distance between the two points decreases and an acceptable height difference between two points has been defined as a function of the distance between the points. The inter-point slope threshold and inter-point height threshold must both be exceeded within a given radius for a point to be considered non-ground. SBF filters suffer from the challenge of applying a constant slope threshold to terrain with widely varying slopes.
To mitigate the above-mentioned challenge, this implementation first normalizes the underlying terrain with a white top-hat normalization. Then, points within a given radius that are both (i) above neighboring points by the minimum height threshold and (ii) have an inter-point slope greater than the slope threshold are considered non-ground points.
Software: Whitebox tools (LidarGroundPointFilter), jblindsay.github.io. Price: free. Reference: [48]. User-facing parameters: (r) Search Radius refers to the local neighborhood within which the inter-point slopes are compared (default: 2). Values in correlation with the point cloud density-about 7 times the cell-size of the final DEM-have been used.
(n) The minimum number of neighboring points within the search areas; if fewer points are identified, the search radius is extended until this value is reached or exceeded (default: not set). Value 3 lead to the best results.
(st) Inter-point slope threshold (default: 45 deg). The default value works best for most data sets other than for low-density point clouds, for which value must be significantly decreased.
(ht) Inter-point height threshold (default: 1). Suitable values that do not exceed the default value have been inversely related to the point cloud density.
(n) The slope normalization defines whether initial white top-hat normalization is performed (default: on). Always on.
The simple morphological filter filters point cloud by applying a linearly increasing window and a simple slope thresholding together with an application of image inpainting techniques. First, the lowest point within each cell is retained and empty cells are inpainted. Then a vector of window sizes (radii) is generated and gradually increased to the supplied maximum value. For each window, an elevation threshold value is generated that corresponds to a supplied slope parameter and is multiplied by the product of this window size and the cell size of the minimum surface grid. Based on the latter, a surface is generated using a disk-shaped element with a radius equal to the current window size. At each step, each cell is classified as a ground point if the difference between the current and previous surface is greater than the calculated elevation threshold. In an additional process, low outliers are removed. The application is not strict.
(sc) The elevation scalar parameter assists in the identification of ground points from the provisional DEM; it is inversely related to the scaling factor (default: 1.25). Values between 0 and 1.5 were used, values around 1.5 were optimal for dense data of forested steep slopes with very sharp edges.
(st) The slope threshold value controls the process of ground/non-ground identification at each step and should approximate the maximum slope of the terrain; it's inversely related to the maximum window radius (default: 0.15 rise over run). Values between 0.05 and 0.28 were used, on average 1 2 of the steepest slope; the lowest value was used for low-density data.
(w) The maximum window radius controls the opening operation at each iteration and corresponds to the size of the largest feature to be removed. This is inversely related to the slope threshold (default: 18.0). Predominantly values between 15 and 20 were used.
(t) The elevation threshold determines the final classification of the point as ground/non-ground based on the interpolated vertical distance from the minimum surface grid. This is inversely related to the scaling factor (default: 0.5). Values between 0.25 and 0.5 were used.
General note: the interconnectedness of parameters makes the search for optimal settings very time-consuming. Our approach was to test 15 optimal combinations of settings developed for the ISPRS reference data sets by Pingel et al. [53].
(wparam) The W-parameter value of the weight equation (default: 2.5). Negligible improvements were achieved by reducing the value to 0.5 (in combination with the increase of the iterations).
(it) Number of iterations for the filtering logic (default: 5 iterations). Increasing the number of iterations up to 10 resulted in a moderate reduction of T2 errors, an increase beyond that resulted in negligible improvements; increasing the number of iterations extends the processing time linearly.
(olh) Outlier: low, high omit points with height above ground below low and above high. Our tests did not show any noteworthy positive effect on low noise removal or otherwise.
(t) Tolerance to the final surface for the points to be considered ground/non-ground (by default, the weight value is used to filter the points). No improvements have been achieved, the use of default weight is recommended.
(fs) Final smooth refers to a slightly more aggressive smoothing of the intermediate surface model just before the final point selection process. Not to be used for archaeology.
Other switches (smooth, aparam, bparam) are not intended for use in a production environment.
Cloth simulation filter is based on a 3D computer graphics algorithm that is used to simulate cloth. First, a LiDAR point cloud is inverted and then a rigid cloth is used to cover the inverted surface. By analyzing the interaction between the cloth nodes and the corresponding LiDAR points, the locations of the cloth nodes are determined and used to generate an approximation of the ground surface. Finally, the ground points can be extracted from the point cloud by comparing the original points with the generated surface. All software implementations are strict.
Price: free. Algorithm reference: [58]. User-facing parameters: (s) Scene: the rigidness of the cloth is determined; very soft to fit steep slope, medium for rugged terrain, and hard cloth for flat terrain (default: medium). No discernible differences between very soft and medium were observed, hard was not suitable.
(r) Cloth resolution: refers to the grid size of the cloth with which the terrain is covered, i.e., the distance between the particles in the cloth (default: 2.0). The same value as the resolution of the final DEM worked best, lower values (e.g., 1 2 the cell-size of the final DEM) introduced artifacts. Lowering the value increases the processing times exponentially.
(it) Max. Iterations: maximum iterations for the cloth simulation (default: 500 iterations). A significant increase of the iterations (max. value 1000) produced modest improvements at the expense of a linear increase of the processing time.
(th) Classification threshold: the distance to the simulated cloth to classify a point cloud into ground and non-ground (default: 0.5). On flat terrain lower values (e.g., 0.15) slightly reduced T2 error, but on average or rugged terrain the default value is recommended.
(sp) Slope processing: reduces errors during post-processing when steep slopes are present (default: off). Switching on significantly improved handling of cliffs and low walls, but resulted in noticeable T2 errors, e.g., "tree stumps" on slopes and buildings on the flats.

Appendix A.4. Segmentation Based Filtering
SegBF A segmentation based filter identifies ground points using a segmentation based approach that builds on the assumption that the largest segment in any airborne LiDAR dataset is the ground surface. In particular, this approach uses the random sample consensus (RANSAC) method to identify points within a point cloud that belong to planar surfaces, which means that the segmentation is performed using region-growing techniques using normal vectors that result in planar surfaces.
In this implementation, the algorithm begins by attempting to fit the planar surfaces to all points within a user-defined radius. The planar equation is stored for each point for which a suitable planar model can be fitted. A region-growing algorithm is then used to match nearby points with similar planar models. The similarity is based on a maximum allowable angular difference between the plane normal vectors of the two neighboring points. Segment edges for ground points are determined by a maximum allowable height difference between neighboring points on the same plane.
Software: Whitebox tools (LidarSegmentationBasedFilter), jblindsay.github.io. Price: free. Reference: [59,60]. User-facing parameters: (r) Search radius, within which planar surfaces are fitted (default: 5). Lower values than the default worked best without correlation to either point density or terrain type; increasing values lengthen the processing time exponentially.
(s) Maximum difference of normal vectors considered for similarity (default: 2 deg). Values between 1 and 2 worked best without correlation to point density or terrain type.
(h) Maximum difference in elevation between neighboring points in the same segment (default: 1). In all but one case, the value 0.5 worked best.
General note: in our best results there is a direct correlation between (r) and (s): (s) ≈ 1 2 (r).
Appendix A.5. Other Filters The multiscale curvature classification of ground points in airborne LiDAR point clouds iteratively classifies non-ground points that exceed positive curvature thresholds on multiple scales. It uses thin-plate spline interpolation to generate surfaces at different resolutions and uses progressive curvature tolerances to eliminate non-ground returns. During processing, the scale and initial curvature are changed over three domains to account for variable canopy configurations and their interaction with terrain slope: the initial value set to scale is multiplied by 0.5, 1, and 1.5, while 0.1 is added to the initial curvature in each domain. The software implementation is strict.
Software implementation: MCC-LIDAR 2.2, sourceforge.net/projects/mcclidar. Price: free. Algorithm references: [61]. User-facing parameters: (λ) The scale refers to the size of the search window used for the interpolation of the points (default: 1.5). Values between 1.0 and 2.0 worked best, so the default value is an excellent start for experimenting.
(t) Minimum height deviation from the interpolated surface for non-ground point (default: 0.3). Values between 0.2 and 0.25 worked best, which means that the default value might be set slightly too high for archaeology.
Appendix A.5.2. BMHF Blue Marble Geographic's hybrid filter is a two-part process that first removes points that are likely not ground and then compares the remaining points to a modeled curved 3D surface within a window. This is an implementation of MCC that uses SBF for pre-processing to speed up the processing significantly.
Software: Global Mapper 21.x, bluemarblegeo.com Price: 1098$. Reference: bluemarblegeo.com. User-facing parameters: (ht) Maximum height delta refers to the expected elevation range of the ground in the point cloud (default: 50). Default value worked best for all datasets.
(st) Expected Terrain Slope, i.e., threshold value for the inter-point slope (default: 7.5 deg). Higher values worked best for dense datasets with steep slopes, lower values were used for low-density data sets and the lowest value for flat terrain.
(b) Maximum width of buildings; large values cause longer processing times (default: 100). Values between 10 and 25 worked well, but since this is an iterative process-larger values remove big and small buildings-the default value is recommended.
(bin) Base bin size refers to the size of the search window size that is used to interpolate the points and can be set either in meters or in points (average is calculated for the loaded data; default: 3 points). Default value 3 points worked vel; the value in meters is recommended when reproducible results are needed across different data sets; often 3 times the value of the final DEM cell-size worked best.
(h) Minimum height deviation from the interpolated surface for non-ground points; larger values (such as 1 to 3) are necessary for low-density point clouds (default: 0.3). The default value worked best in all data sets. Table A1. Values used in each of the best results.

Tested Settings
Best Settings