Using CarcassonNet to automatically detect and trace hollow roads in LiDAR data from the Netherlands

.


Introduction
The ever-growing set of remotely sensed data offers opportunities to study the past on a landscape scale [1][2][3]. Especially, the advancement of Light Detection And Ranging (LiDAR) techniques [4][5][6] has facilitated the survey of archaeological traces that were hitherto difficult to investigate, due to forest and other vegetation cover [2,7]. For instance, hollow roads-traces of sunken cart track ways [8] caused by the repeated use of the same route-are hard to discern in the present-day landscape, but appear as clear longitudinal objects in LiDAR data (Fig. 1). While the manual analysis of remotely sensed data is a widespread practice in present-day archaeology [9], the sheer amount of available high-quality data necessitates the use of computer-aided methods for the (semi-)automatic detection of archaeological objects [10,11]. Recent years have seen a steady increase in the implementation of archaeological object detection techniques (see Reference [12] for an overview), with a trend towards Machine Learning and Deep Learning methods [13,14]. Implementations of Deep Learning in archaeology have predominantly used Convolutional Neural Networks (CNNs), a type of image feature extractor and classifier loosely inspired by the animal visual cortex [15]. These algorithms learn, comparable to other Machine Learning methods, to generalize from given examples rather than relying on human-defined parameters or rules. CNNs can be pre-trained on a large, generic dataset, and subsequently transfer learned or fine-tuned [16] on a relatively small, specific dataset. This is a clear advantage for domains where large, labelled datasets are scarce or non-existent, such as archaeology.
patterns, such as historical field systems and roads, has seen little application in the archaeological domain [22]. The few studies focussing on these patterns have generally used more traditional spatial and predictive modelling [23], or specialized (hand-crafted) feature extraction and analysis techniques to detect field systems [22,24] and roads [8,[25][26][27]. Nevertheless, the systematic mapping of hollow roads can provide information on past human movement and historical route networks on both an inter-site and inter-regional level [28]. It can offer a better understanding of human-landscape interactions [29,30], as routes both reflect and influence (large-scale) cultural and landscape processes [27], and play a crucial role in the exchange of resources, knowledge, and ideas [31]. Especially the mapping of regional and local roads can supplement and expand the knowledge gained from historical written sources and cartographic data [32]. Therefore, in the study of historic routes we can distinguish two relevant levels of information: (1) the roads themselves and their precise location in the landscape; and (2) the course of the roads and the resulting route network [30]. These require different types of output from a detection model (see Section 2.2.4). While Deep Learning has successfully been implemented for modern road detection [33], these methods do not translate well to the problem at hand due to differences between hollow and modern roads: (1) hollow roads lack the uniformity that modern roads possess; (2) due to the long temporal and repeated use of specific route zones [34], hollow roads tend to manifest in the shape of several linear tracks with slightly different orientation and with multiple overlaps between the individual tracks; (3) due to geomorphological (e.g., erosion and drift) and anthropogenic interference (e.g., agricultural and building activities), hollow roads are often only partially preserved [35] and regularly are dissected by modern landscape objects. In this paper, we present a novel approach to automatically detect and trace hollow roads in LiDAR data, using a combination of Deep Learning and image processing algorithms. This approach, which we named CarcassonNet, has been specifically developed for the archaeological domain, focusing on being computationally light and suited for reconstructing partially preserved and intersected (hollow) roads. In the next Section (2), the research area and data is introduced, followed by an overview of CarcassonNet. In the subsequent Sections 3 and 4 the results of an experimental evaluation will be presen-ted and discussed. The paper finishes with future developments planned (Section 5).

Data & research area
In this research, three hollow road rich subareas (called Speuld, Terlet, and De Ginkel; in total 93.75 km 2 ) were investigated on the Veluwe, a region (circa 2200 km 2 ) in the central part of the Netherlands (Fig. 2). The Veluwe consists predominantly of forest and heathland, interspersed with agricultural fields and villages and towns of various size (for a detailed overview of the area see [12,20]). The Veluwe holds one of the densest concentrations of archaeological objects in the Netherlands, including prehistoric barrows [36], (post)medieval charcoal kilns [37], and (modern) traces of conflict [38]. Nowadays, the majority of these objects can be found under heath or forest cover. While this has almost certainly contributed to their preservation, this also hinders the physical investigation of these objects and restricts the survey of the surrounding landscape for potential new archaeological objects (see also Reference [7]). Previous research has shown that the Veluwe holds a legion of hollow roads [27], which thus far have gained little scientific attention. Unfortunately, no comprehensive overview of hollow roads on the Veluwe is available. Therefore, a reference standard [39] was manually created by the first author, who has ample experience in analysing LiDAR data and considerable knowledge of the archaeology of the research area (see also Section 2.2.1).
LiDAR data (Fig. 2, Table 1) of the area is freely available from the online repository PDOK [18] and the Actueel Hoogtebestand Nederland [41]. Hollow roads are distinguishable in LiDAR data as longitudinal objects (Fig. 1). These sunken lanes result from the intensive use of specific routes, as travellers preferred to keep to the beaten track to make for greater speed and reduce the risk of getting lost. Wherever the soil was soft and could be turned aside, a series of parallel lanes would develop. When a track became impassable due to seasonal conditions (e.g., weather) or general wear over time, travellers were forced to shift to an adjacent lane, resulting in the pattern of parallel cart-ruts [42], also called route zones [34]. It has been suggested that some of the hollow roads on the  [18]; coordinates in Amersfoort/RD New, EPSG: 28,992; amended from Reference [12]).

Table 1
The parameters of the LiDAR data used in this research, after Reference [40].

CarcassonNet
For the approach presented in this paper inspiration was taken from the well-known board game Carcassonne by Klaus-Jürgen Wrede [44], hence the name CarcassonNet. In this game, a medieval landscape, including roads, is constructed by laying small, squareshaped cards, so-called terrain tiles, end-to-end. While the roads depicted on these tiles differ little individually, extensive road networks of varying shapes can be constructed. The same applies for hollow roads: while these differ when examined as whole objects, the variation between small sections of hollow roads is minimal. Therefore, in CarcassonNet individual sections, as opposed to the whole hollow roads, are used as input images (comparable to the approach taken for the detection of Celtic fields in Reference [20] and for forest roads in Reference [45]). This has several advantages: (1) the use of sections (and therefore multiple pieces per single hollow road) makes it much more cost-effective to create a sufficient training dataset for Deep Learning algorithms; and (2) by using this method, a CNN is only used for image classification, a relatively simple task, instead of more complicated tasks such as segmentation. Another benefit of using classification over segmentation is that hollow roads are difficult to define on a pixel level because of their heterogeneous nature, especially if these traces have been 'transformed' by various natural and anthropogenic processes over time (see also Reference [46]). Therefore, using a CNN for classification will produce better detection results at a lower cost/effort [47]. CarcassonNet consists of three steps: preprocessing, classification, W.B. Verschoof-van der Vaart, J. Landauer Journal of Cultural Heritage xxx (xxxx) xxx-xxx and post-processing (Fig. 3). In the following these steps will be discussed in detail.

Preprocessing
The LiDAR data was loaded into QGIS 3.4 Madeira [48] and processed with the Fill nodata processing tool to reduce the number of no-data points. For the manual labelling, the data was visualized with the Local Relief Model [17] and Openness [49] visualizations from the Relief Visualisation Toolbox 1.3 [5], based on the suitability of these visualizations to represent hollow roads. All hollow roads within the data were manually labelled in QGIS (also see Section 2.3), resulting in a reference standard consisting of a geospatial vector file (shapefile) with polygons demarcating hollow roads.
In earlier research it was noted that using the Digital Terrain Model (DTM) instead of visualized LiDAR data resulted in improved performance of Deep Learning models [50]. Therefore, for the training of our CNN, two datasets were created. In one dataset the LiDAR data was visualized with the hillshade visualization from the Relief Visualisation Toolbox 1.3 [5]. The second dataset consisted of DTM data. The performance of both models was evaluated (see Section 3.2). The data (in Float32 format) in both datasets was split into snippets of 64 × 64 pixels with a 32-pixel overlap to all sides. The size of the snippets is based on the smallest size that yielded useful results. Subsequently, the snippets were normalized by subtraction of the central pixel value so that each snippet has pixel (or grayscale) values between 0 and 255. In order to create a binary dataset, with 'hollow road' and 'empty' snippets, a percentage of overlap was computed between every individual snippet and the polygons in the reference standard. This resulted in a dataset containing circa 10,000 positive examples and approximately 383,000 negative examples ( Table 2).
As only a tiny fraction of the landscape consists of hollow roads, clearly a large imbalance between 'hollow road' and 'empty' examples was to be expected. Such an imbalanced dataset can Table 2 The amount and ratio of snippets classified as "hollow road" or "empty" in the original, unbalanced dataset and the balanced dataset (used in this research).

Classification
Original exert a major impact on the classification capacity of Deep Learning models and can result in bias and low performance [51]. Various approaches have been proposed to address this problem (see Reference [52] for a recent overview). In this research data pruning [53] and down and upsampling [54] were used to 'balance' the dataset. Empirically we found that placing only snippets with an overlap of 95% or more with the mapped hollow roads in our reference standard in the 'hollow road' class and only those with an overlap of 5% or less into the 'empty' class provided the best results when training the CNN. Snippets in between these thresholds were pruned from the dataset in order to make the two classes more distinguishable. Subsequently, the majority class ('empty') was downsampled, i.e., a large random portion of the examples (circa 98%) was removed, and the minority class was upsampled, i.e., randomly duplicated, until both classes contained approximately the same number of instances. This resulted in a dataset of circa 32,000 snippets, with circa 16,000 snippets in both classes (see Table 2). 80% of the snippets (circa 25,900) in the balanced dataset were randomly assigned to the training dataset. The remaining 20% (circa 6470) formed the validation dataset, used to evaluate the quality of the CNN during training.

Classification
For the classification step a Resnet-34 CNN architecture [55], developed using the Fast.ai library [56] on top of Facebook's PyTorch Artificial Intelligence (AI) development framework [57], was used. The Fast.ai library was chosen because it incorporates many stateof-the-art tuning methods to improve the performance of CNNs [58].
The CNN was pre-trained on ImageNet [59] and transfer learned [16] on our own balanced dataset (see above). Transfer learning was done following the 'progressive resizing' training scheme recommended in Fast.ai [56], using three separate phases. In the first phase, the CNN was trained for five epochs, during which the snippets in the training dataset retained their original size (64 by 64 pixels). In the second phase, the CNN was trained for four epochs but, contrary to the prior phase, the snippets in the training dataset were upscaled to 128 by 128 pixels. In both training phases all layers of the CNN were frozen (untrainable), except for the last fully connected layers. In the final phase, all layers of the CNN were unfrozen and the model was trained for three epochs on the upscaled snippets of 128 by 128 pixels. During training, the Adam optimization algorithm was used [13]. Discriminative layer training [60] and data augmentation (horizontal and vertical flipping [13]) were implemented to further improve the performance.

Location-Based Ranking
During the development of CarcassonNet the main sources of false positives were analysed. This showed that 'objects of confusion', with morphology comparable to hollow roads, generally caused these incorrect detections. Therefore, Location-Based Ranking (LBR; [21]) was implemented, to reduce these false positives and improve performance. LBR involves determining, ranking, and mapping of (present-day) landscape characteristics, such as subsoil and land-use, which have had an impact on the preservation and/or visibility of archaeology and result in objects of confusion Table 3 The two ranks used in the Location-Based Ranking map for hollow roads on the Veluwe. Rank 1 has high preservation/visibility for hollow roads, while rank 2 has low preservation/visibility.

Rank
Landscape Feature Source Other - (for a detailed overview see Reference [21]). To determine the principal landscape characteristics involved, a broad-brush landscape characterization [1] was performed. This evaluation identified five processes as being the most detrimental to the preservation and visibility of hollow roads on the Veluwe and causing the most false positives: (post)medieval agricultural areas (plaggen soils; [61]), modern agricultural fields, disturbed areas (quarries, sod cutting, etc.) and to a lesser extent urbanized or built-up areas and modern paths and roads. The best chance for survival of hollow roads can be found in heathland. Based on this a ranked map of the research area was created, on which the assigned ranks correspond to the potential for the occurrence of hollow roads within that zone. Subsequently, hollow road detections were compared to this map and assigned different ranks. Detections in high-ranking zones (rank 1) are more likely to be hollow roads, while detections in low-ranking zones (rank 2) have a much higher likelihood of being false positives. Therefore, LBR can be used to reduce the amount of false positives by ignoring detections in low-ranking zones. Based on the above, a simple two-tiered ranking map (Table 3 and Fig. 4) for the research area was developed, using open-source geo(morph)ological and topographical data from the online spatial data repository PDOK [18].
To test the validity of the proposed LBR map for the research area, the locations of all extant hollow roads in the Speuld subarea (see Fig. 4) were ranked. Table 4 shows the results of the ranking: more than 98.5% of the area covered with hollow roads can be assigned to the top rank (1). This shows the effectiveness of the ranking system and the landscape characteristics chosen. Furthermore, it demonstrates that by only considering rank 1 detections-ignoring detections in rank 2-the number of missed hollow roads will be low, while the number of false positives, caused by these zones, will be reduced.

Post-processing
To adequately investigate the location of the roads themselves, the course of the roads, and the resulting route network (see Section 1) requires different types of output of the post-processing step of CarcassonNet. For the location, areas demarcating the actual roads are sufficient. These areas could be used as a proxy for the road courses, but to increase the usability of the results for the interpretation of the wider route network, these polygons should be converted into lines.
Therefore, in the post-processing step the results of the preceding classification step, i.e., snippets with a class label and a confidence score, are converted into geospatial polygons and subsequently into lines, both usable in a GIS (Fig. 5). All snippets classified as roads are loaded into QGIS 3.4 Madeira [48]. These raster files are turned into vectors, combined into larger polygons, and turned from multipart to individual singlepart features with the Tileindex, Dissolve, and Multipart to singlepart processing tools. Subsequently, these polygons are turned into lines using the approach taken by Reference [62]. A 'skeleton' is created from the polygons with the scikit-image Skeletonize package and subsequently rendered into a graph structure with the sknw package in Python 3.7. The result is a set of closely spaced points depicting the course of the detected roads. These are turned into vectors with the Buffer and Dissolve processing tools in QGIS.

Feedback loop
In the training process of CarcassonNet, a feedback loop between the archaeological interpreter and the classification algorithm was implemented to enhance the quality of the (training) dataset and the classification model (Fig. 6). Initially, the first author rapidly labelled the hollow roads within the dataset (see Section 2.2.1). Subsequently, the algorithm was trained and tested on this data. The incorrect detections (false positives) were manually compared to the LiDAR data to determine their cause. This resulted into multiple unlabelled hollow roads, also called 'new positives' [63], that were missed during the initial tagging. The training dataset was updated with these new positives and used to retrain the algorithm. This process was repeated two times till no new positives were detected. This approach allows for the rapid labelling of the initial training dataset, reducing the time needed for preprocessing. Furthermore, it results in the improved performance and a thorough training dataset.

Metrics
To evaluate the workflow, the areas (in m 2 ) of false negatives (FN), false positives (FP), true negatives (TN), and true positives (TP) were determined, following the approach taken for Celtic fields in Reference [21]. The following metrics were calculated: Precision (P), Recall (R), F1-Score (F1), Accuracy (ACC), and Matthews Correlation Coefficient (MCC; Equations 1-5). Precision measures how many of the selected items are relevant. Recall gives a measure of how many relevant objects are selected. The F1-score is the harmonic average of the precision and recall and a measure of the model's performance [64]. Accuracy gives the ratio of correctly predicted instances to the total of instances in the dataset. MCC is a measure of the correlation between the observed and predicted binary (two-class) classification, even if the classes are very imbalanced [65]. Therefore, MCC is a more reliable indicator of quality of the model over the other metrics used [51]. Precision, recall, F1score, and accuracy are restricted between 0 and 1, while MCC is bound between −1 and 1. For all metrics higher values indicate a better performance. We are aware that the above metrics, while adequate to evaluate classification tasks, are suboptimal to access the quality of connected road networks [62,66]. However, the above metrics are more established in remote sensing and archaeological Machine Learning research and are therefore easier to assess and compare with work from other authors (see also Reference [20]).

Experimental evaluation
In our experiments, the ResNet-34 CNN was trained as detailed above on the balanced dataset. Subsequently, the trained CNN was evaluated on the original dataset from the Veluwe (see Table 2). This dataset was used as an unbalanced dataset better represents the real-world situation of scarce archaeological objects in a complex landscape (see Reference [21]). All experiments were performed on Table 5 the results of the experimental evaluation showing the difference in performance between visualized data (HS) and digital terrain model data (DTM) and the use of Location-Based Ranking (LBR). a NVIDIA GTX 1060 GPU. During the experimental evaluation the confidence threshold was empirically set to 0.80. The results of the experimental evaluation are shown in Table 5. CarcassonNet has a recall (R) of 0.51 and a precision (P) of 0.36. It reaches a F1-score of 0.42, an accuracy of 0.89, and a MCC of 0.38 on the dataset with hillshade visualization (Table 5). When using DTM data (instead of the hillshade) the performance increases with circa 6 points to an F1-score of 0.48 and a MCC of 0.44. Implementing LBR further increases the performance with circa 1-2 points to a MCC of 0.39 for hillshade and 0.47 for DTM. Table 5 clearly shows the problem of using the accuracy and F1-score metrics on an imbalanced dataset. The accuracy is very high, due to a bias towards the majority category (in this case TN). On the other hand, the recall, precision, and F1-score are relatively low, because these metrics are dependent on which class is defined as positive (and ignores the TN category). Therefore, the only metric that gives a reliable result is MCC, as it takes into account the balance ratios of all four categories (TP, FN, TN, and FP; [67]). Visual analyses of the results show that in general larger or wider bundles of multiple hollow roads are well recognized by the model (Fig. 7). In contrast, smaller bundles or individual hollow roads that are significantly narrower than the snippet size (64 by 64 pixels) are often missed or only partially detected. This might seem related to the overlap threshold used in the preprocessing step (see Section 2.2.1). However, lowering the overlap threshold to 25, 50, or 75% only resulted in a decline in performance of the model and no improvement in the prediction quality of 'thinner' roads. Therefore, the overlap threshold seems to be of no influence to the detection of smaller bundles or individual hollow roads. Conceivably, the balance of the classes in the dataset might be of influence to the performance and accuracy. False positives are generally caused by objects of confusion, which are morphologically comparable to hollow roads (Fig. 8). There seems to be no clear pattern in these, as many of the common objects of confusion are removed through LBR. Most of the false positives consist of individual or a few detections (or snippets) clustered together, whereas most true positives consist of a larger amount of clustered detections.
The results of the post-processing step of CarcassonNet (Figs. 5, 7, and 9) show that the conversion of the detections into polygons and lines works adequately on individual or smaller Table 6 The results of the implementation of the feedback loop per subarea.

Subarea
Original (m 2 ) Updated (m 2 ) Difference (m 2 ) Difference (%) bundles of hollow roads. However, when a polygon of a large bundle of hollow roads (with many roads running parallel) is converted to lines a problem arises in which branches, perpendicular on the course of the road, are generated (see Fig. 9). This reduces the readability of the road network. Further research is needed in order to resolve this problem.

Evaluation of the feedback loop
During the development of CarcassonNet, a feedback loop was used to detect new positives and update the training dataset (see Section 3.2). It was assumed that this approach would improve the quality of the dataset and the performance of the CNN model. In Table 6 the total area of the manually labelled road polygons in the original and updated datasets is shown per subarea (see Fig. 2). The difference between the datasets, circa 397,000 m 2 , equals the total area of new positives that were identified. This shows that in the initial labelling on average 5.5% of the hollow roads in the Veluwe dataset were missed. Furthermore, two iterations of the feedback loop improved the performance (MCC) of CarcassonNet by circa 0.2. These results show the benefit of using the feedback loop in the development of datasets for automated detection methods.

CarcassonNet versus modern road detection approaches
The results of the experimental evaluation (Table 5) show that CarcassonNet is able to detect hollow roads in LiDAR data. The strategies implemented in this research resulted in a performance (MCC) of 0.47. Validating the performance of CarcassonNet by comparing it with other automated historical road detection methods proves difficult, as the few studies do not provide (comparable) metrics [26,27,68]. Vletter estimates a precision of over 0.80 for the approach presented in Reference [68] (i.e., over 80% of the detections are segments of roads or paths). However, no other metrics are given. The performance (F1) of recent Machine Learning and Deep Learning detection methods for modern roads in remotely sensed data varies widely, between 0.67 and 0.96 (Table 7). Our approach (F1-score of 0.50) seems to be lower. The main reason for this is the difference between hollow roads and modern roads, as hollow roads consist of dissected, re-used, overlapping tracks that lack uniformity (see Section 1). These factors make it much more complicated to adequately detect hollow roads, compared to modern roads. Furthermore, as shown above the F1-score is not an adequate metric to evaluate the performance of binary road classification. Unfortunately, it has been impossible to calculate the MCC  [17], showing the areas classified as hollow road by CarcassonNet in blue and the reference standard in green (source of the height model: [18]; coordinates in Amersfoort/RD New, EPSG: 28992).  for the below methods, as no exact information on the amount of TP, FN, TN, and FP was given.

DTM versus visualized LiDAR data
The results show that the performance of CarcassonNet varies between visualized and DTM data ( Table 5). This is most likely related to information loss, as proposed by Shannon's channel coding theorem [73]. This theorem from Information Theory states that each processing of information from one representation into ano-ther (in our case from DTM to visualized LiDAR data) reduces, for instance due to rounding errors, but never improves the information content. Therefore, the DTM will contain more informational, less noisy data, as compared to the visualized data.

Archaeological implementation
The method and results presented in this research can be used to get a better understanding of (post)medieval routes on the Veluwe. By combining our mapped (regional and local) roads with informa- Fig. 9. Excerpt of LiDAR data from the Speuld subarea, visualized with Local Relief Model [17], showing the lines depicting the route network generated by CarcassonNet in blue and the reference standard in green (source of the height model: [18]; coordinates in Amersfoort/RD New, EPSG: 28992). tion gained from historical written sources, cartographic data, and modelling [27] a more complete picture of the route networks on the Veluwe can be gained. This will enable the investigation of the relative chronology of these roads (see Reference [74]). Furthermore, our method offers opportunities to efficiently investigate archaeological hypotheses on a landscape scale. For instance, the relation between prehistoric barrows (as possible road markers) and hollow roads [75] can be investigated. Also, the relation between hollow roads and forests, which according to Jager were avoided on the Veluwe due to active highwaymen [42], can be investigated.

Conclusions
In this research we presented a novel approach, named Carcas-sonNet, to automatically detect and trace hollow roads in LiDAR data, using a combination of Deep Learning CNNs and image processing algorithms. CarcassonNet consists of three steps: preprocessing, classification, and post-processing. Contrary to other research, CarcassonNet uses individual sections of roads as input, instead of whole roads. This makes it much more cost-effective to create a sufficient training dataset, and makes the classification task (performed by the neural network) relatively simple, leading to better detection results at a lower cost/effort. The output of CarcassonNet consists of two types of geospatial vectors to efficiently study the roads themselves and their precise location in the landscape (polygons), and the course of the roads and the resulting route network (lines). Experiments show that CarcassonNet is able to effectively detect hollow roads in data from the Veluwe, with a MCC score of 0.47. These experiments also show that using the Digital Terrain Model, instead of visualized LiDAR data (i.e., hillshade) improves the performance with circa 6 points. Implementing Location-Based Ranking, to reduce the number of false positives, further increases the performance with circa 2 points.
Further research will focus on the improvement of the performance of CarcassonNet on small, individual hollow roads, for instance by changing to a segmentation architecture, such as U-Net [76], or by implementing segmentation in the post-processing step. Also, further incorporation of CarcassonNet into QGIS is envisioned. Finally, the development of CarcassonNet is part of an on-going PhD research (in the Data Science Research Programme at the Faculty of Archaeology at Leiden University, the Netherlands) to explore the potential of Deep Learning for the mapping of archaeological objects in remotely sensed data. Earlier research focussed on detecting barrows, Celtic fields, and charcoal kilns, resulting in the WODAN workflow [20,21]. Future research will involve the use of both models in conjunction to detect archaeological objects in a case study area, and investigate the quantitative and qualitative knowledge gain of employing automated detection methods.

Funding
This work was in part supported by the Data Science Research Programme (Leiden University).