Improved supervised classi ﬁ cation of bedrock in areas of transported overburden: Applying domain expertise at Kerkasha, Eritrea

A regional bedrock map provides a foundation from which to build geological interpretations. However, rapid and accurate bedrock mapping in an area that lacks outcrop is a common problem, especially in regions with sparse data. A historic bedrock map from an Au and base metal project in the Kerkasha district, Eritrea, is signi ﬁ cantly improved by predicting bedrock distribution in areas previously mapped as transported overburden. Publicly-available remote sensing data (DTM and ASTER) were combined with airborne geophysical data (magnetics and radiometrics) to provide features for bedrock prediction. Remote sensing data were pre-processed using Principal Components Analysis (PCA) to yield an equal number of principal components (PC) as input features. Four iterations were trialled, using different combinations of remote sensing PC features. The two initial trials used all available remote sensing data but compared results when feature ranking and selection is applied to reduce the number of PCs used for training and classi ﬁ cation. The subsequent two trials used subsets of available remote-sensing data, selected based on domain expertise (i.e., the domain-speci ﬁ c knowledge of a geologist), with all respective PCs were retained. Five-fold cross-validation scores were highest when a DTM, magnetics, and radiometrics data were included as input features. However, qualitative visual appraisal of predicted results across trials, complemented by maps of class membership uncertainty (using a measure of entropy), indicate that geologically-meaningful results are also produced when radiometrics are omitted and only the DTM and magnetics are used. The study concludes with a generalised work ﬂ ow to assist geologists who are seeking to improve the bedrock interpretation of areas under cover in a single area of interest. Domain expertise is shown to be critical for the selection of appropriate input features and validation of results during predictive lithologic mapping.


Introduction
Supervised machine learning (ML) using remote sensing data provides an objective and reproducible way to create predictive lithological maps.In particular, the Random Forests (RF) classification algorithm is an effective means to produce machine-assisted maps of bedrock distribution (Harris et al., 2011;Behnia et al., 2012;Cracknell et al., 2014;Harris and Grunsky, 2015;He et al., 2015).The statistical comparison between supervised classification models is traditionally achieved using cross-validation (Stone, 1974;Moore, 2001;Forman and Scholz, 2010).However, machine-assisted maps are spatial interpretations of geology which must conform to geological constraints (e.g., cross-cutting relationships); therefore they can also be evaluated subjectively (Mather and Koch, 2011;Harris et al., 2014;Brungard et al., 2015;He et al., 2015).This study focuses on how bedrock classification model performance can be improved and assessed using domain expertise (i.e., the domain-specific knowledge of a geologist), by combining quantitative and qualitative approaches.

Background
Machine learning input data, called features, can be more effective if they have been pre-processed (Liu and Motoda, 1998).Feature construction is a process to represent information about the relationships between features and augment the feature space by inferring or creating additional features (Matheus, 1991;Wnek and Michalski, 1994).Principal Components Analysis (PCA) is a method to represent data as a linear recombination of features, where the resulting eigenvectors and eigenvalues relate to dataset variance (Pearson, 1901;Jolliffe, 2002Jolliffe, , 2011)).This approach can have potential benefits when applied to geoscientific remote sensing data used to inform ML.One common application is to reduce feature dimensionality, in order to limit overfitting during supervised learning (Bellman, 1961;Hughes, 1968;Halmy and Gessler, 2015;Raczko and Zagajewski, 2017).Another application is geological feature construction for qualitative appraisal of geological processes or supervised classification (Burl et al., 1998;Crosta et al., 2003;Tesfahun and Bhaskari, 2013;Mustafa et al., 2017).Furthermore, PCA has been demonstrated as an effective means of noise removal before spectral-image classification (Rodarmel and Shan, 2002;Mather and Koch, 2011).
The relationship between data-driven and knowledge-driven feature selection has been widely discussed (Liu and Motoda, 1998;Yang and Honavar, 1998;Guyon and Elisseeff, 2003;Kalousis et al., 2007;Gregorutti et al., 2017).Random Forests provides built-in feature selection through a bootstrap-aggregation process referred to as bagging (Breiman, 1996).Bagging uses approximately two-thirds of the training samples to create a prediction (in-bag samples), obtained via random sampling with replacement.When bagging occurs at each node in a decision tree, a random selection of predictor features is made from which the optimal split on the best feature available is found.However, feature selection can also be applied manually before classification, if domain knowledge is used to choose input data.For example, maps of feature data or mapped PCs can be examined, and related to geological entities (Grunsky and Smee, 1999;Reimann, 2005;Grunsky, 2010).
The RF classification algorithm employs an ensemble of decision trees to categorise data (Breiman, 2001).These trees compose a set of hierarchical conditions applied consecutively, from a parent node to a child node, to make predictions of class labels contained within a set of training data.Each decision tree in the ensemble is trained on a bootstrapped subsample of the training data.Individual predictions are then aggregated to produce a final classification.The final prediction output of RF is the majority predicted category, of all the decision trees, for each candidate class.Class membership probabilities are calculated from the proportion of predictions produced by all the decision trees, for a possible class.Entropy quantifies the disorder of classification results and is a function of the number of candidate classes and class membership probability for samples (Cracknell and Reading, 2013;Kuhn et al., 2016).

Case study area
The geology of Eritrea is characterised by the Arabian Nubian Shield Precambrian basement, which is unconformably overlain by Mesozoic to Cainozoic rocks.The country's regional geology is divided into four distinct terranes: Nakfa, Adobha Abiy, Arag and Hagar (Fig. 1;Drury and Berhe, 1993;De Souza Filho and Drury, 1998;Drury and De Souza, 1998).Each terrane is structurally bounded by approximately north-to-south trending shear zones.The late Neoproterozoic collision between East and West Gondwana concentrated transpression in the juvenile crust of the Arabian Nubian Shield in Eritrea along at least two steep, curvilinear crustal-scale belts, the Augaro-Adobha Belt (AAB) and the Asmara-Nakfa Belt (ANB; Johnson et al., 2011;Fritz et al., 2013).Between these belts is a relatively-lower metamorphic-grade domain, which hosts the Kerkasha case study area.
The ~1000 km 2 Kerkasha mineral exploration leases (Fig. 2) are situated ~200 km west of Asmara.Approximately 39% of the area has been mapped as alluvium or colluvium transported cover.The landscape is characterised by low-lying, monotonous plateaus (900-1400 m above sea level) and is bordered by steep slopes.The Kerkasha project is within the Nakfa terrane and comprises Neoproterozoic volcano-sedimentary units adjacent to mafic to felsic intrusions.Local geology has been interpreted as granitoid, granodiorite, and diorite intrusions emplaced into the pre-tectonic metasedimentary and mafic-to-felsic volcanic assemblages (Fig. 2, Internal Company Report, 2011, 2012).The stratigraphy is overprinted by numerous deformation events, locally obscuring the primary fabric of rocks, especially in metasedimentary and metavolcanic rocks.However, the granitoids are comparably-less deformed than the volcanics.

Data and methods
Data used in this study are presented in Table 1 and summarised in Fig. 3 with more detailed information given in Appendix A: Data.Data processing methods are summarised in Fig. 4, and are based on RF bedrock mapping classification studies (c.f.Cracknell and Reading, 2013;Cracknell and Reading, 2014;Cracknell et al., 2014;Carranza and Laborte, 2015;Harris and Grunsky, 2015;Kuhn et al., 2016Kuhn et al., , 2018)).The workflow was scripted in the Jupyter programming interface (P erez and Granger, 2007), using the Python programming language, and the numpy (Van Der Walt et al., 2011), pandas (McKinney, 2010), scikit-learn (Pedregosa et al., 2011), and imbalanced-learn (Lemaître et al., 2017) library modules.Geographic Information Systems (GIS) work was completed in Quantum GIS (QGIS; the Open Source Geospatial Foundation Project: http://qgis.osgeo.org).
Remote-sensing information is used to inform model training and testing, and to predict features of the unknown data.The dataset includes a digital terrain model (DTM) from a Shuttle Radar Topography Mission (SRTM; Farr et al., 2007), Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER; Abrams, 2000) thematic maps, and magnetics and radiometrics from helicopter-borne Versatile Time-domain Electromagnetic (VTEM) survey.Elevation, magnetic, and radiometric data were selected as input features because of their demonstrated efficiency for bedrock prediction under transported cover (Yu et al., 2012;Cracknell and Reading, 2014;Cracknell et al., 2014;Harris and Grunsky, 2015;Kuhn et al., 2018).ASTER multispectral reflectance data were included for their potential to discriminate bedrock in arid environments (Yamaguchi and Naito, 2003;Rowan et al., 2005;Qari et al., 2008).A pre-existing interpreted geological map was used to provide bedrock training classes.Extracted classes were spatially matched to remote-sensing raster images by stacking data layers in QGIS (Table 1) and sampling to a 100 m spaced grid of points.The workflow was applied four times to these point data (trials I to IV; Fig. 4) with a Extraction of data from the stacked GIS resulted in 97,737 points.Bedrock unit codes and class frequency are given in Table 2. Irregular coverage of raster layers meant that some features could not be extracted near the license boundary margin, so these points were removed from the dataset.Rows marked as transported overburden, colluvium and alluvium, were dropped from the train-test dataset for bedrock predictions.These samples, mapped as transported material, were set aside for later classification.The minority class (Granitoid 3, gp; n ¼ 4) was combined with another class (Granitoid 1, g; n ¼ 2007) the latter which is mapped in the vicinity of that unit.After filtering 96,206 samples remained.Before each run, a full Singular Value Decomposition (SVD) of the data was performed as part of PCA.As data should be normalised before PCA (Wettschereck et al., 1997); each feature was independently scaled using a robust z-score approach (Samuelson, 1968), i.e., setting the median as 0, and scaling the data span to the interquartile range.
There was a three order of magnitude difference between the new minority class (Metasediments, s; n ¼ 68) and the majority class (Intermediate metavolcanics, via; n ¼ 13,750).Over-and under-representation of class membership has the potential to bias machine learning algorithms and can lead to spurious results (Henery, 1994;Japkowicz and Stephen, 2002;Wang and Yao, 2012;Cracknell and Reading, 2014).Many options are available to address class imbalance before classification.These include the combination of under-sampling over-represented classes, and oversampling under-represented classes (Wang and Yao, 2012;Barua et al., 2014;Agrawal et al., 2015;Abdi and Hashemi, 2016;Gosain et al., 2016;S aez et al., 2016;Mustafa et al., 2017;Fern andez et al., 2018).We opted to use a combination of under-sampling and synthetic over-sampling to create a balanced train-test dataset.To balance the number of samples for each rock type, the majority class was first under-sampled using Tomek links (Tomek, 1976;Kubat and Matwin, 1997).Tomek links are pairs of similar instances in feature space that represent different classes.The removal of these instances increases the distance in feature space between the two classes, facilitating the classification process.After resampling using Tomek links, the majority class was still two orders of magnitude greater than the minority class (Table 2).Classes with sample frequency greater than the median (n ¼ 821) were then reduced to the median value by randomly removing samples.Classes with sample frequency less than the median were oversampled to the median value using SMOTE (Synthetic Minority Oversampling Technique; Chawla et al., 2002) followed by removing Fig. 2. Historic geological mapping of the Kerkasha project, produced from outcrop mapping and aerial magnetic data (Internal Company Report, 2012).Transported overburden (alluvium/colluvium) covers about 39% of the map area (plotted as white regions).Training data for classification is taken from a 100 m by 100 m grid of points over mapped rock types and excludes areas of transported overburden.Coordinates are given in WGS84/UTM Zone 37N for this, and following, figures.similar synthetic samples using Tomek links.In this regard, Tomek links is used to remove similar samples created after applying SMOTE over-sampling (Lemaître et al., 2017).The final number of samples available for training and testing the RF classifier was 17,358.Changes in sample frequency are given in Table 2.
For each trial, the number of PCs was initially set equal to the number of input features, to account for 100% of data variance.Trial I used all 23 available input features (DTM, magnetics, radiometrics, and ASTER) as test and train data, and the top-ranked PCs.The number of important features was selected using an F1 score (the harmonic mean of the precision and recall; Rijsbergen and Joost, 1979;Powers, 2011), based on 23 classification trees.Trial II used all 23 available input features, and 23 PCs as training features.Trial III used DTM, magnetics, and radiometrics (13 input features), and 13 PCs generated from these as training features.Trial IV used DTM and magnetics (seven input features), and the seven PCs generated from these as training features.

RF model building and validation
The application of RF towards lithological mapping is well described in the literature, as are discussions on the selection of RF hyperparameters (Carranza, 2002;Waske et al., 2009;Cracknell and Reading, 2014;Carranza and Laborte, 2015;Harris and Grunsky, 2015;Rodriguez-Galiano et al., 2015;Harvey and Fotopoulos, 2016;Kirkwood et al., 2016;Kuhn et al., 2016Kuhn et al., , 2018;;Ord oñez-Calder on and Gelcich, 2018).The most important hyperparameters relate to the number of trees, the number of features considered at each split, the maximum depth of individual trees, and the minimum number of samples in terminal nodes.A

Table 1
Summary of data used for study.Historic mapping is sourced from outcrop observations, coupled with airborne magnetics from VTEM survey {line spacing of 200 m; \Bell, 2011 #528}.The Digital Terrain Model is sourced from the Shuttle Radar Topography Mission (Farr et al., 2007).Magnetic and radiometric features were produced via airborne VTEM survey.Mineralogical features were produced via ASTER satellite.The RF algorithm hyperparameters were set as follows: 500 decision trees were trained for each classifier, and the square root of feature numbers (rounded to a whole integer) was used as the number of features considered at each split.A fixed random seed of 42 was used.The Gini index (Gini, 1912;Breiman, 1984) was used as the criterion for impurity at splitting nodes.Decision trees were not pruned, and no limitation was placed on the minimum number of samples within a node.
In trial I, an optimal number of PC features were selected by first ranking input features using measures of mean impurity decrease during sample splits at RF nodes.Mean impurity decrease was measured using the Gini index (Breiman, 2001).Then an optimal number of features was selected using an "F1 sweep" (where an F1 measure is calculated iteratively for a forest of trees where the number of trees is equal to the number of features, and each tree is only split once).For trials II to IV, all PC features were used.
Classification-model assessment involved five-fold cross validation on the balanced training data (Table 3).Precision and recall are given in Table 4, for each lithology, as a way to represent the map producer's and user's accuracies (Story and Congalton, 1986).Metrics used to score models were precision, recall, overall accuracy, and the F1 measure, and were averaged after the five iterations from the k-folds cross-validation (Table 6).
Recall (Rijsbergen and Joost, 1979;Powers, 2011) is defined as: F1 (Rijsbergen and Joost, 1979;Powers, 2011) is defined as: After cross-validation scoring in each trial, all training data were used to train a classification model.This model was applied to predict bedrock classes in the original unbalanced data (including the samples tagged as transported cover) to create predictions for unknown data (i.e.transported overburden, with median entropy and variance of entropy tabulated for comparison to other metrics).
Finally, classified points and entropy values were plotted in map space.Conventional geology maps display confidence by separating observations from inferences, e.g., solid versus hashed geological contact lines, or inclusion of field-observed outcrop polygons within mapped geological domains.For RF-assisted bedrock prediction maps (machineassisted maps), information entropy (Shannon, 1948) provides an effective way to measure uncertainty.Entropy is defined as: where the class membership probability is given by p i at location (sample) i, n is the number of candidate classes, and k is a constant, typically 1 (Shannon, 1948).

Results
Fig. 5 gives an example of scaled PC loadings.Eigenvalues for these PCs are presented in map space in Fig. 6.Full results from PCA are given in Appendix B: Principal components analysis.In trial I, PC1 is dominated by the negative covariance of ASTER ferric Fe composition and Fe-MgOH content (Fig. 5A), PC2 is dominated by RTP1VD and RTP2VD (Fig. 5B), and PC3 is dominated by RTPTILT, RTP2VD, AS, and ASVI (Fig. 5C).PC7 shows a range of covariance relationships, with the strongest negative covariance between DTM (positive) and Th (negative).Relative covariance for PC8 is similarly mixed, with the strongest negative correlation between RTP2VD and eU/eTh (positive), and eU and eTh (negative).Finally, PC23 is dominated by AS (positive) and ASVI (negative).The top ranked PC for trial III and IV is shown in Fig. 5E and F, respectively.The highest ranked PC feature for each trial in listed in Table 5, along with the strongest eigenvalue loading of input data.
The number of features that contribute most information during RF classification are shown in Fig. 7A-C, and relative feature importance is indicated in Fig. 7D-F.In trials I and II, eight features provide the majority of information useful for bedrock classification (Fig. 7A).PC1 accounts for the majority of dataset variance (31.8%) but is ranked as the 9th most important variable of 23 total PCs, PC2 accounts for the second most dataset variance (20.3%) and is ranked 18th, and PC3 (13.0% of dataset variance) was ranked as the most important feature for classification (Fig. 7B).PC23 accounts for the least dataset variance (5.1 Â 10 À3 %) but was ranked as 4th most important variable for RF classification.
In trial III, PC2 was ranked most highly by RF and accounts for 25% of the dataset variance.In Trial IV, PC4 was ranked most highly and accounts for 5.8% of dataset variance.Map patterns for selected PCs are presented in Fig. 6.In trials I and II, PC1 (Fig. 6A) has high values that are mainly coincident with alluvium (Fig. 2).PC2 (Fig. 6B) very faintly indicates the edges of mapped units shown in Fig. 2. PC3 (Fig. 6C) has the highest values coincident with granitoid rocks as mapped in the southwest-central map area (Fig. 2).PC23 (Fig. 6D) represents a northwest to southeast striping of high and low values that do not correspond with any mapped geological features.Trends of these are oriented at a high angle to the dominant structural grain of the map area (Fig. 2).In trial III, PC2 (Fig. 6E) is nearly identical to PC3 from trials I and II.In trial IV, PC4 (Fig. 6F) is visually similar to the pattern of the DTM (Fig. 3A).
The input features and PCs used for model training and sample classification are summarised in (Table 6).For trials II to IV, the number of PCs were set equal to the number of input feature: trial II ¼ 23, trial III ¼ 13, and trial IV ¼ 7 (i.e., so that all information is conserved).
For each trial, the number of features was: trial I and II ¼ 5, trial III ¼ 4, and trial IV ¼ 3. Cross validation results from all folds are given in Appendix C: Cross Validation Results.The producer's and user's accuracies (precision and recall) for each classified lithology are presented in Table 4, with trial III presenting the most instances of highest values.Scoring metrics from five-fold cross-validation for each trial are presented in Table 6.An example of cross-validation results is given for trial III (which had the optimal values of precision, recall, overall accuracy, and F1).The ranges of these were: precision, from 78.87% (trial IV) to 86.89% (trial III); recall, from 79.85% (trial IV) to 87.25% (trial III); overall accuracy, from 76.93% (trial I) to 86.98% (trial III); and F1, from 78.75% (trial IV) to 86.79% (trial III).Trial III produced highest cross-Fig.4. The workflow of methods used for this study.In the classification step, train-test data are used to produce a RF classifier.As input data are preprocessed, numerical models are created so that data transformations can be applied to unknown data in the prediction stage.validation evaluation metrics and trial I produced the lowest crossvalidation metrics (Table 6).
Bedrock prediction maps are presented in Fig. 8.These show progressively less speckling of classified pixels in successive trials.Entropy values for the predicted bedrock maps are presented in Fig. 9.The median of entropy and variance of entropy are presented in (Table 6).This measure of uncertainty ranges from 1 (low confidence) to 0 (high confidence).The lowest median entropy was for trial IV (0.626) and the highest was for trial II (0.751).The lowest entropy variance was for trial I (0.036) and the highest was for trial III (0.048).

PCA and retaining all PCs from a full data input set
Machine learning classification of bedrock, using remote sensing data, has improved averaged scoring metrics when all PCs are retained after PCA decomposition (Table 6; trial II vs. I).This finding supports conclusions from remote sensing land-cover classification studies (Rodarmel and Shan, 2002;Castaings et al., 2010;Mather and Koch, 2011), and contrasts with conclusions in the discipline of computer sciences, where feature reduction after PCA is recommended (Tesfahun and Bhaskari, 2013).The linear recombination of covarying features assists classification via the improved separation of classes in feature space (Fortuna and Capson, 2004) and enhancement of signals which relate to target classes (Rodarmel and Shan, 2002;Mather and Koch, 2011).Ranked plots of PC importance (Fig. 7) highlight that explained variance does not correlate with target feature properties.This can be interrogated using mapped eigenvalues for PCs, most of which represent different geological features (e.g., bedrock or regolith domains; Fig. 6).

Domain knowledge assisted feature selection
Bedrock classification results were improved when a reduced set of remote sensing input data were used during PCA, and all resulting PCs retained for model building and classification (Table 6; trials III and IV vs. II).The feature ranking inherent to RF is not sufficient to discriminate spurious features: e.g., trials that include ASTER data (trial I and II) produced a PC with a striped map pattern, interpreted as a remote sensing artefact (PC23; Fig. 6D).While this feature should logically be downranked during model training, it was ranked the 4th most important of 23 total features (Fig. 7D).There is an advantage in comparing the feature ranking of PCs (Fig. 7), PC maps (Fig. 6), and ranked-scaled eigenvectors (Fig. 5).These plots represent relationships between target classes (bedrock), feature space (petrophysical qualities), and the function of the RF model building in relating these together.Considered in this way, these three plots provide a means for domain experts to understand and contribute towards bedrock classification studies, especially in situations where a technical background in the subject might be lacking.

Table 2
Lithological unit codes and class balancing.Original refers to the sample points created in a grid over the study area, representing the train-test data.Nan-drop refers to omitted rows of data, where entries marked as being 'Not A Number' were filtered,e.g., missing geospatial information from the margin of the study area.Tomek refers to Tomek links, used to eliminate samples that are very similar in feature-space, from different classes.The column Rand.Undersample denotes the random removal of samples from classes over a median value (n ¼ 821) until they have a frequency of that value (i.e., 821).SMOTE-Tomek refers to a method of oversampling (minority classes) to a target frequency, in this case 821.Then, synthetic samples are reduced using Tomek-links.

Evaluating RF bedrock classification
The features (and data) most highly ranked during classification trials indicates that radiometrics assist in discriminating the most map pixels (Table 5).Classification metrics from five folds cross-validation in the present study show an improvement of results between trials I to III and a poorer result for trial IV, making trial III the best performing model (Table 6), although trial IV also produced geologically reasonable results (Fig. 8).Because no single model is absolutely correct in a suite of competing models (Elith et al., 2002), we consider that maps from trials III and IV vary but represent complementary products for interpreting local bedrock geology.
Speckling (or noise) in a classified bedrock map relates to isolated pixels or small groups of pixels which do not match the surrounding majority.Reasons for speckling include geological units with a size below or near the resolution of feature data, or from local incorrect classification.Speckling is progressively reduced from trial I to IV in the central region previously mapped as colluvium (Fig. 8, location 1).The high degree of speckling in trials I and II is likely related to the inclusion of

Table 3
An example confusion matrix for fold 1, trial III, which tested DTM, magnetics, and radiometrics as input features (n ¼ 13) with 13 PCs.The producer's accuracy (the complement of Type I error; precision) and the user's accuracy (the complement of Type II error; recall) for each classified lithology, per trial run.The highest accuracy of a given type for each lithology is given in bold, and the frequency of top scores for each trial given as Freq.A summary of performance measures for the four trial runs.Scores are averaged over five folds.Definitions for precision, recall, and the F1 score are given in the text.These range from 0 to 1, with an optimal value of 1.The median entropy for each run indicates an average measure for overall entropy, ranging from 0 to 1, with an optimal value of 0. The variance of entropy provides a relative indication of whether entropy values are uniformly distributed (low values) or varying between predicted classed (higher values).ASTER data which is representing surficial material with a mixed lithological signal.Based on a geological rationale, the reduction of speckling and increase in spatial contiguity for resultant maps indicates improved classification results.An indication of classification accuracy is present where predictions under overburden match adjacent existing mapping, e.g., Fig. 8 at location 1. Results should also conform to the geological character of an area, in terms of faulting or ductile deformation.An apparent sinistral offset of ~5 km for metabasalt (vm) is present in Fig. 8A location 3.However, we note that these kinematics conflict with the nearby apparent dextral offset of intermediate metavolcanics (vsi); fieldwork is required to unravel the relevance or relative timing of stratigraphy and faulting.However, all units generally conform to the regional structural grain of the area with pre-tectonic units trending approximately northeast to southwest (e.g., Fig. 8D in location 4).
Results from trials III and IV exhibit the greatest spatial cohesion of predicted classes, the least speckling of classified pixels, and the highest tendency to conform to the regional tectonic fabric.However, the results from these trials differ.Trial III (Fig. 8C, location 5) does not include a mafic unit observed in trial IV (Fig. 8D, location 5).This discrepancy might result from an erroneous radiometric signature.In contrast, trial IV (Fig. 8D, location 6) does not predict a granitoid unit observed in trial III, which may relate to the necessity of radiometrics to predict that entity (Fig. 8C, location 6).
The lowest average entropy of predicted results, in trial IV, indicates the lowest class membership uncertainty among the predicted maps (Table 6).Low entropy approximates higher confidence in the bedrock prediction map (Cracknell and Reading, 2013;Kuhn et al., 2016).The entropy function is influenced by the number of input features and so absolute entropy values should not be compared when input features vary between models.However, relative patterns of high, low, or mixed entropy represent a qualitative indication of model performance and geological domains (Fig. 9).Speckled entropy in Fig. 8 location 1 likely relates to poor feature representation of bedrock (due to confounding overlying cover) or the non-homogenous petrophysical character of the volcaniclastic target class (e.g., internal bedding or facies changes).Entropy value variance for trials I to IV ranges from 0.037 to 0.044 (Table 6).Changes to entropy variance indicate the partitioning of variance into geological units, apparent as increasing contrast of high and low entropy values between units from Fig. 9A-D.Mapped zones with  4, which adds a suggested approach for producing "machine-assisted maps" by the manual combination the outputs of RF bedrock classification and considering maps of classification entropy.
The above discussion extends the understanding that cross-validation metrics cannot replace or represent the qualitative appraisal of a geological map by a domain expert, i.e. a geologist.In other domainspecific examples, RF classification maps that use a data-driven approach of quantitative model ranking only produce results less efficiently than those guided by domain expertise.For example, during RF landslides characterisation (Marjanovi c et al., 2011;Pham et al., 2016) or soil composition (Dornik et al., 2017) mapping, model-scoring metrics help to guide the selection of a classification algorithm which produces practical map products.However, in related studies of landslide maps (Goetz et al., 2015) and soil maps (Woznicki et al., 2019) which include domain knowledge at the methodology and interpretation stages, there is additional value in that feature selection is logical (using domain knowledge) and more efficient (using a lower number of relevant features than would be included using a "naïve" approach).In the latter case of Dornik et al. (2017), 23 additional variables were required during classification to represent what that data-driven study referred to as "local knowledge" (as compared to an early study by Brungard et al., 2015).

Conclusions
A historic bedrock map was significantly improved by predicting bedrock distribution for areas previously mapped as transported overburden using Random Forests supervised classification.This result was achieved using DTM þ magnetic AE radiometric data and assessed through quantitative and qualitative means.Principal components analysis is shown to assist Random Forests bedrock mapping when all principal components are retained.
The inclusion of ASTER data produced inferior bedrock maps, despite the demonstrated use of ASTER data for geological mapping in similar regions.Although Random Forests can identify relevant features during model training, this is not sufficiently effective such that all features should always be presented during model building.Iterative feature selection and assessment of maps by operators with geological domain expertise yields measurably better results.

Fig. 1 .
Fig. 1.Map of northwest Eritrea, showing major shear/fault and transpressional zones (modified from Johnson, 2017).The Kerkasha project area is within the Nakfa terrane, approximately 55 km southeast from the Bisha VMS mine.

Fig. 3 .
Fig. 3. Examples of feature inputs used for Random Forests prediction, representing the four groups of remote sensing data.A. Digital elevation model (SRTM).B.Potassium radioactivity (K Rad) from VTEM survey.C. Reduced to Pole (RTP) magnetics from VTEM survey.D. Magnesium-hydroxyl composition (MgOH comp) calculated from ASTER image band data.

Fig. 5 .
Fig. 5. Scaled loadings for PCs, showing the ranked covariance between input variables.Figures A. to D. (trial I and II, using all input features) show the first three PCs, and the last PC.Figure E. (trial III) and F. (trial IV) show PC2 and PC4, respectively, based on RF feature ranking given in the results section.

Fig. 6 .
Fig. 6.Example maps showing scaled PC loadings from data input, from trials I and II, corresponding to Fig. 5.

Fig. 7 .
Fig. 7.A summary of feature ranking using Random Forests.Figures A. to C. show the diminishing improvement to classification accuracy as more features are added, in each trial.For example, after eight features are used in A., classification improvement is marginal.Figures D. to F. show the ranking of feature importance.For example, PC3 has been ranked as the most important feature for classification in A. PC values are unique to each of D., E., and F. (i.e., have been computed for each unique set of input features).
Fig. 7.A summary of feature ranking using Random Forests.Figures A. to C. show the diminishing improvement to classification accuracy as more features are added, in each trial.For example, after eight features are used in A., classification improvement is marginal.Figures D. to F. show the ranking of feature importance.For example, PC3 has been ranked as the most important feature for classification in A. PC values are unique to each of D., E., and F. (i.e., have been computed for each unique set of input features).

Fig. 8 .
Fig. 8. Results from Random Forests classification, showing a progressive increase of spatial cohesion of results, in areas overlain by transported overburden.Locations 1-6 are discussed in the text.A. Using all available training features (DTM, magnetics, radiometrics, and ASTER), and the top-ranked eight PCs from RF metrics.B. Using identical training features as A., and all PCs.C. Using all PCs created from DTM, magnetics and radiometrics input features.D. Using all PCs created from DTM and magnetics as input features.

Fig. 9 .
Fig. 9. Entropy maps derived from the predictions shown in Fig. 8. Low entropy corresponds to low uncertainty for prediction results (stable class membership probability for the dominant class) while high entropy corresponds to high uncertainty of prediction results.

Table 5
A summary of the highest ranked PC feature by RF during model training for each trial, and the corresponding strongest loadings for input data.