Combined land surface emissivity and temperature estimation from Landsat 8 OLI and TIRS

Remote sensing of Land Surface Temperature (LST) generally requires atmospheric parameters and the emissivity ( ) of the target to be estimated. The atmospheric upand downwelling radiances and transmittance can be accurately modelled using radiative transfer models and profiles of relative humidity and temperature, either measured by radiosonde probes or retrieved from assimilating weather models. The estimation of is a large source of uncertainty in the resulting LST product, and there are various approaches using multi-angle observations, multispectral optical or multispectral thermal infrared imagery. In this paper, the estimation of LST from the Thermal InfraRed Sensor (TIRS) on board Landsat 8 is evaluated using more than 6 years of in situ temperature measurements from a network of 14 Autonomous Weather Stations (AWS) in Belgium. is estimated from concomitant atmospherically corrected imagery from the Operational Land Imager (OLI) using two new neural network approaches trained on ECOSTRESS spectra, and an established NDVI based method. Results are compared to using = 1 and the ASTER Global Emissivity Dataset. LST retrievals from L8/TIRS perform well for all emissivity data sources for> 500 matchups with AWS subsoil temperature measurements: Mean Differences 0.8–3.7 K and unbiased Root Mean Squared Differences of 2.9–3.5 K for both B10 and B11. The use of unity emissivity gives the best results in terms of MD (0.8 K) and unb-RMSD (3 K). Similar ranges of unbRMSD are found for> 500 matchups with broadband radiometer temperatures (2.6–3.1 K), that have lower absolute MD values (−2.2–0.6 K). For the radiometer temperatures, both the neural net approaches gave lowest MD, in the best case ±0.1 K. The present investigation can hence recommend the neural nets to derive for the retrieval of LST over the AWS in Belgium. Using published matchup results from other authors however, no single source of data performed better than = 1, but this could be due to their low number of matchups. Further efforts for estimating representative pixel average emissivities are needed, and establishing a denser in situ measurement network over varied land use, with rather homogeneous land cover within a TIRS pixel, may aid further validation of a per pixel and per scene estimates from multispectral imagery. AWS data seems valuable for evaluation of satellite LST, with the advantage of a much lower cost and higher potential matchup density compared to conventional radiometers.


Introduction
Satellite observations in the thermal infrared (TIR) part of the spectrum, with typical wavelengths in the atmospheric windows between 3-5 and 8-15 µm are used to retrieve surface temperatures (ST) for land (LST) and water (WST) targets (Wan et al., 2004;Wan, 2014;Sekertekin and Bonafoni, 2020;Vanhellemont, 2020). As surface temperature drives heat exchange between the surface and the atmosphere the remote estimation of LST can be a very powerful tool in understanding climate and ecological processes. In the aquatic environment, WST can be an important driver of nutrient mixing and can structure phytoplankton communities (Wiltshire et al., 2008;Trombetta et al., 2019). Water temperature in lakes and reservoirs may be a driver for the occurrence of harmful algal blooms and WST could be used in predictive models (Schaeffer et al., 2018;Wynne et al., 2013). In terrestrial applications, soil moisture and evapotranspiration for example can be inferred from satellite estimated LST (Sandholt et al., 2002;Anderson et al., 2012), and the remotely sensed LST could hence be crucial for synoptic monitoring of drought conditions (Karnieli et al., 2010;Wan et al., 2004) and crop health (Anderson and Kustas, 2008; T 2007) and geothermal activity (Lagios et al., 2007;Mia et al., 2014;Mia et al., 2017;Chan et al., 2018). High resolution satellite derived LST can also be used to study the effects of urban heat islands Liu and Zhang, 2011) and heatwave risks (Dousset et al., 2011;Buscail et al., 2012).
The TIR radiance observed by satellite sensors is a function of atmospheric and surface (emissivity and kinetic temperature) properties. Parameters needed for the atmospheric correction of top-of-atmosphere TIR (atmospheric transmittance, , downward and upward atmospheric radiances, L d and L u ) can be reliably estimated using radiative transfer simulations and atmospheric profiles of relative humidity and temperature, either from radiosondes or assimilated weather models (Barsi et al., 2005;Pérez-Planells et al., 2015;Vanhellemont, 2020). Alternative methods using a split window approach (Caselles et al., 1998;Wan, 2014) require multiple TIR bands and knowledge of the water vapour content of the atmosphere. To retrieve the surface temperature, the emissivity needs to be estimated, either simultaneously or from external data sources. The emissivity of a surface is defined as the ratio of the radiance emitted by a surface to the radiance emitted by a black body at the same temperature. The accurate estimation of emissivity is essential, as a 1% error on the emissivity can lead to errors up to 1 K in the derived LST product (e.g. Wan (2014Wan ( , 2017). Typically, emissivity is modeled using an image classification method (Snyder et al., 1998), the Normalised Difference Vegetation Index (NDVI) and/or fraction of vegetation cover (FVC) if concurrent multispectral optical imagery is available (Sobrino and Raissouni, 2000;Chen et al., 2016;Sekertekin, 2019;Ren et al., 2017) or imposed from external data sources, such as the MODIS (Wan et al., 2004) or ASTER (Hulley et al., 2015) emissivity products. A review and sensitivity analysis of different methods of emissivity estimation from NDVI is given in recent publications (Gao et al., 2020;Sekertekin and Bonafoni, 2020). MODIS and ASTER have multispectral TIR instruments that can be used to retrieve emissivity using the Temperature Emissivity Separation (TES, Gillespie et al., 1998) or Temperature Independent Spectral Indices of Emissivity (TISIE, Li and Becker, 1993) approaches. A review of TES and TISIE methods is given in Li et al. (2013), but are not discussed further due to the focus on Landsat type instruments in the present paper. There can be significant anisotropy of the emissivity especially for viewing angles > 30° (Ren et al., 2011;Li et al., 2013;García-Santos et al., 2015), which can be ignored for largely nadir-viewing instruments such as those on Landsat.
Water targets are generally used for calibration and validation of atmospheric correction methods and surface temperature retrieval, due to the constant emissivity of liquid water and the generally low spatial heterogeneity of temperature within a satellite pixel (e.g. Barsi et al. (2005)). For water targets, the high resolution thermal imagers on the Landsat series of satellites have generally a better than 1 K Root Mean Squared Difference (RMSD) compared to in situ temperature measurements (Cook et al., 2014;Malakar et al., 2018;Vanhellemont, 2020).
Land pixels generally aggregate information for a large variation of heterogeneous materials (buildings, trees, roads, fields, water). Because the content of a TIR satellite pixel is heterogeneous, the emissivity for a given pixel is an aggregation of the different pixel contents and geometry (Yang et al., 2015), and it may be a challenge to establish an appropriate pixel specific emissivity. Furthermore, land targets are typically much more variable in space and time due to seasonal effects (growth, snow cover, flooding), land use variability (ploughing, crop growth and harvesting), and the emissivity of soils can increase based on moisture content, leading to up to 2 K errors on LST estimates (Mira et al., 2007;García-Santos et al., 2013). As a result of these effects, validation over land sites typically gives larger RMSD compared to in situ measurements compared to water sites, ranging from 2-4 K for Landsat type sensors (García-Santos et al., 2018;Sekertekin, 2019). In a comprehensive validation exercise of LST derived from multiple coarser resolution sensors (AATSR, GOES, MODIS, and SEVIRI), Martin et al. (2019) found daytime accuracy within 4 K, and nighttime accuracy within 2 K. This paper presents an alternative emissivity estimation from multispectral imagery, using a machine learning approach (neural network) and a spectral library. The method is demonstrated using Landsat 8 (L8) which has on board a 9 band visible to shortwave infrared imager, OLI, and a two band thermal infrared sensor, TIRS. Atmospherically corrected OLI imagery acquired at the same time as the TIRS imagery is used in combination with the neural network to estimate emissivity in both TIRS bands. Emissivity estimates are compared to other emissivity sources, i.e. the ASTER GDEM dataset and an NDVI based approach, and are used to retrieve single band LST from both Bands 10 and 11 in the L8 "Collection 1" Level 1 data using the TIR atmospheric correction method presented in Vanhellemont (2020). Split channel algorithms were not used due to the poor performance found for the coefficients provided by Du et al. (2015) as used by Gerace and Montanaro (2017) after their stray light correction (Vanhellemont, 2020). LST for all emissivity sources are compared to in situ subsoil temperature measurements at 14 sites operated by the Royal Meteorological Institute of Belgium (RMI), and to temperatures derived from broadband radiometers at 12 of the 14 sites. The use of these regular weather monitoring sites, which can provide many more matchups, as well as matchups from radiometers presented by previous studies are evaluated for satellite validation. The varied locations of the different sites (Bertrand et al., 2013) may be differently impacted by seasonal variability in the emissivity estimations.

In situ data
In situ temperature measurements were obtained from 14 AWS operated by the RMI across Belgium (Table 1 and Fig. 1). This dataset covers the full Landsat 8 mission and includes hourly measurements of air temperature at 1.5 m above the surface, and ground temperatures (1) above grass, (2) above bare soil, and (3) 5 cm below bare soil. Details on the automated quality control of the air and soil temperatures can be found in Bertrand et al. (2013Bertrand et al. ( , 2015. Temperatures are measured with a PT100 probe, which has an accuracy of < 0.1 K or < 0.1% of span, and the total accuracy of the system is estimated to be 0.3 K. 12 out of the 14 sites used here also have a Campbell Scientific CNR-1 broadband (0.3-3 µm and 5-50 µm) Net Radiometer with upand downward facing instruments installed. The irradiance measurement from the downward facing thermal instrument (E, W/m 2 ) can be used to compute the ground surface temperature, T (K), according to the Stefan-Boltzmann law: (1) with the Stefan-Boltzmann constant, taken here as 5.67·10 8 W/m K 2 4 . Temporal coverage of the individual measurements vary per site, and the closest bounding temperature measurements within one hour of the satellite overpass were linearly interpolated to the overpass time. Soil and air temperatures are measured using contact probes, and those in situ temperature measurements are perhaps not directly comparable to satellite-measured skin temperature, and a better performance is expected for the broadband radiometer-derived temperatures. From the contact thermometer data, the measurements below bare soil are likely most comparable to the satellite measured skin temperature, if the soil is not frozen (Jin and Mullens, 2014), as they are less sensitive to small scale air fluctuations and direct insolation. The air temperature usually differs from the skin temperature due to surface and atmospheric conditions (Jin et al., 1997), and generally a lag is observed between skin temperatures and soil and air temperatures (Jin and Mullens, 2014). The air, grass and soil temperatures are quite sensitive to direct solar heating and are hence not used here. The presence of buildings and trees, and their cast shadows, in the satellite pixel may cause significant differences between the AWS temperatures and satellite measured skin temperatures, as the 120 × 120 m TIRS pixel may not be homogeneous around the site. The landscape in Belgium is very fragmented as a result of its long history and (lax) spatial planning policies (Albrechts et al., 2003), which results in a changing of emissivity at short spatial scales, and likely significant sub-pixel variability, and complex thermal adjacency effects. This provides a challenge for the validation of satellite imagery, but the present investigation may provide performance indicators in the typical use of satellite LST in such a fragmented landscape, even though the AWS are installed on long-term stable meteorological observation sites. The dense time series in time and space provides excellent opportunities for large scale matchup analysis, and can give an indication whether these types of measurements can be used for evaluation of satellite derived LST. For water applications for example, bulk temperatures are often used for validation, and corrected with a simple offset to obtain equivalent skin temperatures (Donlon et al., 2002;Cook et al., 2014). Furthermore, the installation and maintenance of radiometer systems is more expensive than these AWS or automated contact thermometers, and many national institutes already have an established network of soil and subsoil temperature and irradiance measurements. The use of these AWS is here suggested as addition to, and not a replacement for, validation sites with high performance thermal radiometers such as those in SURFRAD.

Satellite data
The available "Collection 1" daytime data combining both sensors on Landsat 8, OLI and TIRS, were obtained from Google Earth Engine (GEE, Gorelick et al., 2017). Imagery was provided in standard USGS L1T format, i.e. terrain corrected Top-of-Atmosphere (TOA) radiances/ reflectances in GeoTiFF format as produced from the Landsat Product Generation System (LPGS) version 2.7.0 (imagery before October 2017), 13.0.0 (October 2017 -April 2018), and 13.1.0 (imagery after April 2018). OLI is a push broom imager with 9 spectral bands (B1-9) in the visible to near-infrared part of the spectrum, with 8 bands at 30 m and 1 panchromatic band at 15 m spatial resolution. TIRS is a two band push broom imager with bands B10 and B11 centred on 10.9 (10.6-11.2) and 12.0 (11.5-12.5) µm, and records data at 100 m spatial resolution. TIRS data is resampled to 30 m and collocated on the OLI grid. OLI data from the 30 m bands excluding the cirrus band at 1.37 µm were atmospherically corrected using the Dark Spectrum Fitting Atmospheric Correction (DSF, Vanhellemont and Ruddick, 2018; Vanhellemont, 2019) using a fixed path reflectance for a 12 by 12 km Region Of Interest (ROI) centred on the measurement sites. Only scenes with > 75% ROI coverage were retained, i.e. removing edge-of-swath scenes where the station is not in the scene. Quality control was performed separately for the computation of the ROI mean and median reflectance datasets and the matchups. Mean and median reflectance datasets (e.g. Fig. 2) were computed per ROI based on quasi cloud-free images, where a simple cloud filtering was performed using thresholds on the blue and cirrus bands: scenes with > 10% of the ROI with t 1373 nm > 0.015 or s 443 nm > 0.15 were removed. Data filtering for the matchups was performed using an additional quality filtering; matchups were excluded when the mean of a 11 × 11 pixel box centred on the station had t 1373 nm > 0.01 (to exclude cirrus contamination), a maximum visible band (443, 483, 561, 655 nm) s > 0.12 (to exclude direct cloud cover over the site), and the subscene interquartile range of s 443 nm > 0.035 (to exclude scenes with scattered clouds). TOA thermal radiances (L t ) were also extracted for TIRS B10 and B11. Parameters required for the atmospheric correction of the thermal channels were derived using the method presented in Vanhellemont (2020). Subscene center ERA5 reanalysis profiles from the 0.25x0.25 degree data provided by the European Centre for Medium-Range Weather Forecasts (ECMWF, Malardel et al., 2016) and the libRadtran (version 2.0.2) radiative transfer code (Mayer and Kylling, 2005;Emde et al., 2016) were used to compute the atmospheric transmittance ( ), and up-(L u ) and downwelling (L d ) radiances. The TOA thermal radiance is a measurement of the combined surface-emitted radiance (L s ) and atmosphere: and in order to derive land surface temperature, the emissivity ( ) per pixel needs to be estimated, which is detailed in the next section.

Emissivity data
Emissivity was estimated from OLI surface reflectances ( s ) using a neural network approach. Reference spectra were obtained from the  Table 1. ECOSTRESS spectral library (Baldridge et al., 2009;Meerdink et al., 2019). 1072 out of the 3403 ECOSTRESS spectra with complete spectral coverage from the visible to thermal infrared (wavelengths 0.3-15 µm) were resampled to the OLI and TIRS relative spectral response functions. A subset of 839 spectra (Fig. 3) from the "manmade", "rock", "soil", "vegetation", and "water" classes (excluding "meteorites", "nonphotosyntheticvegetation", and "mineral") were finally retained to train a neural network (Net1) using the Keras (Chollet, 2015) library and TensorFlow (Abadi et al., 2015) backend. A second network (Net2) was trained using a subset of 561 spectra, also excluding the "rock" class. The input reflectance spectra were normalised using the spectral mean and standard deviation in each OLI band. The neural network has n input neurons, for the s in the VNIR-SWIR OLI bands, and output neurons, for the in both TIRS bands. Input neurons either included (n = 8) or excluded (n = 7) the panchromatic channel on OLI, band 9 (cirrus at 1.37 µm) was excluded due to the low atmospheric transmittance. The mean squared error loss function and adam optimiser were used for the output neurons. Various combinations of deep (4-8) and wide (16-256) hidden layers were tested. 1000 training epochs with a batch size of 20 were ran per set up, with datasets randomly split in 90% training and 10% validation data in each epoch.
For each ROI, emissivity was estimated using the selected neural networks from the OLI s imagery on a scene by scene basis, as well as for the cloud filtered per-pixel median s over the mission lifetime. Pixels with s 1609 nm < 0.02 are considered to be water, and their emissivity is fixed to band averaged values from Vanhellemont (2020), 0.9926 and 0.9877 for B10 and B11 respectively. The scene specific estimates are included in order to accommodate for seasonal (e.g. snow, droughts) or rapid (e.g. crop harvest, ploughing, flooding) changes where an average dataset may be insufficient. The mission lifetime median dataset is likely biased to clear weather (i.e. summer) conditions. In Fig. 2 the differences between the median and scene specific coverage for BUZENOL are illustrated. There are marked differences in field state, where in the scene specific RGB composites many more fields have been recently ploughed and are hence bare soil. The forest appears darker in the median composite than in the spring image, due to the bias to summer condition due to cloud coverage. The differences between pine and deciduous stands are more pronounced in the spring and median images than in the summer image. Fields appear to be more yellow or brown in the summer image than in the spring image, which could indicate changes in soil water conditions. Emissivity data for each ROI were also obtained from the ASTER Global Emissivity Dataset 100meter V003 (AG100, Hulley et al., 2015), resampled to the OLI 30 m grid using cubic interpolation. The simplified Normalised Difference Vegetation Index (NDVI) approach (Sobrino and Raissouni, 2000;Sobrino et al., 2008;Jiménez-Muñoz et al., 2008) was also used to estimate emissivity: with s the average soil emissivity (0.971 and 0.977 in B10 and B11) and v the average vegetation emissivity (0.987 and 0.989 in B10 and B11) taken from Skoković et al. (2014). FVC is the Fraction of Vegetation Cover (Carlson and Ripley, 1997): where NDVI min and NDVI max give the NDVI ranges used between bare soil (< 0.18) and fully vegetated (> 0.85), and NDVI is computed from OLI red (655 nm) and NIR (865 nm) surface reflectances: FVC is set to 0 for NDVI < NDVI min and to 1 for NDVI > NDVI max . For FVC = 0 the is estimated separately for B10 and B11 according to Skoković et al. (2014): Finally, temperatures were also computed using unity emissivity.

Statistics
Linear correlation statistics and Reduced Major Regression (RMA) lines were computed for the matchups, and the Root Mean Squared Difference (RMSD), Mean Difference (MD), and unbiased RMSD (unb-RMSD) are given as indication of the total error, systematic error (bias), and precision:

Emissivity estimates
The hidden layer construction of the neural network was varied between widths of 16, 32, 64, 128, and 256 neurons, and depths of 4, 5, 6, 7 and 8 layers. The performance of the neural net was evaluated against the expected reflectance from the ECOSTRESS library, with the 64 × 4 net giving highest r 2 (87% and 86%) and lowest RMSD (1.19 and 1.02) in both TIRS bands for the first subset of spectra (Fig. 4). The second subset of ECOSTRESS spectra (excluding the "rock" spectra) gave similar performances (r 2 82%, RMSD 1.12-1.13, Fig. 5). The inclusion of the panchromatic band only marginally changed the model performance using the ECOSTRESS library, and hence was excluded from the application to satellite imagery due to the larger uncertainty on the atmospheric correction of the panchromatic band.
The emissivity estimated from the median ROI s compares well to the emissivity from the ASTER GED (an example for BUZENOL is shown in Fig. 6). Although a larger spread is found for the OLI derived emissivity (e.g. Fig. 7) compared to the ASTER GED, the Mean Average Relative Differences (MARD) are about 1.3-2.6% in B10 and 0.8-2.0% in B11 for Net1 and 0.8-2.2% in B10 and 0.6-1.8% in B11 for Net2, with slight variability between sites. Thanks to the higher OLI resolution (30 m) compared to the ASTER GED (100 m), individual fields or tree stands, and manmade materials may be better resolved spatially. The OLI data range is more recent (2013-2019) than the one used for the ASTER GED (2000-2008), a decade during which land use has changed significantly. There are some differences between the ASTER and TIRS spectral responses that are not taken into account. The OLI derived emissivity is generally a bit lower than the ASTER GED data (maximal MD −0.024 in B10 and −0.013 in B11 for Net1), likely as a direct result of the spectral response differences and the neural network performance. Compared to the ASTER GED dataset, the different nets performed differently across the sites, depending on the ROI scene contents. Generally the second net gave lower differences with the ASTER GED compared to the first net, except for regions with large forest coverage (e.g. BUZENOL). The lower emissivity estimated from the neural nets will lead to higher LST estimates compared to using unity or ASTER emissivity.
A time-series of B10 emissivity for several pixels in the BUZENOL site is shown in Fig. 8. The median emissivities from both nets are similar to those from the ASTER GED, with Net2 giving emissivities about 0.01 lower than ASTER. The two pine stand pixels show a slight variability of emissivity through the year, with highest values reached during spring and early summer. The second pine stand was cut down in 2018, and the emissivity drops closer to values associated with bare soils or barren terrain. Net1 gives much lower emissivity for the barren pixels, likely since it confuses bare soil with the rock spectra (with generally lower, or more variable emissivity) also included in the model. The NDVI based model generally fluctuates between the imposed soil (0.971) and vegetation (0.987) emissivities, and also clearly shows the bare soil period for the second pine stand. A larger standard deviation in the 3 × 3 pixel boxes are found for the neural nets, likely due to the fixed emissivity range of the NDVI method, and the coarser resolution of the ASTER GED. For the Field pixel larger variability is found, cycling between tilled and planted fields. The NDVI and Neural Net methods give different results for separate tilling operations: going from crop to tilled in April 2015 decreases the NDVI but increases the estimate from both Nets, while the tilling operation in April 2017 has the NDVI going from low to high, and the nets (esp. Net2) give the inverse pattern.

Surface temperature
Surface temperatures are computed using the simulated up-and down-welling radiance and transmittance of the atmosphere, and six estimates of emissivity: (1) unity, = 1, (2) ASTER GDEM, (3) perscene OLI NDVI derived, and (4-7) per scene and ROI-average OLI derived from both neural nets. An example of an LST product for BUZENOL site is shown in Fig. 9. When using unity and ASTER GED emissivity, lower temperatures are found compared to the OLI derived emissivity, which is caused by the OLI products having an overall lower emissivity. The higher resolution of the OLI product compared to ASTER GED or the TIRS native resolution gives a sharper derived temperature product.
Matchups with the AWS data from 12 and 14 sites operated by the RMI were identified, and quality controlled using the OLI reflectance data. The number of potential matchups was decreased from 1972/ 1848 (scenes with bounding in situ measurements) to 537/501 (scenes passing automated quality control) for the radiometer and subsoil in situ measurements. A summary of the performance is given in Table 2 and 3, with scatter plots in Fig. 10 and 11. A strong linear relationship is found between satellite and in situ measurements, with high correlation coefficients (r 2 0.90-0.95), and near unity slope (0.94-1.04).
For the broadband radiometer derived temperatures, the RMSD are quite close between all the datasets (2.8-3.5 K), with a cold bias (negative MD) for unity, ASTER and NVDI emissivity (-0.9 to −2.2 K). Net1 and Net2 give the lowest absolute MD, with Net1 biased 0.6 K warm for B10 and ± 0.1 K for B11. For Net2 the MD is ± 0.1 K for B10 and −0.4 K for B11. The RMSD and MD are directly related to the range in the emissivity dataset used, as the emissivity is the only change between the comparisons. For these matchups, the MD seems to be the distinguishing statistic, as the performance for other statistics is very similar. The unb-RMSD can be slightly higher for the Nets than for the other datasets, due to the spatial variability inherent in the OLI data, and the larger range of emissivities from the Net outputs. Both nets give different emissivity estimates in both bands, and it seems like Net1 gives a better B11 estimate than Net2 (MD ± 0.1 K compared to −0.4 K), and Net2 gives a better B11 estimate than Net1 (MD ± 0.1 K compared to 0.6 K).
For the subsoil temperature, the RMSD across all emissivity datasets is quite large (3-5 K). The MD (0.8-3.7 K) and hence difference between unb-RMSD and RMSD (generally 0.1-1.5 K) are also larger for the subsoil temperatures than for the radiometers. In terms of scatter, overall performances are quite similar for the radiometer (unb-RMSD 2.6-3.1 K) and subsoil (unb-RMSD 2.9-3.5 K) temperatures. For MD, the performance of the former is better, with the bias with subsoil temperatures about 3 K.
In order to evaluate the performance of the emissivity estimates for in situ measurements with high performance radiometers that measure the skin temperature, the results from Sekertekin (2019) Barsi et al., 2003;Barsi et al., 2005) for the atmospheric correction. The former used an NDVI based estimate for emissivity, while the latter used the ASTER GED. Table 4 replicates the results by Sekertekin (2019) for all twenty matchups and for the subset of fifteen they deemed more reliable, and includes results from the present study for unity and scene based emissivity for both Nets. The results for the per scene estimated emissivity compare well with those from Sekertekin (2019), with unb-RMSD of 2.2 K for the full matchup dataset and 1.6-1.7 K for the subset. Surprisingly, the unity emissivity gives the best performance, outperforming both other methods that have a variable emissivity, especially in terms of slope and MD. The lowest unb-RMSD is found for the Net1 estimated emissivity. Overall results from the RMI AWS sites give similar slopes, but with higher unb-RMSD and MD for the subsoil matchups (both by about 1 K). Matchups with AWS radiometer temperature show slightly higher unb-RMSD, but lower absolute MD. Matchups from García-Santos et al. (2018) processed using the present study for unity and scene based emissivity for both Nets are shown in Table 5. García-Santos et al. (2018) found the Single Channel and Split Window Algorithms (SCA Fig. 4. A comparison of the predicted and expected reflectance outputs from the 7 band 64 × 4 neural network for both TIRS bands, B10 (left) and B11 (right), using the first subset of spectra (Net1). These plots show the results after 1000 training epochs with 90-10% training-validation split in each epoch.   and SWA) to outperform the results obtained from the ACPC in terms of RMSD (1.6-2 K and 2.0-2.3 K respectively), but the ACPC giving the lowest bias (-0.1 K). With the results from the present study, lowest unb-RMSD is found again for unity emissivity, but the MD from both neural nets is significantly lower (0.7 K for Net1 and 0.02 K for Net2), similar to what was found for the AWS matchups. Using the ASTER GED emissivity computed by García-Santos et al. (2018), the performance is practically the same for the present method and the ACPC (similar to findings in Vanhellemont (2020)). The linear regression slopes are < 1 for these matchups, largely due to an underestimation of high temperatures for unity emissivity, and an overestimation of low temperatures for the neural nets (not shown). Differences in unb-RMSD are quite low, indicating that the different methods have a similar relative noise level, especially methods using a fixed or unity emissivity. The differences between the unb-RMSD and the RMSD reflect the impact of bias originating from the use of different emissivity datasets, also represented in the variability of the MD.

Perspectives
Emissivity can be estimated from multispectral imagery and a neural network approach trained on a spectral library. There are clear conceptual advantages of estimating the emissivity per scene, but some contradictory patterns are found between the NDVI approach and both nets, and may depend on scene specific characteristics or surface properties, or the imposed limits of the NDVI approach (Gao et al., 2020). The NDVI approach has been previously found to be unsuitable  for impervious materials (Chen et al., 2016). The performance of the neural nets may be impacted by the imbalance between the number of spectra in the different classes in the training dataset, and both the nets and the NDVI based method can be impacted by errors in the atmospheric correction of OLI imagery. The inclusion of rock spectra in Net1 may be less appropriate for Belgian sites, as they are typically a patchwork of fields, bare soils, trees, and man-made materials. Further sensitivity analyses need to be performed using long term soil temperature measurements in other countries with more varied surface characteristics.
Single band LST derived from L8/TIRS using atmospheric parameters computed using ERA5 reanalysis profiles of relative humidity and temperature was evaluated for 14 sites across Belgium that autonomously measure air, soil, grass and subsoil temperatures, and 12 sites that measure broadband irradiance sky-and ground irradiances. Strong linear correlations were found between LST and all measured temperatures, indicating that satellite-derived LST could be used to model any of the parameters. Excellent performance was found for the matchups of LST with the broadband radiometer data, with for B10 the neural nets giving unb-RMSD of 2.7-3.1 K, and MD 0.6 K for Net1 and ±0.1 K for Net2. Good performance was also found for the subsoil temperatures which gave near unity slopes, and an unb-RMSD of 2.9-3.5 K across bands and emissivity sources, with a 0.8-3.7 K warm bias. (Caused by the in situ subsoil temperatures being colder than the in situ radiometer temperatures.) B10 and B11 matchups gave very similar performances, and it seems the calibration and stray light correction efforts (Gerace and Montanaro, 2017) into the "Collection 1" data make both bands suitable for LST estimation. These results are on par with other recent validation results from L8/TIRS using various correction methods an in situ thermal radiometers (García-Santos et al., 2018;Sekertekin, 2019;Meng et al., 2019), and the present study illustrates how global networks of AWS could aid in validation of satellite derived LST, by complementing radiometer measurements with higher associated costs.
Although the performance of the LST products compared to in situ measurements was similar across various emissivity datasets: (1) unity emissivity, (2) ASTER GED, (3) NDVI based, and the ECOSTRESS based neural networks developed in the present study, (4) Net1, scene s , (5) Net1, median s , (6) Net2, scene s , and (7) Net2 s , median, the best performance (lowest MD, unb-RMSD, offsets and unity slopes, with highest r 2 ) was found for the neural nets in comparison with the AWS radiometer data (with MD ± 0.1 K for Net2), and for unity emissivity for the AWS subsoil data. Similarly, with the matchup dataset with SURFRAD data published by Sekertekin (2019) and UIB data by (García-Santos et al., 2018), and summarised in Tables 4 and 5, the best performance was found for unity emissivity, although they provided a more limited number of matchups. The RMSD values are lower for unity emissivity likely due to the lack of noise introduced by the spatial variability present in the other datasets. Of the neural networks developed in the present paper the net excluding the "rock" spectra gave slightly better performances, i.e. the absolute MD for Net2 is lower than that for Net1 for B10. It should be noted that the AWS are situated in long-term meteorological sites, that have a largely unchanging coverage through the year, which may not fully demonstrate the power of a scene by scene emissivity estimate. The study sites in the present paper have significant vegetation coverage inside and surrounding the TIRS Table 2 Summary of the matchups with the broadband radiometer temperature measurements at the Belgian AWS. n is the number of matchups, m and b the slope and offset of the Reduced Major Axis regression, r 2 is the square of Pearsons linear correlation coeffient. RMSD and MD are the Root Mean Squared and Mean Differences between satellite and in situ measurements (satellite -in situ), with unb-RMSD the unbiased RMSD.  pixel, and the comparison should be extended to non-vegetated and homogeneous sites. The MD for the neural nets is about 1 K higher (and hence closer to 0 for the broadband radiometer data) than those when using the NDVI and ASTER GDEM emissivity estimates, although giving approximately the same unb-RMSD. This higher MD is consistent with the generally lower emissivities retrieved from the neural nets.

Conclusions
• Neural networks trained on the ECOSTRESS spectral library can reliably (1% RMSD) estimate emissivity in thermal infrared bands from multispectral visible to shortwave infrared reflectance, with a low negative bias. The estimation of emissivity in two thermal channels on TIRS from concomitant 7 band OLI s imagery is demonstrated.
• The OLI and neural network derived emissivity gives low mean differences compared to the ASTER Global Emissivity Dataset (0.6-2.6% MARD), but a larger emissivity range, potentially due to the higher spatial resolution of the OLI dataset. TIRS derived surface temperatures using the OLI emissivities have a crisper appearance when using the OLI derived emissivity, and the OLI derived emissivity may lead to a 'sharpening' of the TIRS data.
• TIRS derived LST compare very well to in situ measurements of broadband radiometer temperatures at 12 Automated Weather Stations in fragmented landscapes across Belgium. Matchup results for B10 show near unity slopes, with an unb-RMSD of 2.6-3.1 K and a MD of −2.2-0.6 K, with the best performing emissivity source (Net2) giving an unb-RMSD around 2.8 K and MD of ±0.1 K, depending on whether the scene-specific or mission median emmisivity was used. TIRS derived LST also compare well to in situ measurements of subsoil temperatures at 14 AWS. Matchup results with subsoil measurements show an unb-RMSD around 3 K and MD from 1-4 K with near unity slopes.
• There were only small differences between using the neural network derived emissivity from the per-scene or the mission median reflectance, with the per-scene emissivity generally having a lower MD and slightly higher RMSD compared to the mission median. This indicates that the results from the mission median data could also be used for processing night-time data in the absence of optical data.
• Better performances may be achieved using high performance in situ radiometers which measure the land surface skin temperature, even though the results achieved here are comparable to matchup analyses done with in situ radiometers (García-Santos et al., 2018;Sekertekin, 2019). The use of temperature measurements from dense networks of meteorological observations does significantly increase the number of matchups, and can complement the observations from sites equipped with thermal radiometers. > 2000 potential matchups of satellite overpasses with bounding in situ measurements were identified, with > 500 remaining after automated quality control (cloud masking).
• Although good performance was found for the broadband radiometer matchups and the neural network derived emissivity, for the datasets presented by Sekertekin (2019) andGarcía-Santos et al. (2018), and the subsoil temperature matchups presented here, there is no obvious performance improvement of using a non-unity emissivity value. Further investigations could be the measurement of temperature in contrasting land covers within the same region of interest, to establish the importance of a spatially and temporally variable emissivity. The impacts of the number of matchups and range in the validation data need to be further investigated, and

Table 4
The matchups with SURFRAD data as presented by Sekertekin (2019), and results from the present study. perhaps the study could be extended to include the meteorological measurements from other countries.
• Quality control of the imagery was automated due to the large amount (> 2000) of potential matchups. Some scenes that would not pass manual quality assessment, e.g. presence of very thin clouds that cannot easily be detected spectrally, are still included in the matchup exercise. Further improvement of the validation results are expected if automated quality control can be improved.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.