Evaluation of the relationships and uncertainties of airborne and ground-based sea ice surface temperature measurements against remotely sensed temperature records

ABSTRACT Sea ice surface temperature (IST) is an important indicator of environmental changes in the Arctic Ocean. In this study, the relative performance of four mainstream IST records, i.e. airborne IST, infrared radiometer measured IST (IR IST), longwave radiation derived IST (LWR IST), and snow and ice mass balance array buoy derived IST (Buoy IST), were evaluated against the MODIS IST product. Bias, standard deviation (STD), and root mean square error (RMSE) were used to evaluate the data quality. Results revealed that airborne IST had the best accuracy, which was 0.21 K colder than MODIS IST, with STD of 1.46 K and RMSE of 1.47 K. Ground-based ISTs were biased with each other but all warmer than the MODIS IST. The IR IST had the best overall accuracy (bias = 0.55 K; STD = 1.52 K; RMSE = 1.61 K), while the LWR IST was the noisiest measurement with the largest outlier data percent. Besides, co-located IR and LWR ISTs were more consistent than any type of evaluated IST against MODIS IST (correlation coefficient = 0.99). Airborne and IR ISTs are thus the premier choice for monitoring the rapidly changing Arctic sea ice, together with satellite observations.


Introduction
Arctic sea ice has been undergoing continuous loss in thickness, extent, and volume since the satellite era, together with surface temperature rising (Comiso and Hall 2014;Box et al. 2019;Parkinson and DiGirolamo 2021;Wang et al. 2022). The declining sea ice cover has significant climate effects far beyond the Arctic region (Vihma 2014;Gao et al. 2015). To analyze the drastic changes in Arctic sea ice, a comprehensive observation system consisting of both in situ measurement and satellite observation is needed . Among those essential climate variables of the Arctic Ocean, sea ice surface temperature (IST) is a crucial thermodynamic parameter relating to the longwave radiation of the Arctic Ocean, and thus has a significant influence on the surface mass and energy balance (Cheng et al. 2008;Lei et al. 2014). A precise and accurate IST product will benefit studies on retrieving other key sea ice parameters (e.g. ice thickness), as well as the development and upgrade of regional and global climate models (Lavergne et al. 2022;Rasmussen et al. 2018;Sun et al. 2022).
Satellite and in situ observations are two major data sources for monitoring surface skin temperature. With the help of prior knowledge on surface emissivity, the split window algorithm can promisingly retrieve sea and land surface temperature (SST and LST) from remotely sensed thermal infrared imagery (Li et al. 2013). The same technology was also adopted to routinely retrieve IST based on the thermal infrared channels of the Advanced Very High Resolution Radiometer (AVHRR), Moderate Resolution Imaging Spectroradiometer (MODIS), and Visible Infrared Imaging Radiometer Suite (VIIRS) sensors (Key and Haefliger 1992;Hall et al. 2004;Dybkjaer, Tonboe, and Høyer 2012;Key et al. 2013). These satellite-based IST products have been comprehensively compared and validated against multiple sources of simulation data, in situ measurements, reanalysis datasets, and other satellite-derived IST products (e.g. Key et al. 1997;Son, Kim, and Lee 2018;Liu, Dworak, and Key 2018;Høyer et al. 2019;Fan et al. 2020). For MODIS IST, the model uncertainty of the IST retrieval algorithm is about 0.1-0.3 K (Hall et al. 2004). The cross-validation against VIIRS IST revealed a bias of 0.12 K and RMSE of 1.02 K (Liu et al. 2015). Relatively higher uncertainties were obtained when the in situ data were adopted as validation dataset, ranging from 1.0 K to 4.0 K (e.g. Scambos, Haran, and Massom 2006;Hall et al. 2008;Li et al. 2020). In contrast, field measured IST records in the Arctic Ocean are generally adopted as true data regardless of their unknown uncertainties due to the difficulty of regular device maintenance in such a harsh environment. Previous studies have confirmed that in situ IST measurements in the Arctic Ocean are sensitive to changes in environmental conditions such as sunlight heating and snow cover (Hall et al. 2015). The different error sources of in situ data make it questionable whether they are consistent or comparable with each other. Furthermore, most in situ data are acquired on sparse sites scattered on floating ice, making it even harder to compare them with each other.
Numerous studies have been conducted to evaluate the quality and consistency of in situ SST and LST measured by different technologies, encouraging the use of both in situ and satellite data. High quality satellite SST could be adopted as reference data to estimate uncertainties of in situ measurements because they are independent of in situ observation network (Kennedy 2014). Taking the satellite SST as a benchmark, Emery et al. (2001) compared buoy and ship measured SST, and suggested that ship SST is generally warmer and noisier than buoy SST. Similar work flows were also conducted by Atkinson et al. (2013) to build up a robust quality control schedule for in situ measurements, and by Donlon et al. (2002) to select a high quality validation dataset of satellite SST under different environmental conditions. Using two satellite SST data as reference, Castro, Wick, and Emery (2012) found that the manufacturer and the program are not the major sources of uncertainties in buoy measured SST and that the calibration issues and the physical and geographical environment should account for the decreasing accuracy of buoy SST. In contrast to SST, LST has larger spatial variability, and the scale issue is important to the consistency of different LST data sources. Krishnan et al. (2015) compared the performance of LST measurements in three scales (i.e. in situ point LST, airborne LST, and satellite LST) and suggested that point and airborne LST coincide better with each other than with satellite LST. To further assess the quality of LST data from different sensors, Krishnan et al. (2020) conducted a field experiment and concluded that LSTs measured by different infrared temperature sensors agree well with each other. They also found that different types of LST data agree better at nighttime, while longwave radiation-based LST have large differences compared to the infrared sensor and camera-derived IST during the daytime.
Researches on the uncertainties and correlation for different field-measured ISTs remain insufficient. Instead, studies have largely focused on the differences and the relationship between near-surface air temperature (SAT) and actual IST (e.g. Høyer et al. 2017;Nielsen-Englyst et al. 2019), because the SAT is generally adopted as a substitute for IST under several criteria (e.g. Hall et al. 2004). To address the current research gap, this study, for the first time, evaluated the relative performance of four mainstream IST measurement technologies in the Arctic Ocean: airborne IST, infrared radiometer measured IST (IR IST), longwave radiation derived IST (LWR IST), and snow and ice mass balance array buoy derived IST (Buoy IST). Due to its extensive spatial coverage and good accuracy, the MODIS IST product was adopted as an independent reference data. Three criteria, the bias, standard deviation (STD), and root mean square error (RMSE), were selected to evaluate the accuracy, precision, and uncertainty of those measurements following Guillevic et al. (2018). We expected this work could guide the deployment and maintenance of the polar observation system. Moreover, the selected high quality in situ observations could further benefit the optimization of temperature retrieval algorithm in the extremely cold environment as well as the calibration of new launched satellite sensors.

Airborne IST
Airborne IST records from Operation IceBridge (OIB) campaign were adopted in this study. The OIB campaign was the most comprehensive airborne survey mission for polar research. During the years 2012-2014 and 2017-2019, the surface temperature was acquired by Heitronics  Series II Infrared Radiation Pyrometer (hereafter KT-19) onboard the National Aeronautics and Space Administration (NASA) P3 aircraft. The KT-19 radiometer measures the brightness temperature (Tb) of the surface beneath the aircraft at the wavelength between 9.6 and 11.5 microns with a frequency of 10 Hz. By setting a constant ice surface emissivity value of 0.97, the Tb received by the KT-19 radiometer is directly inverted to IST (Studinger 2020). While the resolution of the KT-19 radiometer is 0.01°C, other error sources, such as clouds beneath the aircraft and the variance of ice surface emissivity, could decrease the accuracy of retrieved IST. The spatial resolutions of airborne IST data depend on the flight height. The footprint of airborne IST is approximately 15 m when the aircraft is flying 450 m above ground level. In this study, all the airborne IST data from the OIB KT-19 radiometer available during the six years in the Arctic Ocean were selected to quantify the performance of aircraft measured IST. The OIB airborne IST has a large spatial distribution, covering the Central Arctic Ocean, the Beaufort Sea, the coastal and northern of the Greenland (Figure 1). However, it should be noted that the airborne IST records were only captured during the Arctic winter and spring (from March to May). In this work, 18038 Airborne-MODIS IST data pairs were matched and analyzed.

Ground-based IST
The IR IST is directly measured by infrared radiometer. The IR IST adopted in this study was acquired from an automatics weather station (AWS) amounted with a non-contact IRR-P infrared temperature sensor. This AWS was deployed in the Beaufort Sea on August 19, 2014, during the 6th Chinese Arctic Research Expedition (CHINARE-2014), and remained functional until May 26, 2015. The IRR-P sensor detects infrared radiation from 8 to 14 microns and measures temperatures ranging from −40°C to 70°C with a nominal accuracy of ±0.5°C. This AWS also measured other meteorological parameters, including SAT at 2 and 4 m height, wind speed and direction, subsnow temperature at 10 and 40 cm depth, radiation, and near surface air pressure hourly (Pan 2015). In this work, 696 IR IST records were adopted to compare with MODIS IST.
Given its association with surface energy budget, IST can be derived from longwave radiation (LWR) by inverting the Stefan-Boltzmann Law. In this study, the surface broadband longwave radiation collected during the Norwegian young sea ICE Campaign (N-ICE 2015) (Hudson, Cohen, and Walden 2016) and the previously mentioned AWS during CHINARE-2014 were used to derive LWR IST. Both campaigns used Kipp and Zonen pyrgeometer to measure surface up and downward longwave radiation flux. The spectral range of the pyrgeometer is 4.5-42 microns, wider than that of the infrared radiometer. The N-ICE 2015 radiation station functioned from January 21, 2015 to June 19, 2015, and reported data every minute, while CHINARE-2014 provided hourly radiation information together with IR IST. More details on the device and deployment information can refer to Walden et al. (2017) and Pan (2015). In this work, 137 points of N-ICE 2015 and 692 points of CHINARE-2014 LWR IST measurements were matched with referential IST.
Buoys equipped with thermistors are the most direct tool to automatically get ice temperature. In this work, five snow and ice mass balance array (SIMBA) buoys deployed by the N-ICE 2015 campaign were used to derive Buoy IST (Itkin et al. 2015). The SIMBA buoys are equipped with 240 thermistors along a 480 cm vertical string that record temperature every 6 h. The resolution of an individual thermistor is 0.0625°C, but the actual accuracy of measured temperature could be lower, especially under extremely cold environments (Jackson et al. 2013). The location of airsnow, snow-ice, and ice-ocean interface could be identified by measuring the temperature gradient and thermal diffusivities of given thermistor strings, thus, SIMBA buoys are widely used in snow and sea ice thickness retrieval (e.g. Provost et al. 2017;Zuo, Dou, and Lei 2018;Liao et al. 2019). For this work, we manually retrieved the air-snow interface of the SIMBA buoy and set the temperature measured by the sensor adjacent to this interface as surface temperature. Note that the retrieved Buoy IST is the temperature of the air-snow interface directly measured by a thermistor, which is different from the radiometric temperature or snow surface skin temperature from noncontact airborne, IR, LWR, and MODIS ISTs. Although the ISTs measured by SIMBA buoy and other approaches were not identical, they were close enough and often treated equally as skin temperature (e.g. Cheng et al. 2014). Only 49 points of Buoy IST records were used in this study. Although numerous other types of buoys available at the International Arctic Buoy Programme (IABP, Rigor, Clemente-Colon, and Hudson 2008) website also provide IST regularly, they were not included in this research because their surface temperature is derived from the sensor on the bottom of the buoy hull. Once the buoy is covered by snow, the reported surface temperature will become much warmer than the actual IST, and there is no practical way to correct that internal temperature into surface temperature (Høyer et al. 2017;Yu et al. 2021). Details of the airborne and ground-based measurements work are summarized in Table 1.

Satellite data
The recently updated version 6.1 Aqua/MODIS IST product (MYD29; Hall and Riggs 2021) was selected as the reference IST. The Terra/MODIS IST product (MOD29) was discarded here, as the MYD29 product performs better than the MOD29 in the lower temperature range because of the lower saturation temperature (Hall et al. 2015). For every scene of the MYD29 image, the corresponding cloud mask product (MYD35_L2; Ackerman 2015) and geolocation product (MYD03; MODIS Characterization Support Team 2017) were also downloaded to derive auxiliary information (i.e. detail cloud mask, solar zenith, and sensor zenith). All MODIS data were re-projected and gridded into WGS 84/NSIDC Equal-Area Scalable Earth Grid 2.0 North (EASE-Grid 2.0 North), a typical Lambert Azimuthal Equal Area projection suitable for polar research (Brodzik et al. 2012). The spatial resolution of the re-projected MODIS imagery was set as 1000 m in accordance with the default resolution of the MODIS IST product.

Data processing
The IR, LWR, and Buoy IST data are all point measurements. For the IR and LWR IST data from CHINARE-2014 AWS and N-ICE 2015 Buoy IST, a tolerant time gap of 30 min was set. If the time lag between MODIS and in situ data is shorter than 30 min, the latitude and longitude of the in situ measurement were converted into X-Y coordinates and transferred into rows and columns of the corresponding re-projected MODIS images. Given that the N-ICE 2015 radiation station reported data every minute, we only processed those N-ICE 2015 LWR ISTs recorded at the time when MODIS images were captured. Obviously, one MODIS pixel can only match a single in situ measurement. But unlike those in situ measurements abovementioned, the airborne KT-19 radiometer moved with the aircraft and recorded the temperature every 0.1 s (10 Hz); thus, a MODIS pixel could match more than one airborne IST data. The 30-minute time gap threshold was again used, and the airborne IST records matched with the same MODIS pixel were averaged and then applied for further analysis.
As a tradeoff of data quality and quantity, those pixels with cloud flags of possible cloudy are also interpreted as clear pixels and applied to produce the MODIS IST product. In addition, the sensor zenith and solar zenith filters were also used to refine the MODIS IST product. The solar zenith filter removes data with solar zenith between 80 and 90 degrees, given that low sun conditions could increase the random errors on satellite-based ice and cloud products (Liu et al. 2015). The sensor zenith filter was set to 50 degrees because the snow surface emissivity will decrease abruptly with the increase of sensor zenith in thermal infrared channels used for IST retrieval (Hori et al. 2006(Hori et al. , 2013. In this study, the MODIS IST was used as the reference temperature, and all three criteria were employed to eliminate potential abnormal pixels. Additional quality control (QC) processing is generally adopted to refine the data quality in case of undetected residual cloud and other potential observation errors. To better evaluate the performance of SST acquired by different buoys, Castro, Wick, and Emery (2012) discarded measurements with biases exceeding five sigma compared to the reference climatological and satellite SST. As for the validation of satellite LST, Duan et al. (2019) recommended the 3σ-Hampel identifier, which is more robust than the 3σ-edit rule because the gross errors will result in biased estimates of the mean and standard deviation (Davies and Gather 1993). In this work, the 3σ-Hampel identifier was adopted to refine the quality of matched data pairs. According to Pearson (2002), assuming that {x i } is the data sequence of the temperature difference between satellite and in situ observations, and the x m is their median value, then the standard deviation S of {x i } could be estimated as: Those matched points with x i . x m + 3S or x i , x m − 3S were discarded as abnormal data. The distribution of matched data is presented in Figure 1.

Comparison between airborne and MODIS IST records
The airborne IST measurements were compared with the MODIS IST values. In total, 19068 OIB airborne-MODIS IST pairs from 2012-2014 and 2017-2019 were matched after applying the cloud, solar, and sensor masks, and 18038 of them passed the QC schedule. The results are summarized in Table 2.
In general, the airborne IST agreed well with the satellite IST from Aqua/MODIS, although the values were slightly colder. The correlation coefficient (r) between them was 0.97. The bias, calculated by airborne IST minus MODIS IST, was −0.14 K, while the STD and RMSE values were 2.00 and 2.00 K, respectively. The influence of the 5.4% outliers that did not pass the QC was not significant. After QC, the bias, STD, and RMSE values were −0.21, 1.46, and 1.47 K, respectively. The scatter plot ( Figure 2a) and histogram (Figure 2b) show that the relationship between Airborne-MODIS IST pairs has no apparent temperature dependence. Also, the biases between them roughly followed the normal distribution, but the tail of the normal distribution, which should be treated as normal data, was eliminated by the 3σ-Hampel identifier. We then estimated the annual error distribution of Airborne IST. The annual bias and STD were depicted in Figure 2c. RMSE values were not included in this figure because they were very close to the STD. In general, a higher bias value corresponded to a larger STD. The bias values ranged from −0.54 K (2019) to 0.38 K (2018), while the STD values were between 0.88 K (2018) and 1.63 K (2013). In addition, no apparent relationship between the sea ice type and the accuracy of OIB airborne IST was found (see supplement).

Comparison between in situ and MODIS IST records
The ground-based IR, LWR, and Buoy IST records were compared with the MODIS IST. All three in situ IST records are point measurements, and one in situ data point can only match one satellite pixel. From the 753 IR and LWR IST points (from CHINARE-2014), 159 LWR IST, and 54 Buoy IST measurements (from N-ICE 2015) that were analyzed, 696 CHINARE-2014 IR IST, 692  Table 3, and the data analysis before QC is provided in the supplement.
In general, all in situ ISTs were warmer than the MODIS IST, with positive (warm) bias values ranging from 0.55 K (IR IST) to 1.74 K (Buoy IST). The in situ ISTs were consistent with the MODIS IST, with correlation coefficients higher than 0.9. The IR IST was the most stable and accurate among the situ IST data, with the smallest filter fraction of outliers (7.57%, calculated by data number filtered by QC schedule divide original matched data number) and bias value (0.55 K). The STD and RMSE of the IR IST were 1.52 and 1.61 K, respectively, slightly higher than those of the airborne IST. The LWR IST data were generally noisier and sensitive to outliers, as 8.10% of CHI-NARE-2014 and 13.84% of N-ICE 2015 data were filtrered by QC. When QC is not applied, the overall accuracy of N-ICE 2015 LWR IST was the worst among the in situ data (bias = 1.95 K; STD = 3.79 K; RMSE = 4.25 K). After applying QC, LWR ISTs had good precision (CHINARE-2014: bias = 1.47 K, STD = 1.39 K, RMSE = 2.02 K; N-ICE 2015: bias = 0.90 K, STD = 1.17 K, RMSE = 1.48 K). The Buoy IST exhibited similar precision (STD = 1.62 K) but had the worst accuracy and the largest uncertainty (bias = 1.74 K; RMSE = 2.37 K). The large uncertainty can be attributed to the inconsistency of different buoys and the schedule of IST retrieval from the temperature chains. The scatter diagrams and error distribution histograms of the in situ measurements are shown in Figure 3. No apparent temperature-dependence was found for the IR and LWR IST data, and most Buoy IST data were found in the low-temperature range (< 260 K). We also found that the temperature difference for LWR IST was not symmetrical, with more data points holding positive biases (Figure 3f and Figure 3g).

Comparison between IR and LWR ISTs
During CHINARE-2014, the infrared radiometer and the longwave radiation pyrgeometer were equipped in the same AWS, allowing the direct comparison between IR and LWR IST records. This study analyzed 6731 IR-LWR IST data pairs from August 2014 to May 2015. The IR and Table 3. The r, bias, STD, and RMSE of in situ IR, LWR, and Buoy IST records with respect to MODIS IST. The bias is calculated by in situ IST minus MODIS IST. All data analyzed were quality controlled using the 3σ-Hampel identifier. The count is the number of data pairs. The filter fraction is calculated by data number filtered by QC schedule dividing original matched data number.  LWR ISTs were acquired in the same time and location, and both of them are point measurements, thus the comparison between them could directly reveal the difference between these two IST measurement technologies.
The results show that the IR and LWR IST records generally agreed well in all temperature ranges, with the correlation coefficients (r) between them higher than 0.99. In most cases, the LWR IST was warmer than the IR IST (Figure 4a), with a bias of 0.75 K (LWR IST minus IR IST), medium error of 0.62 K, STD of 0.73 K, and RMSE of 1.04 K. The STD and RMSE values here were smaller than those of airborne and all ground-based ISTs with respect to MODIS IST, indicating that in situ ISTs were better consistent with each other than with satellite IST products. The maximum negative and positive bias values between radiation and IR IST were −8.24 and 5.88 K, respectively. The bias histograms presented in Figure 4b suggest that the biases were not subject to the normal distribution and that the few large negative biases were mainly gross errors. We then analyzed the time-dependency of difference between the IR and LWR IST records. As shown in Figure 4c, the bias, STD, and RMSE values between IR and LWR IST measurements slightly increased after the AWS deployed in August. For each month, the bias was better than 1.0 K, and the STD and RMSE were better than 1.5 K. The best agreement between IR and LWR IST records occurred in November, with the smallest bias, STD, and RMSE values, followed by February.

Scale and spatial analysis
The representativeness of ground-based measurement is argued by several studies (e.g. Guillevic et al. 2012), as the spatial resolutions of satellite images are much coarser than pointwise in situ observations. The scale issues, which are defined as the contrast of surface characteristics or information at different scales (Wu and Li 2009), raise the question that whether the dense airborne IST could better represent the physical property described by low-resolution satellite image pixels.
In this study, we analyzed the representativeness of airborne observation. The spatial resolution of MODIS is 1×1 km, much coarser than that of airborne data (around 15 m), which means that one MODIS pixel can match more than one airborne IST data point. In theory, for homogenous ice surface, the difference between single airborne and MODIS IST values should be neglected. But for heterogeneous regions, such as leads area and marginal ice zone, more airborne data points could capture more surface types in the corresponding MODIS pixel and have better precision than a Figure 5. Relationship between the number of airborne IST data within each MODIS pixel and corresponding bias/STD values. The blue asterisk is the STD value, and the red cross refers to bias. The black dash line is the y = 1.46 K line, which was the average STD value of all Airborne IST with respect to MODIS IST. single data point. To verify this assumption and check the representativeness of pointwise in situ data, the matched Airborne-MODIS IST data pairs were categorized by the airborne IST data numbers contained in one MODIS pixel with a step length of 10, and the bias and STD errors were calculated for each category ( Figure 5). In general, the variation of STD values was not so significant. When the number of airborne IST data within one MODIS pixel was less than 50, the STD errors decreased as the number of airborne IST data increased, indicating that more airborne IST measurements could better describe the variations in surface temperature. However, after reaching 50, as the number of airborne IST increased, the local variation of IST within a MODIS pixel stopped rising. This means that the scale effect has a limited range of influence. In addition, when the airborne IST data number within one MODIS pixel was between 30 and 100, their STDs were better than 1.46 K, which is the average precision for airborne IST data. We thus suggest that in order to fully characterize the overall features of the sea ice surface, more than 30 measurements (airborne or in situ) are needed for each square kilometer. However, no apparent relationship between data number and bias was found.
As mentioned in Castro, Wick, and Emery (2012), physical and geographical surroundings have an important influence on the accuracy of SST measured by buoys. Thus, it is necessary to take an insight into the spatial character of the error distribution for IST research. In subsequent analyses, only the MODIS pixels matching more than 30 airborne IST data points were selected in this study,  Figure 6. Distribution of mean absolute bias between airborne IST and MODIS IST. The original data were downscaled into a 25 km×25 km grid. and the absolute difference between airborne and satellite IST was calculated. To avoid data overlap and better illustrate the spatial distribution of the difference between airborne IST and MODIS IST, the absolute difference was gridded into a 25-km grid, and the results are displayed in Figure 6. Apparently, most pixels with high bias (red pixels in Figure 6) were found along the coasts of Greenland, which could be attributed to the instable environmental conditions during the flight taking off and landing.

Impact of filter criteria
Following previous works, we adopted the 3σ-Hampel identifier to remove residual cloud pixels in the satellite images and the gross errors of in situ observations. However, in practical use, it is unrealistic to assign in situ observation for every satellite image pixel and vice versa. To better exploit satellite data, it is necessary to assess the performance of the strict cloud mask, solar zenith mask, and sensor zenith mask that only relies on satellite data. An efficient mask for satellite images should be able to effectively remove most of the degraded pixels contaminated by clouds and other issues and keep valid data. In order to evaluate the performance of the three commonly-used masks, we calculated the filter fraction of them and the changes in bias and STD based on OIB airborne IST. The filter fraction is defined as the number of filtered data dividing the number of matched data pairs. The results are listed in Table 4, and the distributions of excluded and remaining data using these filters are presented in Figure 7. The strict cloud mask was the most efficient mask in improving data quality of MODIS IST and reserved the most qualified data, as the STD between MODIS and airborne IST decreased from 2.28 K to 2.09 K with only 6.15% of data pairs filtered. Both solar and sensor masks were inefficient in refining the quality of MODIS IST. The filter fractions for solar and sensor masks were 38.00% and 23.36%, respectively, much higher than that of the strict cloud mask, and yielded higher STDs. When all three criteria were adopted, although better results were obtained (STD = 2.00 K), more than half of the matched data pairs were eliminated. We concluded that a strict cloud mask is necessary for any situation, while solar and sensor zenith masks should be used selectively to balance data quality and quantity.

Conclusion
In this study, we evaluated the correlation and uncertainties of airborne and ground-based IR, LWR, and Buoy IST records with respect to MODIS IST products in the Arctic Ocean. Our results show that IST data from various measuring principles are biased against each other and hold different uncertainties. The ground and airborne IST records showed good agreement with MODIS IST. Their correlation coefficients were higher than 0.9, and the bias values ranged from -0.21 K to 1.74 K, the STD values from 1.17 K to 1.62 K, and the RMSE values from 1.48 K to 2.37 K. The airborne IST yielded the best overall accuracy and smallest uncertainty, and was nearly unbiased with MODIS IST, with a cold bias of 0.21 K, low STD of 1.46 K, and RMSE of 1.47 K, partly because the scale of airborne IST is closer to satellite IST data. The airborne data acquired at the beginning and end of a flight course should be eliminated to improve data quality. All ground-based ISTs were warmer than MODIS ISTs. Among them,the IR IST had the best overall accuracy and was more stable than other ISTs, with bias of 0.55 K, STD of 1.52 K, and RMSE of 1.61 K, while the LWR IST tended to suffer from gross errors. For the N-ICE 2015 campaign, 13.84% of LWR IST records were filtered as outliers, and for the CHINARE-2014, the LWR IST was also noisier than the co-located IR IST. Once the anomalous data were removed, the LWR IST showed good precision, with STD of 1.39 K for CHI-NARE-2014 and 1.17 K for N-ICE 2015. The Buoy IST had similar precision compared to other IST records (STD = 1.62 K), however, the large bias (1.74 K) and consequent RMSE (2.37 K) decreased its overall accuracy. The concurrent and co-located IR and LWR IST records coincided better with each other than any ground or airborne IST record with MODIS IST (r = 0.99, bias = 0.75 K, STD = 0.73 K, RMSE = 1.04 K). We also found that the strict cloud mask is the most efficient filter to refine the data quality of satellite-based IST, while the improvements from solar and sensor zenith masks are not significant. Our findings could encourage subsequent works to fully exploit existing IST records from multiple sources, including satellite, airborne, and ground measurements.

Data availability statement
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.