Validation of Himawari-8 Sea Surface Temperature Retrievals Using Infrared SST Autonomous Radiometer Measurements

: This study has evaluated ﬁve years (2016–2020) of Himawari-8 (H8) Sea Surface Temperature (SST) Level 2 Pre-processed (L2P) data produced by the Australian Bureau of Meteorology (Bureau) against shipborne radiometer SST measurements obtained from the Infrared SST Autonomous Radiometer (ISAR) onboard research vessel RV Investigator. Before being used, all data sets employed in this study have gone through careful quality control, and only the most trustworthy measurements are retained. With a large matchup database (31,871 collocations in total, including 16,418 during daytime and 15,453 during night-time), it is found that the Bureau H8 SST product is of good quality, with a mean bias ± standard deviation (SD) of − 0.12 ◦ C ± 0.47 ◦ C for the daytime and − 0.04 ◦ C ± 0.37 ◦ C for the night-time. The performance of the H8 data under different environmental conditions, determined by the observations obtained concurrently from RV Investigator, is examined. Daytime and night-time satellite data behave slightly differently. During the daytime, a cold bias can be seen under almost all environmental conditions, including for most values of wind speed, SST, and relative humidity. On the other hand, the performance of the night-time H8 SST product is consistently more stable under most meteorological conditions with the mean bias usually close to zero.


Introduction
Sea surface temperature (SST) has long been recognized as an essential climate variable [1,2]. Governed by both atmospheric and oceanic processes, SST plays a significant role in the air-sea interaction, including the exchange of heat, momentum, moisture, and gases. Its variability patterns are closely related to, or used to define, critical climate phenomena on a global scale such as ENSO (El Nino-Southern Oscillation) or on a smaller basin scale such as PDO (Pacific Decadal Oscillation (PDO)) [3]. SST also functions as a thermal expression of subsurface dynamics such as fronts and eddies [4].
The Advanced Himawari Imager (AHI) onboard the Japanese geostationary satellite Himawari-8 (H8) was launched in October 2014. The AHI is an optical radiometer with an observation frequency of every 10 min for the full disk. It has 16 spectral bands from visible and infrared (IR) wavelengths, five of which (centred at 3.9, 8.6, 10.4, 11.2, and 12.4 µm) provide information for SST retrievals [5]. H8 has provided SST observations over the western Pacific and eastern Indian Ocean with a spatial resolution of 2 km at nadir and temporal resolution of 10 min since July 2015, when it became operational. Currently, there are a number of agencies operationally generating H8 SST data sets, including the Japan Aerospace Exploration Agency (JAXA) [5], the National Oceanic and Atmospheric Administration (NOAA) [6], and the Australian Bureau of Meteorology (Bureau) [7], employing different retrieval algorithms. SST retrievals from JAXA and NOAA (namely the Advanced Clear-Sky Processor for Oceans, ACSPO, products) have been validated in multiple studies against drifting and/or mooring buoys [5,6,[8][9][10],

Ship Observations
Skin SST measurements, which serve as the "ground truth" in this study, are obtained from an Infrared SST Autonomous Radiometer (ISAR) model 5D onboard Australia's Marine National Facility, RV Investigator ( [24]; http://imos.org.au/facilities/shipsofop portunity/sstsensors/, accessed on 20 April 2023). As a self-calibrating instrument, the ISAR contains two blackbody reference cavities, a rotating gold mirror, and one singlechannel radiometer that functions within the 9.6-11.5 µm spectral band. ISAR can measure SSTs at the same depth (~10-20 µm) as IR radiometers onboard satellites to an accuracy of~0.1 K root-mean-square error [25], making it an excellent choice for the validation of satellite retrievals. Before and after each cruise, the ISAR instrument is calibrated against a second-generation Concerted Action for the Study of the Ocean Thermal Skin (CASOTS-II) reference blackbody [26]. For this study, we use the delayed-mode, recalibrated ISAR data.
Also onboard RV Investigator and measuring SST at a depth of~7-10 m is a SeaBird SBE38 temperature sensor (https://www.seabird.com/sbe-38-digital-oceanographic-the rmometer/product?id=60762467703, accessed on 20 April 2023), located within the thermosalinograph water intake in the vessel's drop keel [27]. The SBE38 sensor samples the water temperature every 5 s and averaged over a minute. It is calibrated on a yearly basis by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) Oceanographic Calibration Facility. In this study, only those depth SSTs simultaneously obtained with ISAR skin SSTs are retained [27].
In addition to the SST data sets, several environmental variables that are measured concurrently on the ship are also included in this study. This includes wind speed, relatively humidity RH, air temperatures, and shortwave flux ( [28]; https://imos.org.au/fa cilities/shipsofopportunity/airseaflux, accessed on 20 April 2023). The wind speed U is measured from an ultrasonic sensor at a height of~24.5 m above the summer load line, from which the 10 m wind speed U 10 is computed [29]. All ISAR, SBE38, and meteorological data are sourced from the IMOS thredds directories from the Australian Ocean Data Network [30].

Quality Control and Matchups
All datasets go through careful quality control before they are used. For the satellite dataset, only the highest-quality-level (QL = 5) SST measurements are retained. In the Bureau H8 L2P SST dataset, the quality level is determined by the probability of cloud clearance and the sensitivity of simulated SST to real SST, and a chisquared test is adopted to compare the size of any discrepancies between the simulated and the real SSTs [7,31]. A filter of clearness (using the probability_of_clear variable in the L2P files) being equal to 1 is applied to further eliminate any potential cloud contamination. SSES-bias-corrected SSTs are used for this study.
For ISAR data used in this study, the total expanded uncertainty for the skin SST value is calculated using the uncertainty code produced by Werenfrid Wimmer [25]. The ISAR total uncertainty is an estimate of the SST that differs from its true value by less than the stated uncertainty in 95% of cases, and can be considered as about 2 times the SD. In this study, only ISAR skin SSTs with a total uncertainty ≤0.2 • C are used. This threshold is adopted from [27]. Several tests were performed in [27] before the 0.2 • C threshold was determined to be sufficiently strict yet not discarding too many good measurements. For U 10 < 10 ms −1 conditions, there is little dependency of the total uncertainty in ISAR data on wind speed, and the uncertainty is relatively stable at around 0.12 • C [27]. The effect of ship roll starts to impact the performance of the instrument as the wind strengthens. When U 10 is >15 ms −1 , the total uncertainty sharply increases [27]. Therefore, in addition to the total uncertainty being ≤0.2 • C, in this study, the wind condition is also confined to be under 15 ms −1 to ensure that only the most trustworthy ISAR measurements are included.
Depth SSTs from SBE38 also went through a series of quality control (QC) steps as adopted in the IMOS data processing system [28]. The QC procedure consists of checks including the physical bounds check (−2 to 40 • C), time and geolocation check, platform speed check, water temperature check, exhaust contamination check, air temperature and wind check, statistical check, and climatological bounds check [28]. Only depth SSTs that have passed all checks are retained.
Following the above QC steps, matchups between SST (H8, ISAR, and SBE38) measurements, as well as environmental conditions (wind speed, relative humidity, shortwave radiation, and air temperature) are generated for each cruise from 2016 to 2020. Due to the limited availability of high-quality, reprocessed ISAR SST data in 2021 and 2022, we do not validate the satellite SSTs in these two years in this study. The temporal window is set as ±5 min around the H8 SST observation time, and the spatial window is set to Remote Sens. 2023, 15, 2841 4 of 16 be 5 km. During the collocation process, only the best pair is retained. Here, "best pair" means that only one single match between an H8 and ISAR measurement that is closest in space and time is retained. We do not keep multiple ISAR measurements that match to a single satellite measurement, or multiple satellite measurements that match to a single ISAR measurement.
Before we obtain the final matchups, it is a known issue that for some cruises, the performance of the ISAR instrument may not be optimal, due to several reasons, some of which are not yet well understood such as issues related to mirrors or electronics. These suboptimal ISAR data may still escape the total uncertainty check, as the total expanded uncertainty estimate strongly depends on the roll of the ship [25]. Therefore, we decide to conduct a final visual QC on ISAR SSTs by comparing them with the collocated SBE38 depth temperatures for each individual cruise under night-time, well-mixed (U 10 > 2 ms −1 ) conditions only, when the only temperature difference should be the cool skin effect [32]. In this study, daytime is determined as solar zenith angle (SZA) < 90 • and night-time as SZA > 110 • to avoid the twilight times. SZA is calculated as cos θ s = sin α s = sin Φ sin δ + cos Φ cos δ cos h, where θ s is the solar zenith angle in degrees, α s is the solar altitude angle (α s = 90 • − θ s ) in degrees, Φ is the local latitude in radians, δ is the current declination of the sun in radians that is determined by the number of days of the year, and h is the hour angle in the local solar time in radians (LST, determined by longitude; h = (LST − 12) × 15). If the average night-time SST difference (ISAR minus SBE38) for a certain cruise is unrealistically large, then the whole cruise, including the daytime measurements, is considered as untrustworthy and subsequently discarded. The comparisons between ISAR skin SST and SBE38 depth SST for all cruises between 2016 and 2020 under well-mixed night-time conditions are shown in Table 1. As expected, most of the average night-time temperature differences are minor, between about −0.38 • C and 0.05 • C due to the cool skin effect [32][33][34]. However, there are five cruises standing out with extraordinarily large negative temperature differences (absolute values > 0.8 • C; red rows in Table 1). These biases are more than three times larger than the average night-time cool skin amplitude (~−0.23 • C as observed in [27]); therefore, these five cruises are deemed as of questionable quality and removed from the final collocations. This three-times-larger average night-time cool skin amplitude is believed as justifiable and it serves our purposes well in this study. Table 1. Comparison between ISAR skin SST and SBE38 depth SST for all cruises between 2016 and 2020 under well-mixed (U 10 > 2 ms −1 ) night-time conditions only. Statistics include number of matchups (N), bias (ISAR-SBE38 SSTs), SD of biases, and cruise date period. Rows in red indicate the cruises to be discarded because of the abnormally large mean biases between ISAR and SBE38 SSTs. The final data set contains 31,871 matchups, including 16,418 during daytime and 15,453 during night-time. A flowchart of the quality control measures on different data sets and the matchup process is shown in Figure 1. For the remainder of the study, all analyses are based on this database unless explicitly stated otherwise, and no wind speed filter is applied. The final data set contains 31,871 matchups, including 16,418 during daytime and 15,453 during night-time. A flowchart of the quality control measures on different data sets and the matchup process is shown in Figure 1. For the remainder of the study, all analyses are based on this database unless explicitly stated otherwise, and no wind speed filter is applied.

A Diurnal Variation Case Study
A case study to evaluate the three SST datasets, i.e., Bureau H8 L2P, ISAR skin, and SBE38 depth SST measurements, is undertaken during a strong diurnal variation (DV) event that occurred on 12 October 2016 near the north-eastern Australian coast (~146 • E, 18 • S) [11]. Part of the motivation for this case study is to compare with the study by [11], who also studied this DV event using the same data sets, except that their H8 SST retrievals are sourced from JAXA. The time series of the SSTs for this date are shown in Figure 2. First, it is encouraging to see that the Bureau H8 SSTs follow the ISAR skin SSTs very closely for the whole day ( Figure 2a). Similar to the JAXA product (Figure 2b), the Bureau H8 SSTs also capture the DV amplitude well (Figure 2a). Since both studies use the same data source for ISAR and SBE38 SSTs, their time series (red and green dots) should be theoretically identical in both panels, respectively. However, in order to retain the full data set and show the DV cycle with more detail, we do not apply any QC on the data sets for this particular case study, hence the slight differences, such as the gaps at different hours, observed in the red and green dots in the two panels, respectively. The inconsistency could also be partially due to the different availabilities of the two satellite SST data sets during collocation as their retrieval algorithms and collocation methods differ. Furthermore, it is noticed that in both panels, for LST before 10 h and after 18 h, the SBE38 depth SSTs (green dots) are usually warmer than the ISAR and H8 SSTs by a few tenths of a degree, which can be a good indicator of the cool skin effect. The largest difference between SBE38 and ISAR skin SST reaches a maximum of~0.4-0.5 • C at around local sunrise (~5-7 LST), which is consistent with the findings in [27].
A case study to evaluate the three SST datasets, i.e., Bureau H8 L2P, ISAR skin, and SBE38 depth SST measurements, is undertaken during a strong diurnal variation (DV) event that occurred on 12 October 2016 near the north-eastern Australian coast (~146°E, 18°S) [11]. Part of the motivation for this case study is to compare with the study by [11], who also studied this DV event using the same data sets, except that their H8 SST retrievals are sourced from JAXA. The time series of the SSTs for this date are shown in Figure  2. First, it is encouraging to see that the Bureau H8 SSTs follow the ISAR skin SSTs very closely for the whole day (Figure 2a). Similar to the JAXA product (Figure 2b), the Bureau H8 SSTs also capture the DV amplitude well (Figure 2a). Since both studies use the same data source for ISAR and SBE38 SSTs, their time series (red and green dots) should be theoretically identical in both panels, respectively. However, in order to retain the full data set and show the DV cycle with more detail, we do not apply any QC on the data sets for this particular case study, hence the slight differences, such as the gaps at different hours, observed in the red and green dots in the two panels, respectively. The inconsistency could also be partially due to the different availabilities of the two satellite SST data sets during collocation as their retrieval algorithms and collocation methods differ. Furthermore, it is noticed that in both panels, for LST before 10 h and after 18 h, the SBE38 depth SSTs (green dots) are usually warmer than the ISAR and H8 SSTs by a few tenths of a degree, which can be a good indicator of the cool skin effect. The largest difference between SBE38 and ISAR skin SST reaches a maximum of ~0.4-0.5 °C at around local sunrise (~5-7 LST), which is consistent with the findings in [27].  [11] (their Figure 6). Identical ISAR and SBE38 SSTs are used in both studies, but the H8 SSTs in [11] are sourced from JAXA. Note that there is no QC applied on any of the SST data sets for this case study (upper panel).

Statistics
With the confidence in the Bureau H8 L2P SST product gained from the above case study, we proceed to calculate some statistics between ISAR and H8 skin SSTs based on the matchup database. The paths for all the voyages in the quality-controlled collocation set are shown in Figure 3, for daytime and night-time separately, with the colour representing the temperature difference between H8 and ISAR SST. Note that there are a few cruises that have crossed the Southern Ocean and reached near the Antarctic coast, but these are filtered out by the relatively strict set of QC methods in this study. Most of the cruises retained are along the Australian coasts. An easy observation from Figure 3a is that there are more cold ∆Ts (H8-ISAR) than warm ∆Ts in the daytime, especially over the western Australian coast. In the night-time, there are more ∆Ts close to zero (Figure 3b).

Figure 2.
A case study showing a strong DV event that occurred on 12th October 2016 over the north-eastern Australian coast (~146°E, ~18°S). The upper panel (a) is the result from this study and the lower panel (b) is from [11] (their Figure 6). Identical ISAR and SBE38 SSTs are used in both studies, but the H8 SSTs in [11] are sourced from JAXA. Note that there is no QC applied on any of the SST data sets for this case study (upper panel).

Statistics
With the confidence in the Bureau H8 L2P SST product gained from the above case study, we proceed to calculate some statistics between ISAR and H8 skin SSTs based on the matchup database. The paths for all the voyages in the quality-controlled collocation set are shown in Figure 3, for daytime and night-time separately, with the colour representing the temperature difference between H8 and ISAR SST. Note that there are a few cruises that have crossed the Southern Ocean and reached near the Antarctic coast, but these are filtered out by the relatively strict set of QC methods in this study. Most of the cruises retained are along the Australian coasts. An easy observation from Figure 3a is that there are more cold ΔTs (H8-ISAR) than warm ΔTs in the daytime, especially over the western Australian coast. In the night-time, there are more ΔTs close to zero (Figure 3b). Yearly and overall statistics, including the number of matchups (N), mean and median bias (satellite minus ISAR SST), SD and robust SD (RSD), and the coefficient of determination (square of correlation coefficient; R 2 ), for daytime and night-time collocations between the H8 and ISAR SSTs are shown in Table 2  Yearly and overall statistics, including the number of matchups (N), mean and median bias (satellite minus ISAR SST), SD and robust SD (RSD), and the coefficient of determination (square of correlation coefficient; R 2 ), for daytime and night-time collocations between the H8 and ISAR SSTs are shown in Table 2. Here, an RSD is defined as the Median Absolute Difference (MAD) multiplied by 1.48. To obtain a MAD, we first calculate the median of all the ∆T (H8-ISAR) values. Then, the differences between this median and all ∆T values are calculated. Finally, we take the median of the absolute values of those differences as a MAD. For the daytime matchups, the mean and median biases are the same, being both −0.12 • C, with a SD/RSD of 0.47 • C/0.31 • C. This negative daytime mean/median bias in H8 can be seen in all the years, with the largest being −0.21 • C/−0.17 • C in 2016 and the smallest being −0.08 • C/−0.06 • C in 2019. The statistics look better in the night-time, which should be largely attributed to the adoption of the extra 3.7 µm channel in the night-time retrieval algorithm. The overall mean and median biases are −0.04 • C and 0.03 • C, respectively, much smaller than those in the daytime. Night-time SD and RSD have also seen an improvement when compared with daytime data, being 0.37 • C and 0.24 • C, respectively. One year standing out from Table 2 is 2016. The daytime mean bias is the largest (−0.21 • C) of all years and the coefficient of determination is the lowest (0.95). Most interestingly, the night-time satellite data in 2016 are significantly colder than the other years (mean bias being −0.16 • C), although the median bias being −0.06 • C indicates that this could be due to cold outliers. Therefore, we examine time series of night-time ∆T (H8-ISAR) for each year and the plots are shown in Figure 4. It is easily observable that there are significantly more cold outliers (defined as ∆T < −1 • C here) in the night-time H8 data when compared with ISAR SSTs in 2016. The percentage (6.2%) is far higher than any other year. Given the relatively stable behaviours of ISAR SSTs over the years by comparing with SBE38 depth SSTs as illustrated in Table 1 (cruises that are retained), these large cold biases in night-time satellite data appear to be real signals and merit further investigation.
The distributions of daytime and night-time ∆Ts (H8-ISAR) are shown in Figure 5. Most daytime ∆Ts sit within the −0.2 • C to −0.1 • C range, corresponding to the negative mean bias (−0.12 • C), whereas the peak for the night-time ∆T distribution is seen at the 0-0.1 • C bin. As expected, daytime ∆Ts have more outliers in both cold and warm tails than night-time data, hence the larger SD as observed in Table 2. There are few night-time ∆Ts > 1 • C outliers but the daytime population at this warm tail is still noticeably large.
Density distributions of daytime and night-time ∆Ts (H8-ISAR) are further analysed against ISAR SST conditions ( Figure 6). For both daytime (Figure 6a) and night-time (Figure 6b), most of the matchups are obtained under SST conditions between 15 • C and 32 • C, especially within the 25 ± 2 • C and 30 ± 2 • C ranges. More collocations for SST < 15 • C conditions are observed at night-time than in the daytime. There are significantly more warm outliers (∆T > 1 • C) in daytime than in the night-time, as also shown in Figure 5  The distributions of daytime and night-time ΔTs (H8-ISAR) are shown in Figure 5. Most daytime ΔTs sit within the −0.2 °C to −0.1 °C range, corresponding to the negative mean bias (−0.12 °C), whereas the peak for the night-time ΔT distribution is seen at the 0-0.1 °C bin. As expected, daytime ΔTs have more outliers in both cold and warm tails than night-time data, hence the larger SD as observed in Table 2. There are few night-time ΔTs > 1 °C outliers but the daytime population at this warm tail is still noticeably large. Density distributions of daytime and night-time ΔTs (H8-ISAR) are further analysed against ISAR SST conditions ( Figure 6). For both daytime (Figure 6a) and night-time (Figure 6b), most of the matchups are obtained under SST conditions between 15 °C and 32 °C, especially within the 25 ± 2 °C and 30 ± 2 °C ranges. More collocations for SST < 15 °C Density distributions of daytime and night-time ΔTs (H8-ISAR) are further analysed against ISAR SST conditions ( Figure 6). For both daytime (Figure 6a) and night-time (Figure 6b), most of the matchups are obtained under SST conditions between 15 °C and 32 °C, especially within the 25 ± 2 °C and 30 ± 2 °C ranges. More collocations for SST < 15 °C conditions are observed at night-time than in the daytime. There are significantly more warm outliers (ΔT > 1 °C) in daytime than in the night-time, as also shown in Figure 5.

Validation under Different Environmental Conditions
The performance of the Bureau fv02 H8 L2P SST data is further evaluated under different environmental conditions, which are determined by the observations obtained simultaneously with the ISAR SSTs from RV Investigator.

Performance against Wind Speed
Since the AHI sensor onboard H8 samples SST at the same depth as the ISAR, in theory, we should be able to directly compare these measurements under any wind conditions without excluding calm wind conditions (i.e., U 10 < 2 ms −1 in the night or <6 ms −1 in the day) to avoid DV contamination, which is commonly adopted in many satellite SST validation studies against depth SSTs. Figure 7a shows that under most wind conditions (U 10 < 10 ms −1 ), daytime mean ∆Ts are constantly below the zero line by −0.2 • C to −0.1 • C. There is no clear dependency of daytime ∆T on wind speed, which is particularly encouraging to see under calm conditions (0 < U 10 < 6 ms −1 ) when a DV may develop. When U 10 is >10 ms −1 , the number of collocations reduce sharply, leading to the fluctuation in the mean biases. In Figure 7b, we see that the night-time ∆Ts follow the zero line more closely throughout the wind speed range (0 < U 10 < 15 ms −1 ), indicating the good performance of the satellite product in the night-time when compared against ISAR skin SSTs.

Performance against SST
The dependency of ∆Ts on different SST conditions is shown in Figure 8. For both daytime and night-time, as most of the collocations are obtained around Australian coasts, the temperatures colder than 18 • C are scarce, hence the jumps in biases and SDs for those bins (most obvious at night-time in Figure 8b). For warmer waters (ISAR SST > 20 • C), daytime satellite retrievals underestimate the temperatures most of the time, with a deep trough at around~28 • C (∆T being~−0.35 • C). When SST exceeds 28 • C, the daytime underestimation eases and the mean bias approaches, or even slightly exceeds, 0 • C. Two large positive ∆Ts are observed when SST is >32 • C. However, the collocation numbers are small and the SDs are very large (Figure 8a). In the night-time, the mean ∆Ts are much closer to 0 • C and the behaviour is consistent throughout the SST range (0-33 • C) with only minor fluctuations (Figure 8b).
There is no clear dependency of daytime ΔT on wind speed, which is particularly encouraging to see under calm conditions (0 < U10 < 6 ms −1 ) when a DV may develop. When U10 is >10 ms −1 , the number of collocations reduce sharply, leading to the fluctuation in the mean biases. In Figure 7b, we see that the night-time ΔTs follow the zero line more closely throughout the wind speed range (0 < U10 < 15 ms −1 ), indicating the good performance of the satellite product in the night-time when compared against ISAR skin SSTs.

Performance against SST
The dependency of ΔTs on different SST conditions is shown in Figure 8. For both daytime and night-time, as most of the collocations are obtained around Australian coasts, the temperatures colder than 18 °C are scarce, hence the jumps in biases and SDs for those bins (most obvious at night-time in Figure 8b). For warmer waters (ISAR SST > 20 °C), daytime satellite retrievals underestimate the temperatures most of the time, with a deep trough at around ~28 °C (ΔT being ~−0.35 °C). When SST exceeds 28 °C, the daytime underestimation eases and the mean bias approaches, or even slightly exceeds, 0 °C. Two large positive ΔTs are observed when SST is >32 °C. However, the collocation numbers are small and the SDs are very large (Figure 8a). In the night-time, the mean ΔTs are much closer to 0 °C and the behaviour is consistent throughout the SST range (0-33 °C) with only minor fluctuations (Figure 8b).

Performance against Local Solar Time
Combining the daytime and night-time data together, we now validate H8 SSTs against local solar hours. In Figure 9a, most of the mean biases, ΔTs, are slightly negative (−0.05 °C to −0.20 °C) in the daytime but are much closer to 0 °C in the twilight and nighttimes. The underestimation of H8 SSTs exists consistently in the daytime, with a noticeable dip in mean ΔT at around −0.22 °C (Figure 9a). Given that the shortwave radiation is typically highest around 12 h LST, one is naturally intrigued to see if there is any correlation between ΔT and the shortwave flux. We plot the daytime ΔTs against solar shortwave radiation in Figure 9b. It is noticed that when the matchups are abundant (shortwave radiation < 1100 Wm −2 ), there is no discernible dependency of ΔT on shortwave radiation.

Performance against Local Solar Time
Combining the daytime and night-time data together, we now validate H8 SSTs against local solar hours. In Figure 9a, most of the mean biases, ∆Ts, are slightly negative (−0.05 • C to −0.20 • C) in the daytime but are much closer to 0 • C in the twilight and nighttimes. The underestimation of H8 SSTs exists consistently in the daytime, with a noticeable dip in mean ∆T at around −0.22 • C (Figure 9a). Given that the shortwave radiation is typically highest around 12 h LST, one is naturally intrigued to see if there is any correlation between ∆T and the shortwave flux. We plot the daytime ∆Ts against solar shortwave radiation in Figure 9b. It is noticed that when the matchups are abundant (shortwave radiation < 1100 Wm −2 ), there is no discernible dependency of ∆T on shortwave radiation.
times. The underestimation of H8 SSTs exists consistently in the daytime, with a noticeable dip in mean ΔT at around −0.22 °C (Figure 9a). Given that the shortwave radiation is typically highest around 12 h LST, one is naturally intrigued to see if there is any correlation between ΔT and the shortwave flux. We plot the daytime ΔTs against solar shortwave radiation in Figure 9b. It is noticed that when the matchups are abundant (shortwave radiation < 1100 Wm −2 ), there is no discernible dependency of ΔT on shortwave radiation.

Performance against Relative Humidity
Water vapor existing in the atmosphere is known to pose challenges for accurate satellite SST retrieval in the IR spectrum band ( [35]). The dependence of ΔT on relative humidity (RH) is examined to evaluate the performance of Bureau fv02 H8 L2P SST retrievals. As shown in both panels in Figure 10, the most common RH values range from 40% to 90%, peaking at 70-75% in the daytime (Figure 10a) and at 75-80% in the night-time (Figure 10b). In the daytime, when RH is <60%, H8 tends to underestimate water temperatures by −0.25 °C to −0.05 °C. When RH is >60%, the absolute values of the negative biases

Performance against Relative Humidity
Water vapor existing in the atmosphere is known to pose challenges for accurate satellite SST retrieval in the IR spectrum band [35]. The dependence of ∆T on relative humidity (RH) is examined to evaluate the performance of Bureau fv02 H8 L2P SST retrievals. As shown in both panels in Figure 10, the most common RH values range from 40% to 90%, peaking at 70-75% in the daytime (Figure 10a) and at 75-80% in the night-time (Figure 10b). In the daytime, when RH is <60%, H8 tends to underestimate water temperatures by −0.25 • C to −0.05 • C. When RH is >60%, the absolute values of the negative biases decrease, with also significantly larger matchup sizes (Figure 10a). During the night, most of the mean ∆Ts are very close to the zero line (Figure 10b). The jumps at both ends should again be due to the small matchup sizes.

Discussion
The overall statistics for the Bureau H8 fv02 L2P SST product when compared against ISAR SSTs are comparable to previous studies. In this study, with 16

Discussion
The overall statistics for the Bureau H8 fv02 L2P SST product when compared against ISAR SSTs are comparable to previous studies. In this study, with 16,418 collocations, the daytime mean and median biases are both −0.12 • C, with an SD/RSD of 0.47 • C/0.31 • C. Night-time statistics are better, with a matchup size of 15,453, a mean/median bias of −0.04 • C/0.03 • C, and an SD/RSD of 0.37 • C/0.24 • C. These numbers are comparable with previous studies. For example, mixing daytime and night-time data together, Ref. [5] found a bias of −0.16 • C with a root-mean-square error of 0.59 • C when comparing JAXA H8 SSTs against drifting buoys and tropical moorings for June to September 2015. In addition, Ref. [8] validated the NOAA ACSPO H8 SSTs against tropical moorings and buoys in the Australian Great Barrier Reef region for August to October 2015, and found that H8 SST has a mean bias of 0.18 • C with a SD of 0.53 • C for both daytime and night-time data. Importantly, note that these two studies evaluate H8 data against depth in situ SSTs.
A more direct and fair comparison can be made between this study and [11], which used the same source of ISAR and SBE38 depth SSTs to validate the JAXA H8 SST data set for 2016 and 2017. With 2701 matchups (without differentiating daytime and night-time), Ref. [11] found a mean bias of 0.09 • C with a SD of 0.30 • C between H8 and ISAR SSTs. Their SD is smaller than either the daytime (SD = 0.47 • C) or night-time (SD = 0.37 • C) SDs in this study. They also validated separately daytime and night-time H8 data in a case study using ISAR data from only one cruise (IN2016_V05). To make the comparison most relevant, we conducted the same case study using also only voyage IN2016_V05 and the results are shown in Table 3. In [11], they found that, on average, daytime JAXA H8 SSTs are 0.13 • C warmer than the ISAR skin SST, and the night-time mean bias is 0.00 • C (their Table 3). Instead of an overestimation, in this study, we found an underestimation of SST in both daytime and night-time Bureau H8 SST data. This is consistent with the negative biases found in year 2016 from Table 2 and Figure 4 in this study. Their daytime SD/RSD is 0.27 • C/0.26 • C and their night-time SD/RSD is 0.25 • C/0.21 • C, all smaller than those in this study. The reasons for such differences are many. First, different satellite SST retrieval algorithms are adopted in these two agencies. Different channels are used to retrieve SST at JAXA (10.4, 11.2, and 8.6 µm) and at the Bureau (0.64, 0.86, 10.4, and 12.4 µm for daytime; 3.9, 10.4, and 12.4 µm for night-time). The RTM model versions (JAXA employing RTTOV 10.2 and the Bureau using RTTOV 12.3), NWP inputs, and cloud masking methods all differ between JAXA and the Bureau. Then, the spatial windows during collocation are different: we use a 5 km threshold but they used 0.02 • . In [11], they required that all pixels within a 7 by 7 pixel window centred around a target QL = 5 satellite pixel should be of QL = 5, which is not adopted in this study. Although the QC on the ISAR SST data in both studies are the same, no cutting of higher wind conditions (U 10 > 15 ms −1 ) was performed in [11]. In addition, the matchup sizes in these two studies are different, with our collocation sizes being significantly larger than those in [11]. The 7 by 7 pixel window in [11] would have significantly reduced matchups with cloud-affected (and therefore colder) H8 SST retrievals. Finally, there might be more cold outliers in the Bureau fv02 H8 L2P SST data set for year 2016, as seen in Figure 4, which warrants further investigation. In addition to the above environmental conditions listed in the Results section (wind speed, SST, LST, RH), we have also examined the performance of the satellite data against air-sea temperature differences (∆T air-sea ; ISAR SSTs are used here to represent sea surface temperatures). Although it has been proven that a small ∆T air-sea affects the physical satellite SST retrieval accuracy only marginally, a large ∆T air-sea could still potentially have a significant effect, especially in coastal regions where greater extremes may exist [36,37]. However, in this study, there is no noticeable dependency of ∆T on ∆T air-sea observed as most ∆T values fall just below the 0 • C line.

Conclusions
Five years (2016-2020) of the Bureau of Meteorology's physically retrieved Himawari-8 SST L2P product have been evaluated, for daytime and night-time separately, against ISAR skin SST measurements obtained from RV Investigator. The paths of the cruises are mostly along Australian coasts. After quality control on all SST (ISAR, H8, and SBE38 depth) measurements, a large matchup database is generated with the collocation number being 16,418 for the daytime and 15,453 for the night-time. A case study is carried out first, in which the Bureau H8 L2P SST product captured a strong DV event that occurred on 12 October 2016 over the north-eastern Australian coast. Statistically, for the daytime, an overall negative mean bias (−0.12 • C) is observed between the satellite and ISAR SSTs with an SD/RSD of 0.47 • C/0.31 • C. Night-time statistics are better with a mean/median bias of −0.04 • C/0.03 • C and an SD/RSD of 0.37 • C/0.24 • C. These results are comparable to previous studies. Over the study period (2016-2020), daytime H8 SST data underestimate the temperatures by more than one tenth of a degree Celsius for most of the years. On the contrary, the night-time average mean ∆T is close to 0 • C for the whole period except for 2016, when a negative mean/median bias of −0.16 • C/−0.06 • C is observed due to the abnormally large number of cold outliers ( Figure 4).
When validating under different environmental conditions, daytime and night-time H8 SST data behave slightly differently. A cold bias can be seen during daytime under almost all environmental conditions, including for most values of wind speed (U 10 < 15 ms −1 ; Figure 7a), SST (ISAR SST < 30 • C; Figure 8a), and relative humidity (Figure 10a). On the other hand, the performance of the night-time H8 SST product is more optimal under nearly all meteorological conditions, with the mean bias being usually very close to 0 • C.
Overall, this study shows that the ISAR skin SSTs produced by the Bureau and CSIRO are of good quality and can be confidently used as a fiducial source of 'ground truth' for satellite IR SST retrieval validations. The Bureau fv02 H8 L2P SST product displays high sensitivity to the skin temperature variation, and shows overall good performance, especially during night-time, under different environmental conditions. Nonetheless, the underestimation of SSTs in the daytime found in this product should merit further investigation. The unusually many cold outliers during night-time in year 2016 is another interesting topic for a future study. The reasons identified can potentially help to improve the retrieval process.  Data Availability Statement: The fv02 H8 L2P data are available from the National Computational Infrastructure (Bureau of Meteorology, 2022). The ISAR, SBE38, and meteorological data were collected and processed by the Australian Marine National Facility and Bureau, and sourced from the IMOS ISAR_QC Thredds directories from the Australian Ocean Data Network (http://thredds.aodn.org.au/thredds/catalog/IMOS/SOOP/SOOP-ASF/VLMJ_Investig ator/meteorological_sst_observations/YYYY/ISAR_QC/catalog.html, where "YYYY" is "2016" through to "2020", accessed on 20 April 2023).