A 10-year Metocean dataset for Laguna Madre, Texas, including for the Study of Extreme Cold Events

Coastal observations along the Texas coast are valuable for many stakeholders in diverse domains. However, the management of the collected data has been limited, creating gaps in hydrological and atmospheric datasets. Among these, water and air temperature measurements are particularly crucial for water temperature predictions, especially during freeze events. These events can pose a serious threat to endangered sea turtles and economically valuable fish, which can succumb to hypothermic stunning, making them vulnerable to cold-related illness or death. Reliable and complete water and air temperature measurements are needed to provide accurate predictions of when cold-stunning events occur. To address these concerns, the focus of this paper is to describe the method used to create a complete 10-year dataset that is representative of the upper Laguna Madre, TX using multiple stations and various gap-filling methods. The raw datasets consist of a decade's worth of air and water temperature measurements within the Upper Laguna Madre from 2012 to 2022 extracted from the archives of the Texas Coastal Ocean Observation Network and the National Park Service. Large portions of data from the multiple stations were missing from the raw datasets, therefore a systematic gap-filling approach was designed and applied to create a near-continuous dataset. The proposed imputation method consists of three steps, starting with a short gap interpolation method, followed by a long gap-filling process using nearby stations, and finalized by a second short gap interpolation method. This systematic data imputation approach was evaluated by creating random artificial gaps within the original datasets, filling them using the proposed data imputation method, and assessing the viability of the proposed methods using various performance metrics. The evaluation results help to ensure the reliability of the newly imputed dataset and the effectiveness of the data imputation method. The newly created dataset is a valuable resource that transcends the local cold-stunning issue, offering viable utility for analyzing temporal variability of air and water temperatures, exploring temperature interdependencies, reducing forecasting uncertainties, and refining natural resource and weather advisory decision-making processes. The cleaned dataset with minimal gaps (<2%) is ready and convenient for artificial intelligence and machine learning applications.


a b s t r a c t
Coastal observations along the Texas coast are valuable for many stakeholders in diverse domains.However, the management of the collected data has been limited, creating gaps in hydrological and atmospheric datasets.Among these, water and air temperature measurements are particularly crucial for water temperature predictions, especially during freeze events.These events can pose a serious threat to endangered sea turtles and economically valuable fish, which can succumb to hypothermic stunning, making them vulnerable to cold-related illness or death.Reliable and complete water and air temperature measurements are needed to provide accurate predictions of when cold-stunning events occur.To address these concerns, the focus of this paper is to describe the method used to create a complete 10-year dataset that is representative of the upper Laguna Madre, TX using multiple stations and various gap-filling methods.The raw datasets consist of a decade's worth of air and water temperature measurements within the Upper Laguna Madre from 2012 to 2022 extracted from the archives of the Texas Coastal Ocean Observation Network and the National Park Service.Large portions of data from the multiple stations were missing from the raw datasets, therefore a systematic gap-filling approach was designed and applied to create a near-continuous dataset.The proposed imputation method consists of three steps, starting with a short gap interpolation method, followed by a long gap-filling process using nearby stations, and finalized by a second short gap interpolation method.This systematic data imputation approach was evaluated by creating random artificial gaps within the original datasets, filling them using the proposed data imputation method, and assessing the viability of the proposed methods using various performance metrics.The evaluation results help to ensure the reliability of the newly imputed dataset and the effectiveness of the data imputation method.The newly created dataset is a valuable resource that transcends the local cold-stunning issue, offering viable utility for analyzing temporal variability of air and water temperatures, exploring temperature interdependencies, reducing forecasting uncertainties, and refining natural resource and weather advisory decision-making processes.The cleaned dataset with minimal gaps ( < 2%) is ready and convenient for artificial intelligence and machine learning applications.
©  (5) enhance water and natural resource and risk management decisions during freeze or drought events.• The most significant contribution of this paper is the creation of a complete 10-year timeseries dataset.A minimal gap ( < 2%) dataset is highly valuable for the calibration of Artificial Intelligence (AI) models.• This dataset can be valuable to data scientists, natural and water resource managers, climate scientists, forecasters, and others who are in need of reliable air and water temperature data.• The imputed dataset provides reliable air and water temperature information in one of the most important development areas for juvenile endangered green sea turtles in the western Gulf of Mexico.

Data Description
The dataset presented in this article is representative of hydrological and atmospheric conditions within the Laguna Madre TX, a shallow estuarine system located in southern Texas.Water temperatures can change very rapidly in the Laguna Madre because of the cooling air temperatures brought in by cold fronts but also because of the hydrodynamics of the Laguna Madre itself (e.g., wind-driven and well-mixed, shallow, restricted flow from the Gulf of Mexico [GoM]).Given the climatic conditions of the area, the lagoon system is sometimes susceptible to freezing air temperatures when cold fronts travel toward the coast during the cold season, impacting water temperatures [ 9 ].Climatic and oceanic factors such as air temperature, sea surface temperature, barometric pressure, wind direction, and wind speed influence cold-stunning events along the Texas coast [ 9 ].However, Tissot et al. showed that air temperature was by far the main forcing on water temperatures in the Laguna Madre (with the exception of waters by deep draft ship channels, e.g., Brownsville ship channel) [ 10 ] [ 10 ].Cold fronts can substantially lower air temperatures by more than 10 °C in less than 24 hours [ 9 ], significantly decreasing water temperature in the Laguna Madre [ 10 ].These conditions can cause threatened green sea turtles and other marine life to become "cold-stunned," no longer capable of moving or protecting themselves.
The dataset described in this article consists of 10 years of air and water temperature measurements from 2012 to 2022 extracted from the Texas Coastal Ocean Observation Network (TCOON) [ 7 ], initially used to forecast water temperatures in the area of interest.TCOON has been noted as a valuable hydrological/environmental data retrieval tool since 1991 for the state of Texas, collecting water level, wind speed, barometric pressure, salinity, water quality, and other environmental data along several locations along the Texas coast [ 8 ].TCOON has been utilized by the National Ocean and Atmospheric Administration (NOAA), US Army Corps of Engineers (USACE), and the Conrad Blucher Institute (CBI) for many applications, resulting in many benefits to the agencies (e.g., Texas General Land Office, Texas Water Development Board) and communities that each TCOON station serves.However, the maintenance of TCOON was temporarily halted starting in 2014 for one or more years, depending on location, before resuming data collection.However, the 2014 halt, occasional extreme events, data transmission problems and harshness of the coastal environment led to the reduction of data quality, leading to large gaps of missing data and at times erroneous data.The reduction in the data quality along the Texas coast has limited the usability and reliability of the data for a diverse set of users.This paper focuses on enhancing the usability of air temperature (ATP) and water temperature (WTP) data acquired from TCOON by combining statistical processing and utilizing highly correlated locations (depending on the variable and location; Table 1 ).The goal is to improve its applicabil- ity for diverse analysis and forecasting models, aiming to restore its value in scientific research, analysis, and various management decision-making processes.

Study location -Laguna Madre estuarine system
The Laguna Madre is characterized as a shallow ( 1.2 m [1] ) estuarine system that is divided into two sections: the upper and lower Laguna Madre.Both sections cover approximately 1133 km 2 [ 5 ], separated by an extensive area of wind tidal flats and hydrologically connected by the Gulf Intracoastal Waterway (GIWW) also known as the "Land Cut".The estuarine system has highly restricted flows in and out of the GoM with only three outlets that allow for water transfer from the bay to the Gulf: Brazos Santiago Pass, Mansfield Channel, and Packery Channel [ 9 ].Both sections of Laguna Madre also have minimal freshwater inflow, historically often expressing a negative freshwater inflow balance [ 12 ].Because of this, the system is known to be one of the six most hypersaline lagoons in the world, with salinity levels ranging from 26 to 50 g/kg depending on local rainfall [ 9,12 ].During the passage of cold fronts, water temperatures in Laguna Madre are driven by generally homogeneous air temperatures brought in by cold fronts and can be considered homogeneous as well [ 9 ].Despite these harsh saline conditions and occasional extreme cold events, the Laguna Madre is an extremely productive bay system, home to numerous commercially and ecologically valuable marine species.There are approximately 9 present and historical TCOON and National Park Service (NPS) stations placed within the Laguna Madre system ( Fig. 1 ).

Data acquisition
Hourly air and water temperature time-series data from TCOON and NPS stations within the upper Laguna Madre were acquired (lighthouse.tamucc.edu).The selected locations are South Bird Island, Packery Channel, Baffin Bay, and NPS-South Bird Island [NPS-SBI] stations.The data acquired from the multiple stations were analyzed to assess the variability and heterogeneity of water and air temperatures between each station in order to understand the range of suitability of the nearby stations for potential data imputation.

Percentage of missing data
The unprocessed 2010-2022 air and water temperature dataset from all stations contained substantial proportions of missing data ( Table 2 ).Within the initial acquired data, data prior to 2012 had more than 90% missing data and therefore was excluded.

Experimental design
The primary objective is to create a dataset that is representative of the upper Laguna Madre with minimal gaps ( < 2%) for each year within the time-series dataset.Therefore, each station used for experimentation for the data imputation method was analyzed using Pearson correla-tions between each combination.It was observed that each station combination for both air and water temperatures had correlation values higher than 99% ( Table 1 ).This justifies the use of the selected stations for use in our proposed data imputation methods.After data imputation methods were applied and the final missing percentages were computed for each combination, the imputed dataset that contained the lowest percentage of missing data was selected for each of the two variables.All imputation and evaluation methods were implemented with the Python programming language.

Gap-filling methods
Two different processes were used to gap-fill missing data within the 2012-2022 air and water temperature dataset, dependent on the length of the gap of missing data.With this in mind, the gaps were classified as short and long gaps.Short and long gaps for missing air and water temperatures were defined by the dynamics of the local physical conditions of the Laguna Madre system.Short gaps were characterized as gaps that were less than or equal to 3 hours for air temperature and 5 hours for water temperature.Any remaining gaps that were larger than the defined short gaps were defined as long gaps.
Short-Gap Interpolation Method: Gap-filling methods utilized for short gaps involved linear interpolation methods.To interpolate the small gaps, the averages of the last three measurements before and after the gap were computed.The two computed averages were used as the first and the last interpolated values within the gap.Rather than using the gap's first and last measurements, the average of the previous and next three values added robustness to the interpolation approach ( Fig. 2 ).
Although this approach is viable for a majority of the dataset, this gap-filling method was not found to be suitable for extreme cold events, where water and air temperatures drop significantly very rapidly [ 9 ].Studies show that air temperatures in the area can drop by more than 10 °C in less than 24 hours [ 2,9 ].To ensure that the proposed approach would not fail in these scenarios, the approach was applied when the following conditions were met: (1) the range of the three values before the beginning of the gap and the range of the value after the end of the gap is smaller than 1.5 °C; (2) the absolute difference between the mean values before and the after the gaps is smaller than 1.5 °C.If these conditions were not met, then the short gaps were not filled with our proposed method.
Long-Gap Imputation Method: Once short-gap interpolation methods were applied to all selected stations, long-gap imputation methods were implemented based on all combinations above ( Table 2 ).The stations where gap-filling was applied are referenced as the main stations, and the stations that were used to gap-fill are referenced as the nearby stations.When analyzing the data, it was observed that some of the long gaps in the dataset extended over multiple days, making linear interpolation approaches unreliable for addressing these cases.Long gaps of missing data within the main datasets were thus filled with the measurements of the selected nearby stations after linear adjustments of bias of the start and end of the gaps were accounted for.
To apply the linear adjustment used for the long gap-filling process, the averages of the last three measurements before the start of the gap and the first three measurements after the end of the gap were computed for both the main and nearby stations.The difference between the average measurement of the main station and that of the nearby station was then calculated and extracted.The corrected value was obtained by averaging the differences between station measurements before and after the gap.This corrected value was added to the nearby station measurement to obtain the value used to fill the missing measurement.Similar to what was observed when using the proposed short-gap interpolation approach, the long-gap imputation approach worked for most cases, however it was observed that the method failed when sudden changes in temperatures occurred.To ensure that the proposed approach would not fail in these scenarios, the approach was applied when the following conditions were met: the difference

Table 3
Percentage of missing values before and after the long gap-filling method has been applied for all station combinations.The main stations are listed first, and the nearby station used for gap-filling is labeled by and succeeds the dash (-). between the average of the three values for both the original and nearby station before the beginning and after the end of the gap was smaller than 1.5 °C.Linear interpolation methods that were utilized for short gaps were then facilitated again after long gaps were filled to account for new short gaps that formed after data substitution processes were completed.Final missing data percentages after data imputation methods: Once data imputation methods were complete, final percentages of missing air temperature data showed that the utilization of the Packery Channel dataset as the main station in combination with Baffin Bay resulted in the lowest percentage of missing air temperature data ( Table 3 ).It was also determined that the utilization of the NPS-SBI dataset as the main station in combination with the South Bird Island station resulted in the lowest percentage of missing water temperature data ( Table 3 ).Thus, the final dataset is comprised of air and water temperature measurements from Packery and NPS-SBI stations, respectively.In the case of the Packery measurements, the dataset was filled using Baffin Bay, whereas the NPS-SBI measurements were filled using the SBI station.The dataset included only the years that contained less than 2% of missing data after completion of the gap-filling process.All years from 2012 to 2022 contained less than 1.2% of missing data with the exception of 2021 (15.5% of missing data; Table 3 ), therefore was excluded from the final dataset ( Table 3 ).

SBI-NPS-SBI NPS-SBI
Table 4 shows the significant effect that the gap-filling approach had on the original datasets.In the original datasets, only 5 years had both water and air temperature data with less than 2% of missing values.However, after the use of the imputation methods, all years except for 2021 accomplished this goal, resulting in 10 years of data with less than 2% missing data ( Table 4 ).This is a significant improvement for water and air temperature data within the Laguna Madre and extremely valuable for the application of artificial intelligence (AI) and machine learning (ML) modeling particularly when continuous time-series inputs are necessary such as for long short-term memory model [3] , recurrent neural networks [ 6 ], and transformer architectures [ 11 ].

Evaluation of gap-filling method
In order to evaluate the proposed data imputation method, both NPS-SBI and Packery Channel datasets were used to assess the reliability of the methods.Random artificial gaps were created, representing up to 10% of the dataset size for each year.These gaps were then filled utilizing the proposed methods and evaluated using various metrics (e.g.mean absolute error [MAE] ( Eq. 4 ), root mean squared error [RMSE] ( Eq. 3 ), maximum 10% error [ME10] ( Eq. 6 )) to determine the reliability and validity of the method.The short and long-gap imputation methods were evaluated separately.For the short gap interpolation evaluation, 3-hour gaps were created for ATP measurements and 5-hour gaps were created for = WTPmeasurements.Random placement of these gaps was conducted for each year and variable.This assessment created gaps of maximum length of the short gaps for both ATP and WTP.This means that in the case of the use or observation of smaller gaps than the defined maximum length within the small-gap interpolation method, the interpolation evaluation results would be slightly better.For the long gap imputation evaluation, gaps ranging from 6 to 168 hours were randomly created, both in length and placement, within the WTP and ATP time series.This range is representative of 95% of the long gaps observed within the original dataset and was used to ensure a broad representation of the potential missing value scenarios.Both evaluation methods were applied thirty times in order to capture the variability of the observed errors (e.g., mean ± standard deviation) that were computed using the metrics noted and defined below: (2) Where x i is the observed values, x is the interpolated values, and n is the number of data points.
Results for the short gap interpolation method for the 30 trials show that ATP MAE values ( Eq. 4 ) for all years were below 0.50 °C, while the maximum 10% mean error (Max10%(MAE)) ( Eq. 6 ) averaged 1.12 ± 0.03 °C ( Table 5 ) for all years.WTP results for the short gap interpolation evaluation show similar results for MAE, displaying MAE values below 0.50 °C and Max10%(MAE) values no higher than 1.40 °C ( Table 6 ) for the full WTP dataset.
Results for the long gap imputation method for the 30 trial runs show that ATP MAE values averaged 0.87 ± 0.14 °C for the full ATP dataset ( Table 7 ).Max10%(MAE) values averaged 2.63 ± 0.34 °C for all years = ( Table 8 ).WTP results for the long gap interpolation method reflected MAE values that averaged 0.88 ± 0.69 °C for the full WTP dataset ( Table 8 ).Max10%(MAE) averaged to 2.99 ± 1.51 °C for all years ( Table 8 ).These results justify the application of the proposed data imputation approach.

Table 8
Evaluation metrics for the long gap-filling approach for WTP measurements when using NPS-SBI as the main station and SBI as the adjacent station (i.e., mean ± standard deviation of 30 trial runs).

Limitations
One limitation of the proposed imputation method is the need for highly correlated nearby stations to apply the long gap-filling approach.If the nearby stations did not exist or the nearby station data was not of good quality during the main station gaps, then the long gap-filling approach could not be applied.Another limitation is that the proposed gap-filling approach cannot be applied when the missing data corresponds to extreme events.

Fig. 1 .
Fig. 1.Map of water stations located in Laguna Madre, TX.Stations that were used for the imputation process are labeled in red, while the remaining stations that are not used for the newly gap-filled dataset are labeled in purple.

Fig. 2 .
Fig. 2. (A)Normal interpolation method versus (B) interpolation method using short gap method with linear adjustment.

Type of data Table Data collection Air
2023 The Authors.Published by Elsevier Inc.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Specifications Table and water temperature data were acquired from the records of the Texas Coastal Ocean Observation Network (TCOON) and the National Park Service (NPS).All data collected by TCOON follow the National Ocean Service standards including instrumentation, data collection procedures, periodic inspections and maintenance, and metadata collection.conrad-blucher-institute/LagunaMadreWaterAirTempDataCleaner (github.com)

Table 1
Pearson correlation coefficients (%) of (A) air and (B) water temperature measurements ( °C) of various stations located in the Upper Laguna Madre, including South Bird Island (SBI), Packery Channel, Baffin Bay (BB), and National Park Service (NPS)-SBI stations.

Table 2
Percentages (%) of missing values for the original datasets of the South Bird Island (SBI), Packery Channel, Baffin Bay (BB), and National Park Service-South Bird Island (NPS-SBI) stations per year.

Table 4
Percentage (%) of missing values for (1) the original datasets before imputation methods were employed and (2) the final datasets after imputation methods were employed.

Table 5
Evaluation metrics for the short gap-filling approach for Packery Channel ATP measurements (i.e., mean ± standard deviation of 30 trial runs).

Table 6
Evaluation metrics for the short gap-filling approach for NPS-SBI WTP measurements (i.e., mean ± standard deviation of 30 trial runs).

Table 7
Evaluation metrics for the long gap-filling approach for Packery Channel ATP measurements when using Packery Channel as the main station and Baffin Bay as the adjacent station (i.e., mean ± standard deviation of 30 trial runs).