Assessment of a portable UV–Vis spectrophotometer's performance in remote areas: Stream water DOC, Fe content and spectral data

This paper presents data for the assessment of a portable UV-Vis spectrophotometer's performance on predicting stream water DOC and Fe content. The dataset contains DOC and Fe concentrations by laboratory methods, in-situ and ex-situ spectral absorbances, monitoring environmental indexes such as water depth, temperature, turbidity and voltage. The records in Yli-Nuortti river (Cold station, Finland) took place during the hydrological year 2018-2019 and in Krycklan (C4 and C5, Sweden) during the hydrological years 2016-2019. The data analyses were conducted with ‘pls’ and ‘caret’ package in R. The correlation coefficient (R), root-mean-square deviation (RMSD), standard deviation (STD) and bias were used to check the performance of the models. This dataset can be combined with datasets from other regions around the world to build more universal models. For discussion and more information of the dataset creation, please refer to the full-length article “Assessment of a portable UV–Vis spectrophotometer's performance for stream water DOC and Fe content monitoring in remote areas” [1].

age in R. The correlation coefficient (R), root-mean-square deviation (RMSD), standard deviation (STD) and bias were used to check the performance of the models. This dataset can be combined with datasets from other regions around the world to build more universal models. For discussion and more information of the dataset creation, please refer to the full-length article "Assessment of a portable UV-Vis spectrophotometer's performance for stream water DOC and Fe content monitoring in remote areas" [1] .
© 2021 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) Table   Subject Environmental Science Specific subject area Water Science and Technology Type of data

Value of the Data
• The data can be used to build accurate and unbiased models for multiple watersheds for DOC prediction in Northern Fennoscandia, and these models could be extrapolated from one watershed to another even without site-specific calibration for DOC. • Scientific guidance could be provided to water industry and hydrological researchers for the applications of portable UV-Vis spectrophotometers for different purposes. • When similar research in different regions around the world are conducted in the future, this data can be combined to prove the generality of the proposed models for DOC prediction.

Data Description
The development of continuously operating water quality sensors has led to a transition from studying long-term trends and seasonal patterns to the investigation of highly dynamic phenomena, such as storm events and diurnal patterns, using high-frequency in situ measurements [3] . With the currently available technology and decreasing costs, in situ sensors are more frequently used for monitoring, especially in remote areas [4][5][6] . Although large amounts of data present challenges regarding storage, processing, and analysis [7] , long-term monitoring datasets provide an opportunity for detailed investigations of hydrological and biogeochemical processes in dynamic systems [ 6 , 8 , 9 ]. This paper presents data for the assessment of a portable UV-Vis spectrophotometer's performance on predicting stream water DOC and Fe content in remote area. Fig. 1 is the site locations in Finland [2] in-situ and ex-situ spectral absorbance shows the performance of in-situ S::CAN ( Fig. 3 ). The details of data sets for modelling are listed in Table 1 . The performance of DOC and Fe predicted models are shown in Table 2 -4 (for DOC) and Table 5 -7 (for Fe), respectively. Raw data for each step of analysis are recorded in 4 excels which are available at the direct URL ( https://data.mendeley.com/datasets/f67dw4hccv/1 ) [2] . Excel1 (In-situ & ex-situ absorbance) includes spectral absorbances measured by two methods (S::CAN and UV-1800) and

Table 1
List of 5 data sets, the training and testing set of each data set for modelling

Data set
Training set Testing set 1 (C4, C5&Cold station) 75% of observations randomly selected data set 1 The rest 25% of the observations 2 (C4, C5&Cold station) The observations from C4 and C5 The observations from Cold station 3 (C4&C5) 75% of observations randomly selected from data set 3 The rest 25% of the observations 4 (C4&Cold station) 75% of observations randomly selected from data set 4 The rest 25% of the observations 5 (C5&Cold station) 75% of observations randomly selected from data set 5 The rest 25% of the observations

Sampling and filtration
Before sample collection, the sampling bottles and reagent containers were cleaned in a Deko-20 0 0 washer with detergent and soaked for at least 24 h in 2% HNO 3 , then rinsed six times with Milli-Q water. Glassware was additionally pre-combusted for 4 h at 450 °C before use.
In Cold station, water was sampled monthly in winter and fall, once a fortnight in spring, and every week in summer. In Krycklan, sampling was done monthly during winter, once a fortnight during summer and fall, and every third day during the spring flood. The water samples were filtered through Filtration Assembly with Whatman GF/F Glass Microfiber Filters (pore size 0.45 μm). To precondition the filtration system and avoid contamination from the filter, 30 ml of sample water was filtered and then discarded. As the sites locate in remote area, samples for absorbance measurements were preserved using ZnCl 2 and then stored at 4 °C until laboratory analysis. Samples for DOC and Fe measurements were frozen until further analysis.

Laboratory measurements of spectral absorbance, DOC and Fe
After sample collection and preparation, spectral absorbance was measured with a laboratory benchtop spectrophotometer (UV-1800, Shimadzu, Kyoto, Japan) between 20 0 and 80 0 nm with a 10 mm pathlength quartz cell (acquisition step: 1 nm, scan speed: slow).
In Finland, dissolved organic carbon (DOC) was determined by thermal oxidation coupled with infrared detection (Multi N/C 2100, Analytik Jena, Germany) following acidification with phosphoric acid. Fe concentrations were determined calorimetrically with ferrozine corresponding to an absorbance at 562 nm by Victor3 1420 Multilabel Counter (PerkinElmer) [10] . In Sweden, DOC was measured with Shimadzu TOC-50 0 0 using catalytic combustion [11] . Fe was analysed using Inductively Coupled Plasma Optical Emission Spectroscopy (ICP-OES Varian Vista Pro Ax) [12] .

In-situ measurement of spectral absorbance and validation
In site, portable multi-parameter UV-Vis sensors (spectro::lyser, S::CAN Messtechnik GmbH, Austria) were applied as an emerging technology to monitor the water condition. UV-Vis sensors can determine the real-time spectral absorbance of water [5] . Thereafter, algorithms calculate DOC and Fe concentrations based on absorbance at a specific wavelength or multiple wavelengths. Three UV-Vis sensors were installed, one in the Yli-Nuortti river on June 12, 2018 and two in the Krycklan catchments on May 9, 2016. They measured absorbance across the UV-Vis range (220-732.5 nm, at 2.5 nm intervals) every 15 minutes and recorded these values in an internal datalogger. Water depth, temperature, turbidity and the voltage of S::CAN were detected simultaneously.
Unlike the laboratory benchtop spectrophotometer, in-situ S::CAN measured unfiltered water directly and was more sensitive to the environment changes such as water temperature, ambient sunlight and power supply. Therefore, the lab measured absorbance was used to check the performance of S::CAN and validate the quality of real-time spectral absorbance for DOC and Fe prediction ( Fig. 3 ).

Modelling for DOC and Fe prediction
The real-time absorbance (every 15 mins) from S::CAN in C4,C5 and Cold station was integrated into daily data, then merged with lab measured DOC (n = 183) and Fe (n = 142) according to date. The absorbance values from 220 nm to 732.5 nm at 2.5 nm intervals (207 variables) were used as input data for Fe analyses, while wavelengths shorter than 250 nm were excluded from the DOC analyses (194 variables) because inorganic substances can lead to interference at the lower end of the UV-Vis range [13] .
We used three methods: multiple stepwise regression (MSR), partial least-squares regression (PLS), and principal component regression (PCR). These methods were selected due to their applicability to data sets containing collinear variables and datasets that may contain a larger number of independent variables than observations. Lab measured DOC and Fe concentrations were always the dependent variable, and the absorbance values at different wavelengths were the independent variables. The models rely on splitting the data into a training and testing data set. We tried 5 different splits of the data ( Table 1 ). The performance of DOC prediction models (PLS, PCR, MSR) basing on data set 1 and 2 showed in Table 2 and Table 3 , while the one (MSR) indicated in Table 4 basing on data set 3 to 5. Additionally, the performance of Fe prediction models (PLS, PCR, MSR) basing on data set 1 and 2 showed in Table 5 and Table 6 , while the one (MSR) indicated in Table 7 basing on data set 3 to 5.
The 'pls' package [14] in R [15] was applied for PCR and PLS analyses. Coefficients and pvalues were estimated by jackknife T-test method using 'jack.test' function in 'pls' package. MSR analyses were performed with 'caret' package [16] in R [15] . The correlation coefficient (R), rootmean-square deviation (RMSD), standard deviation (STD) and bias were used to check the performance of the models. Table 4 The goodness of fit statistics of MSR regression estimating DOC by spectral absorbance for different data sets. Data set 3 is observations from C4 and C5; Data set 4 is observations from C4 and Cold station; Data set 5 is observations from C5 and Cold station. The training set contains 75% of observations that were randomly selected from each data set and the testing set contains the rest 25% of observations.   Table 7 The goodness of fit statistics of the MSR regression estimating Fe by spectral absorbance for different data sets. Data set 3 is observations from C4 and C5; Data set 4 is observations from C4 and Cold station; Data set 5 is observations from C5 and Cold station. Training sets are 75% of the observations randomly selected from each data set and testing sets are the rest 25% of the observations.