A Data-Based Framework for Identifying a Source Location of a Contaminant Spill in a River System with Random Measurement Errors

This study addresses the problem of identifying the source location of a contaminant spill in a river system when a sensor network returns observations containing random measurement errors. To solve this problem, we suggest a new framework comprising three main steps: (i) spill detection, (ii) data preprocessing, and (iii) source identification. Specifically, we applied a statistical process control chart to detect a contaminant spill with measurement errors while keeping the false alarm rate at less than or equal to a user-specified value. After detecting a spill, we generated a nonlinear regression model to estimate a breakthrough curve of the observations and derive a characteristic vector of the estimated curve. Using the characteristic vector as an input, a random forest model was constructed with the sensor raising the first alarm. The model provides output values between 0 and 1 to represent the possibility of each candidate location being the true spill source. These possibility values allow users to identify strong candidate locations for the spill. The accuracy of our framework was tested on part of the Altamaha River system in Georgia, USA.


Introduction
Water is a crucial resource for both public health and ecological life. Since the amount of fresh water is decreasing at the same time that population, industrialization, and environmental pollution are increasing, the importance of water quality monitoring is attracting more attention. Based on improvements to real-time sensor and data analysis technologies that enable people to monitor water quality more effectively, the problem of identifying the source location of a contaminant spill has also been extensively studied by researchers. Most previous studies related to this problem have addressed two types of water systems: groundwater and rivers. Optimization algorithms, such as linear/nonlinear programming and meta-heuristics, have been commonly used to identify contaminant spill locations in groundwater systems, as shown by Aral and Guan [1], Aral et al. [2], Gorelick et al. [3], Singh and Datta [4], and Sun et al. [5]. Additionally, statistical methods (such as a backward probability model approach [6,7] and a geostatistical approach [8]) and machine learning techniques including artificial neural network models have been used by Singh and Datta [9], Singh et al. [10], and Srivastava and Singh [11] to address similar problems.
Due to the size and complexity of the problem, fewer studies have been conducted for rivers than for groundwater systems. Boano et al. [12] employed a geostatistical approach that generates previous information about a pollutant. Chen et al. [13] applied multivariate statistical methods to determine the spatial and temporal variations of water quality, then identified the contaminant source location. The backward probability method was also used by Ghane et al. [14] and Telci and Aral [15], while Lee et al. [16] recently provided a method based on random forest models to identify the source location of a contaminant spill in a river system. The random forest model is a famous classification model that consists of a set of tree-structured classifiers. It is known to have several advantages, such as computational efficiency, accuracy, and robustness, compared with other models. One may see Breiman [17] for an overview of the random forest model. For source identification in a river system, the random forest models provide values between 0 and 1, indicating the possibility that a candidate location is the true spill location, while the backward probability method provides only the rankings of candidate locations.
Many studies rely on hydrodynamic simulation models that provide contaminant concentration levels in a water system to develop and test their methods. It is difficult to apply these methods, however, because, in practice, the observed concentration levels often contain random measurement errors. As mentioned by Kim et al. [18], the results from a case considering observations with random measurement errors may be totally different compared with those from a case considering ideal observations generated by the simulation models without measurement errors (e.g., the concentration levels are exactly 0 when no spill event occurs). To obtain reliable results despite random measurement errors, one may need to control false alarm rates, which can be done by statistical process control (SPC) charts. A good review of SPC charts is provided by Montgomery [19]. Among the various SPC charts, we focus on the CUSUM chart developed by Kim et al. [20] due to the following advantages: (i) it enables users to derive a threshold value with a simple analytical method and (ii) it can deal with autocorrelated observations that follow a non-normal distribution.
This study, therefore, considers the problem of identifying a spill source location in a river system in the presence of random measurement errors. To the best of our knowledge, this is the first work considering random measurement errors in the source identification problem. To solve this problem, we suggest a new framework comprised of three main steps: (i) detecting a contaminant spill, (ii) preprocessing obtained contaminant spill data, and (iii) identifying the spill source location via random forest models. Specifically, we first use the CUSUM chart developed by Kim et al. [20] to detect a contaminant spill, where the false alarm rate is guaranteed to be less than or equal to a user-specified level. Then, we obtain a set of observation data including all information about the contaminant spill and use a robust locally-weighted regression model [3] to estimate the profile of the observations, called the breakthrough curve. Finally, using profile characteristics as inputs, we develop a classification method using random forest models to measure the possibility that each candidate spill location is truly the correct location of the contaminant source. Based on the possibilities, we can consistently identify strong candidate locations for the spill.
This paper is organized as follows. Section 2 describes the problem with notations and assumptions and provides a method for obtaining the observation data. In Section 3, we suggest a new framework to identify the source location of a contaminant spill in the presence of random measurement errors. Section 4 presents experimental results of a case study applied to a part of the Altamaha River system located in Georgia, USA. Concluding remarks follow in Section 5.

Problem Description
We consider a river system including N possible candidate locations for a contaminant spill. We index the locations by integers starting from 1 and call these integers location indices. In order to monitor the water quality and detect the spill event, K number of sensors are installed at a subset of the possible spill locations to report the concentration levels of the contaminant at each discretized time t.
Let D denote the index set of all possible locations, i.e., D = {1, 2, . . . , N}. For 2 ≤ K ≤ N given, the location of the K sensors are represented by a vector y = (y 1 , . . . , y K ), such that y j ∈ D for all j = 1, . . . , K. (We assume y 1 < y 2 < . . . < y K to avoid the repetition of representations). At each time t, an observation X t y j is returned by the sensor installed at y j location, and the values from all K sensors at time t are represented by the observation vector: . . .
Since the true spill location is unknown and a contaminant spill can occur at any of the candidate locations, the possibility that location d ∈ D is the true source of the spill, P(d), can be evaluated. Note that 0 ≤ P(d) ≤ 1 for all d ∈ D. The closer P(d) is to 1, the more likely that d is the source location of the spill. Let P denote a vector of P(d) as follows: . . .

P(N)
The problem considered in this paper is to construct a data-driven framework for the purpose of identifying the source location of a contaminant spill. We develop a model that calculates P based on the recent observation vectors, X t−ω+1 (y), . . . . . . , X t (y) with a pre-specified window length ω, when K and y are given.
To construct the data-driven framework and test its performance, preparation of a large dataset is required. In the next subsection, we briefly explain the data acquisition method by incorporating random measurement errors with values obtained from a hydrodynamics simulation.

Data Acquisition with Simulation
This paper supposes that an observation X t y j may contain some measurement error for all t and j = 1, . . . , K. In order to obtain realistic X t y j values reported by the sensors with random measurement errors, we adopt the model from Kim et al. [18] as follows: where ξ t y j is the true concentration level of the contaminant and ε ξ is a random measurement error that depends on the value of ξ t y j .
For obtaining the realizations of ξ t y j values, a popular simulation software package called the Storm Water Management Model (SWMM) provided by the United States Environmental Protection Agency, Durham, NC, USA, is used. The SWMM is developed by the United States Environmental Protection Agency to simulate hydrodynamics and contaminant transport in river systems under dynamic flow, including the various rainfall and watershed conditions described by Rossman [21]. The SWMM requires inputs of variable information related to the properties of random contaminant spills and rainfall events derived from historical data in addition to fixed information related to geologic/geometric properties and fundamental river system hydrodynamics. After executing the SWMM software using the inputs, we can obtain concentration levels from each candidate location at every inter-reporting time of the simulation clock.
As shown in Kim et al. [18], the measurement error ε ξ is quantified based on two perspectives: accuracy (i.e., the bias of the measurement error) and precision (i.e., the variability of the measurement errors). Let µ ξ denote the level of accuracy and σ ξ denote the level of precision of the sensor at location y j at time t. For given values of µ ξ and σ ξ , ε ξ is assumed to be independent and identically distributed along with the Laplace distribution. Therefore, the probability density function of ε ξ is specified as follows:

Overall Description of the Proposed Framework
In this subsection, we provide notations and an overall description of our framework to identify the source location of a contaminant spill in the presence of random measurement errors. A list of notations needed to describe our framework is as follows: t the discretized time index at which each sensor returns an observation; y * the location index of the sensor that raises the first alarm; τ * a the point of time when the first alarm is on by the sensor at y * ; τ * b the point of time when the first alarm is off by the sensor at y * ; x th the concentration level reported by the sensor at y * at time τ * a (i.e., X τ * a (y * )); ω a pre-specified window length for spill detection; a lag parameter pre-designated by users to determine τ * b ; a vector representing the curvature characteristics of ψ(t); D y j the set of candidate spill locations that can be identified only by the sensor at y j ; and φ y j the random forest model corresponding to the sensor at y j . Figure 1 shows the overall description of our framework. Note that the framework consists of three main steps for (i) detecting a spill, (ii) preprocessing the set of obtained data, and (iii) identifying the source location with a classification model.

Overall Description of the Proposed Framework
In this subsection, we provide notations and an overall description of our framework to identify the source location of a contaminant spill in the presence of random measurement errors. A list of notations needed to describe our framework is as follows: the discretized time index at which each sensor returns an observation; * the location index of the sensor that raises the first alarm; * the point of time when the first alarm is on by the sensor at * ; * the point of time when the first alarm is off by the sensor at * ; the concentration level reported by the sensor at * at time * (i.e., * ( * )); a pre-specified window length for spill detection; ℓ a lag parameter pre-designated by users to determine * ; ( ) a nonlinear regression model for ( * ) in the time period [ * , * ]; ( * ) a vector representing the curvature characteristics of ( ); the set of candidate spill locations that can be identified only by the sensor at ; and the random forest model corresponding to the sensor at . Figure 1 shows the overall description of our framework. Note that the framework consists of three main steps for (i) detecting a spill, (ii) preprocessing the set of obtained data, and (iii) identifying the source location with a classification model. Sections 3.2-3.4 provide details about each of the three steps, respectively.
Set ← + 1. end while Set * as the sensor location at which the first alarm is raised. Update * as the current time index and set as * ( * ).

Spill Detection
Detecting a contaminant source location under measurement errors is more complicated than detecting one without measurement errors. If there is no measurement error, the observed value is the same as the true contaminant concentration level and a sensor can simply raise an alarm when the observed value exceeds 0. In this case, an alarm is always a true alarm providing notification that a spill event has happened somewhere upstream. With measurement errors, however, nonzero values can be reported by sensors even when no spill event has occurred. In this case, users should be aware of the risk of false alarms and must carefully determine the threshold value for triggering an alarm. (A high threshold value increases the chance of missing a spill event, while a low threshold value subjects users to frequent false alarms.) In order to detect a contaminant spill under random measurement errors, a statistical process control (SPC) chart with a carefully designed threshold can be used as a monitoring tool. The SPC chart was originally developed to detect a change in the mean of monitoring statistics while maintaining a targeted rate of false alarms. Among the various SPC charts, this paper introduces the CUSUM chart developed by Kim et al. [20] because it provides a simple analytical method for identifying a threshold value even for autocorrelated observations that follow a non-normal distribution. The target false alarm rate is proportional to a target in-control average run length (ARL), denoted by ρ, which represents the average number of observations until a false alarm is raised when the true mean of monitoring statistics is the same. Based on the target ρ given (i.e., the target false alarm rate given), Kim et al. [20] provide the threshold value H by solving the following equation: where ς represents a reference parameter and σ 0 is the known standard deviation of the random measurement errors. We use ς = 0.1, which is recommended by Kim et al. [20]. We then construct our CUSUM statistics S + t y j for the sensor at location y j at time t with the monitoring window length ω as follows:

Constructing CUSUM monitoring statistics
Once the threshold value H from Equation (5) and the monitoring statistic S + t y j are ready, an alarm is raised by a sensor at location y j at time t, when If the first alarm is raised by a sensor, we record the location of the sensor as y * , the current time as τ * a , and the concentration level X τ * a (y * ) as x th , before proceeding to the data preprocessing step.

Data Preprocessing
The main purpose of the data preprocessing step is to prepare refined input data for the random forest models to identify the source of the contaminant spill in the next step. This preparation is done by modeling the breakthrough curve of the observed concentration levels over time after the alarm [16] and deriving the characteristics of the curve. To model the breakthrough curve that contains the most important information about the spill event, we must first define the start and end times of the curve. Because of random measurement errors, positive values of observations can be returned by a sensor even without any spill event. Thus, the time period considered for the curve should not be chosen simply as a period having positive values for returned observations. Instead, our framework suggests a heuristic approach to defining the start time of the curve as τ * a and the end time as where is a lag parameter pre-designated by users. See Figure 2 for examples of X t (y * ) over time, τ * a and τ * b . values for returned observations. Instead, our framework suggests a heuristic approach to defining the start time of the curve as * and the end time as * ≡ min { | ( * ) < for = − ℓ, … , and * }, where ℓ is a lag parameter pre-designated by users. See Figure 2 for examples of ( * ) over time, * and * . Since random measurement error is similar to a noise factor, instead of using raw data { ( * ); = ( * ), ⋯ , ( * )} to represent the breakthrough curve, we used a nonlinear regression model to refine the curve. By reducing the impact of the random measurement errors with the regression model, a smoother and clearer breakthrough curve can be obtained. Among the various nonlinear regression models, we employ the robust locally-weighted regression model developed by Cleveland [22]. In this model, the independent variable is and the dependent variable is ( * ) for = * , ⋯ , * . Let ( ) denote the concentration level estimated by the regression model at . In Figure 2, the solid curve provides an example of the curve ( ) over time = ( * ), ⋯ , ( * ).
After obtaining the estimated breakthrough curve ( ), we need to derive its characteristics. We basically introduce two quantitative characteristics of ( ), denoted by ( * ) and ( * ), which represent the total area and the time-averaged area between the horizontal axis and the estimated breakthrough curve ( ), respectively [10]. Specifically, they are calculated as follows: In addition to ( * ) and ( * ), we also consider the central statistical moment, standard deviation, skewness, and kurtosis of ( ) as the main characteristics of the curve as described by Telci and Aral [15]. The first moment of the estimated breakthrough curve is denoted by ( * ) and calculated by * Figure 2. An example plot for X t (y * ), τ * a , τ * b , and ψ(t).
Since random measurement error is similar to a noise factor, instead of using raw data X t (y * ); t = τ a (y * ), · · · , τ b (y * ) to represent the breakthrough curve, we used a nonlinear regression model to refine the curve. By reducing the impact of the random measurement errors with the regression model, a smoother and clearer breakthrough curve can be obtained. Among the various nonlinear regression models, we employ the robust locally-weighted regression model developed by Cleveland [22]. In this model, the independent variable is t and the dependent variable is X t (y * ) for t = τ * a , · · · , τ * b . Let ψ(t) denote the concentration level estimated by the regression model at t. In Figure 2, the solid curve provides an example of the curve ψ(t) over time t = τ a (y * ), · · · , τ b (y * ).
After obtaining the estimated breakthrough curve ψ(t), we need to derive its characteristics. We basically introduce two quantitative characteristics of ψ(t), denoted by B 1 (y * ) and B 2 (y * ), which represent the total area and the time-averaged area between the horizontal axis and the estimated breakthrough curve ψ(t), respectively [10]. Specifically, they are calculated as follows: In addition to B 1 (y * ) and B 2 (y * ), we also consider the central statistical moment, standard deviation, skewness, and kurtosis of ψ(t) as the main characteristics of the curve as described by Telci and Aral [15]. The first moment of the estimated breakthrough curve is denoted by µ(y * ) and calculated by The estimated kth central moment of ψ(t), for k = 2, 3, . . ., is denoted by m k ψ (y * ) and calculated by The standard deviation, skewness, and kurtosis of the estimated curve, denoted by B 3 (y * ), B 4 (y * ), and B 5 (y * ) respectively, are calculated as follows: Based on Equations (8)- (14), the curvature characteristics of ψ(t) are summarized into a five-dimensional vector as follows: B(y * ) = (B 1 (y * ), B 2 (y * ), B 3 (y * ), B 4 (y * ), B 5 (y * )).

Source Identification
To identify the location of the contaminant source, we employed the random forest model shown in Lee et al. [16]. As inputs of the random forest model, Lee et al. [16] used not only characteristics of the breakthrough curves, but also time differences in alarms between any pair of sensors. Obtaining such relative time information, however, is often time-consuming and not helpful for improving source identification with random measurement errors. Therefore, we used only the characteristics of the estimated breakthrough curve B(y * ) as inputs to our random forest models.
In this study, we partitioned the whole river region into K sub-regions, each of which was monitored by a single sensor. Recall that D y j is defined as a set of candidate spill locations that can be identified only by the sensor located at y j . For example, when N = 19, K = 2 and y = (9, 19), D(9) = {1, 2, · · · 14} and D(19) = {15, 16, 17, 18, 19} as in Figure 3. The sensor located at 9 raises the first alarm if a spill event occurs at any locations in D(9), while the sensor located at 19 raises the first alarm if a spill occurs at any locations in D (19). In our framework, a random forest model was constructed corresponding to each D y j by using simulated training data that includes random measurement errors. Let φ y j denote the random forest model regarding D y j . can be identified only by the sensor located at . For example, when = 19, = 2 and = (9, 19), (9) = {1,2, ⋯ 14} and (19) = {15, 16, 17, 18, 19} as in Figure 3. The sensor located at 9 raises the first alarm if a spill event occurs at any locations in (9), while the sensor located at 19 raises the first alarm if a spill occurs at any locations in (19). In our framework, a random forest model was constructed corresponding to each by using simulated training data that includes random measurement errors. Let denote the random forest model regarding . The random forest model consists of several tree classifiers. Suppose that we have Γ number of datasets sampled from the training data using a bootstrapping technique. Since a tree classifier is generated from each sample dataset, there are Γ number of tree classifiers in the random The random forest model φ y j consists of several tree classifiers. Suppose that we have Γ number of datasets sampled from the training data using a bootstrapping technique. Since a tree classifier is generated from each sample dataset, there are Γ number of tree classifiers in the random forest model. A tree classifier has internal nodes and terminal nodes. At each internal node, Λ number of input variables are randomly selected and linearly combined with coefficients. The linear combination of the selected input variables is checked if its value is above a certain threshold constant, and we move to the next node according to the result. To set the threshold constant and the coefficients of the linear combination at each internal node, a randomized node optimization algorithm is used [23]. Each terminal node represents one of the locations in D y j as the identified spill source location, and no additional decision or action is taken. Figure 4 shows a part of the tree classifier constructed for D (19) from Figure 3 as an example. Notably, different tree classifiers have different structures (e.g., the numbers of nodes and arcs) and logic, with different threshold constants and linear combinations for each internal node of the classifiers. If an input vector is given in the model, each tree classifier individually nominates one of the candidate locations as the identified contaminant source location. The random forest model φ y j collects the votes from all tree classifiers and calculates P(d) for d ∈ D y j proportional to the number of votes from Γ. The model returns P(d) = 0 for d ∈ D c y j . Figure 5 depicts the overall structure of a random forest model.
As remarked on by Lee et al. [16], Γ (i.e., the number of tree classifiers) and Λ (i.e., the number of input variables randomly selected at each internal node) affect the performance of the random forest model. As Γ increases, the error of the model gradually decreases and converges to a certain value; therefore, we use a sufficiently large value for Γ. Like Lee et al. [16], we selected Λ value as that with minimum error among the possible integers in As remarked on by Lee et al. [16], Γ (i.e., the number of tree classifiers) and Λ (i.e., the number of input variables randomly selected at each internal node) affect the performance of the random forest model. As Γ increases, the error of the model gradually decreases and converges to a certain value; therefore, we use a sufficiently large value for Γ. Like Lee et al. [16], we selected Λ value as that with minimum error among the possible integers in [0.5√ , 2√ ] as well as [0.5( + 1), 2( + 1)], where is the number of variables in the input vector.

Simulation Setup for the Targeted River System
In our case study, we applied our framework to monitor a part of the Altamaha River system in the state of Georgia, USA. The whole river system is approximately 760 km long and has over 36,260 km 2 area with 60 reaches and 62 junctions. The fixed information for the SWMM, such as geometric, geologic, and hydrodynamic data, was obtained from the United States Geological Survey in the National Elevation Dataset, and spill and rainfall events are considered variable information for the SWMM. Spill starting time and intensity followed uniform distributions in [0, 10] days and [10, 1000] grams per liter, respectively. The rainfall pattern was randomly selected among five rain patterns for each of 10 predesignated sub-catchments for the river. See Park et al. [24] and Telci et al. [25] for more

Simulation Setup for the Targeted River System
In our case study, we applied our framework to monitor a part of the Altamaha River system in the state of Georgia, USA. The whole river system is approximately 760 km long and has over 36,260 km 2 area with 60 reaches and 62 junctions. The fixed information for the SWMM, such as geometric, geologic, and hydrodynamic data, was obtained from the United States Geological Survey in the National Elevation Dataset, and spill and rainfall events are considered variable information for the SWMM. Spill starting time and intensity followed uniform distributions in [0, 10] days and [10,1000] grams per liter, respectively. The rainfall pattern was randomly selected among five rain patterns for each of 10 predesignated sub-catchments for the river. See Park et al. [24] and Telci et al. [25] for more details about the river system and the corresponding SWMM model.
For each run, the SWMM monitors hydrodynamics and contaminant levels of spills and rainfalls for 40 days. The related quantitative values (e.g., concentration levels, flow rates, and the amounts of overflows) are reported every 15 min in the simulation clock at each candidate location. Each training and test dataset is obtained by adding random errors whose density function is provided in Equation (4) to the concentration levels returned by the SWMM.

Experimental Setup
We consider a part of the Altamaha River as a targeted study area, as seen in Figure 6. The study area includes 53 candidate spill locations (i.e., D = {1, 2, . . . , 53}), which are marked as gray circles in Figure 6. Based on research from Lee et al. [16], we set the number of sensors K = 6 and their locations, y = (9, 19, 26, 33, 46, 53). As mentioned in Section 3.4, the whole study area was divided into six sub-regions that were independently monitored by each sensor. The corresponding D(y i ) values are represented in Figure 6.
For the random measurement errors, we consider two configurations to evaluate the impact of the errors: (i) low bias and low variability, denoted by (L, L), and (ii) high bias and high variability, denoted by (H, H). Specifically, we use µ ξ = 0.002ξ t (y j ) and σ ξ = 0.005 + 0.02ξ t (y j ) for the (L, L) configuration and use µ ξ = 0.05ξ t (y j ) and σ ξ = 0.015 + 0.06ξ t (y j ) for the (H, H) configuration as in Kim et al. [18]. For each of the random measurement error configurations, we searched the threshold value H of the CUSUM chart by using Equation (5) with ρ = 1000 (days) for each sensor (i.e., equivalent to the type I error, approximately 1.04 × 10 −5 ). Under each configuration of measurement errors, we identified the control limit, H, for the CUSUM chart using Equation (5) when ρ = 1000. Thus, we set H = 0.228 for the (L, L) configuration and H = 0.684 for the (H, H) configuration. The monitoring window length for the CUSUM chart, ω, is set to 10 days and is set to 7. Table 1 summarizes information about random forest models constructed by evaluating D(y j )s. For each sensor location y j , the random forest model φ(y i ) is constructed (and trained) by using a D(y i ) × 500 number of simulation datasets where |·| represents the cardinality of a set. As mentioned in Section 3.4, we checked various values of Γ (i.e., the number of tree classifiers in a random forest model) and Λ (i.e., the number of input variables randomly selected at each internal node), and finally selected Γ = 500 and Λ = 4 for all models. We used the "StatsModels" package to generate each robust, locally-weighted regression and the "scikit-learn" package to train each random forest model in Python version 0.21.2 (provided by the Python Software Foundation, Beaverton, OR, USA) with a personal computer (Intel Xeon E5-1650 CPU; RAM 64GB). The average time required to generate each random forest model was approximately 10.7, 3.1, 3.9, 3.8, 10.5, 5.2 s. Table 1 also provides the out-of-bag (OOB) errors for each random forest model, which was calculated by the ratio misclassified datasets to the total number of training datasets for φ(y i ).
to generate each robust, locally-weighted regression and the "scikit-learn" package to train each random forest model in Python version 0.21.2 (provided by the Python Software Foundation, Beaverton, OR, USA) with a personal computer (Intel Xeon E5-1650 CPU; RAM 64GB). The average time required to generate each random forest model was approximately 10.7, 3.1, 3.9, 3.8, 10.5, 5.2 s. Table 1 also provides the out-of-bag (OOB) errors for each random forest model, which was calculated by the ratio misclassified datasets to the total number of training datasets for ( ).

Results
We ran our framework on 53 × 100 test datasets (i.e., 100 spills occurred at each of 53 candidate locations) to evaluate the performance of source identification. Figures 7 and 8 show the identification accuracy rate (in y-axis) for the true spill location (in x-axis) under the (L, L) and (H, H) configurations, respectively. As the result of each test dataset, we can obtain a list of locations from the most promising to the least promising regarding the highest to lowest values of P(d) returned by our framework. Figure 7a presents the percentage of times (i.e., the rate) that the correct source location is the first place on the list. Figure 7b,c presents the percentage of times (i.e., the rate) that the correct source location is within the top two or three places on the list, respectively. Figure 7a shows that spills occurred at about one-fifth of the locations are identified with a relatively low accuracy rate of less than 60% while the spills that occurred at about half of the locations are identified easily with an accuracy rate of more than 90%. However, regarding Figure 7b,c, high accuracy rates of more than 90% are achieved for all 53 locations. In sum, no matter where the spill occurs, users can find the true spill location with more than 90% probability, if they visit the top three locations on the list. Figure 8 shows a similar pattern to Figure 7, but the identification accuracy rate in Figure 8 is slightly worse than that in Figure 7 because of higher bias and the variability of random measurement errors.   The random forest model is advantageous because it enables users to analyze the detailed possibility of the correct selection (e.g., P(d)). By comparing the obtained P(d) values, users can pick stronger candidates for the spill quickly and efficiently. For example, see Figure 9 representing the averaged P(d) values (denoted byP(d)) at each location under the (L, L) configuration for some related pairs of locations. Let us consider the first pair (i.e., locations 4 and 5) in Figure 9, as an example. When a spill occurs at location 4 or 5, both locations 4 and 5 achieve significantly higher P(d) values than other locations. This implies that the spill at location 4 can often be confused with the spill at location 5. However, it implies that locations 4 and 5 can be recognized as promising spill locations when compared with other locations, regardless of where the true spill location is. (Note that similar analyses can be done for other pairs in Figure 9.) Thus, users can distinguish a group of stronger candidates based on gaps among P(d) values even when the true spill location is unknown, which is especially helpful in practical situations for finding unknown source locations under measurement errors.

Conclusions
This study proposed a new framework to identify the source location of a contaminant spill in a river system with random sensor measurement errors. In the framework, one may first detect a contaminant spill using the CUSUM chart while the type I error of detection is controlled. The framework generates a nonlinear regression model for observation data with random errors to estimate a breakthrough curve, and derives characteristics of the estimated breakthrough curve for use as inputs to random forest models. After selecting one random forest model corresponding to the sensor that detected the spill, values between 0 and 1 are evaluated for the possibility of each candidate location being the true contaminant source location. Based on the detailed values, users can judge how likely each location is to be the true spill location, and analyze the gaps among the values to distinguish the most promising candidate locations instead of just picking locations regardless of the possibility. The test results of applying our framework to the Altamaha River system in the USA show that our framework performs well on the identification of a spill source location, even with measurement errors. In the test results, the true spill location is listed within the top three promising locations with more than 90% probability in most of the cases considered.
Our framework was originally proposed to identify a contaminant spill source location in a river system. However, beyond water monitoring systems, the framework can be applied to other areas including traffic and transportation, manufacturing process monitoring, and telecommunications.
While conducting this study, we determined that the identification accuracy of the framework

Conclusions
This study proposed a new framework to identify the source location of a contaminant spill in a river system with random sensor measurement errors. In the framework, one may first detect a contaminant spill using the CUSUM chart while the type I error of detection is controlled. The framework generates a nonlinear regression model for observation data with random errors to estimate a breakthrough curve, and derives characteristics of the estimated breakthrough curve for use as inputs to random forest models. After selecting one random forest model corresponding to the sensor that detected the spill, values between 0 and 1 are evaluated for the possibility of each candidate location being the true contaminant source location. Based on the detailed values, users can judge how likely each location is to be the true spill location, and analyze the gaps among the values to distinguish the most promising candidate locations instead of just picking locations regardless of the possibility. The test results of applying our framework to the Altamaha River system in the USA show that our framework performs well on the identification of a spill source location, even with measurement errors. In the test results, the true spill location is listed within the top three promising locations with more than 90% probability in most of the cases considered.
Our framework was originally proposed to identify a contaminant spill source location in a river system. However, beyond water monitoring systems, the framework can be applied to other areas including traffic and transportation, manufacturing process monitoring, and telecommunications.
While conducting this study, we determined that the identification accuracy of the framework can be improved if the exact starting and ending times of the breakthrough curve caused by the spill event are known. Further research on improving the accuracy of these time indices is ongoing.
Author Contributions: C.P., M.L.L., and J.H.K. designed the study performed the experiments, and analyzed the results; C.P. and M.L.L. wrote and revised the paper.