Detection of Wastewater Pollution Through Natural Language Generation With a Low-Cost Sensing Platform

The detection of contaminants in several environments (e.g., air, water, sewage systems) is of paramount importance to protect people and predict possible dangerous circumstances. Most works do this using classical Machine Learning tools that act on the acquired measurement data. This paper introduces two main elements: a low-cost platform to acquire, pre-process, and transmit data to classify contaminants in wastewater; and a novel classification approach to classify contaminants in wastewater, based on deep learning and the transformation of raw sensor data into natural language metadata. The proposed solution presents clear advantages against state-of-the-art systems in terms of higher effectiveness and reasonable efficiency. The main disadvantage of the proposed approach is that it relies on knowing the injection time, i.e., the instant in time when the contaminant is injected into the wastewater. For this reason, the developed system also includes a finite state machine tool able to infer the exact time instant when the substance is injected. The entire system is presented and discussed in detail. Furthermore, several variants of the proposed processing technique are also presented to assess the sensitivity to the number of used samples and the corresponding promptness/computational burden of the system. The lowest accuracy obtained by our technique is 91.4%, which is significantly higher than the 81.0% accuracy reached by the best baseline method.

wastewater (WW) is particularly important [3]. WW is the water that has already been used for some purpose (civil or industrial uses) and must be subjected to purification before being returned to the natural cycle. To function at their best and effectively, the purification systems must know a priori the type of substances mixed with the water. It follows that a purification system for water for industrial use will be different from a purification plant for water for civil use. Hence, there is a strong need for protocols to promptly detect incompatible substances, to guarantee the correct and effective operation of purification plants [4].
Currently, this is solved by organizing periodic monitoring activities at particular points of the water path, which are carried out by the control institutes in charge using specialized laboratory instruments. Although this is an effective method, the quality of the water between two consecutive checks is unknown, and the checks may be not frequent enough to promptly identify problems. The ideal solution would combine automated continuous and distributed early warning monitoring, alongside periodic manual checks carried out by the control institutes.
To solve the problems of cost and installation of a distributed and continuous monitoring system, it is necessary to resort to low-cost and IoT-ready systems [5], which are able not only to collect environmental data but also to process them relying on centralized data collection and elaboration points.
In this context, the data collected from the sensors need to be processed by an algorithm that is used to analyze and forecast the presence (or absence) of polluting substances in the WW. Current state-of-the-art systems for this task rely on machine learning algorithms such as decision trees [6], [7].
In this paper, we propose a novel system based on deep learning, and in particular on causal generative models developed for natural language tasks, for the detection and classification of pollutants in WW, starting from the data collected by a multisensory system based on SENSIPLUS (Sensichips srl, Pisa, Italy) [8]. Note that the present paper does not present the infrastructure necessary for data transport as any solution based, for example, on MQTT or message queuing protocols could be used for this purpose.
The effectiveness of the proposed classifier is tested against a set of state-of-the-art baselines on a dataset created in collaboration with Sensichips s.r.l. and made available to the scientific community [9]. Results show that the proposed methodology outperforms the baseline methods and its effectiveness allows for practical usage of the developed methodology.

II. RELATED WORK
The monitoring of wastewater is a widely discussed topic in the scientific literature. In particular, several kinds of technologies contribute to developing sensors that discriminate and classify undesired substances to ensure an adequate water quality level. Some of the authors developed systems able to monitor both water and air thanks to the SENSIPLUS platform [10], [11], [12], [13]. The monitoring outputs can vary, ranging from a classification of the pollutants to a simple binary decision on the presence of contaminants in general. Precise solutions to specific problems are often preferred to the development of generic monitoring system that can work properly in very wide contexts. As an example, Lim [14] describes a system to detect pollutants in the WW framework, although the distinction between different substances is missing and the technologies appear outdated nowadays. A different approach is taken by Lepot et al. [15], where the presence of illegal connections in the sewage system is monitored using an infrared camera. Ji et al. [16] present an image processing system, intended to estimate the WW amount without taking care of the distinction among substances. The cameras adopted to acquire images do not suffer from sensors' corrosion problems but they require a high energy budget, thus making the system far from the low-cost condition. There are other cases where the classification accuracy is very high but the energy/cost constraints are not taken into account. This is the case of Pisa et al. [17], who developed a system to detect ammonium and total nitrogen based on another one that is more broadly designed to detect all components derived from nitrogen. Drenoyanis et al. [18] propose an interesting portable device to monitor sewer pumping station pumps in order to generate alarms whenever anomalies are detected. The system is surely of great interest, but it does not include any pollutant classification stage. In terms of processing techniques, to the best of our knowledge, this is the first work leveraging natural language processing techniques, and in particular causal models developed for natural language generation, for the task of detecting WW pollution. Nevertheless, in literature we can find examples of the usage of natural language processing techniques and language models for non-canonical tasks. Language models have been used in the medical domain after the application of a ''reverse encoding'' (i.e., translating codes back to their description) for the classification of diagnostic tests [19], [20], [21] and for diagnostic rule encoding [22]. Furthermore, they have been used with a similar technique for the task of human mobility forecasting [23], [24]. More in general, transformerbased models originally designed for NLP tasks have demonstrated successful applications in a wide variety of non-NLP tasks [25], including: images [26], [27], [28], videos [29], [30], [31], speech and audio recognition [32], [33], conversational systems [34], [35], recommender systems [36], [37], reinforcement learning [38], [39], graphs [40], [41], protein structure predictions [42], [43], autonomous driving [44], [45], and anomaly detection problems [46], [47].

III. SYSTEM AT A GLANCE
The proposed system is end-to-end and contains hardware and software components, which are detailed in the following. VOLUME 11, 2023 50273 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

A. HARDWARE
The hardware part of the acquisition chain can be seen in Figure 1 where the following components are depicted: the Smart Cable Water (SCW), that is the sensing element; the SENSIBUS cable, a proprietary one-wire cable that allows communication with and control of the SCW; and a Micro Control Unit with an onboard firmware for controlling SCW, gathering and transmitting data to the cloud.
The SCW is a low-cost multi-sensory proprietary system, 1 based on SENSIPLUS technology, capable of carrying out Electrochemical Impedance Spectroscopy (EIS) and Voltammetry measurements. SENSIPLUS is a proprietary technology of Sensichips s.r.l. developed in collaboration with the University of Pisa [8].
The SCW is equipped with multiple sensors consisting of 6 Inter Digitated Electrodes (IDE) realized on a base made of copper and functionalized with Gold, Oxide of Copper, Platinum, Silver, Nickel, and Palladium (see Figure 2). These sensitive elements, in conjunction with the EIS available on the chip, constitute the sensors adopted for the collection of the samples present in the dataset. In detail, measurements have been performed on five of these sensors. The IDE metalized with Platinum and Gold has been analyzed at two specific frequencies: 200Hz and 78KHz, while Copper, Silver, and Nickel only at 200Hz. Different stimulus frequencies allow the exploitation of different frequency responses since the interactions between the metals on the sensors and the pollutants vary according to this parameter. In total, 12 quantities 1 https://sensichips.com/smart-cable-water/  proportional to the capacity and resistance of the IDEs were acquired. Table 1 reports the correspondence between IDEs and frequencies.

B. ACQUISITION AND PRE-PROCESSING SOFTWARE
The software components of the elaboration chain can be seen in Figure 3 where the following components are visible: the C API implemented as firmware for the MCU, a Finite State Machine (FSM) for baseline acquisition and injection detection of substances, the classification system. The MCU's firmware controls the SENSIPLUS chip through the SENSIBUS channel, collecting raw data and transmitting it to the computational module. The computational module (that could be a workstation connected through USB or systems in the cloud connected with TCP/IP through Wi-Fi) is responsible for running the FSM and the classification system. The 50274 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  classification system is described in Section V-D, while the FSM is described in the following.
The FSM is represented in Figure 4 and works in two steps: • baseline extraction: a baseline signal is extracted to normalize raw data; • forwarding decision: for each sample, the FSM decides whether to forward it to the classifier, also providing the injection time.
The FSM generates the baseline signal b t by an Exponential Moving Average (EMA): where s t are the sensors' raw data at time t; {WT, BA, BT, BSP, BS} are the possible states of the FSM and correspond respectively to {Wait, Baseline Acquisition, Baseline Tracking, Baseline Suspended, and Baseline Stopped}. The α parameter in EMA is the reciprocal of EMA c that has been empirically set to 25. The normalized signal f t forwarded by the FSM, is evaluated as where s t is the raw data collected from sensors and the b t is the baseline signal computed as described by Equation 1. f t , s t , and b t are n-dimensional vectors with n equal to the number of sensors (see Table 1); the division in the equation is element-wise.
Thanks to this baseline, the system can mitigate sensor drift, environmental noise, signal spikes, and variability between sensors.
In the schema of the FSM reported in Figure 4, t is the current time, τ is a threshold empirically set to 0.05, d t is the Euclidean distance between f t and a vector of ones denoted as u in the feature space.
When b t is equal to s t in Equation 2, the f t vector is the unit vector. As a consequence, the Euclidean distance has been computed with respect to the unit vector, and d t equal to zero means that the baseline signal b t is perfectly tracking the sensors signals s t : The WT state is conceived to ''fill'' the EMA; the BA state is reached automatically after EMA c samples. Through the BA state, the FSM starts to follow the signal waiting for good tracking. Good tracking is obtained by analyzing the distance with the baseline. Once the variability of the distance, computed as its mean plus three times the standard deviation, is below a given threshold τ (empirically established to 0.05), the system can move to the BT state. The system will then check if a substance has been spilled in the water by checking when the current distance is greater than τ . When the signal moves away from the baseline, the state becomes BSP for a while. Once the FSM moves to the BSP state, the system will check that the current distance remains above the threshold for five consecutive samples (BSP); otherwise, the system comes back to the BT state (to avoid confusing the spill of a substance with a measurement spike or noise). Finally, when the FSM reaches the BS state, the current normalized sample f t is forwarded to the classification module.

C. THE CLASSIFICATION MODULE
The proposed classification module is based on deep learning for natural language processing, and in particular on Transformer-based [48] models. We employ T5 [49], which is a large text-to-text language model pre-trained on a multitask mixture of unsupervised and supervised tasks, the former being unsupervised de-noising objective tasks, while the latter being text-to-text language modeling objective ones. For a complete overview of tasks and prompts please refer to Raffel et al. [49,Appendix Section].
The T5 model architecture is similar to that of a general Transformer model and it is composed of a stack of encoder blocks, which transform the input text into a latent representation, and a stack of decoder blocks which translate the latent distribution into a new output text. Each block comprises a self-attention module, optional encoder-decoder attention, and a feed-forward network. Since it is a textgeneration model, it takes a textual input and generates a textual response.
We leverage the pre-training knowledge of the model, and adapt it to the task of substance prediction by textifying the raw sensors' input and training the model to produce a string stating the nomenclature of the pollutant present in the wastewater. All the parts of the classifications module are detailed in Section V.

IV. DATA
In this section, the acquisition process for dataset creation is described in all its aspects.

A. SUBSTANCES
The dataset used in this work aims to identify pollutants in WasteWater (WW), paying attention to spills of chemical compounds that could compromise public safety and/or the efficiency of purification systems. The acquisitions were made in the laboratory to simplify data collection. Measurements at experimental sites were excluded for two reasons: to ensure safety due to biological risks related to the presence of unknown bacteria or pollutants and to have controllable measurement conditions. In fact, the composition of WW is not stable over time, for example, due to atmospheric events such as rain. In detail, all the samples were acquired between 2019 and 2021 in two different laboratories in Poland and in Italy and were recently made public [9]. Table 2 reports the substances used. The dataset consists of 10 acquisitions for each substance (including the WW or background) and was obtained using the measurement protocol described below.

B. THE DATASET
To create the dataset which is used in the experimental part of this paper, we employed a measurement system composed of a PC as control device, a micro-controller 2 which manages the communication between the PC and the multi-sensor system, and the SCW that acquires the sensor's signals. The 2 ESP8266 https://www.wemos.cc/en/latest/d1/d1_mini.html different substances have been injected into a beaker containing 300ml of WW, where the SCW is immersed. A magnetic stirrer was used to simulate the movement of the WW, ensuring the same conditions for each measurement session. The rotation of the anchor, 25mm long, was set at 50 rpm in such a way as to reduce the presence of air bubbles that could make the measurement noisy (turbulent regime). The acquisition of the samples present in the dataset was carried out according to a measurement protocol divided into two steps: 1) initially 600 samples are collected in WW, in warm-up mode; 2) subsequently, the substance of interest was injected, and an additional 1000 samples are collected.
This protocol was repeated ten times for each substance (ten acquisitions for each substance). A total of 1600 samples are collected for each substance and for each acquisition, with an acquisition rate of about 1.6 seconds, for an overall run time of about 40 minutes for acquisition ( Figure 5).

V. EXPERIMENTS A. PROBLEM FORMULATION
We have the measurements obtained on 10 substances plus the background substance (i.e., WW). To obtain definitive and stable results and avoid randomness and bias, our experiment utilizes the k-fold validation methodology. Specifically, we implement a 10-fold validation process whereby we rotate the experiment used for testing purposes from 1 to 10 and utilize the remaining experiments for training. This approach ensures that the samples from every experiment are used for testing once and that the overall performance metrics are averaged across all 10 test sets. As such, we are able to achieve a robust and reliable evaluation of our models' effectiveness. For each substance we have 600 samples collected in the so called ''warm-up mode'', which means that in that period of time the monitoring sensors have been exposed to WW only. Then, we also have 1000 samples collected after substance injection, which means that in this period of time the sensors have been exposed to both wastewater and the specific substance (if injected). Following the k-fold technique detailed above, for each substance (including WW), 9 acquisitions have been used as training set and 1 acquisition has been used as the test set. The effectiveness of the models is thus evaluated on the the average of the test acquisitions, and  only on the 1000 samples collected after the warm-up phase. Details on the composition of the dataset folds can be found in Table 3.

B. METRICS
To evaluate the performances of the model, we rely on different effectiveness metrics, aimed at measuring different aspects of the models effectiveness. We use the following notation for the set of correctly and incorrectly identified sampled: • TP (True Positives): samples correctly identified as belonging to a substance of interest.
• FP (False Positives): samples collected in WW, but classified as one of the 10 substances of interest.
• FN (False Negatives): samples collected in presence of substance of interest but classified as WW. Furthermore, we use the following notation to denote the metric averaging method: • metric m : micro averaged metric; we aggregate the contributions of all classes to compute the average metric.
• metric M : macro averaged metric; we compute the metric independently for each class and then take the average, hence treating all classes equally.
• metric W : weighted metric; each class contribution is weighted by the relative number of samples available for such a class. We also compute the following metrics:

C. BASELINE METHODS
A set of learning algorithms was selected and applied to the collected dataset to compare the proposed solution with standard Machine Learning techniques and obtain a reference baseline. The choice was made to have a sufficiently exhaustive representation of different approaches with different complexity. As a result, we adopted algorithms belonging to the following categories: boosting, bagging, tree-based, instance-based, kernel-based, Artificial Neural Networks, and ensemble classifies. As for the boosting algorithm, we selected AdaBoost [50] with different types of weak classifiers: decision stump, J48 tree [51] and a more complex Random Forest [52]. For the Bagging [53] we selected REPTree, a simple tree learner that uses the information gain heuristic to choose an attribute and a binary split on numeric attributes (faster than C4.5) [54]. For decision tree-based algorithms, we have chosen Random Forest [52]. The classic k-nearest neighbors algorithm (KNN) [55] has been selected for the instance-based algorithms, Support Vector Machines (SVM) [56] for kernelbased algorithms, and a Multi Layer Perceptron (MLP) for the Artificial Neural Networks category algorithms [57]. Finally, a majority vote between MLP, KNN, SVM and RandomForest for ensemble-based algorithms.
The selected ML algorithms were preliminarily optimized through a grid search on their respective hyperparameters. Table 4 shows, for each algorithm, which hyperparameters were selected. The WEKA (Waikato Environment for Knowledge Analysis -version 3.8.6) implementation of these algorithms was used in the experimental phase [58]. The input of the ML algorithms is represented by the feature vector calculated in Equation 2 and therefore contains the instantaneous measurement of all the sensors identified in Table 1, following normalization with respect to the baseline. Note that in order to maintain a fair comparison, the baselines and the proposed model (discussed in the next sections) were trained on the identical dataset using the same features; this ensures that any observed performance differences are solely due to the algorithmic capabilities, not data or feature discrepancies.

D. TEXTIFICATION OF THE INPUT
In contrast to the baseline methods, our proposed T5 model requires textual input to be trained for natural language generation. For this reason, we first need to define a methodology to describe the input features (i.e., the observations of the set of sensors) in natural language. To this aim, we rely on the so called ''textification'' (or prompting) of the input features, an approach that has been successfully applied in the medical domain, in particular in the automatic encoding and prediction of diagnostic texts [19], [20], [21], as well as in human mobility forecasting [23], [24].
This transformation essentially takes an array of floating point values corresponding to the input features and translate it into a text, which will be then the input of our model.
Our approach works as follows. First, let us recall that each measurement is made of 1600 timestamps, t ∈ [1,1600] indicating the warm-up phase where the sensors are exposed to wastewater only, injection time happening at t = 600, and t ∈ [601, 1600] indicating the phase after injection, where sensors are exposed to wastewater and the substance.
For each acquisition (out of the 10 present in each training set), we sample two timestamps: t b in the warm-up phase, and t a after the injection phase. Then, we create a piece of text with the following pattern, for each of the sensors: We repeat the process for all the acquisitions present in our dataset. Note that the proposed methodology allows to sample multiple t b points to predict the substance present at t a ; in this case, the final prediction is obtained by taking the mode (i.e., majority voting) over the different predictions made by the model for the same acquisition.

E. MODEL TRAINING AND INFERENCE
We develop our model using the PyTorch 3 and HuggingFace 4 frameworks. All the data and code used in the paper are made available at: (to be inserted upon acceptance.) We rely on the T5-base model, 5 which is composed of an encoder and decoder stacks comprising 12 blocks each. Each block contains self-attention mechanisms, optional encoderdecoder attention, and a feed-forward network. The attention is of dimension 64, while embeddings are of dimension 768.
The final model has about 220 million parameters. We initialized the model weights with the pre-trained ones of the original T5 model. To feed the textual input to the model we used the custom prefix ''predict:'', and we used the strings ''input:'' and ''target:'' to discern between the model input and the target. We train the model on a Linux server equipped with 16x Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz, 64GB of RAM, and 2x NVIDIA Geforce RTX 3090 GPU GPUs for 3 epochs. As objective we use the conventional multi-class cross-entropy loss function, where the number of classes is equal to the size of the vocabulary, defined as where the superscript b represents the batch and B is the batch size, |V | is the vocabulary size, y represents the true token to be predicted, andŷ is the output probability distribution over the vocabulary at each time-step.
To perform inference, we generate the output text using beam search, thus generating token-by-token the output sequence by feeding the input via cross-attention layers to the decoder, and auto-regressively generate the decoder output. We set the early stopping parameter to true so that the beam generation is finished when all beam hypotheses reached the EOS (End-of-Sequence) token. Experimentally we found that our fine-tuned model generates substance names for each beam, so there was no need to implement a constrained beam search to force the model to output only correct strings as output (i.e., only produce one of the 10 substance names or WW). Since we can augment the training set by sampling multiple time stamps for each acquisition, we aggregate the final predictions of the model using majority voting (i.e., the mode function) to have a single prediction for each t a .

VI. RESULTS
A. EFFECTIVENESS Table 5 shows the effectiveness metrics computed considering the average effectiveness score over each test set for the baselines (upper part of the table) and for the proposed approach (lower part of the table). Given that the proposed methodology can be provided in input with multiple t b timestamps (see Section V-D), the lower part of Table 5 shows the different effectiveness scores computed when sampling 1, 2, 5, 10, and 100 t b timestamps and aggregating the predictions of the model.
As we can see by inspecting the table as a whole, the proposed methodology outperforms the whole set of baselines for all the considered metrics, even in the most restrictive case where only one t b timestamp is provided (i.e., T5 (1-sample)). This behavior is also visually shown in Figure 6, which shows the value for the F1 W metric on the test sets: the different models are arranged along the x-axis, while the y-axis shows the metric value, and the horizontal dashed line represents the performance of the best baseline method. The bars on top of each value represent the variance of the metric over the different folds. The plots for the other metrics are similar and thus not reported. VOLUME 11, 2023 Table 5. The confusion matrix reports the distribution of test sample predictions, displaying the real substances as rows and the predicted substances as columns. As we can see from the matrix, the model correctly classifies almost all substances perfectly (the values on the diagonal are close to 1000), with the exception of ammonia and phosphoric acid, where the model reaches an accuracy of about 0.8. In particular, we see that ammonia is often mistaken for sodium hypochlorite (171 times out of 1000) and phosphoric acid is mistaken for acetic acid (208 times out of 1000). By investigating the dataset, we assume that this is probably caused by the fact that the signals acquired by the sensors show similar trends between the two pairs of substances. In particular, the By focusing on the lower part of Table 5, we see that there is a correlation between the number of timestamps measured before the injection and the model effectiveness. More in detail, we can see that, overall, the more timestamps t b we provide to the model, the higher the effectiveness scores. This behavior is also shown in Figure 8, which displays in the x-axis the number of timestamps fed into the model, and in the y-axis the value for the different effectiveness metrics. This result suggests that practically, we should provide the model with multiple samples from the warm-up phase to get a more accurate prediction. This is due to the fact that seeing more timestamps during warm-up allows the model to better capture the signal variance that might happen before the substance is injected into the wastewater.

B. EFFICIENCY
The inference time has been measured over repeated experiments on a NVIDIA Geforce RTX 3090 GPU. Disabling batching, it takes the system on average 76 milliseconds to perform one prediction (i.e., one full text-generation performed with beam search and the early stopping parameter enabled, see Section V for more details on the inference process) for a timestamp, or 76 seconds for 1000 timestamps. The system can therefore output approximately 13 predictions per second. We can further reduce the inference time by performing batched inference (the GPU used for the experiments allows for batch sizes greater than 512 samples).
We also measured the inference time on the available CPUs (16x Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz). It takes the system on average 450 milliseconds to process one sample.
Such results provide evidence that we can use the trained model and deploy it for real-time predictions, both in an environment equipped with a GPU as well as in a machine that is only powered by CPUs.
In addition to the inference efficiency, which is crucial for deploying a model in real-world scenarios, it is important to consider the computational cost and complexity of the training phase in the proposed approach. T5 and other causal models have millions or even billions of parameters, which necessitate a large amount of data and computational power to optimize. These models are typically trained using the standard transformer architecture, which has a complexity of O(N 2 d), where N is the sequence length and d is the hidden dimension of the model [59]. This means that the number of operations required to train the model grows quadratically with the input sequence length and linearly with the model hidden dimension.
Despite the high computational cost, pretrained models can be repurposed for various tasks through finetuning, which involves adapting a pretrained model to a specific task by training it on a small amount of task-specific data. This process is considerably less expensive than training the model from scratch because most pretrained models are already optimized for the underlying language modeling task. In practice, finetuning a pretrained model usually entails training it for just a few epochs, typically 3, which can take anywhere from a few minutes to a few hours, depending on the size of the task-specific dataset. For this work, we obtained the model's weights from the HuggingFace library and conducted only the finetuning phase of the proposed approach, which took approximately one hour on the GPU architecture described above. Once finetuned, the model can be used indefinitely for the specific task discussed in this paper.

VII. DISCUSSION AND CONCLUSION
In this paper we studied the capabilities of natural language processing models, especially generative causal models and more in detail T5, for the task of detecting the presence of polluting substances in wastewater. To this end, differently from state-of-the-art machine learning models, we applied a transformation of the input features called textification in order to translate them into a textual form and be able to feed them into a generative natural language model. The latter is trained to classify each sample based on whether it contains or not a polluting substance, and to identify it if present. We experimentally evaluated the proposed methodology testing its effectiveness against a set of state-of-the-art baselines, and we measured its efficiency. Experimental results show that the proposed methodology outperforms the baseline methods, and its efficiency and effectiveness allow for its deployment and for practical use.
Given that the purposed approach is non-conventional, and it might seem strange or counter-intuitive at first sight, in the following we discuss why such approach makes sense and works in practice. Recent work demonstrated the vast ability of transformers and attention based models to generalize on a large variety of tasks, including those where the model has not been trained on [60], [61], [62], and [63], or even to tasks not directly related to or not naturally expressed using natural language processing, such as for example images [26], [27], videos [29], reinforcement learning [39], and graphs [40].
The ability of transformer-based models for generalization comes from the attention mechanism, and from the almost task-agnostic training procedure. In fact it consists, in its base form, in reconstructing part of the input item, being it masked or perturbed using domain-specific techniques or to predict the continuation of the input (if the masked part is the last part of the input). Combined together, these techniques allow the model to learn meaningful and -most importantly-general latent relationships in input sequences, and the ability to relate those to the network's output. For example, networks applied to texts show the ability to reconstruct missing text or generate it from a prompt, for images and videos the ability to reconstruct corrupted or missing images and frames, for graphs to learn complex graph substructures (i.e., arrangements of set of nodes and edges), and so on. Besides those specific abilities, network based on transformers and trained with masking or causal objectives (i.e., predict masked parts or predict the continuation of the input) show high generalization abilities across tasks and domains. For the same reason, we believe that the textual description gathered from the sensors which we use to train our neural network allows for accurate forecasting predictions for the possible polluting substances present in wastewater.
Despite the promising results obtained with our approach, there are some limitations that need to be reported. One of the main limitations is that the proposed approach relies on the knowledge of the injection time of the polluting substances. This means that if the injection time is not known, the system may not be able to accurately classify the contaminants in wastewater. In this paper we solved this issue by relying on a finite state machine which is able to accurately identify injection time. Nevertheless, in future research we would focus on developing integrated methods to overcome this limitation and deploy an integrated single system.
Another limitation of the proposed approach is related to the availability of data, since a certain amount of labeled data is needed to train the deep learning model. Obtaining such data requires access to polluting substances or contaminated wastewater, and this can be difficult in practical situations. Future work will investigate alternative ways to generate synthetic data or explore transfer learning techniques to mitigate the data scarcity issue.
Results of this work open to a new research direction that will allow to tackle environmental tasks such as the analysis and detection of polluting substances by means of language models. Future work aims precisely at pursuing a broad adoption of natural language based models on a variety of domains and tasks related to the identification of substances. Furthermore, it will also focus on studying the generalization and explanation abilities of the model by leveraging zero and few-shot learning techniques as well as interpretability frameworks.
KEVIN ROITERO is currently an Assistant Professor (RTD) with the University of Udine, Italy. His research interests include information retrieval, crowdsourcing, data mining and analysis, and machine learning and artificial intelligence. He visited and collaborated with multiple top universities across the globe and leading industry partners, publishing papers in top ranked conferences and top-tier journals. As result of the his work, he received multiple grants and awards, including the participation in the seventh edition of the prestigious Heidelberg Laureate Forum (top 200 young researchers in mathematics and computer science), the ''con.Scienze2020'' prize for the best Ph.D. thesis in computer science discussed in Italy, in 2020, and the Best Short Paper Award at ''ECIR2020,'' and the Best Paper Award at ''NL4AI 2022.''