Low-Cost CO Sensor Calibration Using One Dimensional Convolutional Neural Network

The advent of cost-effective sensors and the rise of the Internet of Things (IoT) presents the opportunity to monitor urban pollution at a high spatio-temporal resolution. However, these sensors suffer from poor accuracy that can be improved through calibration. In this paper, we propose to use One Dimensional Convolutional Neural Network (1DCNN) based calibration for low-cost carbon monoxide sensors and benchmark its performance against several Machine Learning (ML) based calibration techniques. We make use of three large data sets collected by research groups around the world from field-deployed low-cost sensors co-located with accurate reference sensors. Our investigation shows that 1DCNN performs consistently across all datasets. Gradient boosting regression, another ML technique that has not been widely explored for gas sensor calibration, also performs reasonably well. For all datasets, the introduction of temperature and relative humidity data improves the calibration accuracy. Cross-sensitivity to other pollutants can be exploited to improve the accuracy further. This suggests that low-cost sensors should be deployed as a suite or an array to measure covariate factors.


Introduction
Urban air pollution has been linked to adverse effects on the environment, public health and quality of life [1]. Therefore, there is a concerted effort to alleviate the effects of air pollution [2]. Monitoring air pollution can raise awareness among the general public and subsequently lead to a sustainable urban environment [3]. Conventional air quality monitoring system typically involves the deployment of a small number of expensive stationary stations [4]. While the data from these stations are accurate, the poor spatial resolution hinders the generation of robust, city-wide air quality data. Low-cost sensors have been identified as an option to supplement the information captured by conventional air quality monitoring systems [4,5]. Many countries around the world [6][7][8] have started to adopt this approach to monitor urban pollutant at high spatial resolution.
[ [34][35][36][37][38][39] Multilayer Perceptron (MLP) A classical neural network that uses backpropagation for training to develop a model that uses several explanatory variables (e.g., temperature, relative humidity, etc.) to compute the calibrated sensor output. [25,27,28,[37][38][39]43,44] Recurrent Neural Networks (RNN) A neural network that extracts data's sequential characteristics and then uses backpropagation through time algorithm develops a model that uses several explanatory variables (e.g., temperature, relative humidity, etc.) to compute the calibrated sensor output. [37][38][39][40]45,46] Our literature review shows that among the NN-based techniques, One Dimensional Convolutional Neural Network (1DCNN) has not been well investigated for low-cost gas sensor calibration. 1DCNN has demonstrated excellent performance for a variety of applications (e.g., indoor localization [47], human activity recognition [48], and time series forecasting [49]). However, there are only two reports [50,51] of 1DCNN being utilized for the calibration of air pollution monitoring. Kureshi et al. [50] employed it for the calibration of Particulate Matter (PM) sensors. In a recent publication that investigated the impact of the pandemic on air quality, Vajs et al. [51] employed 1DCNN to calibrate low-cost NO 2 sensors. However, they did not benchmark its performance against any other ML techniques; therefore, it is impossible to ascertain its (comparative) efficacy. Similarly, the Gradient Boosting Regression (GBR), an ensemble learning technique, has also not been Sensors 2023, 23, 854 3 of 20 widely utilized for gas sensor calibration, although it has shown good performance in other applications (e.g., PM sensor calibration [52] and prediction and forecasting [53,54]). Bagkis et al. [41] report the only work that employed GBR for gas sensor calibration. However, their work mainly focused on temporal drift correction, and the performance of GBR was not benchmarked against sophisticated techniques such as NNs.
Contribution Statement: • This paper proposes applying 1DCNN and GBR for calibrating low-cost CO sensors. As far as we know, this is the first work that benchmarked these algorithms against NN-based algorithms. • Furthermore, this work, in contrast to most studies reported in the literature, evaluates the calibration models across multiple datasets, enabling us to draw more robust conclusions.

•
We show that 1DCNN-based calibration is consistently accurate compared to several Machine Learning (ML) based techniques across three large CO datasets.

•
We also highlight that GBR, an ML technique that has not been investigated widely for low-cost gas sensor calibration, performs quite accurately for all three datasets. Table 2 provides a summary of the three datasets collected from two locations in Italy and one in China. These datasets are multisensory, but we have focused on the calibration of the CO sensor as this gas is an essential component of the Air Quality Index (AQI) [55], and both the raw (from low-cost sensors) and reference CO data are available for all three deployments. We found that the data collected by the cost-effective multisensory devices and reference sensors have missing samples. Previous research has found evidence of cross-sensitivity in these gas measurements (e.g., see [21]). Therefore, for any given instant, all pollutant (and temperature and relative humidity) data need to be available from the cost-effective sensor alongside the reference CO data for multivariate calibration. As a result, we removed readings of select time instants from each dataset if any pollutant data from the cost-effective sensors or the CO ground truth data were missing. Please see Figure 1 for the CO ground truth distributions and temperature/relative humidity data for all three datasets. World Health Organization (WHO) recommended limits for CO exposure are no more than 9-10 ppm/8 h, 25-35 ppm/1 h, and 90-100 ppm/15 min [56]. As can be seen, the CO concentrations in all three monitoring sites are lower than these thresholds. The dataset was recorded by a multi-sensor device [37] containing an array of five low-cost MOX sensors that measured CO, NO 2 , O 3 , Non-methanic Hydrocarbons (NMHC), and NOX along with temperature (T) and relative humidity (RH). It includes 9357 samples of hourly averaged responses recorded between 10 March 2004, to 4 April 2005, from the Lombardy Region, Italy. A co-located certified reference analyzer provided ground truth, a conventional monitoring station with a spectrometer [37] that provided hourly averaged CO concentrations. After removing missing data points, we were left with 6941 samples for each pollutant, T, RH, and CO ground truth. More details of the dataset can be found in [17,21].

Dataset 1
The dataset was recorded by a multi-sensor device [37] containing an array of five low-cost MOX sensors that measured CO, NO2, O3, Non-methanic Hydrocarbons (NMHC), and NOX along with temperature (T) and relative humidity (RH). It includes 9357 samples of hourly averaged responses recorded between 10 March 2004, to 4 April 2005, from the Lombardy Region, Italy. A co-located certified reference analyzer provided ground truth, a conventional monitoring station with a spectrometer [37] that provided hourly averaged CO concentrations. After removing missing data points, we were left with 6941 samples for each pollutant, T, RH, and CO ground truth. More details of the dataset can be found in [17,21].

Dataset 2
This dataset includes the responses of a MONICA multi-sensor device [44] deployed in the Italian city of Naples. The gas sensor hardware consists of an array of electrochemical gas sensors to measure CO, NO 2 , and O 3 , along with T and RH. Hourly average responses were recorded along with reference CO concentrations from a certified analyzer (Teledyne 300, manufactured by Teledyne API). After discarding the missing data, a total of 13,595 samples collected over 31 months (5 April 2018-24 November 2020) are available. More details of the dataset can be found in [28]. It should be noted that the auxiliary electrode data of the CO data is available for the MONICA sensor and has been utilized during calibration.

Dataset 3
This dataset was recorded by a Sniffer4D multi-sensor device [29] deployed in the Chinese city of Guangzhou. The array of EC gas sensors measured CO, NO 2 , and O 3 along with T and RH. A total of 3450 samples of hourly average data collected over a span of six months between 1 October 2018, and 1 March 2019, are utilized along with reference CO concentration collected from a certified analyzer (Thermo Scientific 48i-TLE). More details of the dataset can be found in [29]. Please note that this dataset is also available at a higher per-minute sampling rate.

Methodology
The calibration was framed as a supervised regression problem such that Here CO calibrated is the calibrated CO reading computed from the raw CO reading of the sensor (CO raw ) and X , that comprises covariate factors, such as T, RH, and other pollutant readings from the sensor array (e.g., uncalibrated NO 2 and O 3 readings from the low-cost sensor array). For Dataset 2, CO raw includes both the working electrode and auxiliary electrode data. Φ is the regression model whose parameters are derived from the training data to minimize the Mean Square Error (MSE) between the calibrated output and the ground truth received from the reference CO sensor. The training set is a subset of the dataset. We have considered two different Train Test Split (TTS) for this study. In TTS1, each of the three datasets is split so that 90% of the data is used to train (and validate, as discussed later) the calibration model, while the remaining 10% is used for evaluating the performance of the trained model. This 90/10 split represents the scenario where a co-located low-cost sensor is being used as a backup in case the reference grade monitor is out of commission for a short period due to fault or maintenance. In TTS2, the train/test split is 20/80. This emulates a scenario when a low-cost sensor is co-located with a reference sensor for a set period for calibration and afterward deployed in the field for monitoring pollutants at locations where no reference AQM station is available. It should be noted that for both the train test splits, we have used consecutive samples. The first 90 or 20 percent samples were used for training, and the remaining data were used for testing. This imitates a practical scenario where the sensor is collocated with the reference for a set period of time for calibration and then taken for field deployment. This also helps the calibration algorithms to exploit the temporal correlation between contiguous samples.
Three different regression cases were considered for each of the ML algorithms.
This involves deriving regressors or calibration models so that The regressor, Φ SC1 , is derived solely based on the raw CO sensor input to minimize the MSE between CO SC1 calibrated and the ground truth. For datasets 1 and 3, CO raw represents the working electrode data of the low-cost CO sensor. For dataset 2, CO raw comprises both working and auxiliary electrode data.

Scenario 2 (SC2)
The second case introduces temperature and relative humidity readings as part of the input so that Sensors 2023, 23, 854 6 of 20 The regressor, Φ SC2 , is now derived from three input variables, raw CO sensor data, temperature, and relative humidity, to minimize the MSE between CO SC2 calibrated and the ground truth. Accurate T and RH sensors are inexpensive, and it is reasonable to expect the availability of these readings for any deployment. As mentioned before, the literature suggests that low-cost gas sensor operations are impacted by T and RH. Therefore, introducing a multivariate calibration strategy is the next logical step.

Scenario 3 (SC3)
Cross-sensitivity is a known issue with low-cost gas sensors. However, this dependency can also be exploited to improve the calibration if the covariate pollutant data is available. In fact, it is quite common to construct and deploy a sensor array consisting of multiple pollutant sensors (as was the case in the three deployments that produced the datasets used in this research). Therefore, the last case further introduces other pollutant readings from the sensor array as part of the input that leads to

Machine Learning Algorithms
Convolutional Neural Networks (CNNs) have become a popular machine learning technique during the last decade [57]. Conventional CNNs are mainly designed to process two-dimensional (2D) data, e.g., videos and images [58]. This structure can be modified as 1DCNN to deal with one-dimensional signals [59][60][61]. The 1DCNN algorithms have less computational complexity, compact structure (1-2 hidden CNN layers), and are less timeconsuming to train, and thus are suitable for low-cost real-time applications as compared to their 2D counterparts [58]. In this paper, we propose to use the 1DCNN-based regressor for the calibration of low-cost CO sensors. 1DCNN is well-suited to deal with time series. Figure 2 shows an example of the 1DCNN structure used in our work. Please note that many parameters are determined through grid search and tuning. Therefore, they vary depending on the data set, scenario (input variables used), and time split.

ML Algorithms for Benchmarking
We trained and evaluated/benchmarked the calibration performance of 1DCNN and GBR alongside three other ML-based techniques commonly reported in the literature, as discussed in Section 1. These are  As mentioned in Section 1, we also develop calibration models using GBR, an ensemble learning technique that has not been widely utilized for gas sensor calibration.

ML Algorithms for Benchmarking
We trained and evaluated/benchmarked the calibration performance of 1DCNN and GBR alongside three other ML-based techniques commonly reported in the literature, as discussed in Section 1. These are • MLP has been employed in many reported works on gas sensor calibration. Please note that in the literature, it is sometimes referred to as an Artificial Neural Network (ANN), Feedforward Neural Network (FNN), Back Propagation Neural Network (BPNN), or simply Neural Network. • Recent literature suggests that Recurrent Neural Networks, or RNNs, are well suited for sensor calibration due to their ability to exploit temporal correlation in the data. After some preliminary investigation, we selected Long Short-Term Memory (LSTM) as the RNN-based technique for our benchmark work. • Random Forest Regressor, or RFR, is an ensemble learning technique that has shown good performance in several works for low-cost gas sensor calibration and was, therefore, also selected to benchmark against. • Furthermore, linear regression is the most commonly employed technique for calibrating low-cost gas sensors and is, therefore, also utilized for benchmarking purposes.
A rigorous training, validation, and testing approach has been followed in this work. All the regressors have hyperparameters that have been tuned on the relevant training datasets and tested on the corresponding testing sets. One way to make sure that the parameters are more generalized is through validation. A k-fold (k = 10) cross-validation has been implemented in this work. Multiple models with various hyperparameters are trained on the training dataset. The trained models are tested on the validation dataset, and the best-performing model is selected. The best-performing model is then finally trained using both the training and validation datasets. Lastly, this newly trained model is evaluated on the testing dataset. The following steps and Figure 3 provide a detailed description of this process: is evaluated on the testing dataset. The following steps and Figure 3 provide a detailed description of this process: Step 1: The dataset is split into training and testing datasets.
Step 2: The training is conducted using a ten-fold cross-validation, where the training dataset is divided into ten equal-sized parts. Each time nine out of the ten parts are used to perform a grid search for hyperparameters tuning and then evaluated against the remainder 10th part (validation). This process is repeated ten times, and the best hyperparameter combination is found across all ten evaluations.
Step 3: The best-performing model is further trained using the entirety of the training dataset. This training is done ten times over, and an average value of the predicted output is calculated.
Step 4: The final output is evaluated on the (unseen) testing dataset by computing the performance metrics.   Step 1: The dataset is split into training and testing datasets.
Step 2: The training is conducted using a ten-fold cross-validation, where the training dataset is divided into ten equal-sized parts. Each time nine out of the ten parts are used to perform a grid search for hyperparameters tuning and then evaluated against the remainder 10th part (validation). This process is repeated ten times, and the best hyperparameter combination is found across all ten evaluations.
Step 3: The best-performing model is further trained using the entirety of the training dataset. This training is done ten times over, and an average value of the predicted output is calculated.
Step 4: The final output is evaluated on the (unseen) testing dataset by computing the performance metrics. Table 3 lists the hyperparameters that were tuned for all the ML algorithms. The final hyperparameters for every calibration model can be found online (https://github.com/ Sharafat-Ali/AirQualityResults, accessed on 19 December 2022). Table 3. List of hyperparameters that were tuned for each ML-based algorithm. Instead of letting the training datasets run for a set number of epochs, an early stopping method has been used during the training-validating stage. Given the extensive and differing number of hyperparameters used across the various ML algorithms, it is difficult to exhaustively define the parameters in the manuscript. We have added a reference for interested readers [62]. This method allowed the training to end once the model performance stopped improving on the validation set. The validation sets' mean absolute error (MSE) was monitored for each epoch. The training would stop when the MSE ceased to decrease by a certain tolerance threshold for a select number of epochs (patience). The model weights with the minimum MSE within that patience were taken as the validation set's final weights. Figure 4 shows an example of the training and validation losses for 1DCNN.

RFR
Several performance metrics have been used in this study to benchmark and evaluate the different calibration models. These metrics, in various ways, measure the residuals or errors, i.e., deviations of the calibrated output of the low-cost sensors (CO calibrated ) from the ground truth (CO re f erence ) for the test data (10% or 80% of every dataset, depending on the split) that has never been used for training. mance stopped improving on the validation set. The validation sets' mean absolute error (MSE) was monitored for each epoch. The training would stop when the MSE ceased to decrease by a certain tolerance threshold for a select number of epochs (patience). The model weights with the minimum MSE within that patience were taken as the validation set's final weights. Figure 4 shows an example of the training and validation losses for 1DCNN. Several performance metrics have been used in this study to benchmark and evaluate the different calibration models. These metrics, in various ways, measure the residuals or errors, i.e., deviations of the calibrated output of the low-cost sensors ( ) from the ground truth ( ) for the test data (10% or 80% of every dataset, depending on the split) that has never been used for training.

Performance Metrics
The Root Mean Square Error (RMSE), which is the standard deviation of the residuals and is commonly used as a performance metric for sensor calibration [29,[63][64][65][66], was used where Here N is the number of samples in the relevant test data set. Another metric utilized is the Coefficient of Determination (R 2 ), which is the goodness of fit in regression analysis [67][68][69]. It is computed as While the appropriateness of R 2 to determine the fit of nonlinear regressors has been questioned, it is still commonly used within the discipline of air pollutant measurement (e.g., see [14,15,40,66,68]).
In some instances, we have also plotted the Cumulative Distribution Function (CDF) of absolute errors, abs CO calibrated − CO re f erence , for a more detailed investigation. Target diagrams [15,40] were constructed to visualize the calibration models. The y-axis represents the Mean Bias Error (MBE) normalized by the standard deviation of the ground truth so that MBE = mean(CO calibrated ) − mean (CO re f erence ) Normalised MBE = MBE Here, σ re f erence is the standard deviation of the ground truth for the relevant test data set. The x-axis of the diagram represents the normalized unbiased estimate of the RMSE, the Centered RMSE (CRMSE) given as The normalized CRMSE is multiplied by sign σ calibrated − σ re f erence to produce the target diagrams with σ calibrated being the standard deviation of the calibrated data for the relevant test data set. Table 4 shows the performance of the calibration algorithms for all datasets in terms of RMSE (in ppm) and R 2 . We can make the following observations:
The accuracy of every calibration algorithm improves (lower RMSE, higher R 2 ) as we go from SC1 (CO only) to SC2 (CO with T & RH) to SC3 (all inputs). The accuracy improves when temperature and humidity are included (SC2) alongside the raw CO data. There is further improvement in accuracy when the other pollutant data are also introduced (SC3) to exploit the dependencies arising from cross-sensitivity. This clearly emphasizes the importance of deploying low-cost sensors as multi-sensor platforms. Not only that allows for monitoring multiple pollutants with a single unit, but the accuracy of the measured data also improves.

2.
Every ensemble and neural network-based algorithm outperforms linear regressionbased calibration methods for all scenarios. In almost every instance, 1DCNN is the best-performing algorithm. This shows that 1DCNN could potentially improve the relative accuracy of low-cost multi-sensor air pollutant monitors. GBR and LSTM are the subsequent most accurate algorithms. While LSTM has gained much attraction for gas sensor calibration, GBR-based calibration appears to have received far less attention and warrants further investigation. 3.
The accuracy of any given algorithm is better for the 90/10 split (TTS1) compared to the 20/80 split (TTS2 The accuracy of the models derived and evaluated with TTS2 (20/80) is not significantly worse than those for TTS1 (90/10). This seems to suggest that with sophisticated calibration models, such as the ones presented in this work, not only the low-cost sensor platforms could be utilized as a backup for a reference grade monitor (TTS1), but they can also be deployed for reasonably accurate CO monitoring for a long duration after a short co-location (TTS2). It should be noted that the accuracy of the calibration models could be further improved by further periodic co-location and recalibration (please see [44]). 5.
The accuracy of the algorithms appears to be the best for dataset 3. It could be due to the comparatively small number of low-concentration CO readings in dataset 3. The low-cost sensors typically struggle to register low gas concentrations. This is corroborated later in the section with box plots of residuals. Dataset 3 covers a considerably shorter time; therefore, the sensor may have experienced lower drift and degradation than the sensors of the other two measurement campaigns. It should be noted that the sensor platforms used to collect datasets 2 and 3 are constructed from the same CO sensors and therefore presents an opportunity for a reasonably objective evaluation of this effect.  Figure 5 shows the CDF of absolute errors for the 1DCNN-based calibration (Error CDF plots for all scenarios can be found at https://github.com/Sharafat-Ali/AirQualityResults, accessed on 19 December 2022). The impact of T and RH on accuracy improvement seems more prominent for datasets 2 and 3. However, the cross-sensitivity impact to other pollutants appears more significant for Dataset 1. This could be because datasets 2 and 3 were collected using EC sensors, whereas those used for dataset 1 are MOX-based. It should be noted that these observations are consistent across all calibration techniques.  Target diagrams for the 1DCNN-based calibration models are shown in Figure 6 (Target diagrams for all algorithms can be found at https://github.com/Sharafat-Ali/AirQualityResults, accessed on 19 December 2022). The following observation can be made: 1. All points lie within the unit circle (radius = 1); therefore, the variance of the residuals is smaller than the variance of the reference measurements. It is an essential characteristic of a functional calibration model [15], indicating that the variability of the dependent variable (calibrated output) is explained by the independent variable (the reference data) and not the residual [27]. It should be noted that all calibration algorithms presented in this work fulfill this criterion. 2. The distance from the origin, which measures the normalized RMSE (RMSE/ ) clearly show that the SC3 (all covariate inputs) regressors are more accurate than the SC2 (CO with T and RH) regressors, which are more accurate than the SC1 (CO only) regressors. This once again demonstrates the importance of the Target diagrams for the 1DCNN-based calibration models are shown in Figure 6 (Target diagrams for all algorithms can be found at https://github.com/Sharafat-Ali/ AirQualityResults, accessed on 19 December 2022). The following observation can be made:

1.
All points lie within the unit circle (radius = 1); therefore, the variance of the residuals is smaller than the variance of the reference measurements. It is an essential characteristic of a functional calibration model [15], indicating that the variability of the dependent variable (calibrated output) is explained by the independent variable (the reference data) and not the residual [27]. It should be noted that all calibration algorithms presented in this work fulfill this criterion.

2.
The distance from the origin, which measures the normalized RMSE (RMSE/σ re f erence ) clearly show that the SC3 (all covariate inputs) regressors are more accurate than the SC2 (CO with T and RH) regressors, which are more accurate than the SC1 (CO only) regressors. This once again demonstrates the importance of the availability of covariate factors, such as temperature, relative humidity, and other pollutants.

3.
The majority of the points lie on the left plane, indicating that the standard deviation of the calibrated sensor data for most models is smaller than the ground truth standard deviation.

4.
For TTS1 (90/10), the points lie above the x-axis, indicating that the models, on average, slightly overestimate the CO concentration. For TTS2 (20/80), a few models also slightly underestimate the CO concentration.  Low-cost sensors have been reported to suffer from variability in accuracy for different dynamic ranges of pollutants [27,40]. Therefore, we performed a quantitative investigation by constructing box plots of residuals as a percentage of ground truth at each decile of the ground truth (please see Figure 7 for the performance of the 1DCNN-based calibration). The figures show that at a lower concentration range of CO, the accuracy is the worst (median values further from zero) and exhibits more variability (larger boxes). The models underestimate at lower concentrations and slightly overestimate at higher concentrations. The variability appears to be more prominent for datasets 1 and 2. The likely reason for such increased variability and degraded performance is the difference in CO level experienced during the measurement campaigns. For datasets 1 and 2, the first deciles of CO data start at 0.0873 ppm and 0.066 ppm, respectively, compared to dataset 3 (starting at 0.296 ppm). Since the low-cost sensors are likely to struggle with sensitivity at low pollutant concentrations [22,70], this hardware limitation is probably causing performance degradation at the lowest deciles for datasets 1 and 2. Low-cost sensors have been reported to suffer from variability in accuracy for different dynamic ranges of pollutants [27,40]. Therefore, we performed a quantitative investigation by constructing box plots of residuals as a percentage of ground truth at each decile of the ground truth (please see Figure 7 for the performance of the 1DCNN-based calibration). The figures show that at a lower concentration range of CO, the accuracy is the worst (median values further from zero) and exhibits more variability (larger boxes). The models underestimate at lower concentrations and slightly overestimate at higher concentrations. The variability appears to be more prominent for datasets 1 and 2. The likely reason for such increased variability and degraded performance is the difference in CO level experienced during the measurement campaigns. For datasets 1 and 2, the first deciles of CO data start at 0.0873 ppm and 0.066 ppm, respectively, compared to dataset 3 (starting at 0.296 ppm). Since the low-cost sensors are likely to struggle with sensitivity at low pollutant concentrations [22,70], this hardware limitation is probably causing performance degradation at the lowest deciles for datasets 1 and 2.

Computational Cost
The computation cost of the ML algorithms can be mainly attributed to the training and tuning of the hyperparameters. However, it should be noted that these activities are not performed on sensor devices, which are constrained in terms of computational capability and energy storage. The low-cost sensors are deployed to collect and send the data to a backend server or a cloud-based infrastructure that performs the calibration/training offline. By investing in this infrastructure, we can deploy a city-wide low-cost sensor network at a very high spatial resolution. This is far more cost-effective than deploying a large number of expensive reference-grade sensors. We trained the ML algorithms on a workstation equipped with a 16-core AMD Ryzen processor, 128 GB RAM, and two NVIDIA A40 GPUs. The cost of this workstation (less than $20,000) is far lower than a single reference-grade gas sensor.
Once an algorithm has been trained, it can produce a calibrated output almost instantaneously. For example, even for the largest dataset (dataset 2), the trained 1DCNN algorithm takes less than 10 s to produce the calibrated output for the entirety of 80% of the data (20/80 split) for SC3, even on a simple PC (Intel Core i7-8700 3.20GHz CPU, 16 GB RAM) with no additional GPU. Therefore, it is possible to have a real-time operation with ML-based algorithms, where the data (electrode reading, T, RH, other pollutant readings) for one time instant will be sent to the cloud for the algorithm to produce the calibrated output only for that instant.
The computational complexity of ML algorithms can be benchmarked by comparing the total number of learnable parameters. Table 5 shows the number of learnable parameters for the three most consistent ML algorithms (1DCNN, GBR, and LSTM) and the linear regression for scenario 3 for the 90/10 splits (TTS1). The performance improvement of the ML algorithms comes at the cost of increased computational complexity, and GBR requires the highest number of parameters to be learned.

Computational Cost
The computation cost of the ML algorithms can be mainly attributed to the training and tuning of the hyperparameters. However, it should be noted that these activities are not performed on sensor devices, which are constrained in terms of computational capability and energy storage. The low-cost sensors are deployed to collect and send the data to a backend server or a cloud-based infrastructure that performs the calibration/training offline. By investing in this infrastructure, we can deploy a city-wide low-cost sensor network at a very high spatial resolution. This is far more cost-effective than deploying a large number of expensive reference-grade sensors. We trained the ML algorithms on a workstation equipped with a 16-core AMD Ryzen processor, 128 GB RAM, and two NVIDIA A40 GPUs. The cost of this workstation (less than $20,000) is far lower than a single reference-grade gas sensor.
Once an algorithm has been trained, it can produce a calibrated output almost instantaneously. For example, even for the largest dataset (dataset 2), the trained 1DCNN algorithm takes less than 10 s to produce the calibrated output for the entirety of 80% of the data (20/80 split) for SC3, even on a simple PC (Intel Core i7-8700 3.20GHz CPU, 16 GB RAM) with no additional GPU. Therefore, it is possible to have a real-time operation with ML-based algorithms, where the data (electrode reading, T, RH, other pollutant readings) for one time instant will be sent to the cloud for the algorithm to produce the calibrated output only for that instant.
The computational complexity of ML algorithms can be benchmarked by comparing the total number of learnable parameters. Table 5 shows the number of learnable parameters for the three most consistent ML algorithms (1DCNN, GBR, and LSTM) and the linear regression for scenario 3 for the 90/10 splits (TTS1). The performance improvement of the ML algorithms comes at the cost of increased computational complexity, and GBR requires the highest number of parameters to be learned.
The number of learnable parameters increases as more covariate factors are included. Table 6 shows the total number of learnable parameters for 1DCNN for all three scenarios for the 90/10 splits. We can observe that the number of learnable parameters increases as we go from SC1 to SC2 and then from SC2 to SC3. Thus, the increased accuracy comes at the cost of learning more parameters.

Conclusions and Future Work
We proposed 1DCNN-based multivariate regressors for the calibration of low-cost CO sensors. The 1DCNN-based calibration algorithms were benchmarked against several machine learning-based calibration techniques and linear regression and were found to be the most accurate based on several performance metrics. GBR-based calibration models also, in general, perform better than many of the popular ML techniques for all datasets. Therefore, both 1DCNN and GBR should be further explored for low-cost gas sensor calibration.
We can also conclude that the CO sensors' accuracy can be significantly improved by using data from other sensors of a multi-sensor platform. Therefore low-cost sensors should be deployed as multi-sensor arrays. It also appears that low-cost CO sensors can be reasonably accurately calibrated through a short co-location with a reference sensor and then deployed for a significantly more extended period for monitoring.
Data augmentation can improve the performance of ML algorithms and can be explored for improving the accuracy of the calibration algorithms. It should be noted that the sensors were tested on data that were collected from the same location. It is not apparent how the calibration models would perform for the same sensors located in an area with significantly different pollutant concentrations and meteorological conditions. In fact, the accuracy of ML algorithms may deteriorate when dealing with data that are outside of the calibration range. One way to deal with this issue would be to train multiple models at various concentration levels or distributions, and a heuristic-based algorithm can then be utilized to switch between the models depending on the pollutant concentration. The sensors will need to be collocated at multiple locations (e.g., central city, industrial area, suburb, etc.) to develop multiple calibration models. The models calibrated for a certain CO concentration can be used to develop the calibration models for other concentration levels. Future work can investigate the efficacy of such a strategy and transfer calibration if the same sensor-platform data is available from multiple locations.
In this work, we have only calibrated CO data. All three datasets also contain the raw and reference data for NO 2 . In the future, we will investigate and benchmark the performance of 1DCNN and GBR-based calibration for NO 2 .
Since low-cost sensor platforms are not likely to have an identical response, one might need a slightly different calibration model for each sensor. However, the development of such calibration models can be expedited through transfer calibration, where a base model, developed through extensive collocated deployments, is fine-tuned with a short collocation to address the slightly dissimilar hardware responses. This can be explored in a future study.