Traffic noise prediction using machine learning and monte carlo data augmentation: a case study on the Patiala city in India

Traffic noise pollution is a serious problem in the modern urban areas especially to design new architecture of smart cities, highways, hospitals, schools for an efficient and healthy environment. To analyse this aspect, we have proposed a machine learning based prediction of sound pressure level on an original dataset collected in Patiala city in India. Vehicular traffic and sound pressure level data was collected on different sites in the city. A total of 502 data samples on the identified sites were obtained for the study. Further this data is augmented using Monte Carlo simulation to 10 times of its initial size and the Artificial Neural Networks (ANN) have been trained and compared with other Machine learning methods for the vehicular traffic noise prediction. The input parameters in the model are traffic volume Q, percentage of heavy vehicles P and the average speed of vehicles V and the output parameter is the equivalent continuous sound pressure level, Leq dB(A). The experimental results show ANN which is trained on the augmented data using Monte Carlo simulations outperforms other advanced methods making it an effective measure for vehicular traffic noise prediction to develop a healthy environment which is free of noise pollution.


Introduction
The increased vehicular traffic on the roads is a cause of serious concern as it leads to environmental noise pollution, annoyance, sleep disturbance [1], [2], anxiety to drivers and passengers and various other related health problems [3]. The scientific community and the regulatory agencies the world over are working to address this issue. One aspect related to this problem is the assessment of the present noise levels using different measurement techniques and equipment and prediction of the future traffic noise levels to take adequate preventive measures. Figure 1 depicts the proportion of various noise sources causing annoyance in which the road traffic noise pollution contributes to about 80% of overall noise annoyance according to [4].
Some of the existing traffic noise prediction models to address the issue of traffic noise assessment and plan mitigation measures include FHWA (USA), RLS-90 (Germany), CoRTN (UK), ASJ (Japan) [5] which are country specific and may not be applicable everywhere. There are some methods which are based on machine learning techniques like ANN, GA etc. [6]. The parameters generally considered include traffic volume, percentage of heavy vehicles, average speed of vehicles etc.  [4]. Road traffic noise is responsible for almost 80% of the overall noise annoyance.
A comparison of some of these methods on the criteria of MSE, correlation coefficient and accuracy has been done by Singh et al. [7]- [9]. But the methods considered are limited in number. Also some other aspects like the augmentation of the data sets considered and the effect of the size of the data set needs to be studied. Some of these aspects have been addressed in the present study. A Monte Carlo simulation has been attempted to augment the dataset and see the effects on the results, i.e. the mean square error and the prediction accuracy, obtained using different Machine Learning techniques.
The material and methods have been presented in section 2, results and discussion in section 3 and finally the conclusions drawn from the study are presented in section 4.

Materials and Methods
The dataset collected for the study, the measurement locations, equipment and methods used are presented in this section.

Dataset for the study
The study locations selected were on the highway roads that bring the traffic into Patiala city. The traffic flow observed is in the free-flow condition as there is no congestion, traffic lights, roundabouts etc. on these locations. The input parameters include the traffic volume (Q), percentage of heavy vehicles (P) and the average speed of vehicles (V). The values of these traffic parameters were measured manually using videography. The output parameter considered is the equivalent continuous sound pressure level, L eq . The values of L eq were measured using a class 1 integrating type sound level meter (SLM) confirming with the IEC specifications [10]. The SLM was mounted on a tripod to avoid the human impedance effects and a windscreen was used to cover the microphone. The SLM was kept at a distance of 1m from the edge of the road. A total of 502 datasets were obtained on the three measurement locations during different times of the day and different months throughout the year to account for the variation in the same. A sample dataset is presented in table 1 with a sample of five values for the considered parameters and also the mean and the standard deviation of the entire dataset of 5000 values. In order to achieve the objective of accurately predicting the values of the output parameter of L eq , based on the set of input parameters, the approach of machine learning combined with Monte Carlo simulation was used and is presented in the next section.

Monte Carlo simulation
The concept of Monte Carlo simulation is used to increase the dataset with random samples rather than using a fixed set of input values for the prediction [11]. The number of experiments using the Monte Carlo method can be increased thousands of times to see more possible outcomes and therefore increase the prediction accuracy as shown in figure 2. It is also observed from figure 2 that as the number of samples increases, the Central Limit Theorem (CLT) states that the distribution becomes normal which is illustrated using the bell curve. The main steps involve studying the probability distribution of the values of the independent parameters or input parameters, generating a range of values for these parameters (using the concept of random sampling) and running the prediction model (machine learning based in the present case) hundreds or thousands of times to get the output results. Monte Carlo simulation is a computational algorithm which randomly extracts samples based upon a given data distribution such as a specified mean and standard deviation as given by the following formulae, Usually real world problems are deterministic in principle in which random selection is imperative and the original data set is not sufficiently representative of enough population. Monte Carlo repeated random sampling is often used in obtaining a probability distribution, numerical integration and optimization [11]. The simulations are useful where degrees of freedom of input variables are coupled in a complex fashion to generate the output, such as the current problem of noise volume prediction L eq , using traffic volume Q, percentage of heavy vehicles P and their average speed V. Monte Carlo simulation is quite popular in financial modeling, risk management, and complex decision boundary conditions. Central limit theorem (CLT) states that if the sample size is larger than 30 then it approximates to the normal distribution, regardless of the population distribution [12]. Therefore the original data collected was of the size 502 samples can be assumed as an approximation of normal distribution. This relatively smaller dataset of 502 samples has been augmented to 5,000 samples using Monte Carlo simulation. Monte Carlo has been implemented in Excel using NORMINV(.) which takes three inputs (a) random number generator function within a specified range (b) mean and (c)

Results and Discussion
The results obtained by applying the methodology discussed in the previous sections are presented below. Neural Network fitting application in MATLAB R2020a has been used to train a 2-layer feedforward network with linear output neutrons and sigmoid hidden neutrons. The network is trained using Bayesian Regularization algorithm option with train:validate:test ratio of 70:15:15. Figure 3 shows the outcome of Neural network results for 118 epochs which implies the convergence validation and other parameters using MATLAB fitting application.
Mu refers to the adaption parameter in the underlying optimisation used by ANN during the parameter update calculations. It is defined as momentum update and its value lies in [0,1]. It affects the convergence error and is a control parameter used by ANN during training. It helps to avoid problems such as network getting stuck in the local minima. The details of mu are given in the Levenberg-Marquardt algorithm. 'Gradient' stands for minimum performance gradient. It has a default value of 1e-7. 'gamk' shows the number of parameters used for training, 'ssX' means sum of squares of the parameters, validation checks (val fail) graph at the bottom is about monitoring the possible degradation of the validation failures. Figure 4 shows the regression plots by the proposed method (Monte Carlo simulation and Neural Networks regression) which is almost 1 for the value of R. Figure 5 shows the frequency histogram of the errors for training and testing target variables (L eq ). Figure 6 shows that the mean square error was minimum at 108 epochs.
Various state-of-art machine learning algorithms have been considered for performance comparison such as, GLM: Generalised Linear Model; DT: Decision Tree; RF: Random Forest; ANN: Artificial Neural Network. Finally the proposed regression model using Monte Carlo data augmentation has been considered which has superior performance. Figure 7

Conclusions
In the modern world especially in the urban areas, noise pollution is a serious problem which causes health and environmental issues especially around the schools, hospitals, highways and industries. To analyze this problem, we have studied the original dataset collected in the Patiala city including traffic volume, percentage of heavy vehicles, average speed of vehicles and equivalent sound pressure level in dB(A). In this paper, vehicular traffic noise has been predicted using three variables: (a) traffic volume Q, (b) percentage of heavy vehicles P and (c) the average speed of vehicles V. Then the output parameter is predicted which is equivalent continuous sound pressure level, L eq dB(A). The original data of 502 samples were collected from Patiala city in India and then Monte Carlo simulation has been used to augment the data to increase the size to 10 times i.e. 5,000 samples. It has been observed that machine learning algorithms work well for larger dataset as compared to traditional machine learning algorithms with smaller dataset. Thus, Monte Carlo simulation has been observed to justify its advantage for data augmentation. The machine learning algorithms perform better on a larger training dataset since numerous internal parameters (such as neuron weights in case of ANN) needs to be setup accurately for better generalization and accuracy. This dataset enlargement is provided by Monte Carlo simulation in our study. The assumption machine learning algorithms is parameter tuning and data distribution which must be analyzed carefully. Moreover, machine learning depends upon the data distribution, so a collection of techniques must be explored for every data analysis problem. Our future plan is to collect data specific to places and vehicle types while including more input variables for accurate prediction.