Acoustical Classification of the Urban Road Traﬀic with Large Arrays of Microphones

.


Introduction
Quality of life is amajor challenge in urban places because of the large number of environmental factors influencing it.In projects such as Mouvie [1] systemic approaches are needed to work on both air pollution and sound pollution and their impact on the comfort of city-dwellers.
Notably,n oise has stress-related impacts on human health such as sleep disturbances or cardiovascular diseases [2,3,4].The World Heath Organisation (WHO)indicates [5] that between 1a nd 1.6 million years in good health are lost (disability-adjusted life-years, or DALY s) every year in occidental Europe because of transportation noise.Sleep disturbance is the major effect with 903 000 years lost every years and long-term noise annoyance (due to passive effects of noise)i st he second effect (654 000 years lost).
In order to unify the national initiativesand provide an efficient tool to diagnose the urban sonic environment, European parliament voted the 2002/49/CE directive in 2002 [6].It imposes for the large cities of the member countries of the European union to realise noise maps based on the L den index(stands for Day-Evening-Night level).It aims to takeinto account the influence of time-periods of the day in noise annoyance by increasing by 5dB(A) the evening levels and by 10 dB(A) the night levels.After this Received17June 2019, accepted 25 October 2019.
diagnosis step the large cities (over1 00,000 inhabitants) have to provide action plans showing their ambition to reduce the proportion of persons highly annoyed by noise with practical actions.
These regulations, with regard to WHO'sthreshold values [7,8], provide an efficient wayt oi nform the city dwellers on their average exposure to noise.But these maps have some limits.First, theya re provided for each type of source separately (road, aircraft, railwaya nd industrial)w hereas in urban areas city dwellers are usually exposed to several sound sources at the same time -with possible interactions and not just as ummation.So that the overall exposure is not estimated.Second, whereas the variation of noise exposure overt he day may be of interest for city-dwellers and urban planners, it is averaged because of the use of the L den .T hird, the noise maps are obtained by simulations and rarely confronted to measurements and consequently validated experimentally.T herefore, it is interesting to see the emergence of monophonic acoustic sensor networks (such as CENSE or DYNAMAP [9])t oprovide additional real-time information about the sonic environment.
In this work, we focus on the noise induced by the road traffic.In this specificcontext, noise maps have additional weaknesses.Thus, though theyare often cited as the most annoying sources [10,11], the powered two-wheelers where considered as light vehicles [11].Indeed, in many cities the trafficfl ow estimation is based on counting the number of axles of the passing-by vehicle, therefore it cannot distinguish apowered two-wheeler from acar.
Note that the European Union voted in 2015 adirective [12] redefining the categories to be used for noise mapping, including the powered two-wheelers.The application of this directive had to be applied before 31/12/2018.
In addition, Marquis-Favre et al. [13] reported that the perceivedn oise annoyance induced by road traffic( here the short-term annoyance, estimated with active listening) is related to the mode of transportation (road, rail or air traffic) butalso to the vehicle type.The driving conditions are also pointed out as adecisive factor to this annoyance.Thus it appears that the road traffice stimation should be improvedt od etect all type of vehicles and their driving conditions as well.
In this aim, we propose to capture the audio signal of each road vehicle, extract from it the vehicle type and its driving condition in order to provide am ore detailed description of the road trafficn oise and pave the wayo f short-term noise annoyance estimation in urban areas.
If the audio signal is known, it can be used to identify the vehicle.Indeed, sound source classification has been investigated with machine learning based on monophonic signals in the past decade (see e.g.[14,15,16,17]).But in major streets the spatial and temporal masking effects of the different sound sources prevent from classifying each vehicle properly.
Multiple studies have been conducted to separate sound source on monophonic signals (see for example Gloaguen et al. [18] for recent development in Non-Negative Matrix Factorisation applied to urban sound scenes)b ut we assume that as patial filtering technique will provide better results and extract each vehicle audio signal with more accuracy.
Weinstein et al. [19] showed that microphone arrays can be used for sound sources separation using inverse techniques.The application to low-speed moving source signal extraction has been done by Hafizovic et al. [20] on a basketball court with a300 microphones array.
In this article, we propose aroad trafficmonitoring system.It aims at detecting each vehicle type, identifying its driving conditions and extracting its specificsound signal.This permits to compute indices that could be used to better assess noise annoyance such as loudness.By identifying the vehicles and isolating their audio features, the proposed system provides more detailed information than those provided by standard urban noise observatories.
Part 2presents the tools implemented in the study.First avideo tracking method provides the trajectory of each vehicle (see sec.2.1).From this trajectory,t he system uses large microphone arrays (sec.2.2)t ogether with ad edicated beamforming technique to extract the signal of each vehicle embedded in the traffic(sec.2.3).The last step of the process consists in classifying these signals into clusters mixing both vehicle type and driving condition (sec.2.4).Note that this study only focuses on internal combustion vehicles.
Part 3presents the applications of the method.In afirst step, it is used in ac ontrolled set up to characterise isolated vehicles on at est track (Section 3.1)a nd constitute al earning database.In as econd step, the system is implemented in ar eal urban context to evaluate its performances in terms of classification according to objective features (Section 3.2).Finally,t he system is modified to perform classification according to perceptual indices in Section 3.3.
Finally,Part 4exposes the global outcome of this work by presenting the evolution of the sound levela ll overa day with respect to the estimated perceptual clusters.

Materials and Methods
The method for an acoustical classification of the urban road traffics tarts from at racking step of each vehicle in the trafficflow (Section 2.1).This provides the trajectory of the sound sources to be measured and classified.The individual signal extraction relies on the implementation of large microphone arrays presented in Section 2.2.The method uses beamforming technique dedicated to moving sources and presented in Section 2.3.The last step aims at classifying the extracted signals.The method based on a supervised machine learning process is presented in Section 2.4.

Moving vehicle tracking method
In order to obtain the trajectory of each vehicle embedded in the traffic, we developed an in-house tracking method.
We first perform ac ontour detection on the video file recorded by ac amera located at the centre of the microphone array.Itisbased on background subtraction for each video frame using the OpenCV library1 .The background is created by averaging the 500 preceding frames.Each moving object is reduced to arectangle including it.The rectangles connection between twoconsecutive frames is simply computed by finding the minimum distance between twor ectangle centroids.Finally,t he vehicle trajectory is obtained by gathering the connected centroids.Figure 1a shows an example of vehicle tracking for a pass-by measurement.But the trajectory detection also has to be robust to the presence of obstacles (such as trees)between the camera and the vehicle, likeinthe configuration presented in Figure 1b: camera overam ulti-lane street.To do so, the current frame -towhom the background is subtracted -isblurred in the direction of the vehicles.

The Megamicros acquisition system
The Megamicros project, introduced by Va nwynsberghe et al. [21] with a1 28 microphones array,a ims at providing digital acquisition systems able to capture up to 1024 synchronised acoustic signals.These systems are dedicated to applications such as acoustic imaging, room acoustics or source directivity measurements.Based on digital MEMS microphones these systems are very versatile and easy

Isolated vehicle pass-by measurements
The microphone array used for the test-track experiment has been designed to provide the best possible acoustic image of the noise sources of passing-by vehicles [22].Therefore, the microphone array is large enough to offer a sufficient resolution at lowfrequencies and dense enough to avoid grating lobes at high frequencies.The array was built according to the geometry presented in Figure 2. 256 microphones were distributed overa2 0m longa nd 2.25 mhigh area, thanks to 32 vertical uprights supporting 8m icrophones each.The microphone array wasl ocated 7.5 ma wayf rom the vehicle path, following the ISO 362 recommendations for pass-by noise measurements.In detail, horizontally the inter-microphone distances are from 10 cm to 1.53 mand from 17 cm to 39.9 cm vertically.
We aimed at being representative of the urban road traffic in terms of vehicle types and driving conditions.Table I lists the characteristics of the vehicles involved in the track tests.Note that the vehicle named hv is the one considered as heavy vehicle despite it is al arge utility vehicle, not ap roper buso rt ruck.The study only focuses on internal combustion vehicles.
To simulate all possible urban driving configurations, each vehicle under test has passed the following scenarios in both back and forth ways: • 25 km/h constant speed in second gear; • 50 km/h constant speed in third gear; • trafficlight vicinity: -deceleration from 30 to 0km/h; -2sstop, engine idling; -acceleration from 0to30km/h with gear change; • full throttle acceleration over20m.

Urban experiment
Our goal is to monitor the road traffici nu rban environments.Therefore, in situ measurements have been carried out on amulti-lane urban street (StBernard Quay in Paris -France)with aspecificmicrophone array.Itis21.6 mlong and is located 9.5 meters overhanging a3×1lane street at a13.5 mdistance from its centre (see Figure 3).Itiscomposed of 128 MEMS microphones regularly spaced with a pitch of 17 cm.This configuration allows segregating the vehicles in the same lane in awide frequencyrange, even at lowf requencies thanks to the array length.The overhanging position of the array allows to have aphase difference between the lanes, which makes it possible to separate the sources located in different lanes.Due to its lightness and simplicity,the installation of the antenna only requires 30 minutes.As illustrated in Figure 3, this experiment takes place in astreet with an "L"shape, meaning that there is no building in the opposite side of the street.It is an important street and the microphone array is set-up close to atraffic light so that we can expect to have all the driving conditions described in the previous section.
This experiment took place during aday in winter which prevented the video tracking process to be disrupted by the leavesonthe trees.Six sequences of ten minutes each have been recorded during day time.

Moving source signal extraction
Knowing the source position at all time, the microphone array recordings can undergo abeamforming (BF) process to extract audio signal of each passing-by vehicle from a multi-source sound scene.
In acoustics, BF is used as areference method, since it is robust for source localization overadiscretised plane or into avolume including static sound sources.The method can also be considered as away of spatial filtering (see e.g.[19,20]).
In this study,w ep ropose to use this property on moving vehicles.Therefore, the standard free-field propagation model used in classical delay and sum (DAS)applications has been modified to takeinto account the cinematic of the vehicles.Figure 4presents aclassical scenario with alinear microphone array recording the sound field propagated by the i th monopolar sound source moving on as traight line.
Morse and Ingard [23], considering an homogeneous media and afree field, write the pressure at time t r at microphone m emitted by source i at time t e as with t r = t e + r mi,e /c 0 , r mi,e the distance between the microphone m and the source i at emission time t e , s i (t) = q (t)/4π, q (t)i st he derivative of the source mass flow and M a (t e )isthe Mach number of the source at emission time M a (t e ) = V (t e )/c 0 .The reconstructed source signal s i (t e )i se stimated by Cousson et al. [24], for instance, with with N m the number of microphones.The central term bears the only modification in the classical DASe xpression.However, in the rest of this paper the energy compensation is discarded (byremoving the r mi,e factor)sothat the output signal levelisthe one recorded by the microphones.Thus, the extracted signal writes: The signal ŝi (t)i sa ne stimator of what would have been recorded by one microphone if only the source s i where in the sound scene.By doing so, only aspatial filtering is operated and the results are more comparable to the noise maps (that provides L den in facade)and to the city dwellers feeling.
The simulations and the experimental tests presented in appendix A1 showthat the BF method presented in equation (3) is accurate to extract each vehicle signal.
Note that in both experiments presented in this paper, the free-field model can be considered valid as there are no major reflectors except for the road, which is too close to the sources to have areal influence in these configurations.

Moving vehicle classification
In classification tasks, the challenge is not only in selecting the best algorithm buta lso in finding the best data to use: here the audio descriptors.Va lero et al. [15] provides ac omparison of classification accuracyo fe nvironmental noise obtained by 13 signal features and 4d i ff erent methods.Theyp oint out that the MFCCs (Mel Fre-quencyC epstral Coeficients)o rt he MPEG7 descriptors give good accuracyi nn oise source classification, especially for Gaussian Mixture Model (GMM), K-Nearest Neighbour (KNN)orNeural Network (NN).
MFCCs are widely used in noise source classification.As pointed out by Giannoulis et al. [25], am ajor part of the research teams uses MFCCs as audio descriptors for classifying different sound scenes signals.MFCCs, as presented by Davis and Mermelstein [26], result from different calculations overe ach signal frame, typically 25 ms long.First, the energy summation of the filtered spectrum overat riangle filter bank align along the mel scale (mimicking the cochlea)i st aken.Then, the discrete cosine transform of the log of each energy is computed.Forour purposes, MFCCs are calculated using the python_speech_features library 2 .
During the DCASE 2013 challenge [25] the best results have been obtained by combining MFCCs with aSupport Ve ctor Machine (SVM).Introduced by Cortes and Va pnik [27], this method is widely used for supervised classification tasks.It aims at finding so-called hyper-plans that sep- arate the samples of different classes with the maximum margin.The hyper-plans are defined by normal vector w and the margin width is equal to 2/ w so that minimising w maximises the margin.The w vector has the same number of components that the number of features used for classification (here the MFCCs).In order to allowerrors and approximation in the case clusters could not be linearly separable, slack variables ζ i -c ounting the number of errors -a re added to relax the constraints on the learning vectors.Finding hyper-plans reduces at solving the equation with C the parameter that determines the tradeoff between increasing the margin-size and ensuring that the samples lie on the correct side of the margin.C can be avector or a scalar,providing respectively avalue for each class or the same for all.Note that convolutional neural network have also been used in environmental sound classification for the past years.This type of algorithm seems interesting, butsofar, it givesthe same type of performance as SVM butwith an intensive computational cost (see eg. [28,29]).
In our case, the task consists in classifying road vehicles in terms of type (2-wheeler,l ight vehicle and heavy vehicle)a nd driving conditions (constant speed, acceleration and deceleration).Nine clusters are used.Theya re detailed in Table II.

Implementation
The monitoring method proposed in the previous part is first tested and validated with isolated vehicles on at est track.Then, its application in areal urban scenario is presented.

Controlled set of vehicles
This extraction method has been applied to the pass-by measurements listed in Table If or the various driving conditions.Note that the "trafficlight simulation" recordings are split into 3d riving conditions: deceleration, idle (not classified)a nd acceleration.Finally,7 0p ass-by signals constitute the classification database (called test-track database in the rest of the article).Theyare distributed in the different clusters as presented in Figure 5.
Signal features selection is important for obtaining the best classification results.MFCCs are computed every 100 ms with 52 filters (ranging from 0to25kHz, half the sample rate)a nd 26 MFCCs are obtained.As implified  bag-of-frames approach [30] is used for representing the MFCCs evolution overthe time: the time evolution is simplified to its mean and standard deviation.So that for each pass-by the extracted signal is represented by 52 audio features.
The classification is trained by SVM over8 8% of the signals and is tested on the remaining 12 %( 8s ignals).The results are averaged over3 5d i ff erent learning and testing signal combinations.
Figure 6presents the confusion matrix for this classifier.It shows some small confusion, mostly within the classes of avehicle type.Some confusions outside avehicle type are also present.Forexample, the light vehicle in acceleration (cat.5) are classified 10 percent of the time as heavy vehicle in constant speed (cat.7)and 3.5 percent as a2wheeler in constant speed (cat.1) Using the MFCCs as descriptors and ar elaxation parameter C from 5a nd above,t he score is 88 %o fc orrect classification which is good regarding the literature [15,29,28].
This result can be improvedb ya dding more information to the dataset used by SVM: the driving conditions.Indeed, in this controled experiment theyare well known.So that the learning and testing datasets are composed of 55 elements: 52 audio features and 3b inary values, one for each driving condition.Note that each element of the dataset is normalised by its maximum among the 70 passby data.Only the heavy vehicle in deceleration (cluster 9) is misclassified, being considered as al ight vehicle in acceleration 67 %ofthe time or in deceleration 33 %ofthe time.This could be explained by the nature of the vehicle: alarge utility vehicle, powered by acar-likeengine not a proper truck one.Note that tuning of C parameter allows some samples to be misclassified.The counterpart of having ag ood fit of the SVM on the overall data is that the heavy vehicle in deceleration is 67 %ofthe time misclassified in acceleration.

Urban sound scene
From the in situ experiment presented in Section 2.  III.Note that for some clusters the number of samples is very low.This is mainly due to the tracking method which is not robust enough buta lso to the lown umber of heavy vehicles passing during the measurements.For the third category,the two-wheelers trajectories were usually lost when theywere overtaking another vehicle at idle.Note that the 52 audio features are normalised by the the same values (maximum of test-track database)and that the driving condition is deduced from the vehicle trajectory.
As discussed in Section 2.4, the C parameter can be a scalar (same value for all clusters)o rav ector (one value per clusters).After aparametric study,the best classification results give 82 %o fa ccurate classification.Theya re obtained for C = 6, 0.6, 0.6, 1.5, 1.5, 0.3, 6.6, 2.4, 3.3 .Formore details, the confusion matrix is giveninFigure 8.The good results are mainly due to the good classification of the light vehicles: from 70 to 93 %o fa ccuracy.The  other clusters are often confused with equivalent driving condition clusters of light vehicles.We can also see that the vehicles of clusters 2, 3and 9are always misclassified.This can be explained by the lack of data for both learning and testing in those clusters.We can finally state that heavy vehicles in acceleration (cluster 8) are confused half of the time with light vehicles in acceleration (cluster 5) and 25 %ofthe time in constant speed.
This in-situ classification test is promising and the performances seem coherent with the literature [15,16,28,29].This work confirms that the audio signal include the clues to identifies the vehicle types and driving conditions, such as humans can do.

Urban sound scene -P erceptual categories
Astudy carried out by Morel et al. [31] proposes amulticriterion typology of road trafficp ass-by noises.During af ree clustering task, theyf ound that the subjects were grouping pass-by noises into clusters that can mainly be explained using twoc riteria: the vehicle type and the vehicle driving condition.Theyw ere also interested in the influence of at hird criterion, the road morphology,b ut it revealed not to be significant in the clustering process.
During the free clustering task, the subjects were gathering the light and heavy vehicles in both constant speed and deceleration.As ar esult, their proposed perceptual clustering of the road trafficisexplained by sevenclusters detailed in Table IV.
We propose here to modify the previous classification method to use Morel et al. perceptual clusters on the same training and testing sets than those used in Section 3.2.After aparametric study,the best success rate is found to be 84 %with the relaxation parameters C = [2.38,1.68, 0.42, 0.9, 0.14, 0.14, 0.53].
The confusion matrix is detailed in Figure 9.It shows some good results, especially for clusters 1, 3and 6: the classification rate for these clusters are equivalent to those obtained in Section 3.2 (previously labelled category 1, 4and 5respectively).We can see that the two-wheelers in deceleration (category 4) are still classified as alight (orheavy) vehicles in constant speed (category 3).Wecan also notice improvements: two-wheelers in acceleration are nowwellclassified 40 %o ft he time versus 0% previously and the classification accuracyo fh eavy vehicles in acceleration (cluster 7) rises from 25 to 50 %.Nevertheless, classification rate decreased from 70 to 64 %for cluster 5because it include the heavy vehicle decelerating that is still classifiedaspassing-by in constant speed.
It is interesting to notice that the classification accuracy rises when we use those perceptual categories.This is not only because of the category merging (that can decrease the error), butalso by improving the classification of some categories, such as the powered two-wheelers.

Monitoring overd aytime
The classification method is applied to all the ten minute recordings acquired in one day at the Saint-Bernard Quay. Figure 10 shows the number of detected vehicles during the six measurements available with the cluster estimated by the method.The data presented in this Section are labelled by their starting acquisition time.The tracking method allowed to detect 539 vehicles during the acquisition time.
First, we can see that the number of detected vehicles decreases from midday to 15:00 then rises at 15:15.We can suppose that it continues rising later in the afternoon.Also, such as in Section 3.3, we can see that the 2-wheelers in deceleration (cluster 4) are neverd etected because of the camera position.We can also see that there is no major evolution of the distribution overthe clusters, except at 15:00 when clusters 5a nd 7( light and heavy vehicles in deceleration)a re less detected.We can finally notice that the road trafficishighly dominated by clusters 3(light and heavy vehicles at constant speed)a nd 6( light vehicles in acceleration)b ut not much vehicles in deceleration (categories 4a nd 5).T his can be explained by the street configuration (see Figure 3):three lanes located after atraffic light and only one before.
Figure 11 shows the evolution of the distribution of the equivalent sound levelbyperceptual cluster with boxplots.The sound leveli sc alculated overt he pass-by duration: one second (for the fastest vehicle)a nd fives econds for the accelerations and decelerations.Note that cluster 4i s neverd etected and not represented.We can notice at en-dency: the lower noise levelv alues are emitted by the light and heavy vehicles decelerating (category 5).T his is not the case for the 15:00 ten minute recordings.When analysing precisely the distribution it seems to be bi-modal with twosets of values centred on 61 dB and 78 dB.This bi-modality seems to be only due to the integration time and so the type of deceleration: short and stopping quickly (high equivalent level) or long and idling in front of the array (low equivalent level).
It seems also that the higher noise levels can be attributed to the 2-wheelers passing-by at constant speed (cluster 1) and heavy vehicles accelerating (cluster 7).For this last cluster,a t1 5:15, we have al arge variation range due to twos ets of values centred on 64 dB and 87 dB.It is interesting to note that this cluster has quite ac onstant number of vehicle during the day (between 6and 7, except for the 15:00 experiment)but with aimportant variation of noise levels.
Moreover, we can notice that the variation range for each cluster is small until 13:25, reflecting an homogeneous traffic, butthen, we have noticed large variations in the noise levels and an important number of outliers.This can reflect the increasing diversity of the road trafficf or those periods (different types of heavy vehicles or different cylinders number for the 2-wheelers).
With these results, we can see that the analysis by perceptual category doesn'tdirectly imply areduced variability of noise level.Thus, we can see that both the traffic segmentation thanks to the classification task and the associated noise levela re complementary in au rban noise monitoring system.

Conclusion
In this study,wehavebeen interested in proving the feasibility of monitoring the urban road trafficfrom the vehicle radiated sound field (thermal vehicles only).In order to do so, large arrays of microphones have been implemented to spatially filter asound scene.This wasachievedthanks to adedicated beamforming algorithm coupled with avideo tracking method, able to extract the audio signal of each passing-by vehicle.Appendix A1 presents the method validation with simulations and an isolated vehicle experiment.The in situ spatial filtering gain in also investigated.
Once the signal extracted, aclassification step has been designed with Support Ve ctor Machines using MFCCs as audio features.As learning samples, MFCCs where completed by the driving condition information based on the video tracking algorithm.It led to 99% of accurate classification on isolated vehicles.
Then, the application to ar eal urban sound scene has been presented with 539 detected vehicles.Amaximum of 82% of accurate classification has been pointed out.This wasm ainly due to the lack of data for the 2-wheelers in acceleration and deceleration and for the heavy vehicles in deceleration.To reach this performance, the learning database is based on the isolated vehicle database but also on manually tagged pass-by in the urban in situ measurements.An adaptation is finally proposed to classify overp erceptual clusters.It allowed to increase the accurate classification rate to 84%.Finally,a na pplication to six available ten minutes recordings has been presented.It allows to analyse the noise levelf or each perceptual category overt he day.I t has mainly pointed out that the most noisy vehicles -f or this measurement place (St-Bernard Quay in Paris city centre)-where the 2-wheelers in constant speed and the heavy-vehicles in acceleration.The least noisy category wasf ound to be almost always the light and heavy vehicles in deceleration.
This method appears to give ag ood knowledge of the road trafficc omposition.Nevertheless, it could be im-provedb ya dding samples in the training and testing datasets, especially with two-wheelers and heavy vehicles decelerating signals butalso with two-wheelers in acceleration.Note that the microphone array is easy to use and can be adapt in all urban situations (eg. on attaching it to the balconies).Even though the current one -i nS t-Bernard Quay -isvery challenging, the results are already satisfying.
Subsequently,s ome improvements could be investigated.The video tracking step could be improvedt or ise the number of detected vehicles.It could be done using ar emote camera with ab etter field of viewo rb yc oupling the video tracking with at racking system on an acoustic image.In addition, in the presence of leaves, this method could be modified by doing the tracking step over an acoustic image rather than on the video.The performance should not be really degraded as the effect of the leavesshould be at very high frequency.
The array geometry of the in situ experiment could also be improvedi no rder to allowab etter source separation between the trafficl anes.Furthermore, as we have extracted the audio signal of each vehicle, different kind of metric could be computed to better assess short-term noise annoyance in urban environments such as loudness or annoyance itself thanks to different models [32,33].There is no clear consensus in the literature on the link between audio signal and long-term annoyance (asu sed by WHO in the estimation of DALY s) butweassume that the shortterm annoyance would probably explain al arger part of the variance of the long-term annoyance than anym etric derivedfrom the instantaneous or equivalent sound level.

Appendix A1. Beamforming validations
We propose here some validation cases for the beamforming formalism on moving source.The propagation and beamforming model is first tested on simulated data.An adaptation is then proposed to extract audio signal without the energy compensation so that the output signal sound levelisthe one recorded by the microphones.This beamforming model is then validated on as imilar experiment.Finally,extraction performance is investigated.

A1.1. Model accuracy
Based on the propagation and beamforming models presented in section 2.3, as imulation with as ource moving at 50 km/h is proposed.It is done with am onopolar source emitting a2kHz pure tone with an amplitude of 1Pa( 90.9 dB SPL).The simulation lasts 3s allowing the source to travel4   Figure A1a shows the spectrogram of the central microphone simulated signal where the Doppler effect is visible.Moreover, the amplitude is varying from 62 to 66 dB while the source approaches the microphone.
Figure A1b shows the spectrogram of the beamformed signal.We can see that the beamforming allows to get the initial emitted sound field, meaning that the amplitude remains constant at 90.9 dB with afrequencyshift compensated.
This simulation shows ag ood accuracyo ft he beamforming to estimate the radiated sound field of as ingle moving source.However, we will be interested in removing the energy compensation so that the output signal sound leveli st he one recorded by the microphones.Indeed, by doing so, our results are more comparable to the noise maps (that provides L den in facade)a nd to city dwellers feeling.This can be done by removing the distance factor r mi,e in equation ( 2),sothat the extracted signal is obtained in equation (3).
Figure A2 shows the spectrogram of the resulting beamformed signal for the simulation presented before.By comparing with the spectrogram of the initial signal (Figure A1a), we can see that the evolution of the energy is conserved (from 63 to 66 dB while the vehicle approaches).

A1.2. Va lidation on experimental data
The set-up and the beamforming method are proposed to be validated on apass-by measure realised during the testtrack experiment presented in Section 2.2.2.To do so, a loudspeaker emitting a2kHz pure tone has been set-up on acar passing-by at 20 km/h.The Figure 1a has shown the tracking step for this experiment.As we can see on this figure, the size of the moving object is over-estimated on the edge of the frame because it takes into account the shadows.
Figure A3 shows the spectrograms of the recorded and beamformed signals referenced to their maximum.On Figure A3a, we can notice both the frequencys hift for the loudspeaker signal butalso the broadband noise produced by the tire/road contact.
The beamformed signal spectrogram is presented in Figure A3a.The de-Dopplerization seems not to be perfect in the first and last seconds of the signal.This is due to the mis-positioning of the source when it enters and leaves the camera frame.Indeed, the beamforming is done on the centroid of the rectangle including the moving object so that we focus on the shadows and the engine when the car enters and on the exhaust pipe and the shadows when it leavesthe frame.This involves awrong speed estimation if the vehicle is not entirely in the camera frame.But when it is, we can see that the signal is well de-Dopplerised and the background noise is reduced proving the good estimation of the vehicle position with respect to the microphone array.

A1.3. Extraction performances
The performances of this technique in filtering as ound scene is nowi nvestigated.Fort his purpose, during the in situ experiment (with the 128 microphone array)aloudspeaker wass et-up at different places emitting a1kHz pure tone.Figure A4 shows the power spectral density of the central microphone and the one of the beamformed signal while the source is placed at 26.7 mf rom the array centre.We can see that initially the loudspeaker is barely audible because of the energy of the other sound sources (road traffic).B ut thanks to the array dimension (21.6 m long)and beamforming we can see (orange curve) that the background noise is reduced by 20 dB around 1kHz.But this gain reduces when the frequencydecreases, reflecting the fact that the array resolution is frequencyd ependent.Such that in very lowf requency( under 50 Hz)t he technique seems not to provide anyfiltering gain.

Figure 1 .Figure 2 .
Figure 1.(Colour online)Detected moving vehicles (green rectangles)and trajectory extraction (red lines)for isolated vehicles or real urban situation.(a) Isolated vehicle on test track (testtrack experiment), (b) Ve hicles in aparisian street (in situ experiment).

Figure 3 .
Figure 3. (Colour online)S cheme of the experiment configuration.The linear microphone array (red dot)isset-up over a3×1 lane street in Paris.

Figure 4 .
Figure 4. Configuration example with a N m linear microphone array and a i th sound source moving rectilinearly at aspeed of V at t = t e .

Figure 5 .
Figure 5. Number of pass-by measurements in each cluster.

Figure 6 .
Figure 6.Confusion matrix in percentage of vehicle per category -Global classification accuracy: 88 %.

Figure 7p resents the
Figure 7p resents the confusion matrix for this classifier, the global classification accuracyrises to 99 %.It represents the percentage of predicted vehicle category with reference to the real (expected)one.Only the heavy vehicle in deceleration (cluster 9) is misclassified, being considered as al ight vehicle in acceleration 67 %ofthe time or in deceleration 33 %ofthe time.This could be explained by the nature of the vehicle: alarge utility vehicle, powered by acar-likeengine not a proper truck one.Note that tuning of C parameter allows some samples to be misclassified.The counterpart of having ag ood fit of the SVM on the overall data is that the heavy vehicle in deceleration is 67 %ofthe time misclassified in acceleration.
2.3, six available recordings are distributed from 11:50 to 15:15.Three of them (11:50, 15:00 and 15:15)havebeen manually tagged in order to quantify the classification accuracy.It provides respectively 210, 246 and 83 vehicle trajectories (only the 200 first seconds have been tagged for the 15:15 recording)and forms what will be called the in situ database in the following.The first classification tests have shown that we have to enhance the training dataset by adding data from the in situ database to the test-track database.The distribution between training and testing dataset is presented in Table

Figure 7 .
Figure 7. Confusion matrix in percentage of vehicle per category -D riving conditions added to the dataset -G lobal classification accuracy: 99 %.

Figure 8 .
Figure 8. Confusion matrix in percentage of vehicle per cluster -Global classification accuracy: 82 %.

Figure 11 .
Figure 11.Boxplots of noise levels over the different ten minute acquisitions (labelled by their starting acquisition time)b yp erceptual clusters.Note that cluster 4isnever detected and not represented.The limits of the rectangles represent the first and third quartiles so that 50% of the data is included in this range.The red triangle symbolises the mean value, the black vertical bar the median value and the black circles the outliers.

Figure A4 .
Figure A4.Power spectral density of central microphone (in blue)and beamformed signal on the 1kHz pure tone loudspeaker.Distance between source and central microphone: 26.7 m, V = 0km/h.

Table I .
Characteristics of the vehicles for the track tests.hv stands for heavy vehicle, lv for light vehicle and twv for twhowheeler vehicle.The MEMS microphones (ADMP441 -A nalog Device)are omnidirectional and have arather flat frequencyr esponse between 60 and 15,000 Hz.These systems allowt ob uild arrays of arbitrary geometries, possibly with extensions of af ew tens of meters.Twoo fs uch arrays were implemented in this study.Theyare presented in Sections 2.2.2 and 2.2.3.In twofollowing experiments, the signals are sampled at 50 kHz.

Table II .
Classification clusters.

Table III .
Number of pass-by measurements by category for training and testing datasets.The training dataset is based on testtrack and in situ measurements.

Table IV .
Morel et al. perceptual clusters.
1.6 m.Signals recorded by al inear microphone array in the configuration presented in Figure 4( with h = 16.5m)iss imulated.The array is the same as the one used for Saint-Bernard Quay experiment (presented in Section 2.2.3): it is 21.6 meters long with 128 microphones regularly spaced 17 cm.The simulation is done at asample rate of F s = 50 kHz and computed in time domain rounding the reception time t r to the closest