Robust Time Series Denoising with Learnable Wavelet Packet Transform

Signal denoising is a key preprocessing step for many applications, as the performance of a learning task is closely related to the quality of the input data. In this paper, we apply a signal processing based deep neural network architecture, a learnable extension of the wavelet packet transform. As main advantages, this model has few parameters, an intuitive initialization and strong learning capabilities. Moreover, we show that it is possible to easily modify the parameters of the model after the training step to tailor to different noise intensities. Two case studies are conducted to compare this model with the state of the art and commonly used denoising procedures. The first experiment uses standard signals to study denoising properties of the algorithms. The second experiment is a real application with the objective to remove audio background noises. We show that the learnable wavelet packet transform has the learning capabilities of deep learning methods while maintaining the robustness of standard signal processing approaches. More specifically, we demonstrate that our approach maintains excellent denoising performances on signal classes separate from those used during the training step. Moreover, the learnable wavelet packet transform was found to be robust when different noise intensities, noise varieties and artifacts are considered.


Introduction
Real world signals are often corrupted by noise which needs be removed before any further analysis or processing step. To solve this problem, several contributions based on different approaches have been proposed. Several approaches for signal denoising have been proposed. This includes approaches such as dictionary learning [1], empirical mode decomposition [2], singular and higher order singular values decomposition [3], [4] or canonical polyadic decomposition [5]. Particularly wavelet-based methods are considered as an essential tool for multi-resolution and time-frequency analysis [6]. They often provide relevant features to monitor industrial systems with time signals [7], [8], or can bu used for data augmentation [9]. The wavelet shrinkage operation that was theoretically investigated in [10] is still considered as one of the most powerful tools to perform signal denoising in many fields. Thus, wavelet packet transform (WPT) denoising has been used recently [11], [12] due to its ability to denoise regular frequency bands of desired size and to remove backgrounds with a specific frequency content.
To perform wavelet denoising, several hyperparameters need to be set. The parameters include the threshold used for wavelet shrinkage, the thresholding function and the wavelet family considered for the decomposition. Several heuristics have been proposed to address threshold selection [13], such as the universal threshold [10], Stein's unbiased risk estimation or the Bayesian shrink method [14]. However, the selection of the wavelet family and the correct heuristics require specialized knowledge to remain robust to the complexity of the real data. Recent work opts for learning or automating the best hyperparameter configuration by supervised learning from a training dataset. In [13] a genetic algorithm is used to find the best wavelet denoising strategy for EEG denoising. These methods benefit from the recent evolution of storage capacities and computing power allowing the constant increase of the amount of data collected to form the training dataset.
For supervised denoising, deep learning methods have recently made a significant progress, particularly in application areas with large amounts of data, such as image denoising [15] or speech enhancement [16]. The use of deep neural networks (NNs) for denoising in domains where data is more specific and difficult to collect, such as biological signals, is a current issue [17], [18]. The main supervised denoising architectures based on deep learning include the convolutional neural network (CNN) [19], the convolutional Auto-Encoder (AE) [20] and the U-Net [21]. However, these deep residual models contain a large number of parameters to be trained, and may lose efficiency when applied to an industrial dataset with a limited number of examples [22].
Recently, deep architectures inspired by signal processing approaches have been proposed [23]. The main advantages of these approaches are to find more meaningful CNN filters, to gain in interpretability and to reduce the number of parameters. In this work, we combine two of the main signal denoising methodologies, namely WPT denoising with wavelet shrinkage and deep autoencoder denoising. Thus, we use a WPT-based deep learning architecture with learnable activation functions mimicking wavelet shrinkage, referred to as Learnable WPT (L-WPT). The used method is a relaxed version of the L-WPT architecture of [24] to improve learning capabilities. This is the first application of L-WPT for a denoising task. The advantage of this architecture is threefold: a) It is based on a very powerful signal processing approach to obtain a time-frequency representation with optimal resolution. This provides our L-WPT algorithm considerable learning capabilities with only few parameters compared to standard deep learning methods [25], [24]. b) The L-WPT contains only interpretable parameters that can be adapted manually if the operating conditions change. c) We propose an intuitive initialization of the parameters to make the behavior of L-WPT as close as possible to the standard WPT.
As a second contribution, we demonstrate in this work how our L-WPT is related to the universality of signal processing methods and the learning capabilities of deep learning approaches. This highlights the advantage of combining signal processing and deep learning methods. Specifically, we evaluate on the one hand how well the L-WPT can specialize and learn the particularity of the training dataset, and on the other hand how well it is able to generalize to information and artifacts that are different to the one contained in the training data.
After presenting the related work in Section 2, we provide the necessary background on WPT in Section 3. The L-WPT for signal denoising is introduced in Section 4. A comparative study between the proposed L-WPT and several deep NNs is made using a standard model for signal denoising in Section 5. Finally, the performance of the L-WPT is highlighted in the real case of background removal in Section 6 before concluding.

Related work
Wavelet denoising: Wavelet shrinkage consists, after the application of a wavelet transform, in removing the low amplitude coefficients associated with noise. Different type of wavelet transform have been proposed. It can be any orthogonal wavelet transform [6], or iterative methods such as the discrete wavelet transform (DWT). The latter provides a representation of a signal in frequency bands of different temporal resolution. DWT has been applied among others for biomedical signals [13], [12] and partial discharge applications [11]. Unlike DWT, Wavelet Packet Transform (WPT) has the advantage of denoising on frequency bands of the same width. WPT denoising has been applied in various fields including speech enhancement [26], [27], noise detection in bio-signals [28] and atomic force microscopy image denoising [29]. WPT has proven to be particularly suitable for background denoising with a specific frequency content. However the settings of the thresholds for each frequency band is a challenging task [30], [31].
Deep-learning-based supervised denoising: The multilayer perceptron can be considered as the most basic deep neural network architecture. In the context of speech enhancement, this architecture has proven to be less robust and more difficult to train due to its high number of parameters compared to other methods such as the convolutional neural networks (CNNs) [21], [32]. The CNN uses the convolution operation. Only local and sparse connections between the input and output of each layers are considered. This reduces the number of parameters considerably and facilitates the learning process. CNN-based denoising methods have been used in several applications such as ECG denoising [18], speech enhencement [19] or image denoising [15]. In [33], a CNN is used to separate sound events from the background in the time-frequency domain to improve sound classification. The auto-encoder version of the CNN, the convolutional auto-encoder (AE) has been one the most widely used architectures for denoising. AE aims to produce an input-like representation using an encoder and a decoder. The objective of the encoder is to find an embedding of the input data that contains the important information, eliminating noise. A noise-free signal based on the embedding is estimated with the decoder [34]. It was used for speech recognition in [35]. Several examples of AE-based denoising methods was successfully applied in the context of prognostics and health Management [36], [37]. In [20], an autoencoder is used to denoise the vibration signals from the bearing dataset. The denoising step helps in monitoring the condition of the bearing by improving fault diagnosis. In [38], a denoising AE is used to improve the prediction of the remaining useful life of aircraft engines and in [39] is was used for defective wafer detection for semiconductor manufacturing process. One challenge encountered by AE is the potential loss of important information with the increasing number of layers. Adding skip connections to create a U-Net architecture [40] can potentially tackle this problem. As shown in the comparative study in [21], the U-Net architecture as proposed in [41] has the best overall performance compared to several other architectures without skipped connections. Several recent works on image denoising also demonstrate a good performance of U-Net denoising [42].
Deep NNs architectures inspired by signal processing: Several approaches that include signal processing elements in NNs have already been proposed. In [23], a CNN layer with kernels constrained to match bandpass filters only are used for speaker recognition. In [43] and [44], wavelet transform and NNs are combined for ECG signal enhancement. The first learnable extensions of WPT are focused on learning the best filter to use in through the entire architecture, such as in [45], [46], [47], [48]. The first model to generalize filtering learning to each layer was in the context of the Discrete Wavelet Transform (DWT) [49]. The authors in [25] proposed the DeSpaWN model which in addition add activation functions that perform automatic denoising. DeSpaWN has been successfully applied to classification and anomaly detection tasks for audio signals. An extension of the DeSpaWN architecture to WPT, called L-WPT, was proposed in [24]. The L-WPT shows a better performance compared to DeSpaWN for the same anomaly detection task.

Wavelet Packet Transform (WPT)
The discrete WPT, introduced in [50], projects the signal on uniform frequency bands of desired size. The WPT has a multi-level structure and can be considered as a multi-resolution analysis since the output of the current level is recursively used as input to the next level. The basic bloc of a WPT applied to an input signal y is: where * is the convolution operation and y lp ( or y hp ) corresponds to the low-(or high-) pass filtered input data with a cut-off frequency of π/2 followed by a sub-sampling by two. This transformation doubles the frequency resolution (the frequency content of each wavelet coefficient spans half the input data frequency) to the detriment of a halved time resolution (y lp and y hp each contains half the number of samples in y). By applying the same block Eq. (1) on y lp and y hp , we then obtain four outputs that divide the frequency content of the input signal in four even bands. The underlying algorithm behind of WPT has a tree structure characterised by L layers, corresponding to the number of times we apply the block Eq. (1) to the outputs of the previous layer. We refer to the nodes as the succession of a filtering and a sub-sampling operation. The outputs from the 2 L nodes at layer L form the time-frequency representation of our signal y. A perfect reconstruction of the WPT of a signal from any layer L is possible. This operation is called inverse WPT (iWPT) and is possible only if the filters y lp , y hp and the transposed filters follow the conjugate mirror conditions [51], [52].

Denoising with WPT
Signal denoising is one of the major applications of wavelet analysis [6]. It has been shown that, a wavelet transform will lead to a sparse decomposition of regular and structured signals [51]. We can then assume that the noise will correspond to wavelet coefficients of small amplitudes. Several procedures eliminating small coefficients of a WPT already exists [6], [10]. The two most commonly applied approaches use the soft-and hard-thresholding operators [10]. The soft-thresholding appears to be more adequate for image denoising with small signal-to-noise ratios (SNR). We propose in this paper to study only the hard thresholding (HT) operation to eliminate the low coefficients of the WPT. Considering a WPT with L layers, and y i L (t) one of the obtained coefficients at node i, the HT operation with threshold value λ corresponds to: The operation has to be performed for all coefficients with index t, and all nodes i of the layer L. An estimation of the denoised input signal can then be computed by applying the iWPT to the thresholded coefficients.

Adapted denoising
One of the major problems of the HT denoising method presented in Section 3.2 is that it does not adapt to the frequency of the input signal. This can be problematic if the background noise we want to remove from the pure signal has a structured frequency content. Some methodologies adapting the thresholding value according to the frequency have been previously proposed [30], [31], [13]. We propose to go further by applying an adapted activation function with learnable biases to each node of the entire tree structure of the WPT algorithm. The proposed activation function will perform a learnable thresholding eliminating the coefficients related to noise.
In order to have a differentiable thresholding function, we propose the double sharp sigmoid activation function proposed in [25] and denoted as η γ (x): with γ the learned bias acting as a threshold on both sides of the origin.

L-WPT: an autoencoder model inspired by WPT
The proposed L-WPT methodology is an instance of autoencoders, where the encoding part mimics the tree structure algorithm of the WPT and the decoding part mimics the inverse tree structure of the iWPT. Since the WPT will provide a sparse representation only for structured signals respecting specific properties [51] which does not always hold for real applications, we propose to learn the WPT filters. The idea is to find an adapted sparse representation of the signal of interest, which, combined with the proposed denoising activation function Eq. 3, will be able to better convert a potentially complex noise into low coefficients that can be then eliminated.
Thus, considering the encoding part, the filter used in each node is replaced by a convolutional layer with the stride two followed by the denoising activation function Eq. 3. To obtain the coefficients at layer l and node i (denoted y i l ), we need to convolve the coefficients at the previous layer with the kernel of the current node, denoted by θ i l , and apply the activation function Eq. 3. It can be written as: where • corresponds to floor function. The activation function is applied at each coefficient of y i l with the learnable bias value γ i l . An illustration providing an comparison of the operations performed in a WPT node and a L-WPT node is shown in Figure 1 For the decoding part, we only replace the filters by a transposed convolutional layer with stride 2. It is possible to compute a denoised estimation of the coefficients at layer l and node i (denotedŷ i l ) by using the denoised estimation of the higher layers and two kernels denoted as β 2i l+1 and β 2i+1 l+1 . It can formulated be written as: Considering these notations, the input signal can be denoted as y 0 0 , the output signal asŷ 0 0 and we have y i L =ŷ i L ∀i ∈ {0, ..., 2 L − 1}.

Adaptable weights
Since our architecture is inspired by a signal processing methodology, we have a good understanding of the purpose of each part of the network, and how a modification of the parameters impacts the output signal. The bias of the activation function is meaningful and is used to threshold low amplitude coefficients associated with noise. After learning the kernels and bias, the L-WPT can be applied to denoise signals under different operating conditions and noise levels. Thus, it is always possible to modify the biases afterwards if, for example, the noise level increases or decreases.
We propose a simple modification of the biases, called δ modification. A trained L-WPT with the δ modification is then denoted as L-WPT-(δ). This modification consists simply in multiplying each bias value by δ and can be written as: ∀i ∈ {0, ..., 2 L − 1} δ is chosen according to the variations of the noise level with respect to the training data (see backgrounds suppression application section 6.2).

Intuitive weight initialisation
One of the advantages of the proposed method is that there are intuitive initializations of filters and biases in order to start with a L-WPT that behaves in a similar way as a standard WPT. Considering a kernel of size K + 1 with K an odd number and k ∈ {0, ..., K}, denoted as h PR , which satisfies the conjugate mirror property (examples of such kernels inlcude wavelet families like Daubechies, Haar or Coiflets), the initialisation of the kernels in the encoding and decoding parts for all layers l and nodes i is [52], [51]: Finally, the denoising activation function has to be replaced by a linear function, which can be done if we initialise all biases γ i l with 0.

Objective function and training
We denote by s a pure signal, ands = s + b the same signal corrupted by a background noise b. Thus, assuming that we have a set of pure signals and background noise, we are looking for the best kernels of the encoding ] minimising the following loss function: whereŝ =ŷ 0 0 is the reconstructed signal from the input datas = y 0 0 . We use the Adam optimiser [53] with a learning rate of 0.0005 and a batch size of 8 to train the L-WPT. The number of epochs is set to 500 and the learning rate is divided by 10 after the epoch 350 and the epoch 450 for a better convergence. We initialise the filters and bias as presented in Section 4.4 using for h PR Daubechies wavelet with 8 coefficients (called db4).
By referring to n p as the number of trainable parameters and considering L layers and K + 1 coefficients per filters, we have for the L-WPT n p = L l=1 2 L K + L l=1 2 L trainable parameters, where the first part are the filter parameters and the second the biases parameters.

Denoising performances on standard functions
We compare the L-WPT performance to other deep NNs architectures and wavelet shrinkage for denoising purpose. More details about these methods are given in sub-section 5.1. For a benchmark study, we consider standard function classes commonly applied in denoising literature to evaluate the performance of denoising algorithms [10], [54], [2]. They mimic spatially variable functions arising in imaging, spectroscopy and other applications and are presented in sub-section 5.2. We quantify in sub-section 5.3 how the L-WPT leads to improve the denoising of signals from the training class, but also signals of different nature from other classes. It will demonstrate how L-WPT relates to the learning capabilities of deep learning approaches if it outperforms the standard WPT denoising on the training class. As well, it will show how the L-WPT relates to the universality of signal processing if it follows better generalisation than deep NNs. Robustness analysis of our method is extended by considering different noise levels in sub-section 5.4.

Comparison to other approaches
We compare our method with a signal processing approach and several deep NNs. This will allow us to position our method with respect to the learning capabilities of deep models and the robustness of signal processing approaches.
We compare our framework to the hard thresholding wavelet shrinkage presented Section 3.2. We call this method "Baseline-HT". We consider the deep NNs presented is Section 2: a standard CNN, a convolutional AE based on [20] and a U-Net model based on [55]. We provide in Appendix Appendix A.1 a methodology to select the best AE architectures based on a set of pre-selected ones. In this case, the number of trainable parameters is extensive (n p = 554954) compared to our L-WPT with 5 layers (n p = 1054). To ensure that the results obtained are not mainly due to the difference in the number of parameters, causing the AE to overfit compared to the L-WPT, we also consider a similar AE architecture with the same number of parameters as the L-WPT. We refer to those two architectures as "AE-large" and "AE-small".
We derive the architectures of the U-Net and CNN models from the two obtained AE architectures. We refer to them as "U-Net-large", "U-Net-small", "CNN-large" and "CNN-small". The deep NNs are trained using exactly the same objective function optimisation parameters as the L-WPT (see Section 4.5).
An overview of the key parameters of the six different architectures are provided in Appendix Appendix A.2

Model functions and noise
We use the following benchmark case functions [10], which are named Block, Bumps, HeaviSine and Doppler. We propose to randomly generate signal classes s inspired by those four function cases. The number of samples in each signal s is set to T = 2 13 . More details about the generation of these function are given in Appendix Appendix A.3.
The pure signal are corrupted by adding a white Gaussian noise, withs the corrupted signal. The corruption is performed as follows:s where b(t) is a realisation of a normal distribution, and σ is the noise level. The factor three is chosen to have an easier interpretation of the noise level, i.e., if σ = 1 almost all realizations will be in the same amplitude range as the pure signal (≈ 99.7% chances to be below 3). In Figure 2, we display different realisations of pure signals and their corrupted counterparts with σ = 0.2.

Robust denoising
We compare the denoising properties of the L-WPT and the methods presented in this Section 5.1. Each method are trained on each class separately. The training was done on 16000 realisationss from one class with noise level σ = 0.2. For the L-WPT, we set the number of layers to L = 5. We refer to the class used for the training as C T . Then N test = 500 test signals are generated for each class. We propose to evaluate the performance with respect to three scores: 1) the specialisation score denoted as S p which shows how efficient a method is to denoise signals from C T ; 2) S r the robustness score to demonstrate how the denoising performances are generalisable to the other classes; and 3)the mean score over all test signals, notedS = (S p + 3S r )/4 that captures a trade-off between good specialisation and robustness. Withŝ the estimation of the pure signal s, the computation of S p and S r is derived as follows: In Table 1, the S p , S r andS scores for the cases when each model is trained using the four classes separately are displayed. The AE-large model has the best specialization performance However, the method does not generalise well  to the other classes. This is reflected in its S r score for the Bumps and Doppler classes. Overall, the L-WPT has the best robustness score, even better than the Baseline-HT method that is particularly adapted for Gaussian denoising. We can state that, for this experiment, the L-WPT keeps the robustness of a general, non trainable denoising procedure like the Baseline-HT, but also learns a relevant denoising for the signals of the learned class. The Figure 3 can be seen as a table of figures, where the columns provide a realisation of each class, and the rows provide the output of the L-WPT and AE big when they are trained with one of the four classes. For each figure, the absolute error between the estimated and the pure signal through time is provided. AE-large performs the best when applied to the training class (the figures on the diagonal). However it performs poorly when it is trained using another class. On the contrary, the L-WPT is more consistent regarding if it is applied on the training class or not.

Impact of the noise level
In practical applications, noise levels can change over time. This can for example occur under new operating conditions or if the training was done combining signals and background with specified SNR [21]. In our setup, the training noise level is fixed to σ = 0.2. We then quantify the performance of each method for different values of the noise level.
The S p , S r andS scores are computed again when the noise level takes the values {0.1, 0.4, 0.6, 0.8, 1}. Since the thresholds learned for the L-WPT are fixed to perform well for σ = 0.2, there is no particular reason that it will continue to do a relevant denoising for other noise levels. To adjust the weights, we perform the L-WPT-(δ) transformation introduced in Section 4.1 with δ = σ σ train = 5σ. Here, we assume we have a good estimation of the noise level for the new operating condition. In order to highlight the denoising performance of our L-WPT-(δ) method, we compute again the best threshold of the Baseline-HT method for each noise level. Figure 4 shows the specialisation and the robustness scores for each method, for the different noise levels and for each training class. We display the decimal logarithmic value of the scores in order to ease the reading of the graphs. The L-WPT performs poorly. However, the modified version with the weight adjustment outperforms all other methods. In the case of σ = 1, the L-WPT-(5σ) method can provide an up to 10-times better denoising capability compared to the deep NN models. The L-WPT-(δ) also outperforms the Baseline-HT method where the threshold was optimised to the new level of noise. It demonstrates that the filters learned by L-WPT are robust to higher levels of noise and that only the biases need to be adjusted.
The Figure 4 demonstrates the denoising performance of L-WPT-(5σ) compared to the AE-large method for the test signal with σ = 1 (both methods are trained using the class on which they are also tested: specialization regime).

Application on audio background denoising
We evaluate the proposed L-WPT on denoising real signals. For this case study, we also compare to the same methods presented in subsection 5.1. The alternative methods are the Baseline-HT, AE-large, AE-small, CNN-large, CNN-small, UNet-large and UNet-small. For the deep NNs models, we use the same architecture as reported in Table B.5. The justification of this choice is provided in Appendix Appendix B.1.   After presenting the dataset in subsection 6.1 the denoising performance of each method is reported in subsection 6.2 with respect to the robustness and specification score. For this case, the SNR is known and an application on real conditions with unknown SNR is provided in subsection 6.3. We provide in this subsection a method to estimate the δ value of the δ transform. Finally an analysis of the trained L-WPT for the airport background removal is provided in subsection 6.4.

Dataset
We consider the dataset from the task 1 and 2 of the DCASE 2018 challenge [56], [57]. The first dataset provides acoustic scenes that are used as noisy backgrounds. We only consider the scenes of the airports of Barcelona, Helsinki, London, Paris and Stockholm. The second task provides a variety of 41 different foreground sound events like "Cello", "Bus" or "Bark". We randomly eliminated the "Trumpet" classes in order to keep only 40 different classes and ease the division of the dataset into folds. We consider only signals where the classe label have been checked manually.
The background training signals are based on 102 10-seconds recordings of the Barcelona airport scene only, the test background uses 26 different recordings of the Barcelona airport or the recordings of the other airport scenes. Moreover, the foreground training signals are based on 3610 recordings, different from the 1600 recordings used for the test classes.
For the audio signal generation of the foreground and background signals, we apply a similar strategy as in [21]. The recordings are downsampled to 8 kHz, cropped randomly in order to have signals with T = 2 13 samples, and normalized. For the foreground sound, we apply padding and make sure that the random cropping does not select a null signal. We mix the foreground and background sound by adding them, the signal to noise ratio in this case is 0 DB. For the training, new signals are continuously generated from the training recordings. 1600 test signals are generated from the test recordings The methods are not trained on all the classes directly, the data is cut into 8 folds of 5 classes each. This aims to mimic real world applications: A limited amount of classes is collected and used for training the model. The trained model is then applied in a more general environment where we aim to eliminate the background also for classes of signals different from the training data.

Robust background removal
We consider the L-WPT with eight layers. It corresponds to the number of layers minimising the entropy for the WPT when applied to the pure signals of the first fold, the entropy minimisation is a standard method used to select the best number of layers [58].
The S p , S r andS scores introduced in Section 5.3 are also applied in this case study, where the training classes C train are the 5 classes of the current fold. We also consider the mean square error between the estimated and pure signals obtained for each class separately. Figure 6 shows the normalised mean square error obtained for each class when the L-WPT and the AE-large were trained with the fold 1 and 2. The classes belonging to C train are displayed in red, the difference between the scores obtained with L-WPT and AE-large is highlighted in green when L-WPT performs best and in purple when AE-large performs best. For this experiment, the L-WPT almost always outperforms the AE-large. The gap between the L-WPT and the AE-large MSE is reduced for C train . For example in the fold 2, the MSE is reduced by 2.6 on average for all classes, whereas it is reduced only by 0.6 if we only consider the training class. It shows how our method is able to generalize well to structured signal from classes different from the training dataset. Figure 7 shows an example of denoising with L-WPT and AE-large when they where trained with the fold 1. The two first cases, "Glockenspiel" and "Harmonica" signals, are cases where the L-WPT performs particularly well compared to the AE-large. The last case "Drawer open and close" is a case where both methods perform similarly. In this last case, the AE-large performs better in eliminating noise alone, however, the L-WPT is slightly more accurate in reconstructing the sound of interest (as indicated by the absolute error).
In Table 2, the mean S p , S r andS scores over the 8 folds and for each method are provided. The score is obtained for the Bacelona airport scene background sounds and the other airport sounds. We recall that the training data use only background sounds from the Barcelona airport. Because the background noise contains specific frequency contents, the baseline-HT method is no longer adapted since it denoises each frequency bands in the same way. It, therefore, gives poor results. Overall, the L-WPT outperforms all other methods. For the airport case, the gap between  Table 2: specialisation score (S p ), robustness score (S r ) and mean score (S ) over the 8 folds when each method are trained with the Barcelona noise only. Figure 6: Performances of the AE-large and L-WPT for each classes when they are trained using three different folds. The classes belonging to C train are in red, the difference between the score obtained via L-WPT and AE-large is highlighted in green when L-WPT works best, in purple when AE-large works best.
the specialisation score and the robustness score is 0.5 which is low. However, the AE-large has a gap between the specialisation score and the robustness score of 2.4 which is comparable large. This demonstrates again that L-WPT has the learning capabilities of deep NNs while keeping the universal properties of signal processing. Considering the background of other airports, the trend is similar, with L-WPT outperforming the other methods in terms of specialization and robustness scores. It also shows that there are not many differences between the background sounds of different airports.

Real conditions application with unknown SNR
Each method was trained based on corrupted signals with a fixed SNR. However, the denoising performances can decrease if applied in real applications to sounds with different SNR. Indeed, depending on the location of the sensor in the airport or the recording time, the SNR can differ significantly. Thus, we now consider different SNR for the test signal. For this, we use the non-normalised airport background noises. Since their raw value ranges are too low compared to the signals of interest, we multiplied them by 200. These values are chosen so that the majority of signals has a higher SNR than for the training case. This corresponds to the situation where the denoising can be impaired.
We impose that the first 2000 samples of the test signals contain only the background noise. Thereby, it is possible to evaluate the δ value for the L-WPT-δ transformation. The δ value is defined as the fraction between the norm of the 2000 first samples over the average norm when we select randomly 2000 samples from the background of the training dataset. The idea is to see if the background energy of the current recording is higher or lower than from the training dataset.  Figure 8 shows the histograms of the obtained δ values over the 1600 recordings of the test dataset. On average, the value of δ is larger than one. This means that the SNR for the test dataset is negative. Table 3 shows the S p , S r andS results for each of the applied deep NN architectures. The extended L-WPT-δ outperforms other methods with respect to each of the scores. It shows again how the L-WPT learn kernels that are robust to different noise level. Thus, the L-WPT can be easily adapted to different operating conditions by modifying the biases only.

Details of the L-WPT training
In this section we propose to compare the filtering provided by L-WPT and baseline-HT in the context of airport background noise removal. The impact of the baseline-HT denoising strategy on cosines of different frequencies and amplitudes is presented in the left Figure 9. It shows the gain score, which is the ratio of the norm of the input cosine and the norm of the output signal, in function of the amplitude and the frequency of the input. The frequency range goes from 0 Hz to the maximum frequency (2**12 Hz), and the amplitude range goes from 0 to 1.5. When cosines have a too low amplitude, they are interpreted as noise and are not reconstructed, which corresponds to a gain score of 0. On the contrary, cosines with a high amplitude are perfectly reconstructed and have a score of 1. Some imperfections are present since the applied filters are not ideal.  Table 3: specialisation score (S p ), robustness score (S r ) and mean score (S ) over the 8 folds when each method are trained with the Barcelona noise at 0db, the test dataset has varying noise SNR. The two middle Figure 9 visualizes the gain scores for the two first folds. We can see that even if each L-WPT was trained using the signals from different classes, the gain scores images appear to be relatively similar. In comparison the the Baseline-HT method, the denoising is adapted to the signal frequency. The right Figure 9 shows the average spectrum (in absolute values and in Decibels (DB)) of the training background noise. We can see that it contains mostly low frequency contents from 0 to 800 Hz. It is interesting to remark that the gain score for cosines with a low frequency content from 0 to 800 Hz stays null for higher amplitudes than cosines with a higher frequency. It shows how the L-WPT learned to suppress the background contents. For the highest frequencies (>3500), the background content is almost null and it turns out that the gain scores are more heterogeneous from one fold to another. It implies the L-WPT were more able to specialise to the training fold for non-corrupted frequency bands.

Conclusion
In this paper, we propose to combine two of the main signal denoising tools: Wavelet shrinkage using wavelet packet transform and supervised denoising by a convolutional autoencoder. Our proposed learnable WPT is interpretable, relying on the signal processing properties of WPT while being able to learn the specifics of the training dataset. Moreover, it is able to generalize to different classes of the training dataset. It has an intuitive parameter initialization that allows it to initialize like a wavelet packet transform. Moreover, we propose a powerful post-learning modification of the weights, called the δ-modification. This modification is only possible because the meaning of each parameter in this architecture is known. Thus, it is possible to adapt denoising to different noise levels resulting from different operating conditions. The L-WPT is compared to deep supervised models and the WPT denoising in two experiments. It was first applied on case functions often used in the denoising literature. The L-WPT was able to learn specific denoising for the signals in the training class. Furthermore, we demonstrate that it retains the robustness of a universal signal processing procedure by testing it on noisy signals outside the training class. We also show that our method was also robust to different types and levels of noise thanks to the δ-modification. Finally, the L-WPT method was applied to a background suppression task and performed better than the other methods. We provide a recommendation for using the δ-modification in a real application that has been shown to be effective for background denoising in a variable SNR.
This work opens several doors for future directions. First, further research on the δ-modification or related modifications should be conducted. For example, learning the L-WPT on different noise levels would show if the kernels remain similar.This would tell us how optimal the delta-modification is. Another future direction would be to use the time-frequency representation of L-WPT as a feature for a supervised task instead of the WPT features. On the application side, it would be interesting to apply our approach in the context of speech enhancement with a fixed background. Finally, the generalization of our approach to multi-dimensional signals would lead to its application to image denoising.  ), robustness (S r ) and mean (S ) score for the Block function ("Function") and for the first fold of the background denoising case ("Background").

Aknowledgments
• The HeaviSine class: For this case, we fix N b = 4, the frequency variablesf i and the phase variableφ i are both different realisations of a normal distribution.
• The Doppler class: where pad is a zero-padding function addingt p 0 at the beginning of the signal, then inverting the function half of the realisations and cropping it, so it contains exactly T samples. The padding variablet p is generated by selecting a random number from 0 to T/2, and the power variablez is generated by selecting a random number from 0 to 10. In order to keep the signals values in the same range for each class, we normalize each realisation between 0 and 1 by performing the following transformation: with s min and s max respectively the minimum and the maximum value of the current realisation s.  Table B.5: The configuration of the six implemented neural network architectures. We used the following abbreviations: convolutional layer (Conv), transposed convolutional layer (T-Conv), linear activation function (Lin act), stride value is set to 1 (No stride), skip connection of the output to the related T-Conv layer (SC).