Data augmentation for speech separation

Deep learning models have advanced the state of the art of monaural speech separation. However, the performance of a separation model considerably decreases when tested on unseen speakers and noisy conditions. Separation models trained with data augmentation generalize better to unseen conditions. In this paper, we conduct a comprehensive survey of data augmentation techniques and apply them to improve the generalization of time-domain speech separation models. The augmentation techniques include seven source-preserving approaches (Gaussian noise, Gain, Time masking, frequency masking, Short noise, Time stretch, and Pitch shift) and three non-source preserving approaches (Dynamix mixing, Mixup, and Cutmix). After hyperparameter search for each augmentation method, we test the generalization of the augmented model by cross-corpus testing on three datasets (LibriMix, TIMIT, and VCTK), and identify the best augmentation combination that enhances generalization. Experimental results indicate that a combination of several non-source preserving strategies (CutMix, Mixup, and Dynamic mixing) resulted in the best generalization performance. Finally, the augmentation combinations also improved the performance of the speech separation model even when fewer training data are available


Introduction
Speech separation is the task of isolating two or more overlapping speech utterances from a mixed speech signal with multiple speakers talking simultaneously. The mixed speech signal can be further corrupted by environmental noise which can make the separation task more difficult. A robust separation model would benefit applications such as automatic speech recognition, hearing aids and voice assistants.
In comparison to multi-channel approaches that exploit the spatial information of sound sources (Wang, 2014;Wang and Cavallaro, 2018), single-channel speech separation is a more challenging task (Wang and Chen, 2018). Deep neural networks (DNN) are at the forefront for speech separation and can be broadly categorized into time-frequency domain approaches (Kolbaek et al., 2017;Hershey et al., 2016;Wang et al., 2018b;Williamson et al., 2015;Wang et al., 2022) and timedomain approaches (Luo and Mesgarani, 2018;Luo and Nima Mesgarani, 2019;Chen et al., 2020;Nachmani et al., 2020). Time-frequency approaches convert the mixture waveform into the time-frequency domain using the short-time Fourier transform (STFT), separate time-frequency features for each source, then reconstruct the source waveforms by inverse STFT. They usually use the original phase of the mixture to synthesize the estimated source waveforms, which retain the phase of the noisy mixture (Kolbaek et al., Data augmentation has been extensively used to improve the generalization (Ko et al., 2015;Wei et al., 2018) and to prevent the overfitting (Hendrycks et al., 2020;Wei et al., 2020). Performance improvement can be obtained when a model is trained with multiple augmentations with optimal hyperparameters Lim et al., 2019;Zhang et al., 2020;Cubuk et al., 2020). AutoAugment  and its extensions (Lim et al., 2019;Zhang et al., 2020;Cubuk et al., 2020;Ho et al., 2019) primarily focused on training a model that learns a combination of augmentations and its hyperparameters. However, training such a model (e.g. AutoAugment ) requires extensive computational resources as opposed to using a predefined set of augmentations.
In this paper, we conduct a comprehensive survey of data augmentation strategies and empirically evaluate the ability of ten methods to improve the generalization performance of time-domain separation models. The contribution and novelty of the paper is summarized below.
• First, we apply variants of Mixup (Alex et al., 2021) and Cut-Mix (Yun et al., 2019), which were originally proposed in the computer vision domain, to the speech separation problem, and achieve the top two performance among all the individual augmentation methods. • Second, we conduct an ablation study to identify the best hyperparameters for each individual augmentation method. • Third, we improve the generalization performance by empirically searching for the best strategy for combining various augmentation methods. We identify that the combination of Mixup, CutMix and Dynamic mixing augmentation gives the best generalization result.
We apply the augmentation strategies to the training of two popular time-domain speech separation models: DPRRN  and ConvTasNet (Luo and Nima Mesgarani, 2019), and evaluate the performance of the augmented speech separation model via intra-corpus and cross-corpus testing on three speech datasets (LibriMix (Cosentino et al., 2020), TIMIT (Garofolo, 1993) and VCTK (Veaux et al., 2016)).
The paper is organized as follows: We discuss related works on data augmentation in Section 2, and formulate the problem in Section 3. We adapt the various augmentation methods to speech separation in Section 4, and evaluate the generalization performance in Section 5. Finally, conclusions are presented in Section 6.

Speech separation in noisy environments
In recent years, the introduction of WHAM! (Wichern et al., 2019) and LibriMix (Cosentino et al., 2020) datasets has accelerated the research of speech separation in noisy environments (Gao et al., 2017). WHAM! dataset is an extension of WSJ0-2mix (clean) dataset with ambient noise samples from bars, restaurants and coffee shops. However, Cosentino et al. (2020) highlighted the performance drop when models trained on WSJ0-2mix and WHAM! were evaluated on other datasets, such as their proposed LibriMix dataset. LibriMix dataset was reported to have lower generalization error than models trained on WHAM! and WSJ0-2mix dataset (Cosentino et al., 2020). This improvement was attributed to the larger size of the dataset, variability in recording conditions and the presence of a higher and diverse range of unique speakers in the LibriMix corpus. Despite the improvement in generalization when using LibriMix datasets (Cosentino et al., 2020) for training there is still a lack of generalization when evaluated outside its test corpus, especially on unseen noisy conditions. Additionally, WHAM! (Wichern et al., 2019), VCTK (Veaux et al., 2016) and LibriMix (Cosentino et al., 2020) all use the same noise corpus (WHAM (Wichern et al., 2019)) in their test subset and thus is not a thorough test of generalization.
A summary of the deep learning models tested on WHAM! and LibriMix dataset has been presented in Table 1. Fig. 1 presents the tradeoff between the performance and number of parameters for speech separation models in clean (WSJO-2mix (Hershey et al., 2016)) and noisy (WHAM (Wichern et al., 2019)) conditions. For both clean and noisy conditions, time domain models largely tend to outperform frequency domain models. Wavesplit (Zeghidour and Grangier, July 2021) was reported to have the best separation performance in both clean and noisy environments on WHAM and LibriMix datasets. However, Wavesplit uses speaker-ids as additional information during training and also has a significantly high number of parameters (29M) as compared to other time-domain models (Luo and Nima Mesgarani, 2019;Chen et al., 2020). On the other hand, DPTNet  has the best performance vs model parameters tradeoff. However, DPTNet has very high training time compared to other stateof-the-art time domain separation models e.g. DPRNN  which makes it impractical for research works with extensive ablation experiments.
Furthermore, from Table 1 we can infer that using cascaded variants of TasNet and Deep CASA models results in only 0.6 and 1 dB SI-SNRi performance improvement with doubling the number of parameters which is further highlighted in Fig. 1 with an ''L'' shape.

Augmentations
Data augmentations have been extensively used in varying machine learning domains (e.g. vision (Shorten and Khoshgoftaar, 2019)/audio (Wei et al., 2020)). Data augmentation can encode additional priors other than the once introduced by the choice of model architecture by altering/enhancing the available training data which can enhance the robustness of a model to unseen conditions. Data augmentation for speech separation can be divided into sourcepreserving and non-source-preserving augmentations. Most separation augmentation approaches are source preserving in nature, i.e. the augmentation is only applied to the input mixtures and the groundtruth sources are maintained after an augmentation operation is applied (e.g. SpecAugment (Park et al., 2019)). Non-source preserving augmentation modifies both the input mixture and its ground truth sources with the augmentation. An example of non-source preserving augmentation is Mixup (Zhang et al., 2018a) which linearly mixes both input and its ground truth. Table 2 provides a summary of various augmentations.
A. Alex et al.
A mini-survey of data augmentation techniques for audio (Wei et al., 2020) reported source preserving augmentations to have a very limited impact in improving the audio classification accuracy.
Tested source preserving augmentations included: Gaussian noise; Time stretch; Pitch shift and Time and Frequency masking (SpecAugment Park et al., 2019) augmentations. Whereas, non-source preserving augmentations: SpecMix (Kim et al., 2021); Mixup and their proposed Mixed Frequency Masking had the most performance improvements; 1.14 and 1.28 mean average precision percentage improvement respectively to the baseline and thus performance not being very substantial.
Some augmentation techniques used in the audio domain have been borrowed from the vision domain. For example, SpecAugment (Park et al., 2019) where random bands of time and frequency bins are masked is very similar to Cutout (DeVries and Taylor, 2017) and Random erasing  augmentation. Yun et al. (2019) combined Cutout (DeVries and Taylor, 2017) and Mixup (Zhang et al., 2018a) augmentation and proposed CutMix (Yun et al., 2019) where instead of removing the data points and replacing them with zeros (Cutout (DeVries and Taylor, 2017)) or Gaussian noise (Summers and Dinneen, 2019), they are replaced with data points from other training examples. Drawing inspiration from CutMix, Kim et al. (2021) proposed SpecMix augmentation for audio classification and enhancement where the augmentation was applied to spectral features. They reported both SpecAugment and Mixup to perform worse than Un-augmented models for the speech enhancement task while SpecMix slightly improves the enhancement performance. This is interesting as speech enhancement is closely related to the problem of speech separation, which refers to the task of separating the signal of interest from a mixture that can either be corrupted with another speech signal (speech separation), noise (speech enhancement), or both (Wang and Cavallaro, 2020;Mukhutdinov et al., 2023). However, one cannot assume that an augmentation operation that does well for enhancement tasks will do well for separation tasks and vice-versa. For example, mixed sample data augmentation (Harris et al., 2021) approaches such as Mixup which involves adding other speech utterances to the training sample could have a different impact on a model that is being primarily trained to remove noise (speech enhancement) as compared to a separation model whose primary task is to separate speakers from the mixture.
Similar to Mixup, Between-class learning (Tokozume et al., 2018a) originally proposed for sound recognition (Tokozume et al., 2018b) involves mixing two samples belonging to different classes with a random ratio which are then input into a model which is trained to predict the mixing ratio. Additionally, Between class learning was reported to have the ability to constrict the shape of feature distribution which helps in improving the generalization of the model (Tokozume et al., 2018a). Along similar lines in Sam-plePairing (Inoue, 2018) new training samples are generated by overlaying the target image with another image from the training dataset. However, different from Mixup, SamplePairing uses the label of the target image therefore not mixing up labels (Inoue, 2018). Also, weights by which the target image is mixed with other image is fixed in SamplePairing, unlike Mixup where the weights are randomly drawn from a beta distribution (Zhang et al., 2018a). Similar to the aforementioned works, Smart augmentation (Lemley et al., 2017) takes multiple training samples from the same class as input to a generative model to output new training data which can reduce the validation loss for the model designed for the underlying task. Although this strategy is employable for the classification tasks, it is not feasible for the source separation task which is a regression problem.
The model processes the input signal ( ) in short segments. Let a time-domain segment bē where is the length of the segment and (⋅) T represents the transpose. The separation model is trained in a mini-batch style. Each minibatch contains segments of speech mixture, i.e.
wherē= [ (1), … , ( )] T . The corresponding ground truth is represented as wherē= We apply augmentation to the training data to improve the generalization of the model. Data augmentation on a min-batch is shown in Fig. 2(a). The speech mixture and the ground truth post-augmentation can be represented as: Using the augmented mixture as input, the separation network generates a mini-batch of predicted waveformŝ, which can be represented similarly as Eq. (4). The model is trained to minimize the loss between the ground-truth and the prediction̂, the loss function is defined as where, for a ground-truth and prediction̂, the scale-invariant signal-to-noise ratio (SI-SNR) is defined as (Luo and Nima Mesgarani, wherẽ= ⟨̂, ⟩ ‖ ‖ 2 and ⟨̂, ⟩ denotes the inner product. The whole procedure is illustrated in Fig. 2(a). We aim to find the best augmentation strategy that improves the generalization of the speech separation model (e.g. DPRNN in Fig. 2(b)). All augmentations are applied with a probability of 0.5 unless stated (see Section 5.1).

Source-preserving augmentations
In this subsection, we present seven traditional source-preserving augmentations: Gaussian noise, Gain, Time and Frequency masking, Short noise, Time stretch, Pitch shift. Fig. 3 depicts example results obtained from these augmentation methods.

Gaussian noise
Gaussian noise augmentation adds gaussian noise to the mixture. Adding gaussian noise to the mixture during training can reduce the model's performance sensitivity to mixtures with gaussian noise. Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; is the gaussian noise with normal distribution and is the amplitude, which is randomly sampled from a uniform distribution as The ablation of the hyperparameters ( , ) will be given in Table 4.
A. Alex et al.

Gain
Gain augmentation varies the amplitude of the mixture randomly with a scaling factor. This augmentation aims to increase the robustness of the model to the loudness variation of the mixture. Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; is the scaling factor randomly drawn from an uniform distribution as The ablation of the hyperparameters ( , ) for gain augmentation can be seen in Table 4.

Time masking
Time masking is masking out consecutive time steps from the waveform (Park et al., 2019). Time masking augmentation can make the separation model robust in scenarios where a small segment of the audio signal is dropped while recording. Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; is a binary mask of the same shape as̄, and • denotes element-wise product.
The -element vector is defined as where ∈ [0, − 1], 1 is the duration of value-0 segment and is set as and 0 is the start index of the value-0 segment and is set as ∈ [0, 1] is the maximum length of the value-0 segment as a fraction of the whole length of the segment. Ablation of the hyperparameter has been given in Table 4.

Frequency masking
Frequency masking augmentation masks out consecutive frequency bins from the mixture (Park et al., 2019). Using frequency masks can increase the robustness of the separation model when separating mixtures recorded in a cheap microphone, with the speaker at a distance, which can cause the audio signal to have little bass. Also, when the microphone recording the mixture is very close, the low frequency will dominate the spectrum which will be similar to highfrequency components being masked. Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; BS(.) denotes a bandstop filtering. The bandstop filter is defined as where ∈ [0, ∕2] with being the sampling rate 1 is the width of the stop band and is set as and 0 is the start frequency of the stop band and is set as The band stop filter is implemented as a Butterworth filter with order 6 (a default value in the Python Scipy library).
∈ [0, 1] is the maximum width of the stop band as a fraction of the whole frequency area. Ablation of the hyperparameter will be given in Table 4.

Short noise
The short noise augmentation adds a short burst of noise samples from the noise set of Xu et al. (2015) to the mixture. 1 This can supposedly help with situations where a small burst of noise occurs in between mixtures. Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; is the additive short noise and̂is the amplitude. The additive short noise is generated as follows. Suppose We have a short noise segment̄∈ R ×1 , we first add fade in and fade out effect to the noise signal with wherē∈ R ×1 is a gain vector defined as where ∈ R ×1 and ∈ R ×1 are series of evenly spaced numbers from 0 to 1 and 1 to 0 respectively. The Hyperparameters and are sampled from a uniform distribution as: Values of ( , ) is set as (40, 640) and ( , ) is set as (80, 800) which are the default values in Audiomentations library. 2 We add̄′ tō at an SNR as Finally, the faded noise samplē′ ′ is added to the mixture to obtain short noise-augmented mixture as follows: where is the start time of the short noise which is randomly sampled from a uniform distribution. Time stretch augmentation speeds up or slows down the mixture without changing the pitch which can make the separation model robust to varying speed perturbations in the audio signal (Arakawa et al., 2019). Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; (⋅) is a phase-vocoder (Laroche and Dolson, 1999) that changes the playing speed of the signal with a ratio , i.e. speeding up for > 1 and slowing down for < 1. The rate parameter is drawn randomly from a uniform distribution as It is important to note that we choose the rate parameter such that the mixture is not heavily sped up or slowed down as it would destroy the semantics of the mixture while training the network against the ground truth speakers present in the mixture. Ablation of the hyperparameters ( , ) has been given in Table 4.

Pitch shift
Pitch Shift augmentation varies the pitch of the mixture. Varying the pitch of the mixture during training can make the model robust to speakers with their voice having varying phase characteristics (Arakawa et al., 2019). Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as wherēand̄_ are the original and augmented mixture, respectively; (⋅) is the Pitch shift operation given the semitone parameter .
Pitch shift has a parameter called bins per octave, which is set to 12 as it ensures that 1 step equals one semitone. We shift the waveform by a fixed number of steps , which is randomly drawn from a uniform distribution as where and are minimum and maximum semitones.
The Pitch shift operation is conducted as follows. We first compute a rate parameter as: The rate parameter is used to get a time-stretched signal as described in Section 4.1.6, i.e.
Finally, we resamplēwith a sampling rate of ′ = ∕ to obtain the pitch shifted signal.
Ablation of the hyperparameters ( , ) will be given in Table 4.

Non-source preserving augmentations
In this section, we present three non-source preserving augmentation strategies, including one existing method (Dynamix Mixing) and an extension of (Mixup and CutMix) augmentations.

Dynamic mixing
Dynamic mixing augmentation has been used as an augmentation technique in audio speech separation (Subakan et al., 2021). This augmentation strategy attempts to expose the model to a wider range of mixtures. Instead of using fixed training data, Dynamic mixing creates new mixtures from available training data on the fly for each epoch.
We select unique source segments from the speech corpuŝ1, … , to create the augmented data as: The sources are selected by sampling from the speech corpus by randomly selecting distinct indices as: where  is the total number of mixtures in the speech separation dataset which is 13900 which is the number of mixtures in the train-100 split of LibriMix (Cosentino et al., 2020) dataset. Thus, during each epoch, the model sees new training data instead of having fixed training data for each epoch. Dynamic mixing augmentation is applied with a probability of for each sample in the mini-batch. Ablation of the impact of has been presented in Table 4.

Mixup
Mixup enhances the available training distribution by creating augmented examples from the training mini-batch. Mixup is a domain agnostic augmentation technique that can increase the model's robustness to mixtures with babble-like noises as it involves convex combinations with pairs of mixtures and its sources (Zhang et al., 2018a). We propose two variations of Mixup augmentation: Complete Mixup (CP), which generate augmented mixture and ground-truth, and Data-only Mixup (DO), which generates augmented mixture only (Alex et al.,

2021).
For Complete Mixup, let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as where and are randomly sampled indices from [1, ] and are used to generate the th augmented segment. The scalar is drawn from a beta distribution ( , ) and controls the weights between the two components. Fig. 4(a) illustrates how Mixup augmentation is applied on a mini-batch. Fig. 4(b) illustrates the relationship between ( , ) and .
Data-only Mixup (DO) is similar to Complete Mixup, but operates on the mixture only. This is essentially close to adding babble noise in form of mixtures from other samples in the mini-batch. We expect this augmentation to increase the model's robustness in presence of speech-like noises. The augmented data can be represented as Ablation of the hyperparameters ( , ) will be given in Table 4.

CutMix
CutMix augmentation was initially proposed as an augmentation technique for the image domain (Yun et al., 2019). CutMix augmentation is a combination of Cutout and Mixup augmentation and involves masking out certain regions of the image followed by replacing the masked-out portion with a patch from another image from the same mini-batch. CutMix augmentation can make the model robust against corruption in input mixture (Yun et al., 2019).   Let us use the th segment in the mini-batch (5) as an example. The augmented data is represented as where and are randomly sampled indices from [1, ] and are used to generate the th augmented segment, is a binary mask of the same length as̄.
The binary mask can be computed as follows. We first calculate a threshold which determines the portion of a segment to be mixed, where is a scalar that determines the maximum portion in samples of̄that will be mixed with̄. Following this we determine start ( ) and end ( ) point at which segment from̄will be added to the Augmented examples from CutMix tend to be more perceptually perceivable compared to Mixup (Yun et al., 2019). Although, It has been argued that while training a model, the machine's perception takes precedence over that of humans (Summers and Dinneen, 2019). The ablation of the hyperparameters has been given in Table 4.

Experimental setup
We compare various augmentations discussed in Section 4 to the Un-Augmented model. Un-Augmented model refers to where the input mixture has not been altered before passing to the network for training. We start by doing a hyperparameter search for all the augmentations to determine the best-performing hyper-parameter for the particular augmentation. All augmentations are randomly applied to 50% of the mini-batches during training except for Dynamic Mixing where we do ablation with varying probability values as it does not have any other hyper-parameter associated with the augmentation. We train our separation models for 2 speaker separation ( = 2) on segments of length 3 seconds with a sample rate = 8000 totaling to ( = 24000) samples. All models are trained for = 300 epochs unless specifically stated.
We use three datasets (Librimix (Cosentino et al., 2020), TIMIT (Garofolo, 1993) and VCTK (Veaux et al., 2016)) and consider two types of evaluation: intra-corpus and inter-corpus. The former uses A. Alex et al. TIMIT (Garofolo, 1993) test 10 630 Env. Noise Corpus (Xu et al., 2015) Librimix for both training and testing; while the latter uses Librimix for training and uses TIMIT and VCTK for testing. For training, all the models are trained on train-100-noisy (58 h) subset of Libri2mix dataset (Cosentino et al., 2020). Libri2mix (train-100-noisy) consists of artificially generated mixtures from the Librispeech corpus with the addition of ambient noise samples from the WHAM (Wichern et al., 2019) test set. The resulting noisy mixtures have a mean SNR of −2 dB with a standard deviation of 3.6 dB (Cosentino et al., 2020). We generate Libri2mix (11 h), VCTK-2mix (9 h) and TIMIT-2mix (10 h) for testing. The mixtures in the VCTK test data were created in the same way as the LibriMix noisy samples. The mixture in the TIMIT test data was generated by first randomly mixing utterances from different speakers followed by adding environmental noises from the evaluation set of Xu et al. (2015) at each SNRs with an SNR range from −5 to 20 dB with a step size of 5. All utterances were sampled at 8 kHz, because speech intelligibility mainly requires the information below 4 kHz. For this work, intra-corpus training was restricted to models trained on Librimix as both TIMIT and VCTK test subsets were too small to get a fairly trained separation network. A summary of the three datasets used has been presented in Table 3. It is important to note that the test sets of LibriMix and VCTK datasets are drawn from the same noise corpus (WHAM (Wichern et al., 2019)). Whereas, the test set of the TIMIT dataset is drawn from a different noise corpus (Environmental Noise Corpus (Xu et al., 2015)). We use the DPRNN  model, which is a state-of-theart time-domain speech separation model for all our experiments. We use the Asteroid framework's (Pariente et al., 2020a) implementation of DPRNN  as shown in Fig. 2(b), DPRNN model has an encoder-masker-decoder architecture. The 1-D convolutional encoder learns a 2-D representation from the given mixture. This 2-D feature representation is divided into 3-D chunks which are then processed by the masker network in a dual-path manner where the masker network performs Intra and Inter chunk processing for local and global modeling of chunks, respectively, to output individual masks for each source in the mixture. Finally, the 1-D transpose convolutional decoder outputs the individual waveforms for each source in the mixture. Dual-path models Chen et al., 2020) have been reported to have better separation performance with a lower number of parameters size as compared to the vanilla time-domain models (Luo and Mesgarani, 2018;Luo and Nima Mesgarani, 2019). But it should be noted that this performance improvement is only substantial when the stride and kernel size in the 1-D convolutional encoder is low . Using lower kernels, stride sizes (e.g. kernel size 2 and stride 1) significantly increases the training time by 1 to 1.5 days on Nvidia Tesla V100 GPU as opposed to using a kernel size 16 and stride 8. Thus we chose a kernel size of 16 and stride 8 for the 1-D convolutional encoder in DPRNN to accommodate for the aforementioned trade-off.
For performance evaluation, we use the SI-SNR improvement measure (Luo and Nima Mesgarani, 2019) as the primary evaluation metric, which is defined as the difference between the input and output SI-SNR (cf. Eq. (7)) of one segment. Unlike other performance measures such as signal-to-distortion ratio (SDR), SI-SNRi is scale-invariant and thus suitable for speech applications where proper scaling of the speech signal is not ensured (Wang and Chen, 2018). Additionally, we evaluate the models with best-performing hyperparameters using perceptual evaluation of speech quality (PESQ) and Short-time objective intelligibility (STOI) which measure the quality and intelligibility of speech respectively. PESQ [−0.5, 4.5] and STOI [0, 1] are widely used in speech enhancement research works and some speech separation research. Both PESQ and STOI have been reported to be closely related to how humans perceive speech (Zhang et al., 2018b) with higher values indicative of better quality and intelligibility respectively.

Hyperparameter selection
We start by doing a hyperparameter search on each augmentation to identify the best set of hyper-parameters for speech separation. We retain the hyperparameters for the best-performing augmentations for our later experiments based on this hyperparameter search. Additionally, we compare the performance of augmented models with Un-augmented models to get an intuition on augmentations that improve separation performance. Un-Augmented DPRNN model has an SI-SNRi of 12.00 dB on the test set of the LibriMix dataset. It is important to note that in this Section 5.2 all results presented are models trained on (train-100) split of the LibriMix dataset and tested on the test set of LibriMix dataset. Results of this hyperparameter search are presented in Table 4.
Gaussian Noise: For ablation of hyperparameters for Gaussian noise augmentation we vary the maximum amplitude ( ) with which Gaussian noise is added. The value of is progressively decreased from 0.120 to 0.015. Results indicate that increasing decreases the separation performance which is expected as adding gaussian noise of higher amplitude severely distorts the resulting augmented mixture. We get the best result from = 0.015.
Gain: Results indicate that varying the scale in which gain is applied does not lead to many variations in separation performance. This is because DPRNN  model internally uses batch normalization (Ioffe and Szegedy, 2015) which makes the model invariant to the loudness of the mixture.
Dynamic Mixing: We test various probabilities with which Dynamic mixing can be applied to the mini-batch. Results indicate that increasing the probability of augmentation decreases the separation performance. This is because when the probability of the augmentation is higher, the lower the chances the model sees the same training data throughout training epochs. The empirical results indicate that Dynamic mixing, when applied with a probability of 0.25 and 0.5, improves the performance over the Un-Augmented model. The best performance is obtained when Dynamic mixing is applied with a probability of 0.5.
Time masking: For the ablation of Time masking, we vary the maximum amount of time segments that can be masked as a fraction of the total length of the segment which is determined by in Eq. (12). Also, when there is a higher chance for a larger amount of input mixture to be masked e.g = 0.40 the performance of the augmented model starts to deteriorate. In our experiments, we get the best separation performance with = 0.20.
Frequency masking: Similar to Time masking, in case of Frequency masking, increasing decreases the performance of the separation model. Performance degradation with excessive masking is expected as the model will have less of a mixture to predict the sources from. In our experiments, we get the best separation performance with = 0.77.  noise augmentation (Env noise corpus (Xu et al., 2015)). Secondly, the SNR range of added short noises may not match that of the test set of the LibriMix dataset. The best results on the test set of LibriMix are obtained when using SNR = 0 dB and SNR = 24 dB.
A. Alex et al. Fig. 6. Results from evaluating various augmentations with best-performing hyperparameters in intra-corpus tests using DPRNN  model trained and tested on noisy Librimix dataset.
Pitch shift: From the ablation of hyper-parameters of Pitch shift augmentation we can infer that Pitch shift augmentation severely deteriorates the separation performance as compared to the Un-Augmented model. Therefore, for our later experiments, we will forgo this augmentation. Complete Mixup is less sensitive to the variation of and as compared to Data-only Mixup. Also, as the probability of the value from the distribution gets closer to 1, separation performance is improved as the loss function Eq. (35) is conditioned to optimize the most dominant source in the mixture. Our results indicated that the best results were achieved with = 8 and = 1 for both Complete and Data-only Mixup. CutMix: Ablation of the maximum number of samples ( ) that will be added from the randomly selected mixture from the mini-batch to the mixture to be augmented. Results indicate that the performance of the CutMix augmented models on average are largely invariant to and improves the separation performance over the Un-Augmented model. We suppose this is because mixtures used to perform CutMix augmentation are from the same mini-batch and thus there is not much variation in the data used for training even when value is higher. We get the best results using CutMix augmentation with = 2000 (0.25 s).
From our initial hyper-parameter ablation experiments of various augmentation techniques presented in Section 5.2, we present the results from best-performing hyper-parameters for all augmentations when tested on the test set of LibriMix dataset in Fig. 6. None of the augmentations significantly outperform the Un-augmented model. More specifically, Time stretch and Pitch shift augmentations were the worst-performing augmentations. Complete and Data-only Mixup, Time masking, Time--Frequency masking, Gaussian noise had comparable performance to the Unaugmented model. On the other hand, CutMix, Dynamic mixing and Short noise showed very minor improvements over the Un-augmented model.

Combining augmentations
We test combining various individual augmentation operations to the combination of Dynamic mixing and CutMix augmentation.
We choose the best-performing hyperparameters (Table 4) for each augmentation to test whether the combination of augmentations leads to improved separation performance. 3 Based on results presented in Table 5 we can concur that separating mixtures from the TIMIT dataset is the hardest as both speech and noise corpus are different than that used in LibriMix dataset. The Un-augmented model has an SI-SNRi of 7.64 dB on the TIMIT dataset compared to 12.00 dB for LibriMix. Additionally, the UnAugmented model on TIMIT has approximately 0.16 lower PESQ than LibiMix and VCTK datasets which is indicative of the lower perceptual quality of separated speech from the test set of TIMIT dataset as compared to the LibriMix dataset. On the other hand, since the VCTK dataset has the same noise corpus as the LibriMix dataset with only the speech corpus being different; separation models performance tested on the VCTK dataset is much closer (10.95 dB SI-SNRi, 2.24 PESQ, 0.77 STOI) to that of the models tested with LibriMix dataset.
We get the best separation performance in terms of SI-SNRi on the   DM  CM  DO  CP  GN  SN  TM  FM  Intra-corpus  Inter-corpus  Intra-corpus  Inter-corpus  Intra-corpus  Inter-corpus  LibriMix  VCTK  TIMIT  LibriMix  VCTK  TIMIT  LibriMix  VCTK  TIMIT  -------- From investigating the PESQ and STOI values in Table 5 we can conclude that the minor quantitative improvements for the separation task are hard to interpret. Additionally, Zhang et al. (2018b) reported very minor PESQ and STOI improvements with their separation models despite directly using both metrics as loss functions while training their separation model. Also, most recent research works primarily report SI-SNRi as the primary evaluation metric Luo and Nima Mesgarani, 2019;Zeghidour and Grangier, July 2021;Subakan et al., 2021;Michelsanti et al., 2021). To this end, we will be using SI-SNRi as our primary evaluation metric of focus for the rest of the paper.
Among the source preserving augmentations listed in Fig. 6; Short noise augmentation showed the best performance improvement on TIMIT dataset (0.39 dB) with comparable performance to the Un-Augmented model on LibriMix and VCTK datasets. We can attribute this singular performance improvement of Short noise augmentation on the TIMIT dataset to the similar noise types (environmental noises 1) used to augment mixtures in short noise augmentation. To further see if we could leverage Short noise augmentation we conducted experiments by combining other augmentations with short noise augmentation. But we only observe performance improvements when Short noise is combined with time and frequency masking augmentation on the TIMIT dataset. Specifically, Short noise augmentation when combined with Frequency masking gives the most SI-SNRi improvement (1.26 dB) on the TIMIT dataset. However, it must be noted that none of the augmentations which used Short noise augmentation for one of the augmentation combinations leads to substantial performance improvement on LibriMix/VCTK datasets.
Along similar lines combining SpecAugment augmentations with one or more other augmentations on average leads to deteriorated separation performance on the three datasets. Results on the combination of SpecAugment with CmixDmix are in line with results obtained with SpecAugmented models where Frequency masking tends to perform the worst.
In summary, when using a single augmentation for training, there does not seem to be any relationship between the use of source and non-source preserving augmentation and the performance of the speech separation model. Also, intra-corpus experiments indicated Dynamic Mixing and CutMix augmentation to have the joint best separation performance. Both Dynamic mixing and CutMix augmentation introduces the model to novel mixtures in each epoch as compared to other augmentations such as SpecAugment that apply signal transformations to the original mixture leading to a slightly different variation of the original mixture. Dynamic mixing and CutMix augmentation when combined with Data-only Mixup gave the best generalization performance across the three tested datasets which can be attributed to having distinctively different mixtures in the training while the augmented mixture being semantically meaningful as compared other augmentations such as Complete Mixup.

Training with a different model
To verify the transferability of results obtained with the DPRNN model with a separation model that uses a different model architecture we apply the augmentations to ConvTasNet (Luo and Nima Mesgarani, 2019). ConvTasNet (Luo and Nima Mesgarani, 2019) is a time domain speech separation model which uses convolutional architecture as opposed to DPRNN  which is based on dual-path recurrent architecture. The results from the above comparison have been depicted in Table 6 and Fig. 7. We see consistent performance improvement across the best-performing augmentations (CmixDmix, CmixDoDmix) in both models across multiple datasets.
However, we also observe inconsistent performance among multiple individual augmentations (Time/Frequency masking, Gain) which is in line with the lack of transferability of augmentations reported by Longpre et al. (2020) when they compared LSTM based networks to Transformer based networks for classification and natural language processing tasks. However as an anomaly, Short noise augmentation despite being largely domain-specific augmentation improves separation performance (SI-SNRi) on TIMIT dataset with Con-vTasNet (0.39 dB) and DPRNN (1.31 dB) model. This can be attributed to the presence of similar noise types in the short noise subset to that of the one used by the TIMIT dataset.
Another observed pattern was a combination of task agnostic augmentations e.g (DataOnlyMixup, CmixDmix, CmixDoDmix) seem to lead to the best out-of-domain generalization. Here, task-agnostic augmentations refer to augmentation operations that can be applied to a training sample regardless of their domain (e.g. image/audio/ language). Similar results were obtained by Longpre et al. (2020) where they reported task agnostic augmentations to have a maximal impact when the unseen test conditions might not be well represented in the preliminary training subset. Therefore, we observe the maximum SI-SNRi improvement of 1.64 dB (DPRNN), 1.40 dB (ConvTasNet) over the Un-Augmented model using CmixDoDmix when testing on samples from the TIMIT dataset.

Training with fewer training data
To test if the performance improvement brought about by augmentations is due to the large dataset that we have, we perform an ablation by training the DPRNN model  with varying amounts of training data [5, 10, 20, 25, 35, 40, 50, 60, 80, 100%] using CmixDmix augmentation, which is our best-performing augmentation on LibriMix dataset. The results in Fig. 8 indicate that the model trained using CmixDmix augmentation on average can outperform the Unaugmented model. Also, the amount of gain is very stochastic with an average improvement of 0.61 dB with a standard deviation of 0.27 dB. It can also be noted that the performance of the separation model starts to aggressively drop after the 25% data mark.

Discussion
Data-only Mixup performs the best among individual augmentations when tested on intra-corpus scenarios. Along similar lines, Data-only Mixup when combined with CutMix and Dynamic mixing (CmixDoDmix) lead to the best overall performance for both inter and intra-corpus testing. Mixup based methods have been reported to increase the robustness of the model by showing adversarial examples during training (Harris et al., 2021). Adversarial examples in our case are mixtures that are further distorted by other mixtures in the same mini-batch via linear interpolation. Additionally, Mixup based methods have also been reported to improve generalization by limiting the memorization ability of the model (Liang et al., 2018). On the other hand, CutMix unlike Data-only Mixup, does not semantically distort audio. CutMix prevents the model from overfitting on the training dataset by increasing the number of observable data points in the training sample (Harris et al., 2021). The higher number of observable A. Alex et al. data points in our case is analogous to more speakers in a sample mixture in a mini-batch. The above assessment is supported by the quantitative results obtained using CutMix and Data-only Mixup augmentations where CutMix does not seem to improve separation performance on intra-corpus testing with TIMIT dataset which contains highly extreme noise conditions which are quite dissimilar to the one in WHAM noise subset used in LibriMix and VCTK dataset. However, Dataonly Mixup seems robust against these noise types due to being trained on mixtures with artifacts from other mixtures in the mini-batch.
Furthermore, we analyze spectrograms (Fig. 9) of a separated speaker waveform from a mixture in the TIMIT dataset corrupted by guitar noise. Almost all the augmentations presented in Fig. 9 can get close to the spectral representation of the ground-truth speaker waveform. But separated CutMix and Un-Augmented waveforms seem to bring in minor artifacts from the other waveform at the beginning of the separated waveform. It is also evident that none of the augmentations presented in Fig. 9 are fully able to remove the noise from the separated waveform. This could perhaps be due to the nature of the noise here which resembles speech formants. Furthermore, the experimental results presented in Section 5.4 indicate that combining mutually exclusive operations of interpolation and masking from Data-only Mixup and CutMix (CmixDoDmix) not only improved inter and intra-corpus separation performance but also increased the robustness of the performance across the two tested models (DPRNN , ConvTasNet (Luo and Nima Mesgarani, 2019)).
Future work would involve developing an augmentation search policy-based method for predicting the schedule, magnitude, and probability of augmentation combination for each epoch than having fixed augmentation hyperparameters for all the epochs of training. This research can help to narrow down the search space for such augmentation policy search-based methods for speech separation and therefore bring down the training and computational resources required to identify the best augmentation strategy for training speech separation models.
In this paper we employ a simple experimental search-based strategy to determine the best augmentation combination that leads to increased generalization for speech separation in noisy environments. In addition to this, several more advanced research works have attempted to use a combination of data augmentation operations to improve the performance of machine learning models Lim et al., 2019;Cubuk et al., 2020;Hendrycks et al., 2020;Wang et al., 2021). For instance, AutoAugment  trains a child network with a sequence of augmentation operations generated from a recurrent neural network (RNN). Validation accuracy from the child network is used as a reward signal to optimize the RNN to produce a better sequence of augmentations over time.
Fast AutoAugment (Lim et al., 2019) improves on the speed of the search algorithm used to find the most effective augmentation policies by using a density matching method and splitting the training data and training each split in a distributed manner. Top-performing augmentation policies are selected from each split and a combination of best-performing policies is used to re-train the model on the entire dataset. Adversarial AutoAugment (Zhang et al., 2020) extended AutoAugment  by training a network to generate adversarial augmentation policies to make the target model more robust. It optimizes the policies directly on the target dataset instead of using smaller subsets of datasets and models, thus reducing the computational cost of training auxiliary models. Population-Based Augmentation (Ho et al., 2019) predicts a schedule of augmentation policies that can be applied during training over the epochs than having a fixed augmentation policy. AugMix (Hendrycks et al., 2020), RandAugment  further reduce the search space by stochastically applying a sequence of augmentations. The abovementioned methods typically require large computations to automatically search for the most efficient way to combine multiple augmentations. The design of automatic augmentation to speech separation would be an interesting research direction in the future.

Conclusion
We conducted a comparative study of various augmentation techniques to improve the cross-corpus generalization of two speech separation models (DPRNN , ConvTasNet (Luo and Nima Mesgarani, 2019)). The study evaluated ten individual augmentation methods and various combination strategies with intra-corpus and inter-corpus testing based on three speech datasets (LibriMix, TIMIT, VCTK). It was shown that while augmentation cannot significantly improve the separation performance for intra-corpus testing, it can improve the separation performance effectively for inter-corpus testing, which is the main objective of the model generalization. Among individual augmentations, Data-only Mixup achieves the best intercorpus generalization performance. Among the augmentation combinations, CmixDoDmix, which is a combination of three augmentations, CutMix, Data-only Mixup and Dynamic mixing, achieves the best inter-corpus generalization performance. Both Dynamic mixing and CutMix augmentation exposes the model to completely new mixtures in each iteration of training than having a transformed version of the same mixture as in most augmentation operations. This stochastic introduction of new training samples seems to be the key to having better generalization across multiple datasets. It was also shown that the combined augmentation (e.g. CmixDmix as a combination of Dynamic mixing and CutMix) can improve the separation performance when the speech separation model is trained with fewer data.
In future work we will investigate the performance of the augmentation methods on more state-of-the-art speech separation models, e.g. the time-domain DPTNet model  and the timefrequency domain TF-Gridnet model (Wang et al., 2022). We will investigate the performance of the augmentation methods on models separating more than two overlapped speakers and working in a wider frequency range (e.g. sampling rate 16 kHz). In this paper, we empirically proposed a simple strategy to combine different augmentation methods. Future work would develop an automatic and smarter strategy for the combination of augmentation methods to further improve the generalization of the model.

CRediT authorship contribution statement
Ashish Alex: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing -original draft, Writing -review & editing. Lin Wang: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writingoriginal draft, Writing -review & editing. Paolo Gastaldo: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing -original draft, Writing -review & editing. Andrea Cavallaro: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing -original draft, Writingreview & editing.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Audio samples of separated mixtures with model trained using various augmentations can be found at http://www.eecs.qmul.ac.uk/ linwang/demo/augmentation.html.