The whole is greater than the sum of its parts: improving music source separation by bridging networks

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)- based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation , which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency-and time-domain representations of audio signals. We modify the target network, i


Introduction
There is a huge amount of music in our lives, e.g., from radio and TV, as background music in stores or provided by online streaming services [1][2][3][4][5][6].To specialize music for diverse purposes, it is sometimes necessary to remix it, e.g., making the vocal tracks louder, suppressing undesired instruments, or upmixing to more audio channels.It is easy for us to implement such operations when we have access to each audio source independently that was used to mix the music.However, if we only have access to the final recording, which is often the case, this is much more challenging.In such cases, it is necessary to separate music into each instrument, which is called music source separation (MSS), to achieve the above operations.
MSS has a long history, and it is known to be a very challenging problem [7]; therefore, many approaches have been investigated, e.g., local Gaussian modeling [8,9], non-negative matrix factorization [10][11][12], kernel additive modeling [13], and combinations of these approaches [14,15].Data-driven machine learning approaches for MSS have also been of great interest to researchers.Many methods that use deep neural networks (DNNs) have been investigated to improve MSS performance.Specifically, multi-layer perceptrons (MLPs) [16], convolutional neural networks (CNNs) [17], and recurrent neural networks (RNNs) [18], which are the three basic DNN architectures, have been used for MSS.An MLP was used to separate the input spectra then obtain separated results [19,20].CNNs and RNNs were used to achieve source separation with better quality [21][22][23] than previous MLP-based methods since the convolutional and recurrent layers of CNNs and RNNs can effectively capture the temporal contexts.
Although the above studies drastically improved MSS performance, there are two problems with respect to the training of music separation networks: (P1) Most DNN-based MSS methods tend to handle only the time-or frequency-domain but not both.(P2) They do not handle the mutual effect among output sources since network architectures and loss functions are independently computed for each estimated source and the corresponding ground truth.
For example, a well-known open-source MSS method, called Open-Unmix (UMX) [24] 1 , executes MSS only in the frequency-domain.It also applies the conventional mean squared error (MSE) loss function to individual pairs of estimated and corresponding ground truth magnitude spectrograms for each instrument.In other words, UMX trains networks individually for each instrument and achieves MSS by using all of each independent network one-by-one.In the field of speech enhancement (SE), which can be regarded as a case of audio source separation, there are methods for solving the above problems.For solving (P1), Kim et al. [25] showed the effectiveness of multi-domain processing via hybrid denoising networks, and Su et al. [26] reported that building two discriminators responsible for the time-and frequencydomains can enable effective denoising and dereverberation in their scheme of using generative adversarial networks (GANs).For solving (P2), from the classical SE methods such as Wiener filter [27] to current SE methods, e.g., noise-aware training [28] and noise-aware variational autoencoder [29], there are many situations in which knowing and using the information of the noise such as type, level, and time variation is generally beneficial for the following extraction of the target speech.
In MSS, other non-target sources can be similarly regarded as "noise, " and its information may be beneficial for the following target source separation.There is also research on using it for MSS using a Wiener filter [19], but it is used only as post-processing; thus, the information of other non-target sources is not used to train a DNN.Since our first work in [30], more models like Hybrid Demucs (HDmucs) and Hyblid Transformer Demucs (HTDemucs) [31,32] as well as Band-split RNN (BSRNN) [33] appeared that showed the benefit of working jointly in both domains.
Inspired by these discussions, we first append an additional differentiable short-time Fourier transform (STFT) or inverse STFT (ISTFT) layer 2 during training only.To consider the characteristics on both of time and frequency domains, some existing methods such as [31] adopted the architecture having two separated branches each of which is respectively for time and frequency features, but it is the unique architecture per each method, and thus it is difficult to use its architecture for other method.On the other hand, the application of both loss functions in the time as well as frequency domains, i.e., applying multi-domain loss (MDL), easily becomes feasible for almost all existing methods since it is merely a loss function built on multi-domain.Intuitively, the two domains are also giving a complementary view of the separation performance.For a time-domain (TD)loss, it might happen that we have a periodic noise pattern which is unnoticed by it as it is only computing an instantaneous error.However, in the frequency domain such a periodic noise pattern becomes visible and will be reduced by the frequency-domain (FD) loss.On the other hand, FD-loss lacks considering the effect caused by phase information since it only deals with magnitude spectrograms, but TD loss can contain it in the error of loss.Furthermore, to consider the relationship among output sources, we then bridge each instrument network by adding averaging operations if the original source separation is achieved by applying each independent instrument network to the input mixture.This is called bridging operation.For the bridged network to better determine the relationship among output instruments, we produce output spectrograms for instrument combinations and apply MDL to them.We call this loss computation combination loss (CL).The combination of bridging operation and CL helps the separation network determine the cause of an estimation error, i.e., which sources are leaking to the target instrument.
In summary, MDL solves (P1) since the separation network can help determine the estimation error in both time and frequency domains.Bridging operation and CL solve (P2) since they enable the separation network to handle the mutual relationships among the separated sources.We collectively call this "X-scheme, " which crosses the information among all sources with MDL.It is important to note that X-scheme can improve the performance of DNN-based MSS systems while maintaining the original calculation cost.This is because MDL and CL only affect the training step and thus do not change the original inference step.Moreover, bridging operation requires only a slight network modification which does not increase the number of parameters that need to be learned and only slightly the computational costs.More specifically, the rate of computational cost that will be increased by applying X-scheme is depending on the original size of target network.However, as our X-scheme merely adds averaging operators to merge sub-networks together, these additional costs can often be neglected.For instance, only 4 additional averaging operators are needed in the case of a 4-instrument dataset like MUSDB18.No matter how small the deep neural networks are, we believe all existing ones should have much larger computational costs compared to adding a few averaging operators.Hence, there is almost no increase in computational cost by our proposed X-scheme.
Although we confirmed the validity of X-scheme in our previously proposed DNN-based MSS method, i.e., extended UMX (X-UMX) [30] realized by applying X-scheme to UMX, there remains three questions: (i) its generality to other types of network architectures, (ii) the effective positions where we should bridge the paths of the target networks, and (iii) its scalability to a large-scale data regime.Hence, in this paper, we address these questions.Specifically, we validate the effectiveness of X-scheme by applying it to different types of DNN-based MSS methods: well-known CNNbased and RNN-based ones, i.e., densely connected dilated DenseNet (D3Net) [34,35] and Open-Unmix (UMX) [24].Furthermore, not only these frequencydomain networks (i.e., UMX and D3Net) but also well-known time-domain one, convolutional timedomain audio separation network (Conv-TasNet) [1], is extended by X-scheme in this paper.We also present a detailed study regarding the bridging positions and potential to use a large dataset for training X-UMX.
The rest of this paper is organized as follows.In Section 2, we give a brief review of related work.In Section 3, we present X-scheme.In Section 4, we show the effectiveness of X-scheme by applying it to UMX, D3Net, and Conv-TasNet resulting in X-UMX, X-D3Net, and X-Conv-TasNet in terms of the MSS task.Finally, we conclude this paper in Section 5.

Related work
DNN-based MSS methods can be roughly categorized into time-and frequency-domain methods.UMX [24] receives the input spectrogram of a mixture song and extracts the target instrument by using fully connected and bi-directional long short-term memory (BLSTM) layers on the spectrogram, i.e., it works in the frequency domain.Similarly, D3Net [34,35] extracts the target instrument from the input spectrogram by using convolutional layers in the frequency-domain.Note that they use a multichannel Wiener filter (MWF) [19] [37,38].However, the MSS performances of such methods were inferior to those of frequency-domain based methods.Specifically, the overall signal-to-distortion ratio (SDR) was reported to be only around 3.2 dB, which was almost 2 dB behind that of frequency-domain methods.Note that the experiments in the above studies were conducted on the same public dataset, i.e., MUSDB18; thus, we can compare their results.Défossez et al. then investigated a new timedomain method, Demucs [39], which is based on Wave-U-net [38].Demucus improves the modeling capability by incorporating gated linear unit layers [40], BLSTM, and faster strided convolutions; thus, it demonstrated competitive results to frequency-domain methods on MUSDB18.
Although both time-and frequency-domain methods have recorded good MSS performance, there are still concerns.Almost all DNN-based frequency-domain methods tend to use only a spectrogram without phase information since it is difficult for DNNs to work with complex data.The phase information is often ignored with such methods.Therefore, the phase of the input mixture is often used with the output magnitude spectrogram to be able to compute the ISTFT, although this might yield a mismatch to the target source's spectrogram.The Fourier basis, which is used to calculate the above spectrogram, is not always optimal for DNN-based MSS methods.Time-domain methods, however, can optimize their networks from the perspective of being end-to-end, i.e., including the phase information, but tend to make the training more difficult.Inspired by this insight, we previously proposed X-UMX, which can use time-domain information via MDL [30], and confirmed that it performed better than UMX.Methods using time and frequency information in a hybrid manner were proposed for DNN-based MSS.For example, KUIELAB-MDX-Net [41] and Danna-Sep [42] are hybrid methods using time and frequency features.Specifically, they combine the heterogeneous time-and frequency-based MSS networks on the basis of the blending scheme [22], resulting in high performing hybrid MSS.
The number of methods using complex-valued features, i.e., spectrogram magnitude and the corresponding phase via STFT, for MSS has recently been increasing [43][44][45].Specifically, latent source attentive frequency transformation (LaSAFT) [43] and its light version, Light-SAFT [44], use complex-as-channels (CaC) [46] built on U-net [47], enabling MSS in the complex-valued domain.Défossez et al. also improved upon the original Demucs by using CaC, called HDemucs [31], to use time as well as complex-valued frequency information.Its architecture consists of two branches were each handles either time or complex-valued frequency input, respectively.Liu et al. proposed channel-wise subband phase-aware ResUNet (CWS-PResUNet) [45] which includes phase estimation by using the loss function of complex ideal ratio mask (cIRM) [48].Their motivations, which involve phase information as well as spectrogram magnitude, are similar to those of the hybrid methods that compensate for the missing phase information by adding the timedomain signal.Therefore, the above complex-domain methods are hybrid methods.
There have been several attempts to directly estimate the phase of the target source [49,50].PhaseNet [49] successfully predicts the phase information by defining the phase-estimation problem as a classification of discretized phase values.DiffPhase [51] generates as well as predicts the phase through the framework of a diffusionbased generative model, which is suitable for the given spectrogram magnitude.The authors reported that the perceptual scores of reconstructed time signals were high even when their phases were partially generated.
From this literature review, we can see that using the time domain or similar features as well as the frequency domain is important to achieve good MSS performance.However, changing the network architecture such that the time-and frequency-domain features input can be jointly used and optimizing this new architecture may be a laborious task.X-scheme, which includes MDL, is simple and easy to use, thus it enables many methods to handle both time-as well as frequency-domain features in a hybrid manner.
Furthermore, there are some studies that attempted to integrate sub-networks each of which is dedicated for extracting one specific instrument [3][4][5].Specifically, Meseguer-Brocal et al. proposed DNN-based MSS method that used just a single conditioned network [3].
By applying Feature-wise Linear Modulation (FiLM) [6] to the target network as conditioning, separating an arbitrary desired instrument through a single network becomes feasible.Selecting which instrument should be separated is achieved by FiLM-based conditioning without instrument-wise training.Furthermore, Slizovskaia et al. proposed the conditioned network-based MSS method that accepts visual features as well as audio ones, i.e., audio-visual features [4].Besides using the conditioned network, Kadandale et al. proposed multi-task model-based MSS method that used the unified single network outputting all instruments simultaneously [5].While using a conditioning scheme might require a longer network training to ensure that we do enough training steps for each conditioning signal, multitask model-based networks need fewer iterations due to simultaneously learning them as multi-task.They increase the number of output instruments by changing the number of output kernels of U-net and then they easily change the each instrument's dedicated network to the unified multi-task one.However, their method only tried on CNN-based method, i.e., U-net, and it might be difficult to apply to other types of DNNs.
Our X-scheme can be regarded as a modification to change the target network to a multi-task one by bridging, and it can further be applied to not only CNN-based but also other types of networks as shown in the following sections.

X-scheme for DNN-based MSS
In this section, we describe X-scheme, which consists of three components, i.e., MDL, bridging operation, and CL.As mentioned in Section 1, MDL should solve (P1) and bridging operation and CL should solve (P2).
Throughout the paper, we use the following notations.We first assume that the time-domain mixture signal x consists of J sources, i.e., where y j denotes the time-domain signal of the jth source.Note that x and y j are column vectors with their samples, which they respectively denote the monaural signals.In general, the audio signal of music consists of two channels, i.e., stereo signal.However, the calculation of some metrics such as MSE and SDR that are used in our method does not have unique operators specialized for the multi-channel signal, and thus we calculated the following loss values by using each channel one-by-one and summed up them resulting in the final loss.To the best of our knowledge, although there is a multichannel version of classical SDR (e.g., https:// github.com/ sigsep/ where S and S −1 represent the operators of STFT and inverse STFT (ISTFT), respectively.A variable with the hat symbol, e.g., • , denotes the results estimated with the DNN.Therefore, ŷj and Ŷ j are respectively the predicted time-and frequency-domain results of ground truths, i.e., y j and Y j , via the DNN.

Multi-domain loss
For MDL, we first append an additional differentiable and fixed STFT or ISTFT layer after the final layer of the target DNN, as shown in Fig. 1.STFT and ISTFT consist of only product-sum operation, called butterfly computation [2], and thus all computational operations of it are differentiable.In other words, STFT and ISTFT consist of just some matrix-vector products each of which is differentiable.It is then possible to calculate the loss functions in both time-and frequency-domains before and after (2) ŷj = S −1 { Ŷ j }, the appended layer.Hence, we can easily add STFT and ISTFT as the differentiable operators resulting in STFT and ISTFT layers.Since this appended layer is only used during training for computing MDL, it does not affect the inference step.In X-scheme, we use the loss functions of the MSE and weighted signal-to-distortion ratio (wSDR) [52] as the frequency-and time-domains, i.e., where α is a scaling parameter for mixing multiple domains of loss.Specifically, L J MSE and L J wSDR are respectively calculated as follows: where t and f denote the indexes of the time frame and frequency bin of the spectrogram Y j (t, f ) , respectively.In addition, ρ j is the energy ratio between the jth source y j and mixture x in the time-domain, i.e., ρ j = �y j � 2 /(�y j � 2 + �x − y j � 2 ) .Note that the output range of the wSDR in Eq. ( 4b) is bounded to Therefore, L J wSDR + 1.0 written in Eq. ( 3) is bounded to [0, 2.0], and it is useful to mix with another type of loss, i.e., MSE in our case.Although the SDR is traditionally calculated including the logarithm, we keep the no-logarithm style and use Eq.(4b) for MDL due to the above reason.
By using MDL, the target DNN can leverage the advantage of both domains even if the original network operates in either one of them.MDL can also be applied to many conventional DNN-based MSS methods by simply replacing the loss function; thus, no additional calculation is required during the inference.

Combination schemes
In this subsection, we explain bridging operation for DNN-based MSS (Section 3.2.1)and CL (Section 3.2.2) to help independent extraction networks support each other.

Bridging operation
As shown in the blue rectangle of Fig. 2a, if DNN-based MSS is achieved using independent instrument networks, it is difficult for each network to take into account their mutual effect.Thus, we argue that it is effective to cross the network graphs to help independent sub-networks support each other 3 .This is the reason X-scheme includes bridging operation.Note that we adopt a just simple averaging layer as bridging operation.There may be some possible ways to joint sub-networks: using other techniques like cross-attention [53], squeeze-and-excitation [54], and transform-average-concatenate [55].But we consider that they may increase the computational cost and some parameters which are supposed to be learned.One of our motivations is enhancing the existing DNN-based MSS methods keeping calculation cost and original simplicity as much as possible, and thus we focus on adding a simple averaging layer as bridging operation.Please note that the bigger size of CPU/GPU memory tends to be necessary since our X-scheme requires to put all sub-networks, each of which is used to separate an instrument, on CPU/GPU in parallel during training.But this is only a bottleneck during training and might require to adjust the batch size.When doing separation, i.e., inference, this is in general not a problem anymore due to the batch size of one.
We previously did not investigate the detailed settings of bridging operation such as its position and numbers.As shown in Fig. 2b, it is possible to place a bridge between layers #l and # (l + 1) .We can place mul- tiple bridges depending on the number of target network layers L, namely, we can place up to (L − 1) bridges.Namely, we connect the paths to cross each source's networks by adding one or more average operators to the original network.Note that bridging operation does not have any learnable parameters; thus, the calculation cost slightly increases compared with the original network due to merely adding a few averaging operations.We can then regard the parts before and after the last added bridge as the interaction and each source extraction part; thus, their capacity depending on the position of bridging affects the final MSS performance.Motivated by the above discussion, we will conduct experiments on X-UMX (Section 4.3) to confirm the effect of the number and position of bridging operation on MSS performance.

Combination loss
As mentioned above, the purpose of applying bridging operation is to enable each source-extraction network to handle the relationship among output sources via built bridges.In other words, it is necessary for each sourceextraction network to learn its mutual relationships during training.However, using only bridging operation is insufficient for the networks to work together if the loss function is computed independently for each instrument.Thus, it is effective to cross the loss function as well as network paths via CL to boost the benefit of the built bridges.For CL, we consider the combinations of output spectrograms to enable each DNN-based source extractor to interact with each other.Specifically, we combine two or more estimated spectrograms into new ones, where each one can extract two or more sources from the mixture.Using the newly obtained combination spectrograms enables us to compute more loss functions than when we use only the individual instrument spectrograms independently, i.e., where N > J is the total number of possible combina- tions except for mixing all J sources, i.e., N = J −1 i=1 J i , and n denotes the index of the nth combination 4 .For instance, when separating J = 4 sources, as is the case with MUSDB18, we can consider 14 combinations in total, as shown in Fig. 3, whereas conventional methods handle only each source independently, i.e., 4 source spectrograms.
To explain the advantage of CL, let us consider the following example.Assume that we have a system with leakage of vocals into drums and bass resulting in similar errors that both instruments exhibit.By considering the combination drums + bass, we notice that the two errors are correlated, resulting in an even larger leakage of vocals, which we try to mitigate using CL.More formally, let ǫ j denote the prediction error of the jth source; ŷj = y j + ǫ j .We can then consider the MSE of the combi- nation u = y 1 + y 2 : When we consider y 1 and y 2 separately without the combination, the term " 2ǫ 1 ǫ 2 " does not appear in the MSE; MSE y 1 , ŷ1 + MSE y 2 , ŷ2 = E ǫ 2 1 + ǫ 2 2 .Therefore, by using CL, we can monitor the error correlation term " E[2ǫ 1 ǫ 2 ] , " which helps the source-extraction net- works train when they are correlated.Specifically, we expect the term " E[2ǫ 1 ǫ 2 ] " to be able to detect errors leaking into the wrong track.In order to efficiently reduce this term, we use the bridging operations which allows each sub-network to be aware of the others and, hence, to reduce potential leakage to a wrong source.Specifically, tying networks together helps the training as now also gradient information is exchanged which can help to learn to have a small " 2ǫ 1 ǫ 2 " term.Further- more, they also benefit from a joint feature extraction.Therefore, we bridged the network by just adding simple average operators as shown in Fig. 2b, which turned out to be beneficial since their results were actually improved in spite of using the same configurations except applying our X-scheme.We can also analyze CL in terms of a geometrical viewpoint.Focusing on Eq. (4b), since the wSDR consists of two cosine similarity functions, it monitors the angle consisting of the ground truth y j and corresponding pre- dicted ŷj .However, there is a critical case in which the prediction error cannot be detected in terms of the cosine similarity.As shown in Fig. 4, when the predicted ŷ1 and ŷ2 are respectively orthogonal to the corresponding ground truth y 1 and y 2 , it is difficult to detect the predic- tion error since cos(y 1 , ŷ1 ) and cos(y 2 , ŷ2 ) are both zeros.However, CL can detect its prediction error via the combined signals u and û since the score of cos(u, û) = −1 penalizes the target network by substituting it for the wSDR-based loss function.There is possibly a case that all of cos(u, û) , cos(y 1 , ŷ1 ) , and cos(y 2 , ŷ2 ) simultaneously become zero.However, in such case, all vectors (including their sums) are orthogonal, CL just does not bring a benefit but also does not cause any degredation.Namely, there is no harm and it is just not effective.Furthermore, if we would include the multichannel Wiener filter (MWF) like UMX and X-UMX, then we can expect that this case can not appear as MWF redistribute the residual to all sources and by this always have a non-orthogonal sum which results in an error.Note that we need to apply our X-scheme after using MWF in that case.
Independent sub-networks can detect each other via the added bridges and CL.The DNN-based MSS network extended with X-scheme can handle multiple sources together, i.e., separate two or more sources, rather than each source independently.From a different viewpoint, CL can be considered to provide a similar benefit to Fig. 3 CL when mixture consists of four sources multi-task learning [56] since it handles multiple objectives jointly by computing combinational loss functions.
We can apply X-scheme to many DNN-based MSS methods to improve their performances while maintaining almost the same computational cost as the original method since MDL and CL are merely loss functions and bridging operation is achieved with simple average operations without increasing learnable parameters.As discussed in Section 4, X-scheme improves DNN-based MSS performance.

Experiments
In this section, we present our experiments on X-scheme for MSS.We first explore the effect of the bridging position using X-UMX [30] to provide insights on the optimal position and its sensitivity.Next, we confirm the scalability of X-scheme in a large-scale data regime.Finally, we demonstrate the generality of X-scheme by applying it to another type of network architectures, D3Net and Conv-TasNet.
We used the following datasets and STFT/ISTFT settings for the experiments.

MUSDB18 [57]
The MUSDB18 dataset is comprised of 150 songs, each of which was recorded at a 44.1-kHz sampling rate.It consists of two subsets ("train" and "test"), where we further split the train set into "train" and "valid" as defined in the official "musdb" package 5 .For each song, the mixture and its four sources, i.e., bass, drums, other, and vocals, are available.

STFT/ISTFT
We used a Hann window with a length of 4096 samples and 75% overlap.We used STFT magnitudes obtained from the mixture signal as input and trained networks to estimate target mask M j (t, f ) or spectrograms Y j (t, f ) , where f is the frequency bin and t the frame index.To use STFT and ISTFT as differentiable layers for MDL, we used "torch.stft"and "torch.istft"from PyTorch which are readily available and provide a differentiable implementation of the STFT/ISTFT 6 .Please see also https:// github.com/ aster oid-team/ aster oid for our actual implementation.

X-UMX
The network architecture of UMX is illustrated in Fig. 5a.The network was trained to estimate all the sources' masks with the Adam [58] optimizer for 1000 epochs.The learning rate was set to 0.001 with a weight decay of 0.00001.The batch size was set to 14 and each input was a random crop of 6.0 sec from the dataset.The scaling parameter α , introduced in Eq. (3) for MDL was set to 10.0 to approximately equalize the ranges of L J MSE and L J wSDR by looking at each loss function's learning curves, respectively.Note that the details of other settings are shown in our code 7 and previous paper [30].

Bridging positions
As shown in Fig. 2, bridging operation can be applied to arbitrary positions between the layers.The number of bridges can also be increased depending on the number of gaps between adjacent layers.Therefore, in this section, we present the results regarding the position of bridging operation on X-UMX trained under the same configurations, e.g., the number of epochs, regularization parameters, and type of optimizer, as mentioned in the previous subsection.We show the simplified network architecture and possible bridging candidates of UMX in Fig. 5b.UMX roughly consists of three affine blocks and a BLSTM block.Each affine block has a fully connected layer, batch normalization layer, and activation function.The BLSTM block has three consecutive BLSTM layers with dropout.In this experiment, we considered three positions as candidates for inserting the bridging network and examined the performance for all combinations of bridging position.
The results are shown in Fig. 6.The performances of almost all bridged versions of UMX, i.e., bridging position (BP) from 1 to 7 (BP1-BP7), were superior to the baseline from the perspective of source-to-interference ratio (SIR) and source image-to-spatial distortion ratio Fig. 5 Original network architecture of and its bridging candidates (ISR) (see "Avg." of Fig. 6b and c).Only the SIR result of BP1 did not outperform that of the baseline but was comparable.Hence, we argue that bridging operation can improve the suppression of the other interference instruments without increasing linear distortions since ISR becomes low when the output signal increases linear distortions.Focusing on the SDR results, which were computed by summing up the weighted SIR, ISR, and SAR, we argue that X-UMX outperformed UMX because the SDR results of BP1-BP7 improved in most cases compared with that of the baseline (see Fig. 6a).In particular, BP4, which bridged the paths between "Affine Block" and "BLSTMs Block, " performed the best in terms of the SDR.Hence, bridging paths between the gaps of different type of blocks or layers is probably effective in terms of sharing each sub-extraction network's information.

Effectiveness of CL
First of all, we confirmed the validity of the term " 2ǫ 1 ǫ 2 " mentioned in Section 3.2.2, which is ignored in the case of training each of instruments' sub-networks separately.Not only this term but also bridging networks bring benefit.Tying networks together helps the training as now also gradient information is exchanged which can help to learn to have a small " 2ǫ 1 ǫ 2 " term.Furthermore, they also benefit from a joint feature extraction.In this way, by computing this term through our X-scheme, we take this mutual effect among sources into account when training the DNN.Specifically, by considering this term in the loss function, it is expected to penalize an errors having correlation between instruments when either of an instrument is wrongly separated to the wrong track.To confirm this, we compared the actual separated results of UMX and X-UMX.As shown in Fig. 7, X-UMX succeeded to Fig. 6 Results bridging position(s) for X-UMX.Note that correspondence between bridging indexes written in a and positions are shown in Fig. 5b suppress errors in the vocals track leaked from drums as expected.In particular, the regions highlighted by colored rectangles were obvious, and this improvement was also audible.It is considered that this power, i.e., energy from drums, which leaked in the wrong track was penalized through the loss function by our X-scheme as we discussed in Section 3.2.2resulting in the performance improvement shown in Fig. 7a.
To confirm the validity of CL in more detail, we monitored the performance change according to the number of combinations.As we discussed in Section 3.2.2, the combined vector may not potentially bring a benefit especially when the number of the target sources is few.Therefore, we fixed the target source as "vocals" and trained 2 X-UMXs each of which was respectively trained by using 3 and 4 instruments for CL and bridging operation.Note that both of them always received the mixture signal consisting of 4 instruments as input, but the number of output instruments, i.e., sub-networks, was different.Namely, "X-UMX on 3 sources" separated 3 instruments from input that was 4 sources mixture whereas "X-UMX on 4 sources" separated 4, i.e., full, instruments from input.Then, in terms of CL, the number of combinations used for CL was different.The results are summarized in Table 1.
As shown in the table, all results of "X-UMXs on 3 sources" were inferior to the model with all four sources.Intuitively, the more related to the vocals the excluded source was, the worse performance the results tended to be.The power of "Bass" is concentrated on lower frequency bands, and thus "Bass" has lower correlation with "Vocals" than "Drums." An example of comparing the synchronized spectrograms of vocals obtained by applying X-UMX and UMX to a musical piece of MUSDB18 test set.The rectangles depicted in the same color denote the corresponding time-frequency regions with each other.For sake simplicity, only single channels were shown here although the actual results were stereo Table 1 Performance comparison of separating "Vocals" track of MUSDB18 test set on X-UMXs having the different number of output instruments in SDR [dB].Since our X-scheme uses the combination of separated output instruments, the number of combinations changes and affects performance when the number of output instruments is different.Note that we removed the post-processing, multichannel Wiener filter (MWF) [19] which was originally used in X-UMX [30], in this experiments since it requires all separated instruments

Scalability with large training datasets
In this section, we discuss the potential of X-UMX for a large-scale training dataset, i.e., X-UMXL, which was not assessed in our previous study [30].DNNs can generalize well if enough data is available for training, and some regularization methods might become ineffective in such a case.Thus, it is important to investigate the scalability of X-scheme.Specifically, we trained UMX and X-UMX on an internal dataset consisting of 1505 songs with a total duration of approximately 100 h, which is 10 times larger than MUSDB18.The dataset exhibits a diverse linguistic composition, with 63% of the songs being in English, 20% in French, 6% in German, and the remaining 11% comprising various other languages such as Italian, Spanish, and Dutch.Regarding musical genres, the collection predominantly features pop and rock music.It also includes a selection of country songs and movie soundtracks, though these are less prevalent.We denote this dataset as "INTERNAL" and note that it has no overlapped songs with MUSDB18.Each song of INTERNAL consists of four instruments, as in MUSDB18.
The results are summarized in Table 2. X-UMX and X-UMXL outperformed the corresponding UMX and UMXL if they were trained on the same dataset, i.e., using MUSDB18 or INTERNAL.X-UMX and X-UMXL outperformed the original UMX and UMXL for all instruments (see the boldface in Table 2).This shows that X-scheme is effective even when we have more training data available.
It is worth noting that X-UMXL greatly outperformed not only our self-implemented UMXL trained on INTERNAL but also "public UMXL, " which was provided by the authors of UMX, although the size of our dataset is one fifth of theirs 8 (see the yellow highlighted cells in Table 2).From these results, we argue that X-scheme can use a given dataset for training more successfully, and even outperform a traditional setup with more training data.

X-D3Net and X-Conv-TasNet
Next, we firstly integrated X-scheme into D3Net resulting in X-D3Net.The network architecture is shown in Fig. 8.The original D3Net, C1, uses band-wise MDenseNets [21] and integrated their outputs by applying a dense block, but they are independent of each other, i.e., there is no path to share their relationship among them.Hence, the bridging path is added at the end of band-wise D3 blocks, resulting in X-D3Net, as in the experiment in Section 4.3.1.This suggests that the semantic boundary can be a good position for inserting bridging operation.The differences in network the architecture between D3Net and X-D3Net are shown in Fig. 8.Each network of X-D3Net was trained to estimate all the sources' spectrograms with the Adam [58] optimizer for 70 epochs.The initial learning rate was set to 0.001 with a weight decay of 0.00001, and its learning rate was dropped to 0.0003 and 0.0001 after 40 and 60 epochs, respectively.The batch size was set to 6 and each input was a randomly cropped music spectrogram with 352 frames.The scaling parameter α was set again to 10, as we did for X-UMX.
The results are shown in Fig. 9.Note that "P" denotes the proposed method that includes all components of X-scheme, i.e., MDL, bridging operation, and CL, while "C1-C7" denote the comparative methods lacking some of these components in order to confirm their effectiveness one-by-one.In terms of the SDR, the methods using at least one component of X-scheme, i.e., C2-C7 and P, were superior to D3Net, i.e., C1 (see the average performances denoted as "Avg." in Fig. 9a).Therefore, the validity of each component of X-scheme was confirmed on a CNN-based MSS method (D3Net) as well as an RNNbased MSS method (UMX).Overall, we could improve MSS performance by 0.3 dB.
In particular, the positive effect of MDL was notable compared with our previous corresponding results on X-UMX [30] (see the results of methods including MDL, Table 2 Comparison of X-UMX with UMX in terms of SDR("median of frames, median of tracks").Note that all results were evaluated on MUSDB18 test set i.e., C2, C5, C6, and P).Therefore, regardless of whether the target network is an originally integrated one or tied against independent source sub-networks, the loss-function-related core components of X-scheme, i.e., MDL and CL, can improve MSS performance.
Finally, to see the effectiveness of applying our X-scheme to DNN-based MSS methods, we summarize the performance comparison before and after applying X-scheme in Table 3.To confirm the effectiveness for not only frequency-domain network, i.e., UMX and D3Net used in the above, but also time-domain network, we also applied X-scheme to Conv-TasNet [1] resulting in X-Conv-TasNet.Note that time-domain networks tend to require much larger size of memory and the corresponding training time than frequency-domain ones, and thus we did not run an ablation study but instead applied our X-scheme to Conv-TasNet using the learnings from X-UMX and X-D3Net.As shown in Table 3, all instruments were improved when comparing to the vanilla networks.In addition to the quantitative results, we also studied their spectrograms shown in Fig. 10.As shown in the figure, all vanilla methods tend to miss the power of "Other" track, but all of them became to be able to detect and extract it by applying our X-scheme.This is due to the missed power that leaked in wrong tracks which was penalized through the X-scheme loss function as we discussed in Section 3.2.2.Thus, we can argue that our X-scheme works well not only for frequency-domain networks but also for time-domain ones, e.g., Conv-TasNet.
From the aforementioned results, we can conclude that our X-scheme can be applied to diverse types of networks such as CNN-based time-and frequency-domain models as well as RNN-based time-and frequencydomain ones.However, please note that the detailed effect of our method, e.g., where the most effective bridging position is, the number of combinations and bridges that should be used, and what types of time-domain and frequency-domain loss functions are effective, may be different depending on the detailed characteristics of the target network.Therefore, it is important to insert each core part of X-scheme, (i) MDL, (ii) bridging operation, and (iii) CL, one-by-one and adapting them such that the optimal configuration for the target network is found.

Conclusion
We revisited our previous proposal and summarized its core component, a versatile scheme called X-scheme.X-scheme consists of three parts: (i) MDL, (ii) bridging operation, and (iii) CL, which improve the performance of DNN-based MSS with almost no increase in calculation cost.Specifically, as MDL and CL are merely loss functions used during training, they do not affect the computational cost at inference.As shown in Fig. 2, bridging operation does not increase calculation cost due to adding only a few average computations.To verify X-scheme for another type of network that differs from the recurrent type, i.e., UMX, we derived an X-scheme-based convolutional networks in this paper.The frequency-domain and time-domain convolutional networks extended by X-scheme are respectively X-D3Net and X-Conv-TasNet.We confirmed their validity compared to the original ones through experiments.We also examined the detailed effectiveness of Fig. 9 Experimental results of X-D3Net.Note that conventional D3Net, which is denoted as "C1, " was re-trained for this experiment to equalize number of optimizers with X-D3Net's.Namely, each version of X-D3Net was trained on single optimizer while original paper trained four D3Nets, each of which corresponds to each instrument, using four optimizers separately.Thus, results of C1 are different from those of original paper Table 3 Comparison of our X-scheme by applying to 3 different MSS methods in terms of SDR ("median of frames, median of tracks").Note that all results were trained on our local environment by using each author's official codes bssev al), almost all of other existing methods and their implementation also handled each channel of stereo signals one-by-one.Thus, for sake of simplicity, the vectors used in the following equations are denotes as single, i.e., monaural, signals.The DNN then predicts the spectrogram of the jth target source from the input mixture spectrogram X = S{x}: