Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained using parallel data which results in the speech data requiring elaborate pre-processing to generate parallel data. NMF-VC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTD-based dictionary-learning method estimates the dictionary matrix for NMF-VC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and non-parallel settings.


Introduction
Voice conversion (VC) is a technique used to convert speaker-specific information in the speech of a source speaker into that of a target speaker while retaining linguistic information. Lately, VC techniques have been garnering particular attention [1], and various statistical approaches to VC have been studied [2,3] as these techniques can be applied to numerous tasks [4][5][6][7][8]. Of these approaches, the Gaussian mixture model (GMM)based mapping method [9] is the most prevalent, and a number of enhancements have been proposed [10][11][12]. Other VC methods, such as approaches based on non-negative matrix factorization (NMF) [13][14][15], neural networks [16], deep learning [17,18], restricted Boltzmann machines [19][20][21], variational autoencoders [22], and a generative adversarial network [23], have also been *Correspondence: takashima@stu.kobe-u.ac.jp 1 Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada-ku, Kobe 657-8501, Japan Full list of author information is available at the end of the article proposed. Notably, in recent years, the NMF has outperformed GMM in parallel data conditions. Exemplar-based NMF-VC retains the high naturality of the converted speech, and many of its variants have been proposed [24,25]. Although more recent deep learning methods require significantly large training data, NMF-VC requires comparatively less training data. Therefore, this study focuses on NMF-VC.
NMF [26] is one of the most popular sparse representation methods. The goal of NMF is to decompose the input observation into two matrices: the basis matrix and weight matrix. In this study, the basis matrix is referred to as the "dictionary," and the weight matrix as the "activity." The NMF-based method can be classified into two approaches: the dictionary-learning approach [14] and exemplar-based approach [27]. In the dictionary-learning approach, the dictionary and activity are estimated simultaneously during the training, and the estimated dictionary is used in conversion. However, in the exemplar-based approach, the training data is straightaway used as exemplars in the conversion step. By using the learned dictionary instead of the exemplars, the VC is executed with lower computation times. However, both the NMF-based approaches require parallel data (aligned speech data from the source and the target speakers, so that each frame of the source speaker's data corresponds to that of the target speaker's data) for training the models, which leads to several problems. First, the data are limited to predefined statements (both speakers must utter the same statements). Second, the training data (the parallel data) are not the original speech data anymore, as the speech data are stretched and modified along the time axis when aligned, and there is no certainty that each frame is aligned perfectly. As the dictionary is assembled from parallel data, the error of alignment in the parallel data might adversely affect VC performance. Several other approaches have been proposed that do not use (or minimally use) parallel data of the source and the target speakers [28][29][30]. For example, in [28], the spectral relationships between two arbitrary speakers (reference speakers) is modeled using GMMs and the source speaker's speech is converted using the matrix that projects the feature space of the source speaker into that of the target speaker through that of the reference speakers. In this study, the conventional NMF-based VC method is expanded into a non-parallel VC method. A previous study [30] proposed using the phone segmentation results from automatic speech recognition to construct a subdictionary for each phone for an exemplar-based NMF voice conversion. This particular technique was applied to the non-parallel VC.
To tackle the non-parallel approach, a non-negative Tucker decomposition (NTD) [31][32][33]-based dictionarylearning method is proposed. The NTD is a non-negative extension of the Tucker decomposition that decomposes the input observation into a set of matrices and one core tensor. Tucker decomposition is generally introduced to deal with a high-order tensor. In recent studies, Tucker decomposition has been widely applied in visual questionanswering systems [34] and speech recognition [35]. As spectral features are used for input observation, a set of matrices consists of two mode matrices for frequency and time and a core tensor corresponding to a core matrix. It is assumed that these matrices correspond to the frequency basis matrix, the phonemic information, and a codebook between the frequency basis and each phone, respectively. In the proposed approach, the activity matrix in NMF is decomposed into the codebook and the phonemic information. When learning the dictionaries, while the activity matrix is shared between speakers using parallel data in the conventional NMF-VC, in the proposed method, the codebook is shared between speakers, and the phonemic information is dependent on a speaker. Hence, the time-varying phonemic information can be captured for each speaker. During the conversion, only the phonemic information matrix is estimated as the activity matrix. As the proposed method can have time-dependent factors for each speaker, there is no necessity for parallel data. To the best of authors' knowledge, NTD-based VC has not been attempted, except [36] where Tucker decomposition was used to represent the speaker space and the conversion mechanism was based on GMM. The present VC is based on NMF, and this approach is fundamentally different from those presented previously [36].
Several methods have been proposed for tensor decomposition [37][38][39]. In [37], NMF is applied to variational Bayesian matrix factorization, where each observed entry is assumed to be a beta distribution. Shi et al. [38] proposed tensor decomposition with variance maximization for feature extraction. In [39], pairwise similarity information is incorporated into Tucker tensor decomposition. While these methods have useful properties, it is difficult to adapt them directly to VC. NTD can be readily integrated with NMF-based VC, because NMF is the second-order case of the Tucker decomposition with the non-negative constraint.
The rest of this paper is organized as follows. In Section 2, a conventional NMF-based VC is described. Section 3 includes the description of the proposed method. Section 4 details the evaluation of the experimental data, and Section 5 details the Experiments on VCC 2018. Finally, in Section 6, the conclusions are presented.

NMF-based voice conversion
NMF is a matrix decomposition method under nonnegative constraints. The basic idea behind decomposing a matrix X ∈ R F×T is to find two matrices W ∈ R F×K and H ∈ R K×T that minimize the distance between X and WH under non-negative constraints. F and T represent the number of dimensions and frames. In NMF, W is called a basis matrix and contains K bases in columns. H is called an activity matrix and indicates the activity of each basis along the time index.
VC approaches using NMF are divided into two categories: supervised and unsupervised approaches. The supervised approach, known as the exemplar-based VC, estimates only the activity from observation and the dictionary must be provided. However, the unsupervised approach, i.e., the dictionary-learning VC, estimates both the dictionary and the activity from observation. The proposed method is based on the latter, i.e., the dictionarylearning approach. Figure 1 shows the basic approach of the dictionarylearning NMF-based VC [14], where F, T, and K represent the number of dimensions, frames, and bases, respectively. This VC method needs two dictionaries that are phonemically parallel. W s ∈ R F×K represents a source dictionary, and W t ∈ R F×K represents a target dictionary. In exemplar-based VC, these two dictionaries consist of the same words or sentences and are aligned with dynamic time warping (DTW), which is comparable with the conventional GMM-based VC. In dictionary-learning VC, these two dictionaries are estimated simultaneously and as a result have the same number of bases. For the training source speaker data X s ∈ R F×T and the training target speaker data X t ∈ R F×T , two dictionaries W s , W t , and the activity H ∈ R K×T are simultaneously estimated. The cost function of this joint NMF is defined as follows:

Dictionary learning using nMF
where X s and X t represent parallel data. In Eq. (1), d KL (A, B) denotes the Kullback-Leibler divergence between the two matrices A and B, and the last term is the sparsity constraint with the L1-norm regularization term that causes the activity matrix to be sparse. λ represents the weight of the sparsity constraint. This function is minimized by iteratively updating parameters, as is done in the traditional NMF. This method assumes that when the source and the target spectra (which are from the same words but spoken by different speakers) are expressed with sparse representations of the source dictionary and the target dictionary, respectively, the obtained activity matrices are approximately equivalent to each other. In the conversion process, for the input source spectrogram X s , only the activity H s is estimated while fixing the source dictionary W s .
The estimated source activity H s is multiplied with the target dictionary W t , and the target spectrogramX t is constructed as follows:

Problems
NMF-based VC has several problems. First, if the source and target utterances are aligned using DTW in advance, the estimated parameters are affected by the quality of the alignment. And a mismatch of alignment appears to persist. Aihara et al. [24] have shown that this mismatch degrades the performance of exemplar-based VC. Second, it appears that the activity matrix contains other information along with the phonetic information. Aihara et al. [25,27] assumed that the activity matrix contains the phonetic information and speaker information, and accordingly proposed certain frameworks to overcome this effect, thereby improving the performance of NMF-based VC. In this study, an alternative approach is proposed. The activity matrix is decomposed into the speaker-shared matrix and the speaker-dependent phonetic information matrix. This decomposition makes parallel data unnecessary. Moreover, during the conversion, estimating only the phonetic information matrix as the activity matrix is expected to improve the accuracy of activity estimation.

NTD
Given a non-negative N-way tensor, NTD [40] decomposes the input tensor into a core tensor and a set of mode matrices that are restricted to have only non-negative elements. In this study, as the spectral features are used as the input observation, a core tensor is represented as a matrix, and there are two mode matrices. Under these conditions, NTD is simply defined as follows: where X ∈ R F×T , U ∈ R F×M , V ∈ R T×L , and G ∈ R M×L represent an input spectrogram, a mode matrix along the frequency axis, a mode matrix along the time axis, and a core matrix, respectively. F, T, M, and L indicate the number of frequency bins, frames, frequency basis, and time basis, respectively. The cost function of NTD is defined as follows: NTD provides a general form of the non-negative tensor factorization including a special case of NMF; updating algorithms have been proposed in [40]. These updating algorithms are based on that NMF.

Dictionary learning using nTD
This section describes the method of estimating a parallel dictionary between the source and target speakers by NTD. The objective function is represented as follows: where and G ∈ R M×L represent the source and target spectrograms, the source and target frequency basis matrices, the source and target time basis matrices, and a core matrix, respectively. α and β represent the weight of each term. F, T s , T t , M, and L indicate the number of frequency bins, source and target frames, frequency basis, and time basis, respectively. This function is minimized by iteratively updating the following equations in the same manner as the NTD: where . * and ./ denote element-wise multiplication and division, respectively. In this framework, only a core matrix G is shared, and time-varying matrices V s and V t are dependent on each speaker, as shown in Fig. 2. Therefore, there is no necessity for parallel data.
After each matrix in the model is estimated, the source and target parallel dictionaries are calculated as U s G and U t G, respectively. During conversion, for the given source spectrogram X s , only V s is estimated as X s = U s GV s . Then, the target spectrogramX t is obtained asX t = U t GV s . It is assumed that U s and U t represent the frequency basis matrices, and V s and V t represent the phonemic information. As the core matrix is not dependent on either the frequency or the time, this matrix represents the codebook between the frequency bases and the phones. Based on this assumption, the core matrix makes a correspondence between frequency bases and phones. Specifically, there are L phones, and a spectrum of each phone is constructed using M frequency bases. Although the information contained in the activity matrix is not only the phonemic information, in conventional NMF-based approaches, the activity matrix is assumed to contain only the phonemic information. Therefore, the estimated activity is degraded. In contrast, the proposed NTD-based approach specifically decomposes the activity matrix into the speaker-shared information and the speaker-dependent phonemic information. Therefore, it is expected that the performance of the activity estimation will be improved during conversion.

Conditions
The proposed VC technique was evaluated in a speakerconversion task using clean speech data by comparing its results with the conventional GMM-based method [10], the conventional NMF-based dictionarylearning method [14], and an adaptive restricted Boltzmann machine (ARBM)-based method [20] that does not use parallel data. For the evaluation, voice samples of speech data stored in the ATR Japanese speech database [41] of three males and three females were used. The sampling rate was 16 kHz. A total of 45 sentences were used for training, and another 50 sentences were used for testing. Parallel data aligned using dynamic programming matching (DPM) was used to train the GMM-based and NMF-based methods. The proposed method and the ARBM-based method do not require parallel data. As training data, the same utterances were used for the source and the target speaker in the parallel setting, and completely different utterances for each speaker were used in the non-parallel setting.
Parameter initialization has a significant impact on the conversion performance. In this study, V s and V t are initialized randomly. Table 1 shows the initialization algorithm for U s , U t , and G. In the parallel setting, the initialization is based on the NMF framework using parallel data calculated by the source and target training data. In the non-parallel setting, the initialization is based on the NMF and NTD frameworks. This initialization method uses an adaptive matrix [42]. Finally, initialized parameters are optimized by Eqs. (6) to (10). In the conventional NMF-based method and the proposed method, a 513-dimensional WORLD spectrum [43] is used for spectral features. The hyperparameters α and β are used to control the length of the training data for the source and the target speaker, respectively. These parameters were set as follows: where T s and T t represent the number of frames of source and target training data, respectively. The sparse constraint λ was set to 0.2. The parameters are updated until the convergence condition |F t − F t−1 | < |F T | is fulfilled, where |F t | indicates a value of an objective function at an iteration t. was set to exp(−9). The GMM experiments were implemented using sprocket [44]. In the conventional NMF-based dictionary-learning method, the number of bases is 1000. In the ARBM-based method, a 32dimensional Mel-cepstrum that was calculated from the 513-dimensional WORLD spectra was used as an input vector. Softmax constraints were set to hidden units. In this study, a conventional linear regression based on the mean and standard deviation [10] was used to convert F0 information. Other information, such as aperiodic components, was synthesized without any conversion.
The proposed method was evaluated both objectively and subjectively. The subjective evaluation was based on "speech quality" and "similarity to the target speaker (individuality). " In the subjective evaluation, 25 sentences were evaluated by 10 native Japanese speakers. To evaluate the speech quality, a mean opinion score (MOS) test was performed. The opinion score was set to a 5-point scale (5, excellent; 4, good; 3, fair; 2, poor; 1, bad). For the similarity evaluation, an XAB test was conducted, in which each participant listened to the voice of the target speaker and then to the voice converted using the two methods. The participant was then asked to judge which sample sounded most similar to the target speaker's voice.

Parameters
The performance of each method was evaluated using different parameters. One male speaker and one female speaker were selected and male-to-female conversion and female-to-male conversion was evaluated. First, the performance of the conventional GMM-based VC was evaluated using different number of mixtures. The results obtained when using 4, 8, 16, 32, 64, and 128 mixtures are shown in Fig. 3. A lower value indicates a better result. As shown in Fig. 3, the optimal numbers were close to 8. Therefore, eight mixtures were used in the subsequent experiments.
Next, the performance of the conventional ARBMbased VC was evaluated using a different number of hidden units. The results are shown in Fig. 4  Finally, the performance of the proposed method was evaluated using a different number of frequency bases. The results are shown in Fig. 5 when the numbers of frequency bases M were 100, 200, 300, 400, and 500. The optimal number was around 200. Therefore, 200 was used as the number of frequency bases in the subsequent experiments. In the experiments, to control the number of dictionary bases during conversion, the number of time bases L was fixed to 1000.

Results
In this section, the proposed method is compared with conventional GMM, NMF, and ARBM-based methods.
Initially, the proposed method is compared with the parallel method in a parallel setting. Table 2 shows the average MCD values for male-to-female conversion, female-tomale conversion, male-to-male conversion, and femaleto-female conversion. In this table, "Mi" and "Fj" indicate the ith male speaker and jth female speaker, respectively, and src → tar denotes the src-to-tar conversion. The rightmost column in the table indicates the mean value for each method with a 95% confidence interval. Here, a lower value indicates a better result. In these experiments, the models were trained using parallel utterances. The GMM and NMF frameworks require parallel data. For these, parallel utterances were used to calculate the parallel data. Table 2 clearly demonstrates that the proposed NTDbased dictionary learning is not affected by the alignment error in DTW, and hence yields 10.1% and 1.8% relative improvements when compared with the conventional GMM-based method and the conventional NMF-based dictionary learning, respectively. Moreover, it confirms that the proposed method achieved a significantly better score than both the comparative methods, when using a p value test of 0.05.
Next, the method was compared with the non-parallel method in a non-parallel setting. Table 3 shows the average MCD values for male-to-female conversion, female-to-male conversion, male-to-male conversion, and female-to-female conversion. These results show that the proposed method has a comparable performance to the conventional non-parallel method, ARBM. However, the proposed method achieved a notably worse score than the ARBM-based method, when using a p value test of 0.05. This difference is explained in the next section. Figure 6 shows the results of the MOS test on speech quality. The error bar shows a 95% confidence interval. Here, a higher value indicates a better  result. M-to-F, F-to-M, M-to-M, and F-to-F denote maleto-female conversion, female-to-male conversion, maleto-male conversion, and female-to-female conversion, respectively. "NTD (para)" and "NTD (non-para)" denote the proposed method with parallel utterances training and non-parallel utterances training, respectively. The proposed method achieved a significantly better score than the conventional methods. Specifically, NTD with the non-parallel setting showed the best results across all conversions. Figures 7 and 8 show the results of the XAB test. The error bar shows a 95% confidence interval. For this test, a higher value indicates a better result. In Fig. 7, the results of the proposed method and conventional NMF-based dictionary-learning method are compared. In the male-tofemale and female-to-female conversions, the proposed method achieved a better score than NMF-based dictionary learning. In the male-to-male and female-to-male conversions, the proposed method achieved a lower score than NMF-based dictionary learning. However, the difference between the two methods is not statistically significantly, because p > 0.3 in the p value test. The proposed NTD-based dictionary learning without calculating parallel data showed comparable performance to the conventional NMF-based dictionary learning, which requires parallel data. In Fig. 8, the results of the proposed method and the ARBM-based VC are compared. In conversions to male, the proposed method achieved a better score than ARBM-base VC. In conversions to female, the proposed method achieved a lower score than ARBM-based VC. In only the male-to-female conversion, the difference was significantp < 0.05. However, in other conversions, the difference was not statistically significant. These tests show that the proposed non-parallel VC approach effectively converts the individuality of the source speaker's voice to the target speaker's voice while preserving high speech quality.

Discussion
In the objective evaluations, the proposed method achieved a better MCD value than the conventional VC, which uses parallel data. This is due to the fact that the proposed method is not affected by the mismatch of DPM. Moreover, the proposed NTD-based method yielded better performance, although the number of learned parameters decreased by approximately 60% of the conventional NMF-based one. This result indicates that the proposed dictionary learning has better spectral representation while keeping the number of bases of dictionaries constant during conversion. In addition, the average difference in MCD between the proposed method and the ARBM-based method was approximately 0.08 dB. This difference is relatively small. It is assumed that MCD is superior to the ARBM-based method, as it uses Melcepstrum as an input feature, whereas NTD-based methods use a WORLD spectrum. In the speech quality test, the proposed method using non-parallel training data achieved a better MOS score than that using parallel utterances. This is due to the model's ability to learn diverse phonemic information by using non-parallel data when compared with parallel utterances. For example, n sentences are used for each speaker as training data. In the instances using parallel utterances, which consist of the same context for both speakers, the frequency base matrices U s and U t and the codebook G are learned from n context patterns. However, in the non-parallel setting, where a different context was used for the source and target speakers, the frequency base matrices and the codebook were learned from n and 2n context patterns, respectively. A codebook was effectively learned while improving the generalization ability. Therefore, the method using nonparallel data outperformed that using parallel utterances.

Experiments on voice conversion challenge 2018
The proposed method was also evaluated on the Voice Conversion Challenge (VCC) 2018 [45], which includes both parallel and non-parallel recordings from native English speakers from the USA. VCC 2018 consists of a total of 12 speakers. Each speaker has sets of 81 and 35 sentences for training and evaluation, respectively. The recordings were down sampled to 16 kHz. Systems were conducted for the 16 combinations of source-target pairs. The results of this objective evaluation are shown in Table 4. Our proposed method did not outperform the GMM-based VC in the parallel setting, while the NTDbased method achieved 3.89% relative improvement compared with the ARBM-based method in the non-parallel setting. These results demonstrate that our method is especially effective in non-parallel settings.

Conclusion
An innovative dictionary-learning method of NMF-based voice conversion was proposed. It makes NMF-VC possible for non-parallel training. While exemplar-based VC retains the naturality of the converted speech to a high degree, the source and target dictionaries expand significantly. Although dictionary-learning VC achieves compact dictionary representation, the parallel dictionaries of the source and target speakers are difficult to learn. These conventional NMF-VC methods require parallel utterances by the source and target speakers to construct the source and target dictionaries. In this study, a method parallel dictionary learning for NMF-VC based on NTD was proposed that does not require parallel data during training. NTD decomposes an input observation into a set of mode matrices and one core tensor. In the proposed framework, it is assumed that NTD decomposes the spectrogram into the frequency basis matrix, phonemic information matrix, and codebook matrix. Recently, several studies have been conducted for NMF-VC, and the scope of possible applications is widening. It is assumed that the proposed method assists these applications with non-parallel training. It was confirmed that the proposed method achieved an almost identical MCD to the conventional NMF-based dictionary learning that uses parallel data. Furthermore, the performance of the proposed method was comparable to that of the conventional ARBM-based method in a non-parallel setting.
In future work, we plan to apply the method to assistive technology for speakers with articulation disorders. The speech of such speakers is considerably different from that of the speech of unimpaired persons, and it is difficult to align correctly. The proposed method does not require the same texts of speech data for the source and target speakers or the framewise matching between acoustic features of both speakers. Furthermore, the NTD-based dictionary learning is a natural expansion of the NMFbased method, and it can read parallel and non-parallel data to learn the dictionary. Therefore, we also aim to investigate a semi-supervised dictionary-learning method that improves the performance of a model trained with a small set of parallel data using a large set of non-parallel data.
In the real world, background noise deteriorates conversion performance. However, the proposed model has not been designed with noise robustness in mind. In order to retain the quality of converted voices in a noisy environment, noise robustness is required. In our previous study [46], a noise-robust NMF-based VC was proposed, where the performance was improved by 25% compared with the GMM-based method. As the currently proposed method is based on NMF-based VC, it will be easy to apply the noise-robust conversion. The evaluation of our proposed method for a noisy environment will be a topic for our future work.