Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency (<inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula>) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula>. The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula> to handle large fluctuations in <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula> for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula> are expected to realize high-quality synthesis for the normal and <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula>-conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula> can achieve higher synthesis quality than conventional methods in all (i.e., normal, <inline-formula><tex-math notation="LaTeX">$f_{\mathrm{o}}$</tex-math></inline-formula>-conversion, and SR-conversion) conditions.

To control the pitch of a speech waveform, acoustic features including f o are extracted from the original speech waveform, and the f o -converted speech waveform is generated by scaling the f o values during the inference process. In the case of neural vocoders, the synthesis quality deteriorates when the input f o is not included in the range of the training data. Several approaches have been proposed to solve this problem [33], [34], [35], [36], [37], [38], [39]. In contrast to AR models [33], [34], [35], non-AR models [36], [37], [38], [39] can realize real-time inference. The neural source filter [36] introduces nonlinear filtering and dilated convolutional layers for parametrically generated source excitation signals corresponding to f o by source-filter modeling [40]. A method in [41], HiNet [42] and the neural homomorphic vocoder [43] introduce trainable linear-time-variant filters for impulse trains corresponding to f o and noise; these are based on mean squared error-based training in the time and frequency domains, GANs [22] and differentiable DSP [44]. Quasi-Periodic WaveNet (QPNet) [ Quasi-Periodic Parallel WaveGAN (QPPWG) [37] introduce pitch-dependent dilated convolutional neural networks (PDC-NNs), which flexibly change the dilation size of the dilated convolution kernel in response to f o fluctuations. The unified source-filter GAN (uSFGAN) [38] improves QPPWG by introducing source-filter modeling that explicitly separates the generation of excitation signals from the filtering. The explicit input of the excitation signal into a generator based on source-filter modeling is highly effective for improving the control accuracy of f o . PeriodNet [39] uses sinusoidal waves and white noise as excitation signals to explicitly separate the generation of periodic waveforms from that of aperiodic waveforms. This approach is particularly effective for singing voice synthesis because the large f o fluctuations contained in the singing voice can be efficiently captured by the input of the excitation signals. However, it cannot realize high-fidelity synthesis of normal speech [45]. Although these approaches have achieved high control accuracy of f o , the synthesis quality tends to be lower than that of the purely data-driven vocoders, such as HiFi-GAN. Additionally, most of them consist of very large convolution layers and require a high-end GPU for real-time synthesis.
SR conversion, which can expand or compress speech waveforms while preserving the pitch of the sound, is traditionally realized by signal-processing-based approaches, such as waveform similarity overlap-add (WSOLA) [46], time domain pitch synchronous overlap-add (TDPSOLA) [47], and sourcefilter vocoders [6], [7], [8], [9]. However, the synthesis quality of these models is not high. To improve synthesis quality for SR conversion, a neural-network-based approach with the multi-speaker AR WaveNet vocoder [48], which can be realized with time-compressed or stretched acoustic features by sinc interpolation-based resampling [49], outperforms conventional signal-processing-based models [50]. However, the AR WaveNet vocoder, even using a GPU, cannot realize realtime synthesis. Additionally, the synthesis quality for the slow-SR condition is particularly low, compared with that for the normal-and fast-SR conditions, because speech waveforms for slow speech are rarely included in training data. ScalerGAN, to perform real-time neural SR conversion with HiFi-GAN, has recently been proposed [51]. In ScalerGAN, input melspectrograms are non-uniformly compressed or stretched by a GAN, and SR-converted speech waveforms are synthesized by a multi-speaker HiFi-GAN generator with non-uniformly compressed or stretched features. In contrast to conventional neural vocoders, ScalerGAN cannot control f o because melspectrograms are used as acoustic features.
In this article, we propose Harmonic-Net and Harmonic-Net+, which are real-time multi-speaker neural speech waveform generative models based on HiFi-GAN. They realize fast and highquality SS on CPUs while preserving the controllability of f o and SR. In the design of these models, we introduce two main extensions to the HiFi-GAN generator. First, we propose an excitation signal network with downsampling layers, which hierarchically receives multi-channel excitation signals for harmonic waves corresponding to f o , whereas the conventional methods only receive single channel sine waves or pulse trains. This explicit input of excitation signals is expected to improve synthesis quality when using scaled f o input, similarly to PeriodNet. Second, we propose layerwise PDCNNs (LW-PDCNNs) for the upsampling-based HiFi-GAN generator, whereas the conventional PDCNNs are developed for CNN-based models and cannot be directly applied to the upsampling-based HiFi-GAN generator. As noted above, PDCNNs can incorporate f o fluctuations into their model structure and we expect that the proposed method can further improve the synthesis quality when using scaled f o input. Furthermore, the proposed models are expected to be used for high-quality and real-time neural SR conversion. This is because, although the f o value itself is resampled along the time axis, the input excitation signals corresponding to f o are not resampled but the number of repetitions of the input excitation signals is changed, and the direct input of the excitation signals can assist in the synthesis of SR-converted speech waveforms. The proposed models are expected to be particularly effective for slow-SR conversion, which corresponds to increasing the number of repetitions of the input excitation signals. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and single-speaker TTS demonstrate that the proposed models with multi-channel harmonic waves can realize higher synthesis quality than conventional methods in all (normal, f o -conversion, and SR-conversion) conditions. The contributions of this article are as follows: r An excitation signal network with downsampling layers, which hierarchically receives multi-channel excitation signals for harmonic waves corresponding to f o , is proposed to improve the f o controllability for HiFi-GAN-based realtime neural vocoder on CPUs.
r LW-PDCNNs for the upsampling-based HiFi-GAN generator is additionally proposed to further improve the f o controllability.
r We show that the proposed methods with explicit input of excitation signals are also effective for SR conversion. The rest of this article is organized as follows. HiFi-GAN [21] is briefly introduced in Section II. Harmonic-Net and Harmonic-Net+ are then proposed in Section III. Section IV describes experiments to compare Harmonic-Net and Harmonic-Net+ with several conventional methods: WORLD [8], WaveNet [50], HiFi-GAN [21], uSFGAN [38], and PeriodNet [39]. Finally, conclusions are presented in Section V.

II. RELATED WORK: HIFI-GAN
HiFi-GAN is a GAN-based neural vocoder that consists of a high-speed generator with transposed convolution layers and two sophisticated discriminators. Similarly to Tacotron 2 [2], band-limited mel-spectrograms are used as input acoustic features. In contrast to typical neural vocoders with white noise input [13], [14], [15], [17], [18], HiFi-GAN directly upsamples input acoustic features and synthesizes speech waveforms without white noise input, similarly to MelGAN [16], [19]. The main component of the generator is an upsampling network with a few transposed convolution layers. The generator upsamples input acoustic features through transposed convolutions until the length of the output sequence matches the temporal resolution of the speech waveforms. After each transposed convolution, multi-receptive field fusion (MRF) is performed. MRF is the aggregation of convolution layers with various receptive fields for efficiently capturing the various frequency components in speech waveforms. In contrast to typical neural vocoders, which have a large number of convolution layers, the HiFi-GAN generator achieves high-speed synthesis with only a few convolution layers, similarly to MelGAN [16], [19]. The two discriminators are a multi-scale discriminator (MSD) and a multi-period discriminator (MPD). The architecture of MSD, which was used in MelGAN [16], [19], is a mixture of several sub-discriminators operating on different sampling frequencies.
In MPD, input audio samples of length T are sampled for each period p and reshaped into two-dimensional features of shape (T /p) × p. Multiple values of p are used, and sub-discriminators are prepared and trained for each one. MSD and MPD also efficiently capture the various frequency components. In addition to adversarial training, mel-spectrogram loss [21] and feature matching loss [16], [52] (which is defined as the distance between the intermediate features of the discriminators) are used in MSD and MPD to train the generator effectively.
By using these sophisticated neural network models, HiFi-GAN has achieved high-quality and real-time SS, even for unseen speaker synthesis, using only a single CPU [21], and the synthesis speed can be further increased by using multiple CPU cores [28], [53]. HiFi-GAN can also be driven by acoustic features for source-filter vocoders instead of band-limited mel-spectrograms [53], such as features based on LPCNet [11]. used for controlling f o instead of mel-spectrograms. The upsampling network receives only melcep and BAP, 1 and upsamples them using transposed convolution blocks until the temporal resolution of the output sequence matches that of the audio waveform. The transposed convolution blocks (T.Conv blocks) consist of a transposed convolution layer and an MRF module, as used in HiFi-GAN.

A. Harmonic-Net With Excitation Signal Network
In the training, the excitation generator receives GCI and generates the sinusoidal waves corresponding to the locations of GCI, similarly to PeriodNet [39]. These sinusoidal waves are then input to the downsampling layers. The purpose of using GCI is to input excitation signals that are in phase with the target speech waveforms, similarly to PeriodNet [39]. In the inference, we use f o instead of GCI, as proposed in [39]. The explicit input of f o features as time-domain waveform signals is expected to reduce the burden of modeling vocal fold vibration and improve the controllability of f o . The downsampling network consists of five convolution layers with the same kernel size and stride as the transposed convolution layer of each T.Conv block (Fig. 1). In each layer, the excitation signals are converted to an intermediate feature whose temporal resolution corresponds to that of the output of the T.Conv blocks, and these features are added together. 2 This process was inspired by Fre-GAN [26], which uses a hierarchical output structure in the generator to maintain the consistency of the output audio at multiple resolutions. By introducing this process, we expect efficient training to be performed so that the generated speech waveforms maintain consistency with the excitation signals at various temporal resolutions.

B. Generation of Multi-Channel Excitation Signals Including Harmonic Wave Components
In most previous studies that introduced source-filter modeling to neural vocoders, sinusoidal waves or summational signals of harmonic components with fixed weighting values were used as excitation signals [36], [38], [39]. In contrast, we propose multi-channel excitation signals up to Ith harmonic waves that consist of sinusoidal waves corresponding to GCI or to f o (i = 1) and their harmonic components (i = 2, 3, . . . , I), where i is the magnification rate of the harmonic signal. Then, the input channel of the proposed excitation signal network is I, and each ith harmonic wave is input to each ith channel of five trainable convolutional layers in a data-driven manner instead of using fixed weighting values [36]. Therefore, the proposed excitation signal network with multi-channel harmonic waves is expected to synthesize and control the harmonic components of output speech waveforms more accurately than the previous methods using simple sinusoidal waves [38], [39] or summational signals of harmonic components with fixed weighting values [36].
GCI can be defined as a sequence that collects the time points at which the most basic phases of the speech waveform 1 Although f o values were also input to the upsampling network in preliminary experiments, they caused the controllability of f o to degrade. Therefore, in the final design, only melcep and BAP are input to the upsampling network. 2 Although a downsampling network with one or two convolutional layers was initially investigated, the synthesis quality could not reach that achieved with five convolutional layers (Fig. 1). match. Let g = [g 1 , . . . , g q , . . . , g Q ] be the sequence obtained by multiplying each value of the GCI sequence by the sampling frequency f s . g is the index sequence of time points at which the phases of natural speech match.
is a one-hot vector sequence that indicates voiced/unvoiced speech at time step t. In the training, the ith harmonic excitation wave sequence e t,i is generated as follows: where φ denotes the initial phase of the excitation signal at t.
. . , f o,T ] as follows: HiFi-GAN-based f o -controllable SS can be realized by controlling f o,l in (3). Additionally, arbitrary I channel waveform signals can be input to the excitation signal network with input channel I. For example, single channel excitation pulse sequences used in source-filter vocoders or single channel waveforms synthesized by DSP-based source-filter vocoders (e.g. WORLD) can be input with input channel I = 1. These input waveforms were investigated in the experiments reported in Section IV.

C. Pitch-Dependent Dilated Convolution Network
As the results of experiments conducted Section IV shown in Fig. 5, the Harmonic-Net generator with only an excitation signal network could realize high-fidelity SS while preserving the controllability of f o for male speakers. However, the synthesis quality for female speakers with higher f o conditions was degraded because the controlled f o value was outside the range of the training data. However, collecting higher f o speech data to extend the f o range of the training data is costly and impractical due to the burden on the speakers compared to collecting normal f o speech data. Therefore, investigating neural speech waveform generative models to extrapolate f o component outside the range of the training data is important. To further improve the synthesis quality and controllability of f o for female speaker synthesis, whose f o range is quite large, we introduce PDCNNs, which were previously used in QPPWG and uSFGAN. The causal PD-CNN was initially proposed for use in the AR model QPNet [33], as a sophisticated network to directly reflect fluctuations of f o in the model structure. The non-causal PDCNN was subsequently proposed for use in non-AR models, such as QPPWG [37], and the synthesis quality was improved by combining it with source-filter modeling in uSFGAN [38]. Because of its higher synthesis quality, we incorporate the non-causal PDCNN into the Harmonic-Net generator. Fig. 2 shows the architectures of the non-causal dilated convolutional neural network (DCNN) and non-causal PDCNN, in which DCNN has gaps between input samples; the length of each gap is a predefined hyperparameter called the dilation size. The non-causal DCNN can be formulated as follows: where · is the floor function, f o,t is the f o at time step t, and a is the hyperparameter (named dense factor) that specifies the number of samples (in one cycle) that are taken as the inputs of a PDCNN. The model parameters are kept unchanged because only the dilation size d changes according to f o,t and the same filter W (k) is used as shown in Fig. 2.

D. Harmonic-Net+ With Excitation Signal Network and Layerwise Pitch-Dependent Dilated Convolution Neural Networks
The PDCNN has been proposed for CNN-based neural vocoders, such as Parallel WaveGAN [15], in which the temporal resolution is the sampling frequency of the speech waveforms. However, the PDCNN cannot be directly applied to the HiFi-GAN generator. In HiFi-GAN, the temporal resolution gradually increases as the number of transposed convolutions increases. Therefore, the PDCNN needs to be designed so that the density of the convolution networks increase in the same manner and all of them have the same receptive field. To introduce PDCNNs into the upsampling-based HiFi-GAN generator, we propose LW-PDCNNs. Fig. 3 shows the architecture of the proposed Harmonic-Net+ generator with LW-PDCNNs. Specifically, it is designed such that the PDCNN of the first T.Conv block has only one layer, and the number of layers of each PDCNN increases as the number of stages of the T.Conv block increases. This means that the nth T.Conv block has a PDCNN that consists of n layers. The temporal resolution of the nth T.Conv block F s,n , the kernel size of the jth layer of PDCNN k n,j , and the dilation size d n,j are defined as follows: where f s,0 is the temporal resolution of the acoustic features and A n is the upsampling factor of the nth T.Conv block. We set d n,2 = 2d n,1 in accordance with the results of preliminary experiments. The kernel size and dilation size of the first layer of the PDCNN are designed in the same manner as those of the conventional PDCNN. In the second and subsequent layers, these values are defined according to the upsampling factor.
The proposed LW-PDCNNs enable the upsampling-based HiFi-GAN generator to incorporate f o fluctuations into its model structure, and the synthesis quality when using scaled f o input is expected to be further improved. HiFi-GAN with only LW-PDCNNs but without an excitation signal network was also investigated in preliminary experiments. However, it could not outperform either Harmonic-Net or Harmonic-Net+ for f o conversion conditions. Therefore, the proposed excitation signal network is important for f o conversion.

E. Speech-Rate Conversion
To control SR, the acoustic features (melcep, BAP, and f o including VUV) extracted from target speech waveforms are resampled with a speech rate of r along the time direction as proposed in [50]. The excitation signals are then generated from the resampled f o, resampled = [f o,1 , . . . , f o,t , . . . , f o,rT ] by the excitation generator (3) and input to the Harmonic-Net and Harmonic-Net+ generators. Although melcep and BAP are smoothed by resampling, the excitation signals are not resampled but the number of repetitions of the input excitation signal is changed. Therefore, the direct input of the excitation signals is also expected to improve the synthesis quality for SR conversion, compared with the quality of the conventional method that uses resampled mel-spectrograms.

A. Experimental Setup
We conducted three experiments to evaluate Harmonic-Net and Harmonic-Net+ in comparison with several conventional methods: WORLD [8] as a reference, HiFi-GAN [21], Period-Net [39], WaveNet [50], and uSFGAN [38]. 3 These experiments were conducted using a multi-speaker normal speech dataset, a single-speaker full-band singing voice dataset, and a singlespeaker normal speech dataset for TTS. For f o conversion with low and high f o , 0.5 × f o and 1.5 × f o were used, as in [37]. For SR conversion with fast and slow SRs, 0.8 × T and 1.5 × T were used, as in [50]. 4 All the neural network models were implemented in PyTorch [54] and trained on an NVIDIA Tesla V100 GPU. Some of the speech samples used in the experiments are available online. 5

1) Dataset:
The following open-source corpora were used in all the experiments to ensure reproducibility. For unseen speaker synthesis with multi-speaker models, we used the JVS corpus [55], a Japanese multi-speaker corpus with f s = 24 kHz. In the training, we used 12,447 utterances by 96 speakers (jvs005 to jvs100). For evaluation, we used 120 non-parallel utterances by four speakers (jvs001 to jvs004), which were not included in the training. For full-band singing voice synthesis, we used 50 acapella songs (about 1 h) by a Japanese female singer, from the Tohoku Kiritan corpus [56] with f s = 96 kHz. We then downsampled the audio to 48 kHz and clipped it into segments of appropriate length. We separated all 50 songs into phrases by using the provided labels and used two songs (05.wav and 30.wav), each of which includes 10 phrases, for evaluation; the remaining 48 songs, constructed from 376 phrases, were used for training, as in [45]. For single-speaker TTS, we used 7,497 utterances from the JSUT corpus [55], a Japanese singlespeaker corpus, downsampled to f s = 24 kHz, to train neural vocoders, and used the remaining 50 utterances (Basic5000-0001 to Basic5000-0050) and 150 utterances (Basic5000-0051 to Basic5000-0200) for evaluation and validation sets. To train a neural TTS model, we used 4,800 sentences (Basic5000-0201 to Basic5000-5000) from JSUT for which HTS-style context labels (based on manual annotation) were available, 6 as in [45].
2) Neural Vocoders: The network architecture of our implementation of HiFi-GAN was the same as that of the official implementation [21]; 7 we used the V1 model in which the number of initial channel is 512 [21]. As input features, we used 50-dimensional melcep coefficients with warping coefficient α = 0.455, three-dimensional BAP, and log-scaled continuous f o for unseen speaker synthesis and single-speaker TTS with f s = 24 kHz. For full-band singing voice synthesis with f s = 48 kHz, we used 50-dimensional melcep coefficients with warping coefficient α = 0.55, five-dimensional BAP, and log-scaled continuous f o , as in [45]. These features were extracted by cheaptrick [57], D4C [58], and Harvest [59] (based on WORLD [8]), respectively. 50-dimensional melcep coefficients, which are not affected by f o , can be extracted from smooth vocal tract spectra analyzed by cheaptrick, while those based on the short time Fourier transform are affected by f o . The window and shift lengths were set to 42.7 ms and 10 ms, respectively. Additionally, we used HiFi-GAN models with 80-dimensional log-mel spectrograms, HiFi-GAN (melspc), as used in [21], to compare the input features. 8 The window and shift lengths were also set to 42.7 ms and 10 ms: the same as those of the WORLD features. Although the original HiFi-GAN used 256-fold upsampling [21], we applied 240-or 480-fold upsampling to obtain a resolution of 24 kHz or 48 kHz from input features with a frame shift of 10 ms. Therefore, we set the upsampling rates of the transposed convolution layers to (5,4,3,4) and the kernel sizes to (11,8,7,8) for 24-kHz synthesis, as in [53], and set the upsampling rates to (10, 6, 2, 2, 2) and the kernel sizes to (20,12,4,4,4) for 48-kHz synthesis.
The network structure of the PeriodNet was the same as that of the non-AR series model in [39]. Its implementation was based on that of Parallel WaveGAN [15], 9 as in [45], and we added two generators (to generate periodic and aperiodic signals) and discriminators that operate at multiple sampling frequencies. As input features, we used WORLD features, as used in HiFi-GAN. The network architecture of the uSFGAN was the same as that of the official implementation [38]. 10 As input acoustic features, we used the same WORLD features as used in HiFi-GAN. The network structure of the WaveNet vocoder was based on [60] with an additional GRU unit for multi-speaker training, as in [61]. As input features for WaveNet, we also used WORLD features as used in HiFi-GAN. We also applied time-invariant noise shaping [60] to suppress the perceptual noise components caused by the prediction error; 35dimensional melcep were used and a parameter to control noise energy in the formant regions was set to 0.5 for noise shaping, as in [60].
The network architecture of the proposed Harmonic-Net generator was based on the official implementation of HiFi-GAN [21], with the addition of the proposed excitation signal network. For the Harmonic-Net+ generator, the LW-PDCNNs were based on the official PyTorch implementation of QP-PWG [37]. 11 The dense factor a in (9) was set to 4.0 for the Harmonic-Net+ generator. The configuration of other modules, such as discriminators, was the same as that of the corresponding components of HiFi-GAN. As input features, we used 50dimensional melcep coefficients, three-dimensional BAP, and GCI extracted by REAPER. 12 In the inference, we used linear f o extracted by Harvest [59] instead of GCI. We investigated four types of excitation signals: single-channel sine wave (sine) with I = 1, pulse sequence (pulse) with I = 1, speech waveform synthesized by WORLD vocoder (world) with I = 1 as a reference, and harmonic waves up to the fifth harmonic (harm) with I = 5, 13 as explained in Section III-B. Although the excitation signal network using 1-channel convolutional layers for summational signals of 5 harmonic components with fixed weighting values as [36] combined with LW-PDCNNs was initially investigated, it was not included in the experiments because the results of preliminary experiments indicated that it underperformed the proposed excitation signal network using 5-channel convolutional layers for 5-channel harmonic waves combined with LW-PDCNNs, especially for high f o conversion condition.
3) Text-to-Speech: As an acoustic model for TTS, we used a FastSpeech-based acoustic model [62] with full-context label input for Japanese, which was complemented by ESPnet-TTS [63]. We used simple 47-dimensional vectors constructed 9 https://github.com/kan-bayashi/ParallelWaveGAN 10 https://github.com/chomeyama/UnifiedSourceFilterGAN 11 https://github.com/bigpon/QPPWG 12 https://github.com/google/REAPER 13 The number of harmonic waves was decided in accordance with the results of preliminary experiments from 38-dimensional phoneme one-hot vectors and ninedimensional accentual label vectors, as in [45]. For TTS, we finetuned HiFi-GAN-based neural vocoders using acoustic features (80-dimensional mel-spectrograms or 55-dimensional WORLD features) estimated by the trained acoustic models, as in [21]. The HiFi-GAN-based models were trained with 1,000,000 steps and fine-tuned with 200,000 steps.  Table I    A mean opinion score (MOS) test with a five-point scale (5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad) [64] was conducted to evaluate the subjective perceptual quality of the synthesized speech waveforms. First, Harmonic-Net and Harmonic-Net+ with sine, harm and pulse excitation signals were directly compared in 1.0 × f o for the normal condition, 0.5 × f o and 1.5 × f o for the f o -conversion condition to compare the differences among the excitation signals. Twenty adult native Japanese speakers without hearing loss listened to the synthesized speech samples using headphones. Fig. 5 shows the results of the MOS test. Both Harmonic-Net and Harmonic-Net+ with pulse outperformed those with sine in all the f o conditions. Harmonic-Net+ with harm was comparable to both Harmonic-Net and Harmonic-Net+ with pulse except for 1.5 × f o condition of female speech. In 1.5 × f o condition of female speech, Harmonic-Net+ with harm significantly outperformed the other models. The results indicated that the proposed excitation signal network using trainable 5-channel convolutional layers for 5-channel harmonic waves combined with LW-PDCNNs was more suitable for high f o conversion condition than that using 1-channel convolutional layers for pulse trains, which is regarded as summational signals of infinite harmonic components with fixed weighting values, combined with LW-PDCNNs. As a result, only LW-PDCNNs with harmonic waves can extrapolate f o component outside the range of the training data while keeping the synthesis quality for high f o conversion condition. Therefore, Harmonic-Net+ with harm was introduced in the following MOS tests. Although the results in Fig. 5 indicate that Harmonic-Net with harm slightly lower than that with pulse except for 1.5 × f o condition of female speech, the results of preliminary experiments for full-band singing voice synthesis suggested that Harmonic-Net with harm significantly outperformed that with pulse for 1.5 × f o condition. The results indicated that the proposed excitation signal network using trainable 5-channel convolutional layers for 5-channel harmonic waves without LW-PDCNNs was also more suitable for high f o conversion condition in full-band singing voice synthesis than that using 1-channel convolutional layers for pulse trains without LW-PDCNNs. 15 Therefore, Harmonic-Net with harm was also introduced in the following MOS tests to match the type of excitation signal to Harmonic-Net+.

B. Evaluation of Unseen Speaker Synthesis With Multi-Speaker Models
For unseen speaker synthesis with multi-speaker models, 1.0 × f o for the normal condition, 0.5 × f o and 1.5 × f o for the f o -conversion condition, and 0.8 × T and 1.5 × T for the SR-conversion condition were evaluated. 16 Twenty adult native Japanese speakers without hearing loss also listened to the synthesized speech samples using headphones. There were 20 utterances for each model and each condition, where five sentences were randomly selected from each speaker of the evaluation set (jvs001, jvs002, jvs003, and jvs004). The total number of sentences evaluated by each listening subject was therefore 600 (= 20 utterances × (8 + 5 + 5 + 6 + 6) models). Fig. 6 shows the results of the MOS test for unseen speaker synthesis in the normal and f o -conversion conditions. According to the results of the normal condition ( Fig. 6(a) and (b)), Harmonic-Net+ achieved the highest quality synthesis for male speaker synthesis although it could not outperform WaveNet, which cannot realize real-time inference, for female speaker synthesis. Comparing the HiFi-GAN and Harmonic-Net models, both the excitation signal network and the LW-PDCNNs contributed to the improvement 15 As described in Section IV-C, Harmonic-Net+ with LW-PDCNNs could not outperform Harmonic-Net without LW-PDCNNs for full-band singing voice synthesis due to the lack of training data. 16 WaveNet was not evaluated in the 0.5 × f o and 1.5 × f o conditions because the f o controllability of AR WaveNet is lower than that of QPNet [33].  of the synthesis quality for unseen speaker synthesis in the normal condition.
In the f o -conversion condition (Fig. 6(c) to (f)), although the conventional HiFi-GAN and uSFGAN could not always outperform WORLD, Harmonic-Net with harm excitation signals improved the synthesis quality, compared with the conventional methods, except for the case of 1.5 × f o with female speech. This means that the excitation signal network worked effectively in low-f o synthesis and interpolation of high f o , but it was unable to improve the extrapolation of high f o . Conversely, Harmonic-Net+ with harm excitation signals further improved the synthesis quality and achieved the best score even in the case of 1.5 × f o with female speech. Fig. 7 shows the results of the MOS test for unseen speaker synthesis in the SR-conversion condition. In the 0.8 × T condition, although the conventional WaveNet and HiFi-GAN outperformed WORLD, Harmonic-Net+ achieved a significantly higher synthesis quality than the conventional methods. In the 1.5 × T condition, the conventional HiFi-GAN models could not outperform WORLD, and WaveNet could not achieve highquality synthesis for female speakers. Conversely, Harmonic-Net+ achieved the best performance of all the methods. As  Table III shows the results of the objective evaluations for fullband singing voice synthesis in the normal condition (1.0 × f o and 1.0 × T ). With respect to SNR, SD and MCD, PeriodNet achieved the best score but the inference speed was insufficient for real-time synthesis on a CPU. With respect to f o -RMSE, Harmonic-Net with harm excitation signals achieved higher score than the other methods except for Harmonic-Net+ with sine excitation, and it maintained real-time speed even for 48-kHz synthesis. Harmonic-Net+ could not realize real-time synthesis because of the large number of parameters associated with 48-kHz synthesis. Additionally, Harmonic-Net+ suffered deterioration in SNR, SD and MCD. We found that the speech waveforms synthesized by Harmonic-Net+ were buzzy throughout. Fig. 8 shows the spectrograms up to 12 kHz of an original speech waveform included in the speech samples and those synthesized by Harmonic-Net with harm and Harmonic-Net+ with harm for full-band singing voice synthesis. Compared with the spectrograms of the original (Fig. 8(a)) and Harmonic-Net ( Fig. 8(b)), that of Harmonic-Net+ (Fig. 8(c)) includes horizontal stripes especially in aperiodic components surrounded by blue squares. These components sound buzzy and degrade the synthesized speech quality of Harmonic-Net+. Table IV shows the results of the objective evaluations for full-band singing voice synthesis in the f o -conversion condition.

C. Evaluation of Full-Band Singing Voice Synthesis
With respect to f o -RMSE, Harmonic-Net+ models achieved higher f o conversion accuracy than the other models. However, with respect to MCD, Harmonic-Net+ models were lower than the other models, and the speech waveforms synthesized by Harmonic-Net+ were also buzzy throughout. Compared with multi-speaker model trained using the JVS corpus, the Tohoku Kiritan corpus only contains about 1 h although the f o range of the Tohoku Kiritan corpus (58 to 793 Hz) is wider than that of the JVS corpus (Fig. 4). Then, LW-PDCNNs might not be able to be trained well due to the lack of training data. Therefore, further investigation of Harmonic-Net+ with a larger amount of training data for full-band singing voice synthesis is required as future work.
We also conducted a MOS test as a subjective evaluation. The evaluation conditions were the same as those for unseen speaker synthesis with multi-speaker models. According to the results of the objective evaluations and the preliminary MOS test, Harmonic-Net+ was not included in the MOS test. Additionally, although uSFGAN achieved high scores in the objective evaluations, it was not included in the MOS test because it could not outperform PeriodNet in preliminary experiments. Harmonic-Net with harm excitation signals was compared with WORLD, HiFi-GAN, and PeriodNet. Twenty subjects listened to all 20 phrases in the evaluation set for each model and each condition. Thus, the total number of phrases evaluated by each listening subject was 480 (= 20 phrases × (6 + 4 + 4 + 5 + 5) models).   Fig. 9 shows the results of the MOS test for full-band singing voice synthesis. According to Fig. 9(a), Harmonic-Net achieved almost the same score as PeriodNet. However, because Period-Net has the problem of low inference speed, Harmonic-Net has the advantage of realizing fast and high-quality full-band singing voice synthesis. In the f o -conversion condition, Harmonic-Net also achieved a higher synthesis quality than HiFi-GAN which could not synthesize speech waveforms with f o -scaled features, and there was a significant difference between Harmonic-Net and PeriodNet in the 0.5 × f o condition. In the SR-conversion condition, Harmonic-Net significantly achieved the best synthesis quality for both the 0.8 × T and 1.5 × T conditions. Therefore, the effectiveness of Harmonic-Net with harm excitation signals was validated for full-band singing voice synthesis.

D. Evaluation of Text-to-Speech
Finally, we subjectively evaluated models using the FastSpeech-based TTS acoustic model. In the experiments, we compared Harmonic-Net+ with HiFi-GAN models that use mel-spectrograms or WORLD features. 17 In SR conversion, the phoneme durations, predicted by the duration predictor in FastSpeech, were changed for the 0.8 × T and 1.5 × T conditions. In preliminary experiments, there was no significant difference between Harmonic-Net+ and HiFi-GAN because the FastSpeech decoder could synthesize SR-converted acoustic features accurately, similarly to ScalerGAN [51] but differently from simple uniform resampling. Therefore, TTS was investigated only for 1.0 × f o in the normal condition, and 0.5 × f o and 1.5 × f o in the f o -conversion condition. The evaluation conditions were the same as those for unseen speaker synthesis with multi-speaker models and full-band singing voice synthesis. Twenty subjects listened to ten randomly selected sentences from the evaluation set, for each model and each condition. Therefore, the total number of sentences evaluated by each listening subject was 80 (= 10 utterances × (4 + 2 + 2) models). Fig. 10 shows the result of the MOS test for single-speaker TTS. According to Fig. 10(a), HiFi-GAN and Harmonic-Net+ achieved high performance and these was no significant difference between them. In the f o -conversion condition, Harmonic-Net+ achieved a significantly higher synthesis quality even when scaled f o not included in the range of the training data was input, as unseen speaker synthesis with multi-speaker models and full-band singing voice synthesis. Therefore, the effectiveness of Harmonic-Net+ was confirmed for single-speaker TTS. Future work includes the integration of Harmonic-Net+ into an entire end-to-end TTS system, in a similar manner to [3], [31].

E. Discussion
From the results of the MOS test shown in Fig. 5, the proposed excitation signal network with multi-channel harmonic waves corresponding to f o combined with the proposed LW-PDCNNs can realize high quality synthesis while keeping the f o controllability, whereas the conventional methods introduce sine waves or pulse trains. Especially for high f o conversion condition, only the proposed method with harmonic waves and LW-PDCNNs can realized high synthesis quality. The effectiveness of the proposed Harmonic-Net+ with multi-channel harmonic waves and LW-PDCNNs was validated for unseen speaker synthesis and TTS conditions from the results of the MOS tests depicted in Figs. 5 to 7, and 10. Although the effectiveness of the proposed LW-PDCNNs could not be validated for full-band singing voice synthesis due to the lack of training data according to the results of the objective evaluations (Table IV) and preliminary MOS test, the effectiveness of the proposed excitation signal network with multi-channel harmonic waves was validated from the results of the MOS test shown in Fig. 9. Further investigation of LW-PDCNNs with a larger amount of training data for full-band singing voice synthesis is required as future work. Additionally, the effectiveness of the proposed excitation signal network with multi-channel harmonic waves for SR conversion was also validated from the results of the MOS tests shown in Figs. 7 and 9.

V. CONCLUSION
To realize fast and high-quality neural speech waveform generation while preserving the controllability of f o and SR, we proposed Harmonic-Net and Harmonic-Net+, which introduce an excitation signal network and non-AR LW-PDCNNs into the HiFi-GAN generator. The excitation signal network uses multi-channel harmonic waves corresponding to f o as excitation signals and we introduced a downsampling network that receives these excitation signals. LW-PDCNNs can flexibly change receptive fields corresponding to the input f o , and we adjusted the network architecture to fit the structure of the HiFi-GAN generator. By introducing the proposed architectures, the controllability of f o is expected to be improved. Additionally, the direct input of the excitation signals is expected to improve the synthesis quality of SR conversion because the excitation signals are not resampled but the number of repetitions of the input excitation signals is changed. We conducted experiments for unseen speaker synthesis with multi-speaker models, full-band singing voice synthesis, and single-speaker TTS. The results of the experiments confirmed that the proposed excitation signal network and LW-PDCNNs worked effectively to improve the synthesis quality, compared with conventional models, while realizing real-time inference with a CPU in all (normal, f oconversion, and SR-conversion) conditions.