Modification of pitch using DCT in the source domain
Introduction
Machine synthesis of speech (Roe and Wilpon, 1994; Syrdal et al., 1995) facilitates convenient information transmission in a number of applications, including voice delivery of text messages and email, voice response to database enquires, reading aids for the blind and mobile communications. Speech synthesis presents a key challenge when it comes to improved quality (Liberman, 1994), which is assessed by the attributes of intelligibility and naturalness. Of the various approaches to speech synthesis, concatenative synthesis has entailed speech with the highest quality to date. Concatenative synthesis involves selecting a class of basic acoustic units, creating an inventory of stored units by recording them from natural voice, and then generating utterances by concatenating appropriately modified segments from this inventory. A critical task in concatenative speech synthesis is that of modifying the prosody (pitch, amplitude and durations) of the voiced sections of the stored units and creating a concatenation of units that sounds seamless. Methods have been proposed in the literature for both time and pitch scale speech modification (George and Smith, 1992; McAulay and Quatieri, 1986; Portnoff, 1981; Quatieri and McAulay, 1986; Vergin et al., 1997). Pitch scale modification or pitch modification has applications such as adjusting the pitch in a singer’s voice to get the desired effect, helping hearing impaired to understand speech better and modifying speech so that it is easier to code efficiently (Anssi, 1999). The objective of pitch modification is to alter the fundamental frequency of speech without affecting the time-varying spectral envelope. Techniques exist in the literature that accomplish this in the time or frequency domain.
Time domain pitch synchronous overlap adding (PSOLA (Moulines and Charpentier, 1990)) is likely the simplest method that can be imagined for good quality pitch modification of speech signals. In practice, the implementation of pitch modification in time domain (TD-PSOLA) requires knowledge about the pitch pulse locations. Exact pitch pulse locations are not essential, but it is crucial to maintain an exact pitch synchronicity between successive pitch marks. The signal is windowed pitch synchronously using a Hamming window of length 2–4 pitch periods, centered around the current pitch pulse. A length of 2 periods is usually good for pure time-domain modification and a longer window (>2) is good for frequency domain PSOLA (FD-PSOLA). Because the intervals between the pitch pulses are altered, the total length of the signal is modified and thus time scale modification of speech is also usually needed in order to maintain the original length of the signal. It is implemented in a simple way: If the pitch is increased, some frames are used twice and if it is lowered, some frames from the original signal are left out in the synthesized signal.
Historically, the FD-PSOLA was the first pitch synchronous time scale and pitch scale modification technique proposed in the literature (Charpentier and Stella, 1986). FD-PSOLA and residual domain PSOLA (LP-PSOLA) are two methods that can be adapted almost directly from the TD-PSOLA paradigm. These two methods are more flexible than the TD-PSOLA technique because they provide a direct control over the spectral envelope at both the analysis and the synthesis stages. In FD-PSOLA, prior to overlap add synthesis, each short-time analysis signal is modified; the modification is carried out in the frequency domain on the short-time Fourier transform signal. The algorithm used is basically a frequency domain resampling, which leads to some complex problems in the synthesis stage. It can be said that, if features such as speaker identity hiding are not needed, TD-PSOLA leads to the same results with a much simpler implementation. In practice, FD-PSOLA differs from TD-PSOLA only in the definition of the short-time synthesis signals for pitch scale modifications.
In LP-PSOLA, prior to PSOLA processing, the signal is split into an excitation component e(n) and the spectral envelope A(z). Pitch scale modification is then carried out on the source (residual) signal. The output is obtained by combining the modified source signal with the time-varying spectral envelope usually using linear prediction. Synthesis is again complex and the details can be found in the literature (Baastian Kleijn and Paliwal, 1995).
In this paper, we present a new method of modifying the residual obtained after inverse filtering with linear prediction coefficients. Gimenez de los Galanes et al. (1995) modified the pitch by interpolating the residual signal, realized by either upsampling or downsampling. Both upsampling and downsampling remap the 0 to π scale to the new residual length corresponding to the given pitch modification factor. Once the residual is modified, the spectral envelope responsible for the formant structure will be superimposed by forward filtering with the same LP coefficients. Our approach is similar to the one above, but differs in the interpolation of the residual signal. Interpolation is carried out using forward and inverse orthogonal transformation of the residual signal (Rao and Yip, 1990). Traditionally, low-pass filters are used in sampling rate conversions for upsampling as well as downsampling to avoid spectral repetitions and aliasing. With the help of fast transforms, computational complexity involved in sampling rate conversion can be significantly reduced. Depending on the pitch modification factor, truncation or zero padding is performed on the forward transformed residual and the modified forward transformed residual is inverse transformed.
For a time-varying pitch modification using upsampling or downsampling, the low-pass filter must be redesigned every time, because the cutoff frequency varies according to the pitch modification factor. This could very well be avoided using an orthogonal transform, irrespective of whether the pitch modification factor is constant or time varying. We have also made some modifications to the above algorithm for handling female speech. In this method, the filter parameters are modified to produce a magnitude response that is significantly less peaky than the original linear predictive model used for inverse filtering. This reduces the filter sensitivity to pitch modification (Ansari, 1997). The discrete cosine transform (DCT) (Ahmed and Rao, 1975) has been used in our algorithm for resampling the residual. Energy loss is minimal in resampling process because DCT has high energy compaction.
Section snippets
Method
As an alternative to strictly time domain techniques, the ubiquitous source-filter model of speech can be invoked (Rabiner and Schafer, 1975). Prosody modification then becomes a task of separating the excitation and vocal tract components from speech, modifying the excitation, and then recombining with the vocal tract component. In principle, this allows retaining the vocal tract response without any modifications. Ideally, the analysis would separate the excitation signal, which could be
Results and discussion
To demonstrate the effectiveness of this technique, individual phonemes, words and sentences spoken by both male and female volunteers were recorded using SHURE mike model SM 58 in the laboratory with some ambient noise resulting in a SNR of about 25 dB. These utterances were analyzed and re-synthesized for different pitch change factors. Fig. 4(a) shows a segment of a phoneme. Fig. 4(d) gives the corresponding segment of the residual signal extracted by inverse filtering the phoneme using
Conclusion
The proposed algorithm is simple and elegant. It directly follows from the basic source-filter model of speech. Perceptive evaluation shows that this performs well for the range of pitch change factors sufficient for a TTS system. The algorithm uses DCT-IDCT, and thus is not computationally intensive. The proposed scheme maintains the relative pitch contour of the original signal, without any additional processing or precautions to be taken. The same basic scheme is valid for both constant and
Acknowledgements
We thank the Ministry of Communication and Information Technology, Government of India for funding part of this research under the project titled “Algorithms for Kannada speech synthesis”. The authors are also grateful to the contributions of the reviewers in significantly enhancing the presentation of our results, and the discussion.
References (20)
- Abe, M., 1996. Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System, Progress in Speech...
- et al.
Orthogonal Transforms for Digital Signal Processing
(1975) - Ansari, R., 1997. Inverse filter approach to pitch modification: application to concatenative synthesis of female...
- Anssi, R., 1999. Pitch modification and quantization for offline speech coding, M.S. Thesis, Tampere University of...
- et al.
Speech Coding and Synthesis
(1995) - Charpentier, F., Stella, M., 1986. Diphone synthesis using an overlap-add technique for speech waveforms concatenation....
- Edgington, M., Lowry, A., 1996. Residual-based speech modification algorithms for TTS synthesis. ICSLP 96,...
- et al.
Analysis-by-synthesis/overlap -add sinusoidal modeling applied to the analysis and synthesis of musical tones
J. Audio Eng. Soc.
(1992) - Gimenez de los Galanes, F.M., Savoji, M., Pardo, J.M., 1995. Speech synthesis system based on a variable...
- Liberman, M., 1994. Computer speech synthesis: its status and prospects, Voice Communication between Humans and...
Cited by (30)
Signal transformation and interpolation based on modified DCT synthesis
2011, Digital Signal Processing: A Review JournalWord level multi-script identification
2008, Pattern Recognition LettersText-To-Speech Synthesis: Literature Review with an Emphasis on Malayalam Language
2022, ACM Transactions on Asian and Low-Resource Language Information ProcessingEfficient Human-Quality Kannada TTS using Transfer Learning on NVIDIA's Tacotron2
2021, Proceedings of CONECCT 2021: 7th IEEE International Conference on Electronics, Computing and Communication TechnologiesVoice activity detection from the breathing pattern of the speaker
2018, 2017 14th IEEE India Council International Conference, INDICON 2017Enhancement of speech signal using modified binary mask based algorithm for vehicular noise
2018, Journal of Advanced Research in Dynamical and Control Systems