Elsevier

Speech Communication

Volume 42, Issue 2, February 2004, Pages 143-154
Speech Communication

Modification of pitch using DCT in the source domain

https://doi.org/10.1016/j.specom.2003.05.001Get rights and content

Abstract

In this paper, we propose a novel algorithm for pitch modification. The linear prediction residual is obtained from pitch synchronous frames by inverse filtering the speech signal. Then the discrete cosine transform (DCT) of these residual frames is taken. Based on the desired factor of pitch modification, the dimension of the DCT coefficients of the residual is modified by truncating or zero padding, and then the inverse discrete cosine transform is obtained. This period modified residual signal is then forward filtered to obtain the pitch modified speech. The mismatch between the positions of the harmonics of the pitch modified signal and the LP spectrum of the original signal introduce gain variations, which is more pronounced in the case of female speech [Proc. Int. Conf. on Acoust. Speech and Signal Process. (1997) 1623]. This is minimised by modifying the radii of the poles of the filter to broaden the otherwise peaky linear predictive spectrum. The modified LP coefficients are used for both inverse and forward filtering. This pitch modification scheme is used in our Concatenative Speech synthesis system for Kannada. The technique has also been successfully applied to creating interrogative sentences from affirmative sentences. The modified speech has been evaluated in terms of intelligibility, distortion and speaker identity. Results indicate that our scheme results in acceptable speech in terms of all these parameters for pitch change factors required for our speech synthesis work.

Introduction

Machine synthesis of speech (Roe and Wilpon, 1994; Syrdal et al., 1995) facilitates convenient information transmission in a number of applications, including voice delivery of text messages and email, voice response to database enquires, reading aids for the blind and mobile communications. Speech synthesis presents a key challenge when it comes to improved quality (Liberman, 1994), which is assessed by the attributes of intelligibility and naturalness. Of the various approaches to speech synthesis, concatenative synthesis has entailed speech with the highest quality to date. Concatenative synthesis involves selecting a class of basic acoustic units, creating an inventory of stored units by recording them from natural voice, and then generating utterances by concatenating appropriately modified segments from this inventory. A critical task in concatenative speech synthesis is that of modifying the prosody (pitch, amplitude and durations) of the voiced sections of the stored units and creating a concatenation of units that sounds seamless. Methods have been proposed in the literature for both time and pitch scale speech modification (George and Smith, 1992; McAulay and Quatieri, 1986; Portnoff, 1981; Quatieri and McAulay, 1986; Vergin et al., 1997). Pitch scale modification or pitch modification has applications such as adjusting the pitch in a singer’s voice to get the desired effect, helping hearing impaired to understand speech better and modifying speech so that it is easier to code efficiently (Anssi, 1999). The objective of pitch modification is to alter the fundamental frequency of speech without affecting the time-varying spectral envelope. Techniques exist in the literature that accomplish this in the time or frequency domain.

Time domain pitch synchronous overlap adding (PSOLA (Moulines and Charpentier, 1990)) is likely the simplest method that can be imagined for good quality pitch modification of speech signals. In practice, the implementation of pitch modification in time domain (TD-PSOLA) requires knowledge about the pitch pulse locations. Exact pitch pulse locations are not essential, but it is crucial to maintain an exact pitch synchronicity between successive pitch marks. The signal is windowed pitch synchronously using a Hamming window of length 2–4 pitch periods, centered around the current pitch pulse. A length of 2 periods is usually good for pure time-domain modification and a longer window (>2) is good for frequency domain PSOLA (FD-PSOLA). Because the intervals between the pitch pulses are altered, the total length of the signal is modified and thus time scale modification of speech is also usually needed in order to maintain the original length of the signal. It is implemented in a simple way: If the pitch is increased, some frames are used twice and if it is lowered, some frames from the original signal are left out in the synthesized signal.

Historically, the FD-PSOLA was the first pitch synchronous time scale and pitch scale modification technique proposed in the literature (Charpentier and Stella, 1986). FD-PSOLA and residual domain PSOLA (LP-PSOLA) are two methods that can be adapted almost directly from the TD-PSOLA paradigm. These two methods are more flexible than the TD-PSOLA technique because they provide a direct control over the spectral envelope at both the analysis and the synthesis stages. In FD-PSOLA, prior to overlap add synthesis, each short-time analysis signal is modified; the modification is carried out in the frequency domain on the short-time Fourier transform signal. The algorithm used is basically a frequency domain resampling, which leads to some complex problems in the synthesis stage. It can be said that, if features such as speaker identity hiding are not needed, TD-PSOLA leads to the same results with a much simpler implementation. In practice, FD-PSOLA differs from TD-PSOLA only in the definition of the short-time synthesis signals for pitch scale modifications.

In LP-PSOLA, prior to PSOLA processing, the signal is split into an excitation component e(n) and the spectral envelope A(z). Pitch scale modification is then carried out on the source (residual) signal. The output is obtained by combining the modified source signal with the time-varying spectral envelope usually using linear prediction. Synthesis is again complex and the details can be found in the literature (Baastian Kleijn and Paliwal, 1995).

In this paper, we present a new method of modifying the residual obtained after inverse filtering with linear prediction coefficients. Gimenez de los Galanes et al. (1995) modified the pitch by interpolating the residual signal, realized by either upsampling or downsampling. Both upsampling and downsampling remap the 0 to π scale to the new residual length corresponding to the given pitch modification factor. Once the residual is modified, the spectral envelope responsible for the formant structure will be superimposed by forward filtering with the same LP coefficients. Our approach is similar to the one above, but differs in the interpolation of the residual signal. Interpolation is carried out using forward and inverse orthogonal transformation of the residual signal (Rao and Yip, 1990). Traditionally, low-pass filters are used in sampling rate conversions for upsampling as well as downsampling to avoid spectral repetitions and aliasing. With the help of fast transforms, computational complexity involved in sampling rate conversion can be significantly reduced. Depending on the pitch modification factor, truncation or zero padding is performed on the forward transformed residual and the modified forward transformed residual is inverse transformed.

For a time-varying pitch modification using upsampling or downsampling, the low-pass filter must be redesigned every time, because the cutoff frequency varies according to the pitch modification factor. This could very well be avoided using an orthogonal transform, irrespective of whether the pitch modification factor is constant or time varying. We have also made some modifications to the above algorithm for handling female speech. In this method, the filter parameters are modified to produce a magnitude response that is significantly less peaky than the original linear predictive model used for inverse filtering. This reduces the filter sensitivity to pitch modification (Ansari, 1997). The discrete cosine transform (DCT) (Ahmed and Rao, 1975) has been used in our algorithm for resampling the residual. Energy loss is minimal in resampling process because DCT has high energy compaction.

Section snippets

Method

As an alternative to strictly time domain techniques, the ubiquitous source-filter model of speech can be invoked (Rabiner and Schafer, 1975). Prosody modification then becomes a task of separating the excitation and vocal tract components from speech, modifying the excitation, and then recombining with the vocal tract component. In principle, this allows retaining the vocal tract response without any modifications. Ideally, the analysis would separate the excitation signal, which could be

Results and discussion

To demonstrate the effectiveness of this technique, individual phonemes, words and sentences spoken by both male and female volunteers were recorded using SHURE mike model SM 58 in the laboratory with some ambient noise resulting in a SNR of about 25 dB. These utterances were analyzed and re-synthesized for different pitch change factors. Fig. 4(a) shows a segment of a phoneme. Fig. 4(d) gives the corresponding segment of the residual signal extracted by inverse filtering the phoneme using

Conclusion

The proposed algorithm is simple and elegant. It directly follows from the basic source-filter model of speech. Perceptive evaluation shows that this performs well for the range of pitch change factors sufficient for a TTS system. The algorithm uses DCT-IDCT, and thus is not computationally intensive. The proposed scheme maintains the relative pitch contour of the original signal, without any additional processing or precautions to be taken. The same basic scheme is valid for both constant and

Acknowledgements

We thank the Ministry of Communication and Information Technology, Government of India for funding part of this research under the project titled “Algorithms for Kannada speech synthesis”. The authors are also grateful to the contributions of the reviewers in significantly enhancing the presentation of our results, and the discussion.

References (20)

  • Abe, M., 1996. Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System, Progress in Speech...
  • N. Ahmed et al.

    Orthogonal Transforms for Digital Signal Processing

    (1975)
  • Ansari, R., 1997. Inverse filter approach to pitch modification: application to concatenative synthesis of female...
  • Anssi, R., 1999. Pitch modification and quantization for offline speech coding, M.S. Thesis, Tampere University of...
  • W. Baastian Kleijn et al.

    Speech Coding and Synthesis

    (1995)
  • Charpentier, F., Stella, M., 1986. Diphone synthesis using an overlap-add technique for speech waveforms concatenation....
  • Edgington, M., Lowry, A., 1996. Residual-based speech modification algorithms for TTS synthesis. ICSLP 96,...
  • E.B. George et al.

    Analysis-by-synthesis/overlap -add sinusoidal modeling applied to the analysis and synthesis of musical tones

    J. Audio Eng. Soc.

    (1992)
  • Gimenez de los Galanes, F.M., Savoji, M., Pardo, J.M., 1995. Speech synthesis system based on a variable...
  • Liberman, M., 1994. Computer speech synthesis: its status and prospects, Voice Communication between Humans and...
There are more references available in the full text version of this article.

Cited by (30)

  • Signal transformation and interpolation based on modified DCT synthesis

    2011, Digital Signal Processing: A Review Journal
  • Word level multi-script identification

    2008, Pattern Recognition Letters
  • Text-To-Speech Synthesis: Literature Review with an Emphasis on Malayalam Language

    2022, ACM Transactions on Asian and Low-Resource Language Information Processing
  • Efficient Human-Quality Kannada TTS using Transfer Learning on NVIDIA's Tacotron2

    2021, Proceedings of CONECCT 2021: 7th IEEE International Conference on Electronics, Computing and Communication Technologies
  • Voice activity detection from the breathing pattern of the speaker

    2018, 2017 14th IEEE India Council International Conference, INDICON 2017
  • Enhancement of speech signal using modified binary mask based algorithm for vehicular noise

    2018, Journal of Advanced Research in Dynamical and Control Systems
View all citing articles on Scopus
View full text