Analysis of Instantaneous Frequency Components of Speech Signals for Epoch Extraction

The major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is difficult to observe these discontinuities (corresponding to GCIs) in the speech signal because of the superimposed time-varying response of the vocal tract system. This paper examines the phase part of different frequency components of the speech signal to extract epochs. Three analysis methods to decompose the speech signal into different frequency components are considered. These methods are the short-time Fourier transform (STFT), narrow bandpass filtering (NBPF), and single frequency filtering (SFF). The locations of the discontinuities in the speech signal are obtained from the instantaneous frequency (IF) (i.e., the time derivative of the phase) of each of the frequency components. A method for automatic detection of epochs using the amplitude weighted IF is proposed. Performance of the proposed epoch detection method is compared with four state-of-the-art methods in clean and telephone quality speech. The performance of the proposed method is comparable with the performance of the existing epoch detection methods for clean speech but better for telephone quality speech.


Introduction
Analysis of speech signals using short-time Fourier transform (STFT) or bandpass filtering focus mostly on the magnitude part to represent speech information related to the excitation and the vocal tract system.This paper examines the phase part in these analysis methods, and shows that glottal closure instants (GCIs) or epochs in voiced speech can be extracted from the phase part as well.GCIs are used in many areas of speech science and technology: in prosody modification (Rao and Yegnanarayana, 2006), voice source analysis (Alku et al., 2009;D. Alessandro and Sturmel, 2011), glottal activity detection (Murty et al., 2009), estimation of fundamental frequency (Yegnanarayana and Murty, 2009;Kadiri and Yegnanarayana, 2018), estimation of formant frequencies (Joseph et al., 2006;Gowda et al., 2020), time delay estimation (Yegnanarayana et al., 2005;Murthy et al., 2020), estimation of the number of speakers from multi-speaker data (Swamy et al., 2007), speech enhancement (Deepak and Prasanna, 2016) and parametric speech synthesis (Airaksinen et al., 2018;Drugman and Dutoit, 2012).The importance of features derived around GCIs was studied in analysis and classification of voice qualities (Kadiri et al., 2020b;Kadiri and Alku, 2021), analysis, detection and classification of emotions (Gangamohan et al., 2013;Kadiri et al., 2015Kadiri et al., , 2020c;;Kadiri and Alku, 2020), and pathological speech analysis and detection (Kadiri and Alku, 2020).More details on the GCI detection methods, and the GCI-based analysis of speech processing can be found in Yegnanarayana and Gangashetty (2011), Drugman et al. (2014), Kadiri et al. (2020a) and Kadiri et al. (2021).
Analysis of natural signals, such as signals generated by the human speech production mechanism, helps in understanding the underlying phenomenon that is responsible for generating these signals.In general, however, analysis of natural signals is difficult because the underlying signal production system is time varying.Analysis of natural signals is typically conducted by decomposing them into mathematically tractable basis functions.The selection of the basis functions depends on the type of the signals and on the type of information to be extracted from them.The most widely used time-frequency analysis, the STFT, decomposes the signal as a linear combination of complex sinusoids.The coefficients associated with the complex sinusoids at different frequencies represent the frequency spectrum, which consists of magnitude and phase components.Both of these components are needed to reconstruct the signal.The relative importance of these components varies depending on the purpose of analysis of the signal.For example, the Fourier transform phase component carries more useful information in images than the Fourier transform magnitude component (Yegnanarayana et al., 1984;Oppenheim and Lim, 1981).In the case of speech, the Fourier transform magnitude is known to carry useful information (Yegnanarayana et al., 1984;Oppenheim and Lim, 1981).Therefore, most speech processing methods focus on analyzing and modeling the magnitude spectrum to represent the speech production characteristics.
In speech processing, the phase spectrum has received less attention because of the phase wrapping problem (Mowlaee et al., 2016;Vijayan and Murty, 2015).It is sometimes argued that the phase spectrum has a less significant role in auditory perception (Wang and Lim, 1982;Mathes and Miller, 1947;Schroeder, 1975;Mowlaee et al., 2016).Recent studies have examined the significance of phase in human speech perception (Mowlaee et al., 2016;Gerkmann et al., 2015).The importance of phase was analyzed in the perception of intervocalic stop consonants in Liu et al. (1997) by using stimuli that were reconstructed as the magnitude-only and phase-only versions of the consonants (Liu et al., 1997).The experiments in Liu et al. (1997) indicated that if the analysis segment length is longer than 50 ms the phase spectrum takes precedence over the magnitude spectrum.Similar analyses were made in Alsteris and Paliwal (2006), Paliwal and Alsteris (2005) and Paliwal and Wójcicki (2008) to study the effects of the size and shape of the analysis window on the phase of the STFT in speech perception.In Gerkmann et al. (2015), Mowlaee and Saeidi (2013), Krawczyk and Gerkmann (2014) and Paliwal et al. (2011), it was shown that enhancement of speech corrupted by noise can be achieved by suitably modifying the phase spectrum.In Quatieri and Oppenheim (1981), Oppenheim and Lim (1981) and Yegnanarayana et al. (1984), iterative algorithms were studied in phase-only recovery of minimum phase and maximum phase signals.In Alsteris and Paliwal (2004) and Schluter and Ney (2001), it was shown that the phase spectrum -based feature extraction improved the performance of automatic speech recognition.In addition, several studies have reported that features representing the phase spectrum are useful in speaker recognition and detection of synthetic speech (Nakagawa et al., 2012;Wang et al., 2010;Bastys et al., 2010;Rajan et al., 2013;Saratxaga et al., 2016).More details of the application of the phase spectrum in speech processing can be found in recent review articles (Mowlaee et al., 2016;Gerkmann et al., 2015).
Reliable estimation of the phase spectrum is problematic due to wrapping of the phase values into the interval (− ].The value of  + 2 (where  is an integer) is indistinguishable from .There may also exist discontinuities of 2 at some frequencies.With addition or subtraction of integer multiples of 2 at these discontinuities, the unwrapped phase can be retrieved in a few cases to preserve the continuity of the phase function (Mowlaee et al., 2016;Gerkmann et al., 2015).An alternative representation of the phase information is to compute the derivative of phase with respect to frequency (   ) and with respect to time (   ) (Yegnanarayana, 1978;Cohen, 1995).The frequency derivative of the phase spectrum is called the group delay function (GDF) (Quatieri, 2004;Yegnanarayana, 1978).The GDF can be computed using the Fourier transform relations, and it does not require the explicit computation of the phase spectrum.The computation of the GDF involves division by the squared magnitude spectrum.Therefore, the GDF shows highamplitude peaks at the spectral nulls which are due to zeros of the transfer function close to the unit circle in the z-plane (Bozkurt et al., 2007).To reduce the effects of these high-amplitude peaks, the GDF can be conditioned using the cepstrally smoothed magnitude spectrum (Murthy and Yegnanarayana, 2011) or the GDF can be multiplied with the squared magnitude spectrum to cancel its denominator (Zhu and Paliwal, 2004).It is worth emphasizing that many of the GD-based methods utilize information from minimum phase equivalent signals (Murthy and Yegnanarayana, 2011).
Another representation is the time derivative of the phase, called instantaneous frequency (IF).IF carries information about the local frequency behavior as a function of time (Boashash, 1992;Cohen, 1995).Most of the methods for the computation of IF use the derivative of the phase of the analytic signal directly, and are therefore subject to the phase wrapping problem (Murty and Yegnanarayana, 2008;Vijayan et al., 2016).IF has been computed using STFT or a filter bank to extract useful information of speech production such as formant contours and formant bandwidths (Costas, 1981;Ramalingam et al., 1994;Kumaresan et al., 1994;McCowan et al., 2011;Stark and Paliwal, 2008;Vijayan et al., 2019;Tsiakoulis et al., 2013).In most of these studies, the random effects of phase wrapping and the effects of discontinuities due to excitation are smoothed out by averaging IFs over time and frequency.
It is to be noted that the alternative representations of phase, i.e., GD and IF, have been used mainly in basic speech analysis like in pitch estimation (Kawahara et al., 2016), formant extraction (Murthy and Yegnanarayana, 1991;Kumaresan and Rao, 1999) and spectrum estimation (Yegnanarayana and Murthy, 1992;Stark and Paliwal, 2008).However, there is little previous effort in using the phase information to extract the excitation characteristics of voiced speech.The most important characteristics are the discontinuities in the speech signals caused by abrupt closure of the vocal folds.This may be due to computational issues and also due to effects of the size and shape of the analysis window used for processing speech signals (Smits and Yegnanarayana, 1995).
In Vijayan and Murty (2016), the authors exploited the phase information to detect the discontinuities at the epochs.The error associated with the all-pass modeling, referred to as the all-pass residual, was used to characterize the excitation.The all-pass residual exhibits prominent peaks at the epochs.As the strength of the peaks in the error signal varies with time, a peak-to-neighborhoodenergy-ratio measure was used to identify the peaks corresponding to the epochs.The epochs obtained from the error signal were refined using a dynamic programming algorithm.It is to be noted that this method uses modeling in order to estimate the source and system information.In Murty and Yegnanarayana (2008), an attempt was made to obtain the locations of the discontinuities by computing the IF of the filtered output.This approach depends on the choice of the center frequency of the filter.
In the present study, the IFs computed by three different speech analysis methods are explored to highlight the discontinuities in the signal.The discontinuities in the IF are identified as the epochs.The highlights of the present study are as follows: • A systematic investigation of the IF computed using STFT, narrow bandpass filtering (NBPF) and single frequency filtering (SFF) is carried out to highlight the discontinuities in the signal.• As information about discontinuities is present at all frequencies, the IFs computed at all frequencies are combined to highlight epochs.• To further enhance the detection of epochs, a method based on the amplitude-weighted IFs is proposed.
• The proposed epoch extraction method is compared with existing methods using three databases, which include both clean and telephone quality speech.Performance of the proposed epoch extraction method is shown to be comparable with the performance of the existing methods for clean speech, and is better for telephone quality speech.
The organization of the paper is as follows.A general description of the IF is presented first in Section 2. This section discusses the computation of the IF both for analytic signal and for multicomponent signals using STFT, NBPF and SFF.In Section 3, the IF is derived for synthetic signals (aperiodic impulse sequences and synthetic speech signals) and for natural speech signals.Section 4 deals with extraction of impulse-like discontinuities in speech signals.A method for epoch extraction using the IF is described in Section 5.The proposed epoch extraction method is compared with existing methods for clean speech and telephone quality speech in Section 6.Finally, a summary of the studies made in this paper is given in Section 7.

Instantaneous frequency (IF)
In this section, the basic definition of IF applicable for monocomponent signal is given.The IF is the derivative of phase of an analytic signal.The concept of IF is extended to multicomponent signals, by defining IF for each component.The multicomponent signal is decomposed into several monocomponents using three different methods, namely short-time Fourier transform (STFT), narrow bandpass filtering (NBPF) and single frequency filtering (SFF).The discontinuities in the IF of each component are caused by the discontinuities in the signal due to impulse-like excitation.By combining the IFs due to all components the impulse characteristics are highlighted at the epoch locations.The three different methods of decomposition of a multicomponent signal yield different phase characteristics.The impulse characteristics in the combined IFs, and their effectiveness in the identification of the epoch locations from these three methods are examined in the subsequent sections.

IF of analytic signal
The IF of a real signal () is defined as the time derivative of the unwrapped phase of the complex analytic signal   () of () (Cohen, 1995;Kadiri, 2018).The   () is given by where   () is the Hilbert transform of () (Cohen, 1995).Writing   () in polar form, we get where are the instantaneous amplitude and phase, respectively.Taking logarithm on both sides of Eq. ( 2), we get Taking derivative with respect to time , we get where the superscript ′ denotes the derivative operator.The  ′ () is the IF, and is given by where ℑ(.) denotes the imaginary part.
The IF  ′ () can be obtained using the Fourier transform   () of   () as follows.The analytic signal   () in terms of   () is given by Note that   () exists only for the positive frequencies.Taking the derivative of   () with respect to time, we get where IFT is the inverse Fourier transform.
The IF can be obtained from Eqs. ( 7) and ( 9) as where ℜ(.) denotes the real part.
The IF can be interpreted as the frequency of a sinusoid that fits the signal under analysis.The IF as a function of time shows the deviation of the frequency of the signal from the monotone at every instant of time.For a multicomponent signal, i.e., for signal consisting of multiple sinusoids, the IF is defined for each frequency component.The IFs vary depending on how the multicomponent signal is decomposed into individual frequency components.Three decomposition methods (STFT, NBPF and SFF) of multicomponent signals are considered in this study.

Decomposition by short-time Fourier transform (STFT)
The discrete-time STFT of the signal [] at time  is given as (Rabiner and Schafer, 2010) where  [𝑛] is the analysis window in the interval 0 to (  − 1).The value of (, ) at any frequency   is given by The term ℎ  [] = []    is the frequency-shifted window function.The th frequency component   [𝑛] of  [𝑛] from the STFT decomposition is given by The th frequency component of the signal can be written (using Eq. ( 12)) as Quatieri ( 2004)

Decomposition by narrow bandpass filtering (NBPF)
A multicomponent signal can be decomposed into monocomponent signals using a resonator at each frequency.The resonator is a second-order IIR filter with a pair of complex conjugate poles ( ±  ) in the z-plane.The system function for the th resonator is given by where  1 = −2 cos(   ),  2 =  2 , and T=1/  is the sampling interval or inverse of the sampling frequency (  ).The value of  defines the bandwidth of the resonator, with smaller values of  ( ≪ 1) corresponding to larger bandwidths.In this study  = 0.995 (corresponds to bandwidth of ≈12 Hz) is used.The output   [𝑛] of the th resonator for the input [] is given by The   [𝑛] corresponds to the th frequency component of the NBPF output.

Decomposition by single frequency filtering (SFF)
In SFF, the component signals are obtained by passing the frequency-shifted signal   [] = [] − ω  through a single pole resonator, with the pole located on the negative real axis in the -plane at  = − ( defines the bandwidth) (Aneeja and Yegnanarayana, 2015;Kadiri and Yegnanarayana, 2017).The ω =  −   , where   is the desired frequency.The system function of the single pole resonator is given by The th frequency component is given by It should be noted that   [] is a complex signal with real part   [𝑛] and imaginary part   [𝑛].
By writing   [𝑛] in polar form, we get with where   [𝑛] is the temporal amplitude and   [𝑛] is the temporal phase.Note that SFF differs from NBPF, as the filter is fixed in SFF, whereas in NBPF the filter varies with the frequency of the component.In addition, the filtering at the highest frequency of   /2 helps to capture the magnitude and phase information at each frequency as in the case of modulation of a carrier frequency.Both in NBPF and SFF, the effects of windowing are avoided due to filtering operation.

IFs for multicomponent signals
The phase function of the component signals in Eqs. ( 14), ( 16) and ( 18) are different due to the manner in which the signal is decomposed by STFT, NBPF and SFF, respectively.The characteristics of IFs derived from these phase functions will be different too, although all of them are expected to show discontinuities at the impulse locations of the signal.
The IF for the th frequency component is obtained using the inverse discrete Fourier transform (IDFT) as follows (Murty and Yegnanarayana, 2008): where   [𝑙] is the discrete Fourier transform of the analytic signal (  [𝑛]), and  is the total number of samples in the signal.The complex analytic signal   [𝑛] of   [𝑛] is given by where    [𝑛] is the Hilbert transform of   [𝑛].The IF   [𝑛] of each component shows deviation from the corresponding frequency   , except in the case of SFF, where all the IFs show deviation from , corresponding to the half the sampling frequency.The combined IF [] is obtained by summing deviations of all  IFs from the corresponding frequencies.Thus, combined IFs can be written as for the STFT and NBPF methods, and for the SFF method.

IF for synthetic and natural signals
In this section, the IF is computed for synthetic signals and natural voiced speech signals using the three methods of decomposition described in the previous section.SFF.In the case of STFT, a 1024-point FFT with a 30 ms Hann window and 1-sample shift is used.It is to be noted that in the case of STFT and NBPF, the IF deviates from the normalized center frequency   = 2    = 2 8000 (500) = 0.3927, and in the case of SFF, the IF deviates from  radians.In all the cases, it can be observed that the IF shows discontinuities at the instants of impulse locations, although the discontinuities do not manifest themselves equally at all the time instants, especially for STFT.Since the impulse information is present at all frequencies, the IFs at all frequencies are combined as indicated in Eqs. ( 24) and (25).Figs.1(e), 1(f) and 1(g) show the combined IF plots obtained for STFT, NBPF and SFF, respectively.Since the IF is subtracted from the center frequency for all the frequencies, the combined IF deviates from zero.From the figures, it can be seen that the combined IF plots highlight the discontinuities clearly.The combined IF plots obtained from NBPF and SFF show sharp discontinuities at the instants of the impulse locations.It is worth noting that the relative amplitude patterns of the discontinuities in the combined IFs of NBPF and SFF are similar to the amplitude patterns of the aperiodic impulse-sequence shown in Fig. 1(a).
Fig. 2 illustrates IFs obtained for a synthetic voiced speech segment (shown in Fig. 2(a)) generated by exciting a 10th order allpole filter using the Liljencrants-Fant (LF) (Fant, 1995) model (shown in Fig. 2(b)).Figs.2(c), 2(d) and 2(e) show the IFs computed at 500 Hz, using STFT, NBPF and SFF, respectively.In all these cases, it is difficult to distinguish discontinuities caused by the LF excitation, although they can be seen to some extent in the IF plots computed using NBPF.Figs.2(f), 2(g) and 2(h) show the  show the segment of voiced speech and the corresponding differentiated electroglottography (dEGG) signal.The negative peaks in the dEGG signal correspond to the glottal closure instants (GCIs) or epochs.Figs.3(c), 3(d) and 3(e) show the IFs computed at 500 Hz using NBPF, STFT and SFF, respectively.In all these cases, it is difficult to see discontinuities that are associated with the negative peaks of the dEGG signal.Figs.3(f), 3(g) and 3(h) show the combined IF plots obtained using STFT, NBPF and SFF, respectively.The combined IF plots in Figs.3(f) and 3(g) do not show the discontinuities due to impulse-like excitations consistently.The combined IFs obtained using SFF (shown in Fig. 3(h)) show the discontinuities clearly at all the GCIs, and they match well with the GCIs in the dEGG signal (shown in Fig. 3(b)).

Extraction of impulse-like discontinuities from speech signals
In Section 3, it was observed that the combined IF obtained using the SFF method highlights the impulse-like discontinuity information.In this section, the SFF method is used for extraction of the impulse-like discontinuities from speech signals.In Fig. 3(h), it can be observed that even though the combined IFs obtained using the SFF method highlights the impulse-like behavior, it contains fluctuations in other parts of the signal.To reduce such fluctuations, amplitude weighting of the IF is proposed.The amplitude weighting of the IF is first defined for the th frequency as follows where   [𝑛] is given in Eq. ( 20).By combining    [𝑛] for all frequencies, the combined amplitude weighted IF, called AIF, is defined as Fig. 3(i) shows the AIF computed for the voiced speech segment of Fig. 3(a).From Fig. 3(i), it can be clearly seen that the fluctuations are reduced in the AIF compared to the unweighted IF shown in Fig. 3(h).The discontinuities in the AIFs match well with the discontinuities in the dEGG signal shown in Fig. 3(i).In the next section, a method is proposed to detect epochs based on AIF.

Epoch extraction
Using the AIF defined in Eq. ( 27), an epoch extraction method is proposed, which is motivated by previous studies in epoch extraction, particularly, the zero frequency filtering (ZFF) (Murty and Yegnanarayana, 2008), and the speech event detection using linear prediction (LP) residual and mean based signal (SEDREAMS) (Drugman and Dutoit, 2009).In both of these epoch extraction methods, the impulse-like excitation characteristics of speech signals are captured by using a filtering operation, which yields an oscillatory signal that varies with the local pitch period.Both the ZFF and SEDREAMS methods have been shown to work well for clean speech (Drugman et al., 2012).However, these methods failed to detect epochs accurately from telephone quality speech because of the attenuated level of low-frequency components in telephone quality speech (Kadiri, 2019;Kadiri and Yegnanarayana, 2020).We examine the effectiveness of the AIF for epoch extraction, especially for telephone quality speech.A oscillatory signal that varies with the local pitch period is derived using the ZFF-based approach (Murty and Yegnanarayana, 2008).The AIF signal is passed through a cascade of two zero frequency resonators as given in Eq. ( 28).The precise location of the epoch is obtained by searching for the minimum of the AIF in the region between the minimum and maximum of the filtered signal, i.e., around each negative-to-positive zero crossing.
The proposed epoch extraction method based on the AIF, called as the EAIF method, consists of the following steps: 1.The AIF signal in Eq. ( 27) is passed through a cascade of two ideal zero frequency resonators.The output of this filtering operation is given by (Murty and Yegnanarayana, 2008) where  1 = −4,  2 = 6,  3 = −4 and  4 = 1.This filtering is equivalent to cumulative sum in the discrete-time domain.The resulting signal  0 [] grows or decays approximately as a polynomial function of time.2. The trend in  0 [] is removed by subtracting the local mean computed over the average pitch period (Murty and Yegnanarayana, 2008;Drugman and Alwan, 2011) at each sample as follows: Here, 2 + 1 corresponds to the number of samples in the window used for trend removal.

Performance evaluation
Performance of the proposed epoch extraction method was evaluated using data from three databases, which contain both speech signals and the simultaneously recorded EGG waveforms (Kominek and Black, 2004).The data in these databases was collected in clean laboratory conditions.In order to test the epoch extraction method for telephone quality speech, the clean speech data were encoded using a low bit-rate speech codec (ITU-T, Recommendation, 2005).Epoch extraction was then carried out on both the clean signal and its coded version.The dEGG signals were used to obtain the ground truth.The results of the proposed method were compared with four state-of-the-art epoch extraction methods using the evaluation metrics defined in Naylor et al. (2007).

Experimental protocol
This section describes the databases, epoch extraction methods for comparison and evaluation metrics.

Databases
The epoch extraction methods were evaluated using speech and simultaneously recorded EGG waveforms of five speakers obtained from three databases.The data of the first three speakers were taken from the CMU ARCTIC database (Kominek and Black, 2004).The speech data of each speaker (BDL (male), JMK (male) and SLT (female)) consists of around 1132 phoneticallybalanced English sentences.The data of the fourth speaker (KED (male)) was taken from the KED TIMIT database.This data consists of 453 sentences.The data of the fifth speaker (RAB (male)) was taken from the RAB database.The data consists of a set of 1946 nonsense words containing phone-phone transitions in English.All these databases are available on the Festvox webpage http://festvox.org/cmu_arctic/index.html.The speech and EGG signals were aligned by compensating the larynx-to-microphone delay (around 0.9 ms for BDL, JMK, SLT, 0.6 ms for KED, and 2.3 ms for RAB).Reference epoch locations were identified as the locations of the negative peaks in the dEGG signal.Telephone quality speech was simulated by processing the speech signals taken from the three databases with a narrowband codec, which was implemented using the G.191 software (ITU-T, Recommendation, 2005).

Epoch extraction methods for comparison
The following four epoch extraction methods were used for comparison to the proposed EAIF approach: Zero frequency filtering (ZFF) (Murty and Yegnanarayana, 2008), speech event detection using LP residual and mean based signal (SEDREAMS) (Drugman et al., 2012), most singular manifold (MSM) (Khanagha et al., 2014) and yet another GCI algorithm (YAGA) (Thomas et al., 2012).It is to be noted that in the proposed EAIF method, the AIF signal is passed through a cascade of two zero frequency resonators as opposed to passing the speech signal as in the original ZFF method (Murty and Yegnanarayana, 2008).

Evaluation metrics
Performance of the epoch extraction methods was evaluated using five widely used measures described in Naylor et al. (2007).These five evaluation metrics are the identification rate (IDR1), miss rate (MR), false alarm rate (FAR), identification accuracy (IDA) and identification rate within ±0.25 ms (IDR2).The first three measures are called reliability measures, and the other two are called

Results and discussion
All the epoch extraction methods were evaluated against the ground truth epoch information provided by the EGG.The average performance of epoch extraction across all the five speakers (BDL, JMK, SLT, KED, and RAB) is shown for clean and telephone quality speech in Table 1.Tables 2 and 3 show the results for each speaker for clean and telephone quality speech, respectively.
For clean speech (Tables 1 and 2), the results show that the performance of the proposed EAIF method is comparable or better than the existing methods in both reliability (i.e., in IDR, MR, and FAR) and accuracy (i.e., in IDA and IDR2).From the reference methods, ZFF and SEDREAMS are most reliable, and YAGA is the most accurate for clean speech.
For telephone quality speech (Table 3), the results indicate that the performance of the existing methods is severely affected for all the speakers compared to clean speech (Table 2).It can be observed that even though the performance of the MSM method is lower compared to the other methods (ZFF, SEDREAMS, and YAGA) in clean speech, the performance of MSM is higher for telephone quality speech for all the speakers.On the other hand, the performance of the EAIF method exceeds the performance of all the other methods for all the speakers in both reliability and accuracy.Overall, the results of the proposed EAIF method are comparable with those obtained using the state-of-the-art epoch extraction methods for clean speech, and better for telephone quality speech.
The key findings of this study are as follows: • The IF derived using SFF analysis for decomposition of speech signal into individual frequency components highlights the discontinuities corresponding to the GCIs/epochs in the speech signal.• The IF derived using the SFF-based decomposition is shown to highlight the discontinuities better than the STFT and NBPF based decomposition methods.• The amplitude weighted IF (AIF) is used to develop a method for epoch extraction from clean and telephone quality speech.
• The proposed EAIF epoch extraction method is shown to perform better than four known state-of-the-art methods, in terms of reliability and accuracy, especially for telephone quality speech.

Summary and conclusion
In this paper, the excitation information of speech was extracted using the phase component, by highlighting the discontinuities at epochs.The impulse-like characteristics in the speech signal were derived from the IF of the filtered signal at each frequency, obtained by STFT, NBPF and SFF.The sum of the IFs of all the filtered signals shows discontinuities at the locations of the impulses in the excitation.It was observed that the combined IF obtained from the SFF method highlighted the impulse-like discontinuity in the speech signals better compared to the STFT and NBPF methods.The impulse-like discontinuity was highlighted further by amplitude weighting the IF, resulting in the combined amplitude weighted IF, the AIF.An epoch extraction method, called EAIF, was proposed based on the AIF.The performance of the proposed EAIF method was compared with four state-of-the-art epoch extraction methods for clean and telephone quality speech.The performance of EAIF was comparable to the existing methods for clean speech, and better for telephone quality speech, both in terms of reliability and accuracy.The performance improvement is due to the following two features.First, the EAIF method exploits the impulse-like discontinuity in IF, which improves accuracy.Second, the filtered signals derived from the AIF oscillate with the local pitch period yielding improved reliability.The impact of noise on IF computation is a topic for future studies.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data is publicly available: http://festvox.org/cmu_arctic/index.html Fig. 1(a) shows a multicomponent synthetic signal, which is an aperiodic sequence of impulses with arbitrary amplitude values.Figs.1(b), 1(c) and 1(d) show the IFs computed at 500 Hz using STFT, NBPF and SFF, respectively.The parameters used are:   =8 kHz and  = 0.995.The filtered signals are obtained for every 5 Hz for both NBPF and

Fig. 1 .
Fig. 1.Illustration of the IF for an aperiodic sequence of impulses.(a) An aperiodic sequence of impulses with arbitrary amplitude values.(b) IF of the STFT output at 500 Hz.(c) IF of the NBPF output at 500 Hz.(d) IF of the SFF output at 500 Hz.(e) Combined IF of the STFT output.(f) Combined IF of the NBPF output.(g) Combined IF of the SFF output.

Fig. 2 .
Fig. 2. Illustration of the IF for synthetic voiced speech.(a) A segment of synthetic voiced speech generated using the LF model for excitation.(b) LF excitation.(c) IF of the STFT output at 500 Hz.(d) IF of the NBPF output at 500 Hz.(e) IF of the SFF output at 500 Hz.(f) Combined IF of the STFT output.(g) Combined IF of the NBPF output.(h) Combined IF of the SFF output.
combined IF plots obtained by STFT, NBPF and SFF.From the figures, it can be observed that the combined IFs highlight the discontinuity information clearly.An illustration of IFs for a natural voiced speech signal is shown in Fig. 3. Figs.3(a) and 3(b)

Fig. 3 .
Fig. 3. Illustration of the IF for a segment of natural voiced speech.(a) A segment of a voiced speech signal.(b) Differentiated EGG (dEGG) signal.(c) IF of the STFT output at 500 Hz.(d) IF of the NBPF output at 500 Hz.(e) IF of the SFF output at 500 Hz.(f) Combined IF of the STFT output.(g) Combined IF of the NBPF output.(h) Combined IF of the SFF output.(i) Combined weighted IF of the SFF output, with the dEGG of Fig. 3(b) superimposed.
3. The region (interval) between the minimum to maximum of the filtered signal [] around each negative-to-positive zero crossing is hypothesized as the region of the epoch.4. The location of the minimum of the AIF signal in the hypothesized region/interval is marked as epoch or GCI.The steps involved in the proposed EAIF method to extract epochs are illustrated in Fig. 4. Fig. 4(a) shows a segment of voiced speech and Fig. 4(b) shows the corresponding AIF.Fig. 4(c) shows the filtered signal obtained from the AIF signal in Fig. 4(b).Fig. 4(d) shows the intervals derived from the filtered signal where epochs are present (regions between the minimum to maximum around each negative-to-positive zero crossing of the filtered signal).The precise epoch locations are estimated by finding the minimum of the AIF signal shown in Fig. 4(b) in the intervals of epoch presence shown in Fig. 4(d).Fig. 4(e) shows the estimated epoch locations, which are marked by downward arrows (in red), along with the reference epochs shown by the dEGG signal.It can be clearly seen that the estimated epochs match well with the reference epochs indicated by the negative peaks of the dEGG signal.

Fig. 4 .
Fig. 4. Illustration of epoch extraction using the proposed EAIF method for a segment of voiced speech.(a) A segment of voiced speech.(b) AIF signal.(c) Filtered signal derived from (b).(d) Intervals derived from the filtered signal.(e) Differentiated EGG (dEGG) signal along with the estimated epoch locations marked by downward arrows (in red).

Table 1
Performance comparison of the epoch extraction methods averaged across all speakers for clean speech and telephone quality speech.IDR1-Identification rate, MR-Miss rate, FAR-False alarm rate, IDA-Identification accuracy in ms, IDR2-Identification rate within ±0.25 ms.

Table 2
Performance comparison of the epoch extraction methods for clean speech.IDR1-Identification rate, MR-Miss rate, FAR-False alarm rate, IDA-Identification accuracy in ms, IDR2-Identification rate within ±0.25 ms.The better the performance of an epoch extraction method, the higher are the values of IDR1 and IDR2, and the lower are the values of MR, FAR and IDA.

Table 3
Performance comparison of the epoch extraction methods for telephone quality speech.IDR1-Identification rate, MR-Miss rate, FAR-False alarm rate, IDA-Identification accuracy in ms, IDR2-Identification rate within ±0.25 ms.