Detection of extra pulses in synthesized glottal area waveforms of dysphonic voices

Background and objectives The description of production kinematics of dysphonic voices plays an important role in the clinical care of voice disorders. However, high-speed videolaryngoscopy is not routinely used in clinical practice, partly because there is a lack of diagnostic markers that may be obtained from high-speed videos automatically. Aim of the study is to propose and test a procedure that automatically detects extra pulses, which may occur in voiced source signals of pathological voices in addition to cyclic pulses. Material and methods Glottal area waveforms (GAW) are synthesized and used to test a detector for extra pulses. Regarding synthesis, for each GAW a cyclic pulse train is mixed with an extra pulse train, and additive noise. The cyclic pulse trains are varied across GAWs in terms of fundamental frequency, pulse shape, and modulation noise, i.e., jitter and shimmer. The extra pulse trains are varied across GAWs in terms of the height of the extra pulses, and their rates of occurrence. The energy level of the additive noise is also varied. Regarding detection, first, the fundamental frequency is estimated jointly with the cyclic pulse train waveform, second, the modulation noise is estimated, and finally the extra pulse train waveform is estimated. Two versions of the detector are compared, i.e., one that parameterizes the shapes of the cyclic pulses, and one that uses unparameterized pulse shape estimates. Two corpora are used for testing, i.e., one with 100 GAWs containing random extra pulses, and one with 25 GAWs containing extra pulses in the closed phases of each glottal phase representing subharmonic voices. Results and discussion With pulse shape parameterization (PSP) a maximum mean accuracy of 88.3% is achieved when detecting random extra pulses. Without PSP, the maximum mean accuracy reduces to 82.9%. Detection performance decreases if the energy level of additive noise is higher than −25 dB with respect to the energy of the cyclic pulse train, and if the irregularity strength exceeds 0.1. For bicyclic, i.e., subharmonic voices, the approach fails without PSP, whereas with PSP, a mean sensitivity of 87.4% is achieved for subharmonic voices. Conclusion A synthesizer for GAWs containing extra pulses, and a detector for extra pulses are proposed. With PSP, favorable detector performance is observed for not too high levels of additive noise and irregularity strengths. In signals with high noise levels, the detector without PSP outperforms the other one. Detection of extra pulses fails if irregularity strength is large. For subharmonic voices PSP must be used.


Introduction
The description of voice production kinematics plays an important role in the clinical care of dysphonic voices, because it aids the indication, selection, evaluation, and optimization of clinical treatment techniques. In clinical routine, voice production kinematics are primarily assessed by means of stroboscopic imaging of the vocal fold vibration [1,2]. However, due to the limitation of the stroboscopic method, many abnormal phenomena in vocal fold vibration may be disguised. For example, one needs to assume in stroboscopy that intercyclic variation of phonation pulses is small, because the behaviour of stroboscopy with large inter-cyclic variation depends on many unexplored factors and is thus hardly predictable. In other words, a sequence of phonation pulses with similar shapes is required to produce a smooth stroboscopic video. This limitation relates to the well-established Nyquist-Shannon sampling theorem that requires a sampling frequency higher than twice the highest frequency of the signal [3]. Currently, stroboscopy is often used beyond this limitation, although high-speed videolaryngoscopy and kymographic imaging are capable of imaging subsequent pulses with different shapes.
The pathophysiological process of extra pulsing is explained as follows. Extra pulsing may be caused by (slight) desynchronization of the anterior and posterior part of the vocal folds. This is a vibration mode that can be understood as an intermediate stage between modal phonation and biphonation / diplophonia. In extrapulsing, one cyclic oscillator is dominant in terms of amplitude, while the other one is kind of "shooting in between" pulses, without being "strong enough" (yet) to generate a distinct second vibration frequency. In the extreme case of extrapulsing that is known as double pulsing / alternating pulses, an extra pulse occurs in each and every quasi-closed phase of the cyclic pulses.
The occurrence of extra pulses in dysphonic voices is interesting from several viewpoints. First, the prevalence of such extra pulses in dysphonic voices is unknown, most likely because (1) stroboscopic imaging does not suffice to find extra pulses, and (2) it is labour intensive to manually search out extra pulses in high-speed videos or kymograms if lots of data needs to be analysed. Thus, extra pulses may often be overlooked in clinical practice.
With regard to the representation of extra pulses in kymographic imaging, it appears to be necessary to distinguish between videokymography (VKG), and digital kymography (DKG) [4]. In VKG, kymographic images are created in real time during endoscopic examination. Kymographic images of a chosen length are shown and updated with a rate reciprocal to its length. Usually, a length of 40 ms is chosen which results in an update rate of 25 Hz. If random extra pulses occur, these are visible for 40 ms only, and are thus hardly detectable visually. If extra pulses occur in a structured way, e.g., as approximately equally shaped extra pulses in each and every cycle, they can be seen easily. In DKG, kymographic images are created after recording. Given a vocal frequency of, e.g., 100 or 200 Hz, a 2 s phonatory segment includes 200 or 400 cycles. One needs to visually search out for extra pulses that may occur randomly in each of these cycles. Such a search is a tedious endeavour.
Second, extra pulses disturb substantially the harmonic spectrum of the voice sound, thus a significant auditory impact is expected from adding extra pulses to the cyclic pulse train of normal phonation. However, not much is known regarding the auditory attributes that a listener may assign to a voice sample containing extra pulses. Thus, extra pulses may often be overheard in clinical practice. In a past case study, the concept of "tonal raspiness" was proposed, which accounts for the pitch / tonality that is provoked by the cyclic pulse train, and the raspy component that is provoked by the extra pulse train [5]. This perspective agrees with Bregman's well-established theory of auditory stream segregation [6]. We hypothesize that auditory raspiness is provoked by frequently occurring extra pulses, while unfrequently occurring extra pulses provoke auditory crackling [7]. Once a synthesizer for voices with extra pulses is available, the auditory impact of extra pulses on the voice sound can be investigated.
Third, the occurrence of extra pulses is likely to be triggered by mechanic and aerodynamic properties of the vocal folds and the phonatory process. These properties may be subject to clinical treatment (logopedic or surgical), thus it appears to be plausible that treatment may be more target oriented in cases for which extra pulses were identified. Finally, from a signal processing perspective, the proposals that we make may also be applicable in the future to other types of signals in which a cyclic pulse train is mixed with a random extra pulse train.
To the best of our knowledge, we present the first attempt towards automatic detection of random extra glottal pulses that may occur during quasi-closed phases of the normally occurring cyclic pulse train. A limitation of laryngeal high-speed videoendoscopy and kymography is that a lot of manual post-processing is required before a diagnostic marker can be displayed to a medical doctor, which impedes clinical acceptability of the approach. Thus, we explore here an approach towards the automatic appraisal of glottal area waveforms (GAW), which is intended to decrease the amount of manual labour required to obtain an underexplored diagnostic marker, i.e., a marker indicating the presence of extra pulses. We propose and test a method for the detection of extra pulses that occur during quasi-closed phases of random glottal cycles. The aim of this study is to further test and improve the detector that was proposed in the past [5]. The remainder of this article is structured as followed. In Section 2 we present related work. In Sections 3.1 and 3.2, the synthesis of the GAWs is explained. In Section 3.3, the detector architecture is explained. A simple version of the detector is compared to an advanced version that uses PSP for the estimation of the cyclic pulse train component. In Section 4, results regarding the detector performance are presented for different levels of additive noise and irregularity strengths, i.e., the detector is tested for robustness. Favorable performance is observed with PSP for low levels of additive noise and small irregularity. In signals with high noise levels, the detector without PSP outperforms the other one. For bicyclic signals / bigeminism / subharmonics PSP must be used. Detection of extra pulses fails for strongly irregular signals. In Section 5, conclusions are drawn and advices for the practical use of the detector are given.

Related work
In [8] several criteria to visually judge kymographic images of vocal fold vibration are presented. Examples for "cycle aberrations" are depicted in Fig. 7 of [8]. So-called "ripples" and "doubled medial peaks" are depicted in kymograms B and D. These descriptive attributes correspond to extra pulses that occur during the open phase of the phonatory cycle. In the depicted examples, ripples and double medial peaks occur regularly in each phonatory cycle. Also, a concept "large cycle-to-cycle variability" was used. Both "cycle aberrations" and "large cycle-to-cycle variability" are superordinate concepts to the extra pulses that we are investigating.
Fraj et al. [9] developed a synthesizer for pathological voices that uses a nonlinear waveshaping model of the glottal area. The Klatt concatenated-curve model is used as a glottal area template [10], and modulation noise is simulated via polynomial distortion. The instantaneous frequency and a harmonic driving function are control parameters of the synthesizer. These parameters enable control of the pitch, amplitude, harmonic richness, open quotient, and irregularity by means of modulation noise. Regarding cycle length modulation noise, jitter and tremor are distinguished. Jitter is simulated as a two-point stochastic process added to the instantaneous phase on a sample-by-sample basis. Tremor is simulated as a band-pass filtered white Gaussian noise further added to the instantaneous phase. Amplitude modulation noise, i.e., shimmer, is only contained in the speech signal and not in the GAW. It results from vocal tract filtering of the source signal that contains jitter and tremor. It was shown that this synthesizer is capable of producing naturally sounding samples of dysphonic voices. As a complement to the work by Fraj et al., we propose to control and estimate cycle length and amplitude modulation noise via the modulation of individual pulses' timings and heights at cycle-synchronous supporting points. This enables control and estimation of the modulation noise on a cycle-by-cycle basis instead on a sample-by-sample basis. The advantages of our approach compared to Fraj et al. are the following. First, our jitter is not a two-point process. Instead, the pulses of the cyclic pulse train may be anticipated or delayed with our approach by an arbitrary amount, and pulse shapes are time-warped accordingly to retain a smooth instantaneous phase. Second, our approach enables the estimation of modulation noise time series from observed signals. Finally, the bandwidth of our jitter does not depend on the sampling frequency.
Chen et al. [11] proposed a voice source model that models pulses of GAWs observed in three male and three female healthy subjects with high-speed videolaryngoscopy. We use this pulse shape model in our work for the synthesis of GAWs, and also for PSP in the estimation of the cyclic pulse train. The model has five parameters, i.e., the cycle length, the open quotient, the asymmetry coefficient, accounting for differences of the opening and closing phases' durations, and two additional shape parameters for the opening and closing phases, i.e., one steepness parameter for each of the phases. The steepness parameters can be understood as the speed of the opening and closing phases.
Ikuma et al. [12,13], proposed a model for GAWs of pathological vocal fold vibration which is similar to ours. They model GAWs as a sum of a harmonic signal, a deterministic nonharmonic signal, and a random nonharmonic signal. Their harmonic signal is from a Fourier synthesizer, their deterministic nonharmonic signal is a sum of sinusoids the frequencies of which are not harmonically related, and their random nonharmonic signal is zero-mean white Gaussian noise. It would be inefficient to model extra pulses with Ikuma et al.'s model because the extra pulses are neither synthesizable with a reasonably small number of nonharmonic sinuses, nor are they zero-mean white Gaussian.
Randomly triggered extra pulses during quasi-closed phases of cyclic glottal pulses were observed in the past in a clinical case study of a dysphonic voice that sounded tonal and raspy [5]. A prototype for the detector was proposed, which identified correctly six observed extra pulses, and only one false alarm occurred. In this work, we further improve and test the detector that was proposed in the past.

Materials and methods
This section explains the synthesis of the GAWs, the detection of the extra pulses, as well as the performance measures and statistical analysis.

Synthesis of glottal area waveforms with random extra pulses
One-hundred GAWs are synthesized at a sampling frequency f s = 48 kHz with a length of 0.3 s. The synthesis of the GAWs involves the synthesis of the cyclic pulse train d 1 (n), and the synthesis of the extra pulse train d 2 (n), where n is the discrete time index. The synthesized GAW d′(n) = d 1 (n) + d 2 (n) + η(n), where η(n) is zero-mean white Gaussian noise. This signal model is adapted from [5]. In particular, control parameters are made explicit here. Fig. 1 shows the overview block diagram of the synthesizer. The fundamental frequency f 0 , the irregularity strength Irr, and the pulse shape parameters Ψ are input to the cyclic pulse train generator that puts out the cyclic pulse train d 1 (n), the instantaneous phase Θ(n), and the pulse shape r(l), where l is the cycle-relative discrete time index. The instantaneous phase Θ(n), the pulse shape r(l), the extra pulse rate ρ, and the extra pulse height h are input to the extra pulse train generator. The root mean square (RMS) energy level of the zeromean white Gaussian noise η(n) is H = 20 ⋅ log 10 η(n) 2 / d 1 (n) 2 . It is relative to the RMS energylevel of the cyclic pulse train d 1 (n), and given in dB. Fig. 2 shows the block diagram of the cyclic pulse train generator. The cyclic pulse train d 1 (n) is obtained as follows. First, the instantaneous phase Θ(n) is obtained. Therefore, the pulse times n p (μ) = μ · N o + j(μ), where μ ∈ ℤ is the pulse index, the cycle length in samples N o = f s /f o , and j(μ) is the time shift of the μ th pulse. The cycle length modulation noise, i.e., jitter, is drawn from a Gaussian distribution, i.e., j(μ) (0, Irr ⋅ N 0 ), where Irr is the irregularity strength, and μ, σ denotes a Gaussian distribution with mean μ and standard deviation σ. The instantaneous phase at pulse locations Θ(n = n p (μ)) = π · Σ μ ∈ ℤ [2 · μ + 1], and is obtained between pulse locations via spline interpolation. Second, the amplitude modulation function A(n) is obtained at pulse locations A(n = n p (μ)) = s(μ), where s(μ) is the amplitude modulation noise, i.e., shimmer, which is drawn from a Gaussian distribution (1, Irr) . Between pulse locations, A(n) is obtained by shape preserving cubic interpolation. Third, a pulse shape r(l) is obtained with a Chen pulse generator [11]. Fig. 3 shows an example of a pulse shape. The real part and imaginary part Fourier coefficients a p and b p are obtained by discrete Fourier transformation (DFT) of the pulse shape (l), where p is the partial index. Fourth, the cyclic pulse train d 1 (n) is obtained via Fourier synthesis taking a p , b p , and Θ(n) as inputs, i.e.,

Synthesis of bicyclic glottal area waveforms
In an additional experiment, twenty-five bicyclic GAWs are synthesized. The synthesizer described in the previous section is used with a fixed extra pulse rate ρ = 1. Setting the extra pulse rate to one results in the triggering of one extra pulse during the closed phase of each glottal cycle, and thus alternating patterns in the time domain (bigeminism). This signal type relates to a frequently occurring type of voice, i.e., subharmonic voice, which is characterized by alternating magnitudes of partials in the frequency domain. Only signals of class I are synthesized, i.e., the irregularity strength Irr = 0, 0.1 , and the energy level of the additive noise H = − 50, − 25 .

Detection of extra pulses
A detector for extra pulses is proposed in the following. It is based on parameter estimation and resynthesis of the GAWs under test. It is a composition of joint estimation of the fundamental frequency and the cyclic pulse train, estimation of the modulation noise, and modelling of the extra pulse train. Parts of the detector were proposed in the past [5]. The method is here improved by (1) the use of a parametric pulse shape model, i.e., the Chen pulse model [11], (2) the use of a new candidate selection procedure in the fundamental frequency extraction, and (3) a peak-picking free extra pulse train estimator. The method is described as follows. Viterbi algorithm six times, as in the "fast" setup described in [14]. The candidate index γ = 1, 2, …, Γ, and Γ is the number of candidates. No high-pass filtering is used, as was for the analysis of audio signals in [14]. Candidate cyclic unit pulse trains u 1 γ (n) are created for each For further details the interested reader is referred to [14].
We propose "ultra fast" candidate selection that replaces the candidate selection approach described in [14]. The estimate of the cyclic pulse train d 1 (n) is given by where the binary candidate selection vector S = s γ ∈ {0, 1}, and Γ is the number of candidates. The optimal candidate selection vector S opt is chosen so as to minimize the RMS error E 1 = 20 ⋅ log 10 e 1 (n) 2 / d′(n) 2 of the error waveform

Modulation noise estimation-Second
, the modulation noise is estimated as shown in Fig. 6. The method is adapted from [5]. In particular, we add here the option of PSP. A quasi-unit pulse train u 1 (n) is cross-correlated with GAW d′(n) to obtain the pulse shape estimate r (l) . Via a pulse shape parameterization (PSP) switch, either r (l) or a parameterized version r (l) is used. The parameterized pulse shape r (l) is obtained from a Chen pulse generator, the control parameters Ψ of which are obtained via minimization of the parameterization error e r (l) = r ′(l) − r (l), where r ′(l) is a normalized version of r (l) . The modulated cyclic pulse train d 1 (n) is obtained with a Fourier synthesizer, taking the pulse shape's Fourier coefficients a p and b p , as well as the instantaneous phase estimate Θ(n) as inputs. Its output is multiplied by the amplitude modulation function estimate A(n) . The modulation noise vector estimates j (μ) and s (μ) perturb the quasi-unit pulse train u 1 (n), and are obtained by minimizing the error e 1 (n) = d ′(n) − d 1 (n) .
In more detail, the fundamental frequency estimate f o drives a quasi-unit pulse oscillator Subequently, r ′(l) is further normalized such that min r ′(l) = 0 and max r ′(l) = 1. r (l) is shifted in time such that its maximum coincides with the maximum of r ′(l) . The mean square model error is obtained as E r = e r 2 (l) . The parameters OQ , α, S op , and S cp are iteratively optimized one by one by golden section search and parabolic interpolation to minimize E r [15,16]. Each parameter is constraint to the interval [0.1, 0.9]. Each step of iteration includes optimization of each parameter in the order OQ , α, S op , and S cp . Estimation is stopped as soon as the improvement of E r decreases in the last iteration step below 10 −5 .
Optionally, PSP is switched on and off. Accordingly, either the cross-correlation vector r (l) or its parameterized version r (l) is used.
The instantaneous phase estimate Θ (n) and the amplitude modulation function estimate A(n) are obtained from the pulse train estimate u 1 (n) . In particular, Θ(n) = π ⋅ ∑ μ ∈ ℤ [2 · μ + 1] at pulse locations of u 1 (n), i.e., at n = μ ⋅ N 0 + j (μ) + Δ ϕ , and spline interpolated in between, and A(n) = s (μ) at pulse locations of u 1 (n), and obtained by shape preserving cubic interpolation in between. The jitter and shimmer vector estimates j μ and s μ are obtained via minimizing the RMS error E 1 = 20 ⋅ log 10 e 1 2 (n)/ d 1 2 (n) , i.e., j μ , s μ = argmin j μ , s μ E 1 j μ , s μ . The interior-point algorithm is used for each pulse individually [17,18]. After the last pulse, the procedure iteratively refines the estimate until convergence, i.e., until the model error improvement cumulated from the first to the last pulse decreases below 0.01 dB. The optimal candidate selection vector Ξ opt is obtained as follows. Ξ is first initialized as a zero vector. Starting with the first pulse, ξ μ is switched to 1 if its current state is 0, and vice versa.The switch is reverted if the error level E 2 does not decreases. After the last pulse is processed, the procedure is restarted. This is repeated until no single new switch yields a decrease of E 2 . In a second turn, Ξ is initialized as a vector of ones. Ξ opt is the Ξ that minimizes E 2 . As a result, ξ μ is 1 at cycle indices μ for which extra pulses are detected, and 0 elsewhere.

Extra pulse train waveform estimation-Finally
The proposed approach for estimating the extra pulse train d 2 (n) has the advantage over our past peak-picking based approach that no thresholds regarding minimal peak height and minimal peak prominence are necessary. Another advantage of estimating the extra pulse shape via cross-correlation instead of using the shape of the cyclic pulse train is that the height h of the extra pulses is estimated implicitly, because r 2 (l) is automatically scaled accordingly.  [19]. One model is fit for the detector with PSP, and one without. In addition, means and standard deviations of Acc and Se + Sp are obtained, and compared for high and low levels of additive noise as well as high and low irregularity strengths.
For the experiment involving twenty-five bicyclic GAWs, the mean and the standard deviation of only Se are reported, because Sp is not available due to the inexistence of cycles without extra pulses. Table 2 shows the results of the robustness analysis in terms of linear modelling of the detector performance Se + Sp. The two detection options, i.e., with and without PSP, are compared. Regarding detection without PSP, negative coefficient estimates reflect that detector performance is adversely affected by increases of the irregularity strength Irr, the noise level H, and the extra pulse rate ρ. This appears to be plausible because irregularity and additive noise limits the detection due to decreases of the signal-to-noise ratio, and the more frequent extra pulses occur, the larger the cross-talk of d 2 (n) towards d 1 (n) is. In contrast, increases of the extra pulse height h affect detector performance advantageously, which is reflected by a positive sign of the coefficient estimate. This appears to be plausible because larger extra pulses are associated with larger signal-to-noise ratios. The same trends are observed when PSP is used, except for the ρ parameter (-0.099 versus 0.106). The advantageous effect of ρ on the performance of the detector using PSP may be interpreted as a sign that PSP suppresses cross-talk of d 2 (n) towards d 1 (n) .

Results and discussion
The robustness of the detector using the PSP option is favourable in two parameters, i.e., the irregularity strength r, and the extra pulse height h. In particular, effects of Irr and h on the performance when using PSP are half the effects that are observed when no PSP is used. The effect of h is non-significant when PSP is used, whereas it is significant without PSP. In other words, small extra pulses are detected equally well as large extra pulses only when PSP is used. However, the detector with PSP is less robust against additive noise than the detector without PSP, which is reflected by an increased coefficient estimate respective H (−0.0147 versus −0.00433). Table 3  Mean sensitivities for detecting extra pulses in subharmonic voices, i.e., with extra pulse rate set to 1, are 29.7% without PSP, and 87.4% with PSP. This observation is plausible because without PSP pulse shapes of the cyclic train may be estimated which are bicyclic, and extra pulses are cancelled out when subtracting the estimate of the cyclic pulse train from the GAW. This adverse effect is successfully tackled when PSP is used, because this strategy ensures that estimated pulse shapes of the cyclic train are single pulsed only.
Assumptions that are needed to be made, limitations of our approach, and differences of the currently presented detector to its previous version are discussed as follows. First, obviously, the used signal model needs to be valid for the signal under test. It is likely that our detector is able to distinguish between phonation with extra pulses and normal voice, but it is not clear how the detector would behave if applied to voice samples with other types of abnormalities, e.g., diplophonic voice, or chaotic phonation. Further testing (and probably training) of the detector will be needed to establish detection that is specific to extra pulses even if other abnormalities occur in the signal. Second, it is assumed that the extra pulses are unjittered and unshimmered, i.e., they occur at fixed times respective the cyclic pulse train's instantaneous phase, and with fixed heights. These assumptions were relaxed in the past by using a peak-picking based approach [5] to estimate times of extra pulses. However, the current approach has fewer degrees of freedom and appears to be more elegant. Also, we expect that our approach may handle small amounts of extra pulse jitter and shimmer. If large amounts of extra pulse jitter and shimmer occur, it will perhaps become necessary to adapt the detection approach. Third, cross-correlation based segregation of the cyclic pulse train and the extra pulse train relies on the assumption that these trains are uncorrelated. However, we saw in the cyclic pulse train waveform estimate a cross-talk. This cross-talk manifests in the cyclic pulse train as extra pulses, the heights of which depend on the heights of the actual extra pulses and their rate of occurrence. The higher the extra pulses and the higher their rate of occurrence, the larger is the cross-talk. This limitation is tackled successfully in the current approach by introducing PSP to the estimation of the cyclic pulse train, which supresses extra pulses in the cyclic pulse train estimate.

Conclusion
We propose a synthesizer for GAWs that is capable of adding extra pulses to the cyclic pulse train, and a detector for extra pulses. The detector is tested on 100 synthesized GAWs with random extra pulses, and 25 GAWs with extra pulses in occuring in each quasi-closed phase of the cyclic pulse train known as, bicyclicity, bigeminism, subharmonics, double pulsing, or alternate pulsing. Using signals containing random extra pulses, tests were conducted with different energy levels of additive noise, different strengths of modulation noise, i.e., jitter and shimmer, as well as different extra pulse rates and heights. Two variants of the detector are tested. One detector parameterizes the estimated pulse shapes of the cyclic pulse train using a Chen pulse model, whereas the simpler does not.
Significant steps towards the improvement of our detection approach were made. (i) Our past experience has shown that extra pulses disturb the estimation of the cyclic pulse train, which we successfully tackle with PSP. In particular, a cross-talk had been observed that biased the estimation of the cyclic pulse shape in such a way that it appeared to be double pulsed. We hypothesized that it is possible to suppress cross-talk and thus increase detection performance by using a single-pulse parametric model for the pulses of the cyclic pulse train. Indeed, it is shown experimentally that the detector that uses PSP outperforms the simpler approach if the signals are not corrupted with high energy levels of additive noise. The PSP for cross-talk suppression appears to be particularly relevant for subharmonic voices, because frequent extra pulses result in strong cross-talk without PSP. (ii) Faster candidate selection is proposed for fundamental frequency extraction. (iii) A peak-picking free extra pulse estimator is proposed.
We conclude from our results of robustness analysis that a user of the detector may be given the advice to measure the energy level of the additive noise and irregularity strength before using the proposed detector for extra pulses. Normally, the PSP option should be used, especially if extra pulses occur frequently, as, e.g., in subharmonic voices. If high energy levels of additive noise are observed, the detector should be used without PSP. In cases of high irregularity strengths, the user may be advised not to use the detector with either of the two options.  Block diagram of the cyclic pulse train generator. A pulse times vector n p (μ) is obtained owing to a fundamental frequency f 0 , and a cycle length modulation noise vector j(μ), controlled by irregularity strength Irr. An amplitude modulation noise vector s(μ) is also obtained. The instantaneous phase Θ(n) and the amplitude modulation function A(n) are obtained by interpolation. The pulse shape r(l) is a Chen pulse [11], controlled by parameters Ψ. The Fourier coefficients a p and b p of (l), and Θ(n) are input to a Fourier synthesizer (FS). Its output d′ 1 (n) is multiplied by A(n) to obtain the cyclic pulse train d 1 (n).    subtracted from d′(n) to obtain e 1 (n), which is minimized with respect to S.  Block diagram regarding the estimation of the modulation noise. A quasi-unit pulse train u 1 n is cross-correlated with GAW d′(n) to obtain the pulse shape estimate r l . Via a pulse shape parameterization (PSP) switch, either r l or a parameterized version r l is used. The parameterized pulse shape r l is obtained from a Chen pulse generator, the control parameters Ψ of which are obtained via minimization of the parameterization error e r l = r ′ l − r l . The modulated cyclic pulse train d 1 n is obtained with a Fourier synthesizer, taking the pulse shape's Fourier coefficients a p and b p , as well as the instantaneous phase estimate Θ n as inputs. Its output is multiplied by the amplitude modulation function estimate A n . The modulation noise vector estimates j μ and s μ perturb the quasi-unit pulse train u 1 (n), and are obtained by minimizing the error e 1 n = d′ n − d 1 n .

Fig. 7.
Block diagram regarding the estimation of the extra pulse train. The constant phase shift π is added to the instantaneous phase estimate Θ(n), to obtain an extra unit pulse train estimate u 2 (n) . u 2 (n) is cross-correlated with the error e 1 (n) of the cyclic model to obtain the extra pulse shape r 2 (l) . The extra pulse train estimate d 2 (n) is obtained by convolving u 2 (n) with r 2 (l) . The extra pulse trigger estimate ξ (μ) ∈ 0, 1 is obtained via minimizing the model error e 2 (n) = e 1 (n) − d 2 (n) .   Table 3 Summary of the means and standard deviations of the performance measures Se + Sp, i.e., the sum of the sensitivity and specificity, and Acc, i.e., the accuracy. The measures are shown for GAWs with a small level H of additive noise, and a small irregularity strength Irr (class I signals), GAW with an increased level H of additive noise, and a small irregularity strength Irr (class II signals), GAWs with a small level H of additive noise, and a larger irregularity strength Irr (class III signals), and finally GAWs with a large level H of additive noise, and a large irregularity strength Irr (class IV signals). The best performance achieved the detector using PSP with class I signals (Se + Sp = 1.722 and Acc = 0.883).