Assessment of vocal cord nodules: a case study in speech processing by using Hilbert-Huang Transform

Vocal cord nodules represent a pathological condition for which the growth of unnatural masses on vocal folds affects the patients. Among other effects, changes in the vocal cords’ overall mass and stiffness alter their vibratory behaviour, thus changing the vocal emission generated by them. This causes dysphonia, i.e. abnormalities in the patients’ voice, which can be analysed and inspected via audio signals. However, the evaluation of voice condition through speech processing is not a trivial task, as standard methods based on the Fourier Transform, fail to fit the non-stationary nature of vocal signals. In this study, four audio tracks, provided by a volunteer patient, whose vocal fold nodules have been surgically removed, were analysed using a relatively new technique: the Hilbert-Huang Transform (HHT) via Empirical Mode Decomposition (EMD); specifically, by using the CEEMDAN (Complete Ensemble EMD with Adaptive Noise) algorithm. This method has been applied here to speech signals, which were recorded before removal surgery and during convalescence, to investigate specific trends. Possibilities offered by the HHT are exposed, but also some limitations of decomposing the signals into so-called intrinsic mode functions (IMFs) are highlighted. The results of these preliminary studies are intended to be a basis for the development of new viable alternatives to the softwares currently used for the analysis and evaluation of pathological voice.


Causes, Early Signs and Symptoms
Growth of pathological masses in vocal folds is generally caused by excessive and repeated mechanical stress. In fact, vocal cords are subject to collision forces at each vibratory cycle. Moreover, the air forced through the small gap between the folds during voice modulation causes also drying. Therefore, nodules arise from vocal cords tissue trauma, which in turn is due to chronic vocal overuse or misuse. Over time, these vocal abuses generate firstly soft and swollen spots, which then evolve into nodules and become bigger and stiffer if the incorrect vocal use persists.
Ordinarily, the first symptoms noticed by people affected are difficulties to produce sounds belonging to the upper vocal pitch range [15] [16]. Nevertheless, the definition of "healthy" vocal range is still not well defined, and changes according to the field of interest. Conventionally, the limits of "regular" vocal emissions reach as low as 50 Hz (at least) and up to 20 kHz [17]; differences in vocal fold size, due to gender, genetics, age and other causes, makes the definition of "physiological" vocal range quite vague.
However, since nodules interfere more or less markedly -depending on their size -with the vibrational behaviour of the vocal cords, differences between frequency responses of healthy vs. pathological conditions are known and documented in the literature [18], even if the vibrational behaviour of vocal folds is not at all easy to understand nor to reproduce, mainly due to its known nonlinearity [5] [19]. Generally, nodules cause frequency and intensity ranges to be reduced, but fundamental frequencies and intensity often do not undergo any dramatic change when these masses develop [15] [16]. Other prominent signs of vocal fold nodules are breathiness and hoarseness. The latter is a sign due to aperiodic vibrations of vocal fold, while the former results from the incomplete closure of folds upon phonation [15]. The patient's voice may also be perceived as more harsh and rough than usual.
Another common symptom that may arise is a sensation of pain or soreness in the neck, lateral to the larynx. This generally happens because of the increased effort required to produce the voice [15] [16]. As will be discussed in more detail in Sections 4 and 5, this point represents also a great limitation to the usefulness of audio recordings as they are currently performed for vocal cord nodules estimation. Indeed, the patients, more or less involuntarily, arrange their vocal emission in order to provide the requested tone; speech production is a closed loop process -and so patients will adjust in order to produce the correct sound if possible. This means that some parameters, especially volume (and so, energy content of the signal) become meaningless when compared between different audio tracks, as the input (i.e., air pressure as produced by the lungs) is out of control.

Diagnostic Methods
In the current state of knowledge, a kaleidoscope of techniques is available for the diagnosis of vocal fold nodules. By simplifying, they can be sorted into two main groups: methods that depend on the direct observation of vocal cords and the so-called "non-invasive" ones, based on vocal emissions analysis. It should be remember that "vocal emissions" include all kinds of vocal output -voiced, unvoiced and plosive sounds. Furthermore, non-invasive methods can be divided between the ones based on acoustic measurements -therefore, objective and quantitative approaches -and the ones that rely on perceptual evaluation (a subjective measure of voice, performed by specialists). These two approaches dominate the current state of clinical evaluation of voice quality, not only regarding nodules, but also for all kinds of voice disorders. Even more, perceptual methods can be further divided into clinician-based (e.g., GRBAS/GIRBAS and CAPE-V) and patient-based ones, such as V-RQOL and IPVI. Attempts have been made in order to link objective and subjective estimations [20], while the effectiveness and reliability of subjective tests has been amply discussed in literature [21] [22].
In this case, the pre-operative conditions and the follow-ups over convalescence time were investigated through the application of a computerised tool for acoustic voice analysis, the Multi-Dimensional Voice Program (MDVP™). First introduced in 1993, MDVP™ software has been applied in several contexts [23]; its validity has been analysed and compared to other available software in recent years [24]. At least, more than 33 acoustic parameters are currently inspected in a quantitative way thanks to MDVP™ [25]. This has been made possible by the introduction in last decades of new digital instruments, such as the first digital spectrograph (DSP Sonograph), introduced by Kay Elemetrics ® in the late 70s. Also perceptual evaluation was performed as a first assessment of patient's condition, according to the GIRBAS scale. This method works simply by rating six parameters of the voice, ranging from 1 (non-pathological) to 5 (strongly affected). In its first definition [26], the scale was known as GRBAS, where the five elements considered were the Grade, Roughness, Breathiness, Aesthenia and Strain of vocal emissions. The "I" parameter stands for the Instability of voice and was added at a later stage by [27], making it in its current definition. The GRBAS/GIRBAS scale is currently accepted as standard by the European Group on the Larynx and by the Japanese Society of Logopedics and Phoniatrics and represents the most commonly used perceptual methodology.

Treatments
In some cases, surgery is not needed, nor recommended, for the treatment of vocal cord nodules [28] [29]. Non-surgical techniques, such as behavioural voice therapy, ordinarily performed by speech-language pathologists, are generally able to produce a reduction in the size and severity of nodules, even if traumatic injuries are unlikely to heal completely without any aftermath [28] [30]. However, removal surgery may be necessary when behavioural interventions are not effective. Nevertheless, it is not impossible that the vocal range will be permanently altered post surgery [31]. In the particular case reported here, the patient underwent a clinical surgery removal, as voice therapy alone proved to be insufficient.

Case Report.
The patient, an Italian male adult (one of the Authors), started to suffer from voice disorders in April 2013, probably due to the overuse of voice. The clinical picture showed an upper phlogosis and was mainly characterised by hoarseness worsening as a result of vocal strain; an endoscopic evaluation indicated a haemorrhagic vocal cord polyp and some signs of vocal stress.
Two months later, on 11 th June 2013, the patient underwent the removal of the right vocal cord polyp, under analgosedation with remifentanil, according to the technique of Target Controlled Infusion (TCI) and local anaesthesia. The procedure was conducted in an awake setting and the patient was dismissed on the same day.
The patient received one session of preoperative counselling, targeting vocal hygiene instruction and surgery preparation; he also followed a course of postoperative therapy.
On perceptual evaluation, GIRBAS decreased from 1-1-1-1-2-1 to all zeros 5 months postoperatively. A voice handicap index (VHI) was also administered before and 5 months after the operation to evaluate the patients' perceived satisfaction with his voice and it went from mild deficit (score 12) to normal results (score 1 point).
In the acoustic analysis carried out by MDVP™ model, an improvement of the parameters (jitter, shimmer, NHR, high pitch range) were also obtained, as will be explained in more depth later.
Since treatment, the patient has not presented any dysphonia recurrence even if he continues to use the voice intensively, mainly due to his work, which is highly demanding in terms of voice use.
The recordings have been all realised at ENT Unit, Santa Chiara Hospital, Trento, Italy. The microphone and instrumentations are produced by KAYPENTAX, provided by default for MDVP™ analysis.

Speech Records
Each speech record comes from the same patient, is 3.75 seconds long and is composed by a single emission of a sustained vowel 'a' according to Italian pronunciation (/a/ as defined by the International Phonetic Alphabet). Sampling frequency (i.e., the samples of sounds per second to represent the speech recorded digitally) is 44100 Hz, resulting in 165375 elements inside each digital record. This  circa one year since surgery; 210 days since previous record). They can be seen in Figure 1. By comparing the first two records, it is possible to have direct insight into the surgical intervention results and the immediate aftermath, while evolution of specific trends between tracks #2, #3 and #4 have also provided knowledge about follow-ups and healing process of voice condition during convalescence.

MDVP Clinical Reports
Four medical reports have been filed from the analysis of all tracks via MDVP™. Among all the parameters considered, average fundamental frequency (F0) is the most commonly used for the evaluation of voice disorders; ordinarily, fundamental frequencies will fall between 85 to 180 Hz for male adults, and between 165 to 255 Hz for the same-aged females [32]. F0 can be automatically tracked by using peak picking strategies, autocorrelation techniques or other equivalent methods [33]. It represents, obviously, the inverse of the fundamental period T0, that is to say, the elapsed time between two successive laryngeal pulses [10]. Nevertheless, as stated before, T0 is defined in a speech signal according to the assumptions of linearity and local stationarity, which can be a problem if used to approximate a pathological voice. As it will be discussed later in this paper, HHT-based frequency related parameters bypass these limitations. It results from these reports that pathological voice is not subject to any impressive shift in fundamental frequency, as expected ( [15] [16]).
In Figure 2, some results from MDVP™ Reports are reproduced. The 11 indicated parameters are (clockwise moving from the top) the Jitter percent (Jitt); the Fundamental Frequency variation (vF0); the Shimmer percent (Shim); the Peak-to-Peak Amplitude Variation (vAm); the Noise-to-Harmonic Ratio (NHR); the Voice Perturbation Index (VTI); the Soft Phonation Index (SPI); the F0-Tremor Intensity Index (FTRI); the Amplitude Tremor Intensity Index (ATRI); the Degree of Voice Break (DVB); and the Degree of Sub-Harmonics (DSH). The green circle encloses the threshold of healthy conditions (every parameter has a different scale).
These parameters are not the only ones produced by MDVP™ analysis, but represent those generally most taken into account, as they are considered the most important objective measures for assessment of several voice disorders [23].
On this particular case, the parameters of interest -the ones which exceed the respective thresholds -are the Jitt, the vF0, the Shim and vAm. In more detail, vF0 exceeds its threshold (1.100 %) before removal, but decreases and stabilises since then; Also Jitt falls drastically just after operation (from 1.655 %, with a threshold of 1.040 %, to 0.379 %) and remains inside the limits afterwards. On the other hand, Shim, which is just slightly over the limit before the intervention (3.851 % respect to 3.810 %), increases in the first months of convalescence (to 5.139 %) and then starts to decrease (to 4.003 % and 2.467 %, respectively for track #3 and #4). Also vAm, which was not above the limit before -for pathological conditions -, went beyond the 8.200 % threshold after operation, reaching a maximum of 10.427 % (track #2) and then declining to 6.341 % (track #3) and to 5.255 % (track #4). By considering these data, it is possible to deduce that a transient effect of removal surgery was a temporary increase in variability of the amplitude, which affected both the parameters linked with it, ie. Shimmer and vAm (that is to say, the amplitudes of consecutive periods, divided by the average amplitude, and the Peak-to-Peak Amplitude). Instead, Jitter, which represents the cycle-to-cycle variation of F0, and other F0-related parameters were positively affected by the surgery and suffered no transient worsening. However, these reports do not provide any information about which frequencies were afflicted the most by the nodules' presence and removal. This information was provided by the EMD-based analysis described here in the next Section.

Hilbert-Huang Transform for Speech Processing.
The Hilbert-Huang Transform (HHT) was proposed for the first time in its current form by Huang et al [6]; it represents a suitable option for representing data from nonlinear, non-stationary processes without losing any time-domain information. Essentially, the HHT is made up by two parts: Empirical Mode Decomposition (EMD) and Hilbert Spectral Analysis (HSA). Although the HHT is being used more and more often in the signal-processing context, some background theory will be provided here in order to make this paper a little more self-contained. Much deeper explanations can be found in Huang's original papers [34][6] [35], as well as in his recent book [36]. For an arbitrary time series, we can define the Hilbert Transform as This way, the generic time series g(τ) is convoluted with the function 1/πt. This emphasises the local properties of the signal analysed. The tilde ̃ is used here to denote the transformed function, which is still a function of time, as the HT maps function of time or frequency into the same domain, in contrast to the DFT. It should be remarked that the HT, if applied to some dataset, could return physically meaningless results -that is to say, negative frequencies. In order to avoid these problems, two conditions must be applied to the input data: (1) the function must be symmetrical respect to the local zero mean (2) function must have the same number of zero crossing and extrema, or differ at most by one.
Empirical Mode Decomposition provides "modes" that satisfy both these restrictions. These modes -the so-called Intrinsic Mode Functions (IMFs) -can be regarded as the oscillations embedded in the original signal [37]; differently from the harmonics obtained through Fourier Transform, their amplitude and frequency is not constant over time. The process by which they are extracted from the signal can be found in [12]; this step-by-step method is also known as the sifting process. The process is quite straightforward: (1) for k = 0, all the extrema (local maxima and minima) of the analysed data (i.e., of ! = ) are identified; (2) local maxima of ! are connected by a cubic spline line, defining e !"# ( ); likewise, local minima of the same function are linked revolving to the same kind of spline interpolation, thus obtaining e !"# ( ).These two lines form, respectively, the upper and the lower envelops for the given data, which all stand between them; (3) local mean m( ) is evaluated as the mean between e !"# and e !"# ( ); is computed, flag is increased by one ( = + 1) and ! is treated as input data for step (2). If d !!! is not an IMF, d !!! itself is treated as input data for step (2).
Iteration process ends when the residual ! satisfies a predefined stopping criterion. In this work, the decomposition has been stopped when, for the n-th iteration, the residual r ! became a monotonic function, from which no more IMFs can be extracted, or when the set number of maximum iterations was reached, whichever came first. Once computed, the IMFs form a complete and nearly orthogonal basis for the original signal; each group is formed by data which have, at any point, zero mean for both the maxima and the minima envelopes [38], as they are -by definition -monocomponent signals.
For this study, the MatLab ® script 'ceemdan.m' has been used. This code has been developed by Marcelo Colominas and was introduced in the current version in [12]. CEEMDAN, or the Complete Ensemble EMD with Adaptive Noise algorithm, is an improvement of the basic EMD algorithm, which is affected by the problem of the so-called "mode mixing", the overlap between different modes that have so small differences that they can led to misaddressing of their components. This causes an alias in the frequency-time distribution, leading to a loss of physical meaning of the decomposed data. To solve this issue, Ensemble EMD was first proposed. To keep the discussion brief, the idea is to add white Gaussian noise at each stage of decomposition; then, the generic k-th IMF is computed as the mean over an ensemble of trials ( ! ) of the corresponding ! obtained via EMD.
Step-by-step, the algorithm can be defined so: (3) k-th IMF is assigned as A much more detailed description of the EEMD can be found in [13] and [14]. CEEMDAN, instead, uses each mode for the computation of the next one, sequentially, in a deflationary scheme; basically, in CEEMDAN the several modes are computed as the difference between the current residual and the average of its local means (considering also the noise added to the signal), while in EEMD each ! is decomposed independently from the other, thus generating I different residuals ! ! for each mode. Again, it is possible to describe also the CEEMDAN algorithm in subsequent steps: (1) I realisation of white Gaussian noise ( ! [ ]) are used to define ! = + ϵ ! ! [ ], with ϵ ! representing the arbitrary value of noise standard deviation for the first step. Then, first modes are computed exactly as for EEMD: (2) for k = 1, the first residual is calculated as , where ! [ ] are different realisations of white Gaussian noise for = (1, … , ) and the operator ! (•) indicate the whole process that, given a signal, produces the j-th mode by EMD. Then, the second mode is defined as (4) for k = 2 onwards, the k-th residuals are computed as realisations are decomposed. Then, the ( + 1)-th mode is defined as (6) steps (4) and (5) are reiterated until k = K.
The process is relatively time-consuming, as a large number of iterations are generally required, but reduces substantially the risk of mode mixing. In this work, the initial noise standard deviation ϵ ! has been set to 0.2; the number of realisations (NR) to 15; and the maximum number of sifting iterations allowed to 3000. The code was also required to automatically increase the SNR for every stage.
It is also important to state that all the data from the four audio tracks have been filtered before being decomposed into IMFs. In more detail, a Butterworth low-pass filter of order 10 has been applied in both the forward and reverse directions, to ensure zero-phase distortion. Filtering was needed in order to reduce the effects of background noise from the audio recordings. From each track, a set of IMFs has been obtained. From these four sets, in order to speed up the comparison process, four subsets have been extracted, considering only IMFs 7 to 11 ( Figure 3). The IMFs chosen for the subset have their Mean Frequencies, ! , close to the estimated fundamental frequencies of each track, as supplied by the MDVP™ clinical reports.
It must be remarked that the four tracks did not provide the same number of IMFs when decomposed, and that this represents a great limitation to the analysed technique, as explained before. In fact, EMD is a purely empirical decomposition -as the name itself states, obviously -and performs blind signal separation. Hence, no physical interpretation can be provided to justify the obtained number of IMFs. In detail, decomposition of track #1 and track #3 produced both 18 Intrinsic Mode Functions, while 19 functions were extracted from track #2 and track #4. In this particular case, the Authors were able to state that IMFs 7 to 11 include, with small differences, the same data for all the tracks; this was only possible by supervising the EMD operations at any iteration, keeping track of the whole process. However, this is a serious limit to any further attempt to generalise the method, as this distinction can be not always enough evident or easy to oversee.

Results.
From each one of the subsets previously described, four parameters have been considered and analysed, in order to test their capability as indices of pathological conditions and/or improvement in healing: the IMF Mean Frequency ( ! ), the IMF Standard Deviation (SD), the Total Energy Content ( !"! ) and the IMF Content Energy ( ! ). The first two will be addressed hereinafter as time-frequency parameters, while the latter two as energetic parameters. These features can be defined as follow: ) where N stands for the number of elements inside the investigated signal (here 165375, as the signal is 3.75 seconds long with a sampling rate of 44100 Hz) and ! [ ] are the several instantaneous frequencies of the n-th mode.
where [ ] simply represents the standard deviation between the several instantaneous frequencies ! and the mean frequency ! of the n-th mode. ) where [ ] is the amplitude, and hence the energy, of the n-th mode, computed for any element, as it is not constant in time.
where M is the total of the IMFs that compose the original signal (18 for tracks #1 and #3, 19 for tracks #2 and #4).

The Hilbert Spectrum and Time-Frequency Analysis
The Hilbert Spectrum is the graphical representation of the instantaneous frequency over time, computed separately by applying the Hilbert Transform at each one of the IMFs included into the four subsets defined previously. As one can see in Figure 4, the signal taken before removal surgery shows much more often, high frequency peaks, than in the tracks recorded post-operation. By tracking the evolution of both ! and SD over time, it is possible to recognise some distinct trends. These results, described later, can be seen in the graphs shown in Figure 5. ! shows a peak corresponding to the post-operative record closest to the surgical intervention (25th September 2013). This reflects the trend of the average fundamental frequency F0 between track #1 and track #2, discussed before in Section 3. Nevertheless, F0 rebounded slightly after a local minimum occurred corresponding to track #3 and reassessed to values close to the peak for track #4 (mostly the same, circa 134 Hz). Instead, ! keeps on decreasing for all IMFs, with the sole exception of IMF 8. However, even the 8 th mode does not reach again its peak value; overall, all modes seems to tend to stabilising. The exact cause of this behaviour cannot be ascertained with absolute sureness, but could be explained by the fact that surgery is a traumatic event for vocal fold tissues, which change their vibratory characteristics suddenly. A plausible explanation would be scarification and/or swelling. What is certain, is that, since natural frequency is directly proportional to the stiffness and inversely proportional to the mass, any explanation for the immediate post-operative peak will be related to (1) an obvious decrease in mass, due to the nodules' removal; (2) a sudden increase in stiffness, maybe linked to immediate effects of scarification; (3) a combination of these two factors and/or other effects, maybe not related directly to the vocal folds, but to other components of the voice-production mechanisms. Successively, the observed assessment could be most probably due to the healing processes, with scarified tissues reabsorbed over time and a reduction in overall stiffness.
The standard deviation (SD) shows a trend that is strongly related with the one of F0. Indeed, since in track #4 the mean frequencies of IMFs 7, 9, 10 and 11 (just to cite the ones reported here) tend to stabilise, while instead F0 tends to increase, a higher value of variability is understandable. As mentioned several times earlier, non-pathological voice is supposed to be more coherent and to have a larger range of frequency; both these factors contribute to increase variability of the frequencies contained into the IMFs. Noteworthy is the very high value of SD for IMF 7 in track #1. Most probably, this mode is much more sensitive to the pathological conditions of the pre-operative voice, somehow. A possible explication is that non-homogeneity induced by nodules' presence affects more the modes that are related to higher frequencies, like IMF 7; after surgical intervention, this disturbance is greatly attenuated and this mode starts to behave as its companions do.

Marginal Hilbert Spectra and Energy Content Analysis
Given the Hilbert Spectrum as ( , ) -being, by definition, a portrayal of instantaneous frequency over time -Marginal Hilbert Spectrum can be written as Thus, Marginal Hilbert Spectra (MHS), as shown in Figure 6, permit an immediate visual inspection of the amplitude (i.e., energy) contribution of each frequency.
As can be seen also in Figure 7, each IMF evolves along time according to its own manner, but all of them can be related to the trend of Total Energy !"! .
The pre-intervention record (track #1) presents a spectrum generally composed of frequencies lower than those preeminent in the post-operative tracks. This evidence agrees with two of the symptoms generally associated with vocal cord nodules, the shrinking of voice range extension and the difficulty to perform the highest frequencies.
It is also possible to notice in Table 1 that no comparison between the IMFs is possible in terms of Energy Content. The values of ! fluctuate from one record to the other in a fashion that gives no noticeable trends. Only two prominent results can be clearly seen here. Firstly, if IMF 11 is excluded, all the other modes seem to reproduce, broadly speaking, the behaviour of the total energetic content, !"! . As mentioned before, it has been noticed that nodules seem to affect more the IMFs that contain higher frequencies. Thus, IMF 11 should be the least affected of them all; furthermore, its contribution to the !"! is negligible (always less than 1%), both in general and when compared to the other modes included in the subset. Secondly, it is possible to observe that before surgery, the Energy Content was very different between different modes. These differences results strongly attenuated nine months after the operation; the vertical bands in Figure 10 show this plainly. This can be seen as evidence that, when voice is healed and back to physiological condition, the energy is distributed between modes in a more uniform fashion. However, it has been evaluated that ! , as well as !"! and any other possible energy-related index, is too affected by variations of the voice volume to produce any reliable result when different tracks are compared; mean frequencies and standard deviations of instantaneous frequencies provided results that look much more reliable.

Conclusions.
The main aim of this research was to investigate the viability of the HHT and CEEMDAN algorithms to track the healing process of vocal cord tissues over convalescence time, as well as to define the immediate aftermath of the removal surgery. A subset has been extracted by the decomposition of the audio tracks into IMFs, specifically IMFs 7 to 11; evolution of these five modes has been studied along the four audio tracks provided. In particular, four parameters have been investigated: the IMF Mean Frequency ( ! ), the IMF instantaneous frequency Standard Deviation (SD), the Total Energy Content ( !"! ) and the IMFs Content Energy ( ! ). Some speculations have been drawn from the observed results.
Regarding the trend of ! over time, all IMFs produced similar result: peak values just after surgery, followed by a stabilisation to values not distant from the pre-operative ones. This proves that vocal cords nodules do not affect substantially the frequency content of voice, as expected, while surgery does, even if for a limited period of time. Higher-numbered modes, which are related to lower frequencies, show to be less affected by the surgery in the short run. Post-operative peaks can be related to a decrease in mass, due to the nodules' removal; to an increase in stiffness, due to tissue scarification; or to a combination of both these effects plus some other mechanisms. The assessment of ! in the long run, after one year from surgery removal, is almost surely due to an overall reduction of stiffness, which can be related to the healing of scars.
IMFs 8, 9, 10 and 11 showed a higher value of Standard Deviation SD in healed conditions than in pathological ones. This increment is sensibly less marked for higher-numbered modes, as they show again to be less influenced by both nodules' presence and surgery aftermaths. This larger variability of the instantaneous frequencies could be a consequence of the broader range of frequency of the healed voice. IMF 7 counterintuitive behaviour may be a result of the non-homogeneous effects of nodules, as lower-numbered modes proved to be more affected and this "deviation" from the otherwise regular trend is limited only to the pre-operative first record. It was observed that the range of ! values among the five IMFs reduced noticeably after convalescence, demonstrating how the energy is more uniformly distributed between modes in healthy vocal conditions. Moreover, the Marginal Hilbert Spectrum of track #1 showed also a composition made mostly by frequencies much lower than those prominent in the following records, thus highlighting the reduction of vocal range to lower-than-usual frequencies that ordinarily occurs with the onset of vocal fold nodules. However, apart these two speculations, energy-related features were considered not to be reliable. Overall amplitude of the four signals resulted to be too heavy influenced by voice volume, a parameter that was not taken in account during the several recording operation. Thus, !"! and ! cannot be completely trustworthy parameters. By comparing SD (which rebounds slightly or stabilises for all the investigated IMFs between track #3 and track #4) with !"! and ! , which instead decrease in all cases, it is also possible to state that removing nodule masses from the vocal cords allows them to vibrate much freely, producing a wider range of frequencies, spending less energy.
To sum up, some points become clear from the investigation performed: (1) The Hilbert-Huang Transform works, as expected, for the analysis of non-stationary data originated by a nonlinear source, as in the case of human voice; theoretically speaking, the tool is perfectly suited for the non-invasive analysis of vocal fold conditions through speech processing. Different to other current methodologies, the HHT performs time-frequency analysis without requiring any assumption of stationarity of the signal or linearity of the system, thus avoiding the risks inherent in the oversimplifications that other techniques are restrained by.
The EMD-based approach allows analysis of the different frequencies separately, even if the decomposition itself, being empirical, is not free from issues. Since there are no guarantees that the decomposition will produce the same amount of IMFs, if the content of this Mode Functions is not -by chance -clearly similar, as in this particular case, the Hilbert-Huang Transform would be itself viable, but results would be much more difficult, or even impossible, to compare.
Energy Content, both for the overall signal ( !"! ) and for each one IMF inspected ( ! ), is too much influenced by the volume of the patient's voice. Some information can nonetheless be extracted from their trends, but it seems that this results must be handled with great care and not uncritically. (4) IMF mean frequency ( ! ) and instantaneous frequency standard deviation (SD) showed clear trends, both during post-operative convalescence time (long run, tracks #2, #3 and #4) and immediately after surgery (short run, tracks #1 and #2). These results are by far the most interesting ones and explain all the possibilities offered by time-frequency analysis for health monitoring the follow-ups after surgical intervention through speech processing.
With all the difficulties and limitations encountered, the authors find that the HHT proved to be feasible, technically speaking, for the proposed aim. ! and SD can be easily included as additional parameters to MDVP™or to any other software for Acoustic Voice Analysis. Nevertheless, the technical obstacles given by the need to have comparable IMFs between different tracks may be just too extended to make the method advantageous in economical terms. Even more, all the results should be identified in a large, statistically valid population before being accepted definitely. Comparing them with records of patients affected by other voice disorders will also help to discern if the patterns encountered are typical of the pathology studied or detectable for other kinds of disease, too. Even so, the results are encouraging, as trends in time-frequency analysis are evident, and exhort the authors to test other options, especially Wavelets, which could probably overcome the technical problematics that afflict HHT and EMD and provide more crystalline answers.