Ambulatory assessment of phonotraumatic vocal hyperfunction using glottal airflow measures estimated from neck-surface acceleration

Phonotraumatic vocal hyperfunction (PVH) is associated with chronic misuse and/or abuse of voice that can result in lesions such as vocal fold nodules. The clinical aerodynamic assessment of vocal function has been recently shown to differentiate between patients with PVH and healthy controls to provide meaningful insight into pathophysiological mechanisms associated with these disorders. However, all current clinical assessment of PVH is incomplete because of its inability to objectively identify the type and extent of detrimental phonatory function that is associated with PVH during daily voice use. The current study sought to address this issue by incorporating, for the first time in a comprehensive ambulatory assessment, glottal airflow parameters estimated from a neck-mounted accelerometer and recorded to a smartphone-based voice monitor. We tested this approach on 48 patients with vocal fold nodules and 48 matched healthy-control subjects who each wore the voice monitor for a week. Seven glottal airflow features were estimated every 50 ms using an impedance-based inverse filtering scheme, and seven high-order summary statistics of each feature were computed every 5 minutes over voiced segments. Based on a univariate hypothesis testing, eight glottal airflow summary statistics were found to be statistically different between patient and healthy-control groups. L1-regularized logistic regression for a supervised classification task yielded a mean (standard deviation) area under the ROC curve of 0.82 (0.25) and an accuracy of 0.83 (0.14). These results outperform the state-of-the-art classification for the same classification task and provide a new avenue to improve the assessment and treatment of hyperfunctional voice disorders.

Introduction Voice disorders affect approximately 6.6% of the working population in the United States [1] and can have devastating psychological and social-economic consequences on those impacted. The most common voice disorders are chronic or recurring conditions that are believed to be caused by detrimental patterns of vocal behavior, referred to as vocal hyperfunction [2]. Such behaviors are often associated with trauma-induced lesions of the vocal folds (e.g., nodules, polyps), which we refer to as phonotraumatic vocal hyperfunction (PVH) [3]. Despite the significance prevalence of hyperfunctional voice problems, effective prevention and clinical management continues to be hampered by limited knowledge of the etiological and pathophysiological mechanisms related to these disorders. For example, even though daily voice use is often assumed to be a critical factor, the actual relationships between daily voice use and vocal hyperfunction is not well understood.
There have been some recent attempts to better characterize hyperfunctional voice disorders. In an expansion of previous work [2], it has been more definitively demonstrated that glottal aerodynamic measures of subglottal air pressure, and glottal airflow (normalized by sound pressure level) can be used to identify phonatory mechanisms associated with vocal hyperfunction that are distinctly different from normal vocal function [4]. These glottal airflow measures were obtained in the laboratory using a circumferentially vented (CV) pneumotachograph mask to capture oral airflow with a bandwidth of approximately 0 Hz to 1.2 kHz [5]. The oral airflow waveform was then inverse filtered (e.g., [2], [6], [7], [8]) to remove the influence of the vocal tract, and thus estimate clinically parameters of the glottal airflow waveform, such as peak-to-peak AC flow (ACFL), open quotient (OQ), and maximum flow declination rate (MFDR). In terms of clinically interpretability, the works of [2] and [4] provide a robust framework for which aerodynamic measures are useful to differentiate vocal hyperfunction from normal voice.
The aerodynamic-based differentiation between normal vocal function and pathophysiological mechanisms of PVH has been further supported and elucidated in recent investigations employing computer modeling. In particular, these studies have demonstrated that the elevation of ACFL and MFDR can be associated with the compensation that is necessary for individuals with PVH to maintain normal loudness [9], [10] in the presence of vocal fold pathology. This compensatory behavior contributes to what has been described clinically as a "vicious cycle" of continued concomitant increases/worsening of phonotrauma and PVH. Such compensation presents additional challenges in attempting to identify purely etiological factors. The present work focuses only on PVH subjects with the mentioned compensatory behavior, and not necessarily discriminates subjects with other characteristics of vocal fold pathologies, such as for example incomplete glottal closure. Work related to the analysis of aerodynamic measures for normal subjects and subjects with unilateral vocal fold paralysis can be found in [11] and [12] Ambulatory voice monitoring technology has been developed over several decades to investigate daily voice use. Our group has developed a smartphone-based ambulatory voice monitor (see Fig 1) that uses an application to capture and store the high-bandwidth signal from a light-weight accelerometer (ACC) attached to the front of the neck below the thyroid prominence and can be comfortably worn for multiple days at a time [3], [13]. Measures typically extracted from the voice monitor recordings are based on estimates of sound pressure, level, (SPL), fundamental frequency, and voicing duration, including cumulative vocal dose parameters such as phonation time, cycle dose, and distance dose [14]. Univariate statistical analysis of long-term data from individuals with PVH and matched healthy-controls have not shown the expected differences between overall average measures of voice use (i.e., PVH subjects did not, on average, talk more or louder than healthy controls), which suggests that such measures may not be directly useful clinically in helping to identify relevant aberrant vocal behaviors [15]. However, using features derived from these measures (mostly higherorder distribution-based statistics) in a supervised classification task demonstrated statistically significant differentiation between individuals with PVH and healthy controls, with an area under the ROC curve (AUC) of 0.705 and F-score of 0.630 for a small dataset [16]. Analysis of a larger dataset with 102 subjects (51 patient-control pairs) resulted in an AUC of 0.739 and F-score of 0.766 [3]. While promising, these findings may not be readily translated for clinical use because the level of performance may still not be adequate (marginal ability to differentiate between normal and disordered subjects), and because the resulting features do not provide direct insights into underlying pathophysiological mechanisms associated with vocal hyperfunction-i.e. the features are based on measures extracted from the voice acoustic output signal which cannot provide information about the specific physiologic parameters/mechanisms that produce voice (e.g. glottal volume velocity source characteristics). Similar limitations are observed in recent deep learning approaches [17], that still lack physiological and clinical relevance since they operate in sustained vowel scenarios and do not provide additional insights for voice therapy or biofeedback. Further efforts are need to advance ambulatory monitoring of voice with physiologically relevant features that can help to identify vocal hyperfunction.
In this study, we investigate whether ambulatory estimates of glottal airflow parameters can significantly differentiate between normal vocal activity and activity associated with PVH. This is the first analysis of ambulatory estimation and assessment of aerodynamic measures using a large group of PVH subjects. There is evidence in physical models [9] and real subjects [4] that PVH behavior manifests by compensation of SPL by producing higher levels of ACFL than normal voice function. Recognizing these features could improve clinical assessment of PVH by combining the advantages of glottal airflow measures and ambulatory monitoring. To accomplish this task, we used and extended an impedance-based inverse filtering (IBIF) scheme to estimate the high-bandwidth glottal airflow waveform from the neck-surface ACC signal [18]. This is the first effort to advance the IBIF algorithm into an ambulatory scenario, as the original study [18] only used sustained vowels and laboratory conditions. Thus, some additional considerations and details for the IBIF scheme are provided for this purpose. Note that Mehta et. al. [3] plotted the distribution of MFDR for a week for a single subject as proof of concept that IBIF could be potentially used to extract aerodynamic features. However, no quantitative analysis was performed for that case study.

Subglottal impedance based inverse filtering for ambulatory monitoring of voice
In this section, the IBIF algorithm [18] is summarized but also extended and optimized for ambulatory voice monitoring. The IBIF is a model-based scheme to estimate the glottal airflow from neck-surface acceleration [18]. The method uses a mechano-acoustic transmission line model to account for the acoustic propagation in the subglottal system and neck skin characteristics. The scheme is illustrated in Fig 2, where the electrical equivalent circuit shows the interconnection between the subglottal tracts above and below the location of the accelerometer (sub1 and sub2, respectively) and load impedance of the skin Z skin , that also includes the radiation load of the accelerometer sensor Z rad . The glottal airflow signal estimateû g ðtÞ to be obtained from the accelerometer signal _ u skin ðtÞ is calculated using Eq (1): where F À 1 ð�Þ is the inverse Fourier transform, H sub1 (ω) = U sub1 (ω)/U sub (ω) is the transfer function of subglottal section sub1 (see Fig 2), A acc the accelerometer area (cm 2 ), M acc the accelerometer mass (gr), and _ U skin ðoÞ is the acceleration signal in frequency domain. Z sub2 and H sub1 are calculated using an anatomically based, acoustic model of the subglottal system [18][19][20]. Z rad corresponds to the radiation impedance from the accelerometer. All frequency and time-domain expressions are sampled and processed appropriately [21].
In order to use IBIF as a signal processing tool, subject-specific parameters need to be estimated. These IBIF parameters are scaling factors that adjust default values of the mechanical impedance model of neck skin surface, length of the trachea, and accelerometer location. The parameters are represented in a set Q = {Q i } i=1,. . .,5 for neck skin resistance R m , mass M m , and stiffness K m , as well as length of the trachea L trachea and accelerometer placement L sub1 . Each of these Q parameters is bounded to maintain physiological plausibility [18]. The magnitude terms in Eq (3) are the default values for each parameter [22], which are scaled for normalized Q factor as, R m = 2320 � Q 1 in (g � s −1 � cm −2 ), M m = 2.4 � Q 2 in (g � cm −2 ), K m = 491000 � Q 3 in (dyn � cm −3 ), and for Z sub2 (ω), L trachea = 10 � Q 4 , and L sub1 = 5 � Q 5 are in (cm). Note that default model parameter are obtained for Q = [1, 1, 1, 1, 1] [18]. Using these subject-specific factors will allow to filter out neck-skin and subglottal resonances, making the estimated glottal airflow signals comparable between subjects.
To obtain subject-specific IBIF parameters, we compare the IBIF-derived glottal airflow waveform estimates with that from the current gold standard, namely an inverse filtered glottal airflow signal obtained from recordings using a CV pneumotachograph mask [5]. Inverse filtering in this case is a challenging task given the reduced bandwidth of the CV mask due to the airflow transducers (PT-2E, Glottal Enterprises) and the type of voices that will be analyzed (highpitched female voices exhibiting pathology). Inverse filtering of the oral airflow was performed using a semi-automatic approach, as recently described in [4]. This approach was particularly designed to inverse filter normal and pathological high-pitched voices from a CV mask signal.
Once we obtain an estimate of the glottal airflow from the CV mask, we run a Particle Swarm Optimization (PSO) scheme [23], which consists in the optimization of a non-linear continuous fitness function thorough the search of optimal "particles" (parameters) by searching its best set. For this case, PSO searches the optimal Q parameters that represent the subject's anatomical features. The fitness function in this optimization process needs to yield robust and consistent solutions. We minimize the following normalized weighted absolute error (NWAE) function, such that with and e i ðQÞ ¼ whereũ g is the CV mask-based inverse-filtered glottal airflow signal,û g is a time-aligned IBIFbased glottal airflow signal, Δ (i−1) the time-derivative operator of order (i − 1), and i represents the index of the corresponding error function e i and its weight w i . Each weighting w i was set to 0: � 3. The increased order of the time-derivative operator is used to balance the energy of higher harmonics in NWAE to avoid over-fitting in the low frequency range. Therefore, the optimization problem is stated as: where D = {D i } i=1,. . .,5 , is a set of restrictions for each parameter within the Q set that is designed to maintain physiological plausibility [18]. To reduce the computational load of PSO, several configurations of subglottal systems were pre-calculated (i.e., before the PSO algorithm started) for a set of equally spaced values of tracheal length and accelerometer position. Each pre-calculated (Z sub and H sub1 ) transfer function was indexed and retrieved inside the PSO algorithm. This approach substantially reduces the computational time of the optimization process.
The time-alignment of the oral airflow and acceleration signals is as follows. A first approximation is to align using the sample cross-correlation function [21] and find the maximum peak shifted in the neighborhood of mid-lag position [24]. To improve this initial approximation, a delay parameter d is added in the PSO algorithm by shifting the indices of signal vectors (oral airflow and neck acceleration). Since the shifted signal (oral airflow) is delayed for only a few samples, the search space is limited to Then, given N(� d 0 ) samples of data,ũ g andû g are replaced in (7) bŷ Note thatû gt ðnTÞ is a trimmed version ofû g ðnTÞ andũ gtd ðnTÞ is a trimmed, delayed version ofũ g ðnTÞ both with N − 2d 0 samples, where T is the sampling period. An initial value for d 0 was half the average glottal cycle duration.
In the case of incomplete glottal closure, coupling between the subglottal tract and vocal tract is embedded in the resulting dipole source [25]. Therefore, the glottal flow with all the source-filter interactions can be estimated without the need to model glottal coupling.

Experimental setup and participants
The human studies protocol used to collect the data for this study (Ambulatory monitoring of vocal function to improve voice disorder assessment: #2011P002376) was approved by the Institutional Review of the Partners Healthcare System-the Massachusetts General Hospital is a founding member of this organization. Dr. Robert E. Hillman is the PI on this protocol. Study participants were 48 pairs of adult females (total of 96 subjects) with each pair comprised of one patient with PVH (diagnosed with vocal nodules) and one normal control subject matched to the patient by age and occupation (see Table 1 for more details). Diagnoses were based on a complete team evaluation by laryngologists and speech-language pathologists at the Massachusetts General Hospital Voice Center that included (a) a complete case history, (b) endoscopic imaging of the larynx, (c) aerodynamic and acoustic assessment of vocal function based on Mehta et. al. [26], (d) a patient-reported Voice-Related Quality of Life questionnaire, and (e) a clinician-administered Consensus Auditory-Perceptual Evaluation of Voice assessment (CAPE-V). All patients were enrolled prior to the administration of any voice treatment. Written informed consent was obtained from all subjects. All subjects were 18 years of age or older. Due to the higher incidence of female patients with PVH than men in the overall population [27], only women were subjects for this study. Zhukhovitskaya et. al. [28] have shown significant differences (p < 0.0001) in the number of bilateral midfold lesions between males and women. Moreover, the inclusion of men would create confounding variables due to sex-specific characteristics. The matching is done to normalize for general vocal behavior differences. For example, males and females have anatomical differences, there are voice changes with age (for example, presbyphonia usually occurs when people gets older), and the type of occupation is related to how much voicing is used during a typical day at work. On the other hand, the subject-specific parameters from IBIF are normalized for each individual, so signals can be comparable, due to differences in neck-skin and subglottal anatomy. Therefore, these are not matched on healthy-patient pairs. Each subject was recorded as they engaged in normal daily activities during one week using the smartphone-based ambulatory voice monitor [3,13]. The system employs an accelerometer attached to the front of the neck below the larynx as the phonation sensor (see Fig 1). The sampling frequency was 11,025 Hz and the average total recording time for a subject was approximately 80 hours, as in [15] [3].
Each subject underwent a session in the laboratory to obtain a subject-specific calibration for the IBIF algorithm. The session involved simultaneous and synchronous recordings of CV mask-based oral airflow and neck skin acceleration in an acoustically treated room. Each subject performed a series of sustained vowels gestures (/a/ and /i/) with a constant pitch using comfortable and loud (approximately 6 dB increase) voice. For each gesture, a bandpass filter (60 − 1100 Hz) oral airflow vowel segment was used to perform inverse filtering with a single notch filter constrained to unitary gain at DC [29].
Once a glottal airflow approximation is obtained from the CV mask, Q parameters are estimated using the optimization scheme described in the Subglottal Impedance Based Inverse Filtering for Ambulatory Monitoring of Voice section. The whole process, from estimation of parameters to classification and statistical analysis was done with MATLAB (The MathWorks, Inc.).

Ambulatory glottal airflow assessment
Estimates of individual Q parameters, which were assumed to be time-invariant for each subject, were applied in Eq (1). The assumption of time-invariance is due to the properties of the neck skin, which should not change over time. Preliminary studies of the use of IBIF calibrated for a single vowel [30] and on the variability of these calibrated parameters [31], have shown that using a sustained vowel works well on running speech (e.g., the rainbow passage). Current research aims to explain in more detail the estimation and variability of these parameters under different speech conditions. _ u skin ðkÞ the discrete time-domain equivalent of the acceleration signal _ U skin ðoÞ, is convolved with t skin (k) the inverse transfer function of the skin in time domain, where its frequency domain expression is represented by Eq 2.
By taking the inverse fast Fourier transform (IFFT) with 1102 coefficients, we obtain t skin (k), a FIR filter. We take every consecutive hour of the acceleration signal _ u skin ðkÞ and convolved it with t skin (k) to obtain the estimated glottal flow signalû g ðkÞ. This signal was segmented into 50 ms non-overlapping windows. Voiced frames from the ACC signal were identified based on the same voice activity detection algorithm used in [3], where a combination of periodicity and spectral metrics whether a frame is voiced or unvoiced. In addition, we discarded frames in which the absolute ratio of the RMS values of the first half divided by the second half of the frame was greater than a threshold (1.5); thus, frames exhibiting onsets or offsets were removed since they typically result in incorrect inverse filtering estimates due to cycle-by-cycle variations in the signal. As with many inverse filtering methods [32], IBIF has difficulty analyzing signal with high f 0 values due to the short closed phase during which vocal tract information must be estimated (females and singers, especially, produce high-pitched phonation). Performance of traditional glottal inverse methods could be accurate up to a f 0 of 400 Hz [33]. By visual inspection, the estimation of IBIF voiced frames deteriorated around a f 0 of 500 Hz. Thus, voiced frames with f 0 higher than 500 Hz were not processed by IBIF. Future research will analyze sensitivity tests to find the range of frequencies for which the IBIF method fails. Table 2 lists the 11 glottal airflow measures computed within each analyzed frame. Fig 3  shows an example of the estimated glottal airflow signal and its derivative for a single frame. Since the accelerometer is an AC signal, the glottal airflow does not have a DC component. As in previous studies [2,4,34], ACFL was obtained as the difference between the maximum and minimum amplitude (peak-to-peak) within each glottal cycle. We also included 4 additional measures derived from the time-domain measures: • Logarithmic versions of ACFL and MFDR squared: 10log 10 |ACFL| 2 (dB) and 10log 10 |MFDR| 2 (dB).
• SPL normalized by ACFL (dB) and MFDR (dB): SPL/(10log 10 |ACFL| 2 ) and SPL(/10log 10 |MFDR| 2 ). Estimates for SPL are calculated using a linear regression equation: y = mx + b, where m and b are the coefficients from the subject obtained from accelerometer amplitude (x) and corresponding acoustic SPL (y). The calibration is done daily in the morning with a handheld microphone yielding the reference SPL [13], [35]. These ratios have shown to be significantly different between PVH and control subjects [4]. Given that many of the glottal airflow features applied for vocal hyperfunction analysis are cycle-based [2], [4], [34] and multiple glottal cycles occur within each 50 ms frame, we computed average features across all glottal cycles in each frame. The idea was to provide a more consistent estimate of each measure, especially given the inherent fluctuations from continuous speech in the ambulatory signal. Fig 4 shows the spectrum of the estimated glottal airflow, from which spectral measures H1-H2 and harmonic richness factor (HRF) were computed. These measures have been correlated with voice quality [36], [37].

Week-long univariate statistics for paired hypothesis testing
The purpose of the following series of tests is to find the most differentiating statistics between the PVH group and controls. Within-subject univariate statistics were calculated for each week-long time series data from each subject: mean, median, 5th percentile (trimmed minimum), 95th percentile (trimmed maximum), standard deviation, skewness, and kurtosis. These statistics were used for paired t-tests with 48 data points (number of subject pairs). Normality was tested with a Chi-square goodness-of-fit test, and each statistic was not significantly different from a normal distribution with p < 0.05. The false discovery rate is described by Eq (11), where V is the percentage of false positives (type I error) and S is the percentage of true positives. Since the false discovery rate is an expectation, we have m possible outcomes from the hypothesis tests.
If we have H 1 , H 2 ,. . .H m independent hypotheses, Benjamini-Hochberg (BH) [38] showed that regardless of how many null hypotheses are true and regardless of the distribution of the Vocal hyperfunction and estimated ambulatory glottal airflow measures p-values, when the null hypothesis is false, we have the following property [39]: where U is the proportion of true negatives. By setting α = 0.1, the procedure sorts the m p-values and defines a threshold L: We reject all hypotheses H k for which p k � p (L) , the BH rejection threshold. This procedure will find those statistics with at most an α false discovery rate between PVH subjects and controls. It is important to remember that the false discovery rate is not the same as the type I error, but is the expected proportion of false positive features among the list of features that are significant according to the test. An example in reference [39] (page 687) uses a false discovery rate of 0.15, which is typical for analyses that are exploratory in nature [40]. In this case, we find the most differentiating statistics using this test, in contrast with a Bonferroni-corrected ttest, which yields a conservative comparison for which there is no statistically significant difference between any statistic.

Supervised classification task
Following the same procedure as Ghassemi et al. [16], each subject's weeklong ambulatory recording was subdivided into 5-minute windows (6000 frames, nonoverlapping). Only windows exhibiting voicing were only included in the classification task; voiced windows were defined as containing at least 0.5% voicing (30 voiced frames). We then calculated the following univariate statistics over the voiced frames within each window for each measure in Table I: mean, median, 5th percentile, 95th percentile, standard deviation, skewness, and Vocal hyperfunction and estimated ambulatory glottal airflow measures kurtosis. Windows with less than 0.5% voicing were discarded due to data sparsity. Each window-based statistic was z-normalized (subtracting by the mean and dividing the result by the standard deviation) in two ways: a) by week, across voiced windows from all subjects (PVH and controls) and b) by day, across voiced windows within their respective days.
The full feature vector is composed of 154 features: 77 weekly and 77 daily z-score normalization the features derived from the 7 window-based univariate statistics for each of the 11 frame-based glottal airflow measures in Table 2. Since we only have a small amount of training data, we reduce feature dimensionality before training. As a first pass, forward feature selection (FFS) [41] is applied to the full feature matrix. The procedure is a greedy search algorithm that starts with an empty set I and iteratively selects a new feature x from the set of features not in I that minimizes a cost function J (a quadratic discriminant analysis classifier). The feature x is added to I, and the procedure is repeated until a threshold (10 −6 in this case) of consecutive results is achieved. E is the quadratic discriminant analysis classification error using 5-fold cross-validation. The final reduced feature vector is composed of 55 features. It is worth mentioning that this subset is suboptimal since further reduction can be achieved through LASSO selection, which is applied later on. We use these features to build both logistic regression and support vector machine (SVM) supervised classifiers.
Logistic regression is a type of discriminative classifier that models the class-conditional probability as: where x 2 R n is the feature vector, y = 1 2 R l is the class labeled as y i = 1 (PVH) or y i = 0 (control), and β is the vector of coefficient weights. In order to find the coefficients β, we maximize the following penalized log-likelihood using N data points of the training set with p feature vectors: where x i is the data point for instance i and λ is the regularization parameter for the LASSO constraint. The L 1 penalty reduces the number of features used in the model. SVMs are commonly used machine learning tools for classification [42]. The weight vectors w 2 R n are optimized to create a linear L 1 SVM classifier: where C is a regularization parameter similar to λ for logistic regression. The goal is to create a sparse w that solves the L 2 -loss support vector classifier [43]. Fig 5 shows a flowchart of the feature extraction and classification process. We first divided data using leave-one-out cross-validation to generate 48 datasets, each consisting of 47 training pairs and one test pair. All windows from the 47 training pairs (94 subjects total) were then subdivided using 5 cross-validation (1/5th validation and 4/5ths training in each fold). The validation sets are used to find the best set of parameters with respect to the area under the ROC curve (AUC) and these are selected for the model to be used in the test set. The following metrics are used to check the performance of the logistic model on the test pair: AUC, F-score, accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). From this procedure, we test two scenarios: Classification with all the features after selection and using subsets of those features. The latter is done by sorting the absolute Beta values and running L 1 logistic regression again by starting with all selected features. Then we took out the feature with lowest Beta value in magnitude and ran the classification again, and so on. The positive Beta weights are associated with subjects with PVH, whereas the negative weights are associated with control subjects.  different at the 95% confidence level. Minimum and median ACFL were the most discriminative statistics, with medium effect sizes (Cohen's d [44]) of 0.59 and 0.55, respectively. In general, statistics of the ACFL measure had the best differentiating power among all the weeklong paired t-tests. In contrast, average values for estimated SPL for subjects from the same database were not significantly different between subjects with PVH and control subjects [15] [3]. This result suggests that high ACFL values are potentially good indicators of subjects with PVH, if the SPL distributions of both groups are statistically similar. Table 4 shows a summary of the classification results for both implemented classifiers using the multiple performance metrics. Fig 6 displays performance of the L 1 logistic regression classifier for each of the 48 pairs for a subset of the performance metrics. There is a large spread of AUC scores across the subjects with an average of 0.82. AUC scores less than 0.5 indicate that the model places weight on positive examples versus negative ones and vice versa. The large AUC variance, including values less than 0.5, could be explained from the labels; e.g., subjects with PVH do not always exhibit vocal behavior typical for the pathology, whereas control subjects might exhibit some vocal behavior that differs substantially from healthy vocal behavior.

Supervised classification task
Logistic regression and SVM have similar good results on all performance metrics. Since L1 regularization was used in both cases, it could be that the removal of redundant features in every training case helped the performance. The mean (standard deviation) of the performance metrics for both classifiers improved when compared with previous results on 51 matched-paired subjects: 0.74 (0.27) for AUC, 0.77 (0.20) for F-score, 0.74 (0.30) for sensitivity, and 0.77 (0.29) for specificity [3]. Fig 7 shows the proportion of labels classified as positive Table 3. Top 11 week-long summary statistics (from a total of 77) sorted by p-value from the 48 paired t-tests. Statistically significant differences ( � ) were found by applying the Benjamini-Hochberg method using a false discovery rate of 0.1.  Table 4. Classification performance of L1 logistic regression (L1-LR) and support vector machine (SVM) approaches for 96 subjects using IBIF features. Mean (standard deviation) is reported for the performance metrics. Previous results using 51 pairs [3] and 20 pairs [16] are also shown. It is worth noting that the distribution of metrics such as AUC, across all models, may be non-normal and may benefit from other summary statistics such as median (IQR). Vocal hyperfunction and estimated ambulatory glottal airflow measures (VH) for all subjects. 79 subjects from 96 were classified correctly by using a threshold of 0.57. This corresponds to 82.3% of accuracy. Feature selection is important for identifying the most relevant features that can help to further understand the underlying process, as well as reducing the complexity for future biofeedback applications. Table 5 shows the total number of features (26) that were present in all 48 models after using LASSO with the resulting 55 features after FFS. Table 6 shows the results for all 26 models and the subset of features by sorting beta values. The mean F-score is stable in the 0.7 region until the number of features is 9. After that, the performance degrades moderately, where the AUC is 0.68 and the accuracy is 0.71 with only 7 features. Fig 8 shows boxplots of the same models versus F-score, where we can see the same trend: classification performance is more or less similar if we left in 9 features or more in the classifier. Fig 9 shows the association counts of features with PVH subjects as odds ratios. Odds ratios represent the association with a one-unit increase in the features. These features represent a combination of time and frequency-domain features that were consistently present in all 48 logistic regression models with p < 0.05 [3]. The 95th percentile of H1-H2 (daily normalized) had a large association with PVH labels, which is a voice measure correlated with voice quality [36]. However, the large confidence interval for this feature represents low level of precision of the odds ratio. The 95th percentile ratio of SPL and MFDR (MFDR' 95%ile in Fig 9) has a Vocal hyperfunction and estimated ambulatory glottal airflow measures moderate association compared to the rest of the features with a small confidence interval, representing a higher precision on the odds ratio. The current study sought to determine whether optimized IBIF-based estimates of glottal airflow measures extracted from ambulatory voice (accelerometer-based) recordings can be used to differentiate between normal vocal function and pathophysiological mechanisms associated with PVH. Results showed that this approach can be quite successful in classifying subjects as being normal or having PVH. Within-subject univariate analyses identified eight aerodynamic features that were statistically different between the patient and matched control groups. ACFL was the most significant measure with medium effect sizes exhibited. These findings are in agreement with previous laboratory studies that used measures extracted from the inverse filtered oral airflow [2], [4] and with computer modeling that suggests that increases in ACFL may reflect the type of increased compensatory effort (i.e., increased vocal hyperfunction) that is necessary for PVH patients to maintain adequate phonation in the presence of vocal fold fold trauma/lesions [9], [10]. Such increases in vocal effort are believed to reflect the "vicious cycle" of progressive concomitant increases in PVH and vocal fold trauma that contribute to perpetuating these disorders.

Method
From Table 4, use of the IBIF-based glottal airflow measures in the supervised classification task produced results that outperformed previous reports that used acoustic-based features extracted from ambulatory recordings of the acceleration signal to differentiate between Vocal hyperfunction and estimated ambulatory glottal airflow measures subjects with PVH and normal controls [16], [3]. The improvement in performance using IBIF-based features, in combination with the capability of such features to provide better insights into pathophysiological mechanisms, supports the potential that this approach has to improve the clinical assessment of hyperfunctional voice disorders. Future research could  Vocal hyperfunction and estimated ambulatory glottal airflow measures explore the performance of IBIF-based features with other pathologies, such as unilateral vocal fold paralysis [11]. There are several limitations to the current study which may serve to constrain any addition improvement in classification performance. First, even though the use of univariate statistics over 5-minute windows showed good performance, such an approach could smooth out fast variations in some features that may provide important information related to pathophysiology. Moreover, discarding silence periods from the analysis windows might be eliminating information that could further differentiate normal and pathological vocal function by indicating relative differences in non-vocal (non-phonatory) recovery times.
In addition, determination of the IBIF Q parameters is based on accurate estimates of the glottal volume velocity waveform obtained by inverse filtering the oral airflow recorded in the laboratory during sustained vowel production. However, the process of inverse filtering to estimate the glottal flow is still a topic of research and any method will have a degree of error (see [7], [8] for general discussions). The inverse filtering process is particularly challenging when applied to pathological female voices, as was done in this study. The process was made even more demanding by that fact that many subjects in this study were singers who regularly reached very high pitches (above 400 Hz) daily during practice that tend to cause the inverse filtering and IBIF methods to fail. In addition, every feature has an associated uncertainty from the accelerometer measurements, and the task becomes difficult when we combine multiple estimated features (e.g., for the SPL-normalized measures of ACFL' and MFDR') since those errors may propagate and increase the total uncertainty in an ambulatory setting.
Finally, the task of differentiating normal and pathological subjects was made more difficult because the patients with PVH in this study were classified as having only mild-to-moderate voice disorders. We know from clinical experience that such patients can display periods of seemingly normal vocal function, and, conversely, normal speakers can display transient episodes of VH that do not develop into chronic conditions. Future studies could attempt to address these issues by developing estimates of uncertainty for the extracted IBIF parameters and using other analysis methods such as unsupervised learning to better pinpoint specific segments of abnormal vocal function, as has been initially demonstrated in [45]. In addition, efforts to incorporate aerodynamic features in the framework of ambulatory biofeedback to improve voice therapy are currently underway [46].

Conclusion
An ambulatory approach that correctly identifies the instance, duration, and type of incorrect vocal behaviors during daily activities has the capability to provide transformative advancements for the assessment, monitoring, and treatment of vocal hyperfunction. In this study, we further develop prior ambulatory efforts, by improving the ability to discriminate pathological voices from healthy ones. Using an impedance-based inverse filtering scheme to estimate the unsteady glottal airflow component from a neck-surface accelerometer and a smartphone platform, we obtain and quantify, for the first time in an ambulatory assessment and a comprehensive framework, aerodynamic features that have been shown to be physiologically relevant for vocal hyperfunction in recent laboratory settings and computational studies. Prior efforts to obtain aerodynamic features from neck surface acceleration were limited to sustained vowels [18] and simple proof of concept examples [3]. The result of our comprehensive quantitative analysis show that these ambulatory glottal airflow measures can be successfully used to differentiate between normal vocal function and pathophysiological mechanisms associated with phonotraumatic vocal hyperunction, and outperform state-of-the-art reports using sound pressure level, fundamental frequency, and related vocal doses. Due to its physiological relevance, the proposed aerodynamic ambulatory approach has already potential to improve the clinical assessment of hyperfunctional voice disorders, including the evaluation of treatment outcomes. Thus, future efforts will be focused on further relating ambulatory aerodynamic features to vocal therapy and real-time biofeedback.