Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset

: The world has witnessed the success of artiﬁcial intelligence deployment for smart healthcare applications. Various studies have suggested that the prevalence of voice disorders in the general population is greater than 10%. An automatic diagnosis for voice disorders via machine learning algorithms is desired to reduce the cost and time needed for examination by doctors and speech-language pathologists. In this paper, a conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm called CGAN-IFCM is proposed for the multi-class voice disorder detection of three common types of voice disorders. Existing benchmark datasets for voice disorders, the Saarbruecken Voice Database (SVD) and the Voice ICar fEDerico II Database (VOICED), use imbalanced classes. A generative adversarial network o ﬀ ers synthetic data to reduce bias in the detection model. Improved fuzzy c-means clustering considers the relationship between adjacent data points in the fuzzy membership function. To explain the necessity of CGAN and IFCM, a comparison is made between the algorithm with CGAN and that without CGAN. Moreover, the performance is compared between IFCM and traditional fuzzy c-means clustering. Lastly, the proposed CGAN-IFCM outperforms existing models in its true negative rate and true positive rate by 9.9–12.9% and 9.1–44.8%, respectively.


Introduction and Literature Review
Voices are crucial for human beings. We communicate with others and express our emotions and ideas through our voices. Throughout our entire lives, we may witness those that have voice disorders, which include abnormalities in quality, tone, volume, and pitch. Voice professionals, whose jobs are strongly related to vocal load and/or voice quality, are highly influenced by voice disorders [1]. Examples include (i) teachers, who require a high vocal load and a medium quality; (ii) television presenters, who require a medium vocal load and a high quality; and (iii) actors, who require a high standard in both vocal load and quality. A healthy voice that fulfills one's professional needs is important to ensure the best performance.
Various epidemiological studies have suggested a high prevalence of voice disorders. In [2], a prevalence rate of 21% was observed in a sample size of about 4800 people. Another study [3] was conducted in Sweden, in which 16.9% of 114,538 people had voice disorders. Attention was likewise drawn to primary and secondary education teachers in Finland: 54% of 1198 teachers in Finland were found to suffer from voice disorders [4]. Voice disorder sufferers are expected to consult Otorhinolaryngologists and speech therapists for diagnosis and medical treatment to support their rapid recovery. However, research has shown that only a small percentage of total sufferers seek professional advice. Only 5.9% (78 participants) sought professional advice when they suffered from voice disorders [5]. Another survey [6] showed that 22.5% (56 teachers) consulted medical advice to solve their voice disorders. Many of the subjects claimed that they did not recognize they were suffering from voice disorders. Indeed, confirming a voice disorder requires medical knowledge.
Due to the high prevalence and low consultation rate of the population with voice disorders, it is desirable to have an automatic voice disorder detection algorithm with voice inputs that can give an instant diagnosis of a voice disorder, as well as its type. Such algorithms could be integrated into mobile health applications, as demonstrated in related works [7,8].
In Section 1.1, the performance of existing works on voice disorder detection is summarized. The limitations of existing works are explained in Section 1.2, which forms the rationale of the proposed work in addressing these limitations. This is followed by the key contributions of this paper in Section 1.3.

Literature Review
In recent years, various machine learning algorithms have been proposed and evaluated for the detection of voice disorders. It is worth noting that some previous works [9][10][11] tested their algorithms using the Massachusetts Eye and Ear Infirmary voice and speech lab (MEEI) database, which are not discussed in this paper because the MEEI database is commercialized and not publicly available. Further, it is inconsistent and features varying recording conditions for pathological and healthy subjects.
Instead, this paper focuses on two publicly available databases: the Saarbruechen Voice Database (SVD) [12][13][14][15][16] and Voice ICar fEDerico II (VOICED) [16][17][18]. The following is a summary of existing approaches applied to the SVD. A hybrid support vector machine (SVM) and Gaussian mixture models (GMM) were proposed and evaluated in [12]. This method achieved an accuracy, sensitivity, and specificity of 0.965, 0.94, and 0.99, respectively. Six methods, including a sequential minimum optimization (SMO)-based SVM, decision tree, Bayesian classification, logistic model tree, k-nearest neighbor, and entropy-based method, were evaluated in [13]. The SMO based SVM yielded the best performance in accuracy (0.858), sensitivity (0.876), and specificity (0.839). Guedes et al. [14] proposed two approaches, long short-term memory (LSTM) and convolutional neural network (CNN), for differentiation between healthy and dysphonic candidates, healthy and laryngitic candidates, and healthy and paralyzed candidates. The achieved precision values were 0.66, 0.67, and 0.78, respectively. SVM was applied with zero frequency filtering and quasi-closed phase glottal inverse filtering methods for the detection of voice disorders in [15]. Based on the performance evaluation, the applied method achieved an accuracy, sensitivity, and specificity of 0.76, 0.72, and 0.78, respectively. A threshold-based detection method with a newly defined dysphonia detection index was proposed in [16]. The accuracy, sensitivity, and specificity were 0.798, 0.706, and 0.902, respectively.
The work in [16] also applied VOICED, showing degraded performance in the accuracy, sensitivity, and specificity (of 0.5, 0.458, and 0.643, respectively). Researchers adopted K nearest neighbor (KNN) as a model for the detection of voice disorders [17]. This method achieved an accuracy of 0.933 and outperformed the other algorithms (random forest (0.874) and extra trees (0.863)). In [18], boosted tree, SVM, decision tree, naïve based classifier, and KNN were evaluated; the pairs of sensitivity and specificity were (0.829, 0.862), (0.79, 0.279), (0.779,0.844), (0.857,0.644), and (0.774,0.334), respectively. As the VOICED database was published in 2018 (compared with SVR, which was published in 1997), the research publications that consider the VOICED dataset are fewer than those using SVR.

Research Gaps and Motivation
Existing works have applied various algorithms for voice disorder detection [12][13][14][15][16][17][18]. However, further research and exploration are necessary to address the following limitations in existing works.
i. Some existing works [12,16,17] did not apply cross-validation during their performance evaluation of the algorithms. The first concern is that one may pick up a biased training dataset to train the detection model, which takes advantages of this bias by yielding high accuracy. Secondly, not all the data were evaluated, which may affect the fine tuning of the model and the robustness of its applicability in real-world scenarios. ii.
There is room for improvement in the accuracy, sensitivity, and specificity of voice disorder detection models [13][14][15][16]18]. Particularly in smart healthcare applications, the performance of the machine learning model has high expectations, as such applications are related to the health status of humans. iii.
Current works [12][13][14][15][16][17][18] have formulated the voice disorder detection problem as binary detection that only outputs a healthy or pathological result. It is desirable for machine learning algorithms to suggest the actual type of voice disorder to minimize screening time by medical professions. iv.
Current works [12][13][14][15][16][17][18] have not considered the issues of imbalanced datasets in both SVR and VOICED. This may alter the gap between the sensitivity and specificity of the detection model.
To solve these limitations, this paper adopts the following measures.
i. A 10-fold cross validation is adopted for the performance evaluation of the voice disorder detection algorithm. ii.
The proposed algorithm incorporates a generative adversarial network and fuzzy c-means clustering to improve the performance of the detection model. iii.
Voice disorder classification is formulated as a multi-class detection problem, thereby allowing the actual type of voice disorder to be suggested. iv.
A generative adversarial network is proposed to generate new training data, which will reduce the influence of imbalanced datasets and improve the performance of the voice disorder detection model.

Research Contributions
The contributions of this paper are as follows.
i. A conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm named CGAN-IFCM is proposed to enable the multi-class detection of voice disorders. ii.
CGAN offers dual benefits, not only reducing the influence of imbalanced datasets but also generating new training data to improve the performance of the voice disorder detection model. In this way, the gap between sensitivity and specificity in the detection model can be reduced.
The results indicate that the proposed CGAN-IFCM outperforms stand-alone IFCM by 10-12.6% and 5.8-16.2% for the true negative rate (TNR) and true positive rate (TPR), respectively. iii.
IFCM addresses the limitations of existing fuzzy c-means clustering (FCM). IFCM increases performance by introducing interactions between adjacent data points in the fuzzy membership function. The data point and its neighboring data points in feature space will have a high probability to be grouped into the same cluster. The results reveal that the proposed CGAN-IFCM improves the TNR and TPR by 7.3-9% and 3.1-12%, respectively. iv.

Materials and Methods
In Section 2, the two publicly available databases (SVD and VOICED) for voice disorder detection are outlined, followed by the proposed CGAN-IFCM algorithm that will be applied to each of the databases.

Voice Disorders Databases
The SVD was collected in collaboration with the Department of Phoniatrics and Ear, Nose, and Throat (ENT) at the Caritas clinic of St. Theresia in Saarbrücken [19,20]. The data contain recordings of sustained phonations of the vowels, as well as the sentence "Good morning, how are you?" in German. The database includes 869 healthy candidates and 1356 candidates with voice disorders. This can be formulated as a binary detection problem.
For the multi-class detection problem in voice disorders, there are 71 types of pathologies among 1356 candidates. Since most types of pathologies have a small sample size and are aligned with the types of pathologies in VOICED, the multi-class detection problem is formulated as 213 with hyperkinetic dysphonia, 16 with hypokinetic dysphonia, 140 with reflux laryngitis, and 869 healthy.

Voice ICar fEDerico II (VOICED)
VOICED was collected at the faculty of Phoniatrics and Videolaryngoscopy at the Hospital University of Naples Federico II and at the medical room at the Institute of High Performance Computing and Networking [21]. The recordings contain voice signals of the vowel /a/ sustained for five seconds in a quiet room to minimize background noise. Further, VOICED contains information on other attributes, such as smoking status, alcohol consumption, hydration, eating habits, voice handicap index, and reflux symptom index. The database contains 58 healthy candidates and 150 candidates with voice disorders. Thus, the binary detection problem can be formulated.
On the other hand, the multi-class detection problem can be formulated by dividing the voice disorder group into 70 with hyperkinetic dysphonia, 41 with hypokinetic dysphonia, and 39 with reflux laryngitis. Table 1 summarizes the key information on SVD and VOICED, including the signal characteristics, the information included, and the classes of the binary detection model and multi-class detection model. Both have imbalanced datasets, so CGAN will be applied in this study to generate more training samples for classes with smaller sample sizes. In this section, the methodology of CGAN-IFCM is presented. Firstly, this section explains the reasons for using CGAN instead of other types of GAN, along with the details of CGAN. This is followed by the rationale behind the selection of IFCM instead of a traditional FCM as a voice disorder detection model.

Generation of Additional Training Data Using CGAN
Imbalanced classes were observed in the databases SVD and VOICED. This imbalance causes a significant deviation between the sensitivity and specificity of the voice disorder detection model, particularly in VOICED [16][17][18]. The goal is to increase the number of training samples to balance the classes. A recent state-of-the-art article presented the development and progress of the evolution of GAN [22]. The basic GAN proposal has a key limitation: its noise vector does not have restrictions and may lead to fatal theoretical issues. As a result, different kinds of solutions have been proposed. The following relevant articles have received a large number of citations: auxiliary classifier GAN (ACGAN) [23], CGAN [24], and information maximizing GAN (InfoGAN) [25]. In this paper, CGAN is selected to increase the number of training samples for classes with fewer samples. The first reason is that various types of information (the information included in Table 1) are included in SVD and VOICED. This information is conditional information that fits well with the theory of CGAN. Further, conditional information is introduced in both the generator and discriminator to reduce the aforesaid fatal theoretical issue. Conditional generation is then introduced, in which CGAN is restricted to generate new data for the class of the largest sample size. Therefore, in the binary detection model, the class of candidates with voice disorders is restricted in both SVD and VOICED. For the multi-class detection model, the class of healthy candidates and the class of candidates with hyperkinetic dysphonia are restricted in SVD and VOICED, respectively. To avoid the excessive generation of training data, there is another restriction that involves limiting the number of generated samples of a class to, at most, the original sample size quantity of the corresponding class. Figure 1 shows the system workflow of CGAN. Given noise vector n and conditional variable c, generator G captures data distribution x, and discriminator D determines whether a sample was from the original dataset or generated dataset. Both G and D are conditioned. Basically, n and c serve as the inputs of G, G serves as the input of G(n|c), and c serves as the input of x. To train a discriminator, D(x; θ d ) gives a single scalar representing the probability that x came from the training data instead of p data (x). To train the generator distribution p data (x) using data distribution x, the generator constructs a mapping function from Gaussian noise distribution p n (n) to the data space as G(n;θ g ). The training of D and G occurs at the same time, where parameters θ d for D are adjusted to minimize logD(x|c), and parameters θ g for G are adjusted to minimize log(1 − D(G(n|c))) for the conditional variable c. Mathematically, the objective function is expressed as follows: where V(D,G) is the value function.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 18 2.2.1. Generation of Additional Training Data using CGAN Imbalanced classes were observed in the databases SVD and VOICED. This imbalance causes a significant deviation between the sensitivity and specificity of the voice disorder detection model, particularly in VOICED [16][17][18]. The goal is to increase the number of training samples to balance the classes. A recent state-of-the-art article presented the development and progress of the evolution of GAN [22]. The basic GAN proposal has a key limitation: its noise vector does not have restrictions and may lead to fatal theoretical issues. As a result, different kinds of solutions have been proposed. The following relevant articles have received a large number of citations: auxiliary classifier GAN (ACGAN) [23], CGAN [24], and information maximizing GAN (InfoGAN) [25]. In this paper, CGAN is selected to increase the number of training samples for classes with fewer samples. The first reason is that various types of information (the information included in Table 1) are included in SVD and VOICED. This information is conditional information that fits well with the theory of CGAN. Further, conditional information is introduced in both the generator and discriminator to reduce the aforesaid fatal theoretical issue. Conditional generation is then introduced, in which CGAN is restricted to generate new data for the class of the largest sample size. Therefore, in the binary detection model, the class of candidates with voice disorders is restricted in both SVD and VOICED. For the multi-class detection model, the class of healthy candidates and the class of candidates with hyperkinetic dysphonia are restricted in SVD and VOICED, respectively. To avoid the excessive generation of training data, there is another restriction that involves limiting the number of generated samples of a class to, at most, the original sample size quantity of the corresponding class. Figure 1 shows the system workflow of CGAN. Given noise vector n and conditional variable c, generator G captures data distribution x, and discriminator D determines whether a sample was from the original dataset or generated dataset. Both G and D are conditioned. Basically, n and c serve as the inputs of G, G serves as the input of G(n|c), and c serves as the input of x. To train a discriminator, D(x; θd) gives a single scalar representing the probability that x came from the training data instead of pdata(x). To train the generator distribution pdata(x) using data distribution x, the generator constructs a mapping function from Gaussian noise distribution pn(n) to the data space as G(n;θg). The training of D and G occurs at the same time, where parameters θd for D are adjusted to minimize logD(x|c), and parameters θg for G are adjusted to minimize log(1-D(G(n|c))) for the conditional variable c.
Mathematically, the objective function is expressed as follows: min max , where V(D,G) is the value function.  Table 1, the voice disorder detection model can be formulated as a binary detection problem and a multi-class (4-class) detection problem. We selected a multi-class detection model for

Voice Disorder Detection Model Using IFCM
As shown in Table 1, the voice disorder detection model can be formulated as a binary detection problem and a multi-class (4-class) detection problem. We selected a multi-class detection model for the illustration in this subsection because it represents the expected application through which the voice disorder detection model could diagnose actual types of voice disorders.
To design the feature vector, there are two typical approaches: (i) follow clinical and expert guidelines; and (ii) if the first approach is not available, a thorough investigation of the proper features should be carried out. For the voice disorder detection problem, we follow the first approach because there is a clinical acoustic analysis technique available for the formal assessment of voice disorders [26,27]. Four voice quality-based parameters are chosen. These qualities are harmonic to the noise ratio (HNR), shimmer, jitter, and fundamental frequency (f 0 ). Since the estimations of these parameters are a standard approach, and feature extraction is not the focus of this paper, we follow de Krom's algorithm to measure HRN [28], jitter (as the cycle-to-cycle variation of fundamental frequency) [29], shimmer (as the peak-to-peak amplitude variation in decibels) [29], and f 0 (as the maximal of autocorrelation function) [30].
In the implementation and performance evaluation, we adopt k-fold cross-validation, with k = 10 as the typical order [31,32]. We define N i as the total number of samples in class 1, where N 1 , N 2 , N 3 , and N 4 correspond to healthy candidates, those with hyperkinetic dysphonia, those with hypokinetic dysphonia, and those with reflux laryngitis, respectively. The feature vector is In general, the FCM problem is defined to minimize the objective function [33]. Intuitively, in each cluster, the data points should be close to the cluster center. Therefore, the objective is defined as a minimization of the intra-cluster variance, which is equivalent to a maximization of intra-cluster similarities. The less the variance, the greater the similarities among the data points. For example, a variance of zero results in maximal similarity: where N c is the optimally designed total number of clusters using the multiobjective genetic algorithm (MOGA), u pc is the degree of membership of X p in the cth cluster, v c is the cluster center of the cth cluster, and m controls the fuzziness of the resulting partition. The value m = 2 is common and is selected according to [33].
Taking the partial derivative, we have two iterative solutions, v c and u pc , given by Nonetheless, the existing FCM algorithm [33] does not include the relationship between X p and its neighbor X p , which limits its performance in clustering problems. Feature vector X p is indeed strongly correlated (sharing similar characteristic) to its neighbor X p due to the nature of characteristics in the feature space. These X p values have a high probability to be grouped into the same cluster. As a result, the rationale of the proposed work is to introduce the interaction between X p and its neighbor X p into the fuzzy membership function.
A new degree of membership is proposed as follows: where C pc is a conditional variable related to the level of participation of X p in the cth cluster, SN(X p ) is the square neighborhood centered at X p , and N c is the number of feature vectors in the neighborhood. The joined degree of membership between u pc and u neighbor,pc is defined as follows: where ω n and ω 0 are the control weighting factors of the importance between u neighbor,pc and u pc . These are optimally designed via MOGA, which benefits the convergence of the model training.
There are numerous combinations of ω n and ω 0 which may yield different u VDC,pc and v VDC,c values. Searching for the optimal ω n and ω 0 requires considerable computing power. Alternatively, a tradeoff is sought between the convergence of model training and computing power while maintaining favorable performance.
The multiobjective optimization problem consists of four objectives, given by where CH index is the Calinski-Harabasz (CH) index, which is important to determine the number of clusters [34]. CH index evaluates the cluster validity based on the sum of the square between the cluster and the sum of the square within the cluster. Varying the number of clusters N c yields distinct values of CH index : The higher the CH index is, the better the solution. In addition, U VDC = (u VDC,pc ) (N 1 +N 2 +N 3 +N 4 ) * (N c ) is the fuzzy membership matrix, TNR is the true negative rate (equivalent to specificity), and TPR is the true positive rate (equivalent to sensitivity) of the voice disorder detection model. CH index , TNR, and TPR are given by Speci f icity = TNR = TN TN + FP where SS BC is the sum of the square between clusters, N c is the number of clusters, SS WC is the sum of the square within the cluster. The larger the SS BC is, the higher the degree of dispersion between the clusters becomes. The smaller the SS WC is, the closer the relationship in the cluster. In addition, TN is the true negative, FP is the false positive, TP is the true positive, and FN is the false negative of the testing samples. The proposed MOGA-IFCM algorithm is then applied to solve the multiobjective optimization problem in Equation (10) [35,36]. Generally, solving a multiobjective optimization problem has an overall run time complexity of O(GMN 2 ), where G, M, and N are the number of generations, number of objectives, and population size, respectively. We adopted a hyper-grid scheme [37,38]  Evolutionary algorithms (for instance, the genetic algorithm) tend to converge to a single solution as the diversity of the population diminishes [39]. This phenomenon is called genetic drift. The technique for maintaining a stable sub-population of diverse individuals according to the distance between individuals is called the niching technique. An individual's niche count is defined as the sum of the sharing function sf (i.e., n j=1 s f (d(i, j))), whose values lies between itself and every individual in the population. We can define sf as a function of distance d(i,j) between the two population elements, as follows [40,41]: where σ dis is the threshold for dissimilarity, and α = 1 is a constant for regulating the shape of the sharing function.
The optimal tradeoff solutions are thus found. Pseudo code of the algorithm can be found in Algorithm 1.

Analysis and Results of CGAN-IFCM
The effectiveness of the proposed CGAN-IFCM is analyzed in four parts: (i) the performance of the proposed CGAN-IFCM; (ii) the necessity of using CGAN by comparing the performance of the detection models using CGAN-IFCM and IFCM; (iii) the necessity of using IFCM by comparing the performance of the detection models using CGAN-IFCM and CGAN-FCM; (iv) comparing CGAN-IFCM with two typical data generation methods (synthetic minority oversampling technique (SMOTE) and cost-sensitive learning (CSL)); (v) comparing the performance between CGAN-IFCM and existing methods. As previously noted, 10-fold cross-validation was adopted for performance evaluation, for which the value of 10 has been widely adopted in the literature [31,32].

Performance Evaluation of CGAN-IFCM
As shown in Table 1, there are four cases of formulations: (i) the binary detection model using the SVD database; (ii) the binary detection model using the VOICED database; (iii) the multi-class detection model using the SVD database; (iv) the multi-class detection model using the VOICED database. Section 3.1 will present and analyze the results of CGAN-IFCM in all these cases.

Binary Detection Model using CGAN-IFCM
Consider that the proposed CGAN-IFCM, the number of clusters N c , and conditioning variables ω n and ω 0 highly influence the joined degree of membership u VDC,pc and cluster center v VDC,c and thus correlate to the performance (TNR and TPR) of the detection model.
The evaluation is first performed on the binary detection model using CGAN-IFCM with the SVD database. Table 2 summarizes the TNR and TPR of the selected scenarios with N c = [3,10], ω n , ω 0 = [2, 2.4], with step size of 1 and 0.1, respectively. Key observations can be drawn from this process: i.
(ω n = [2.1, 2.4]): The highest TNR and TPR are obtained at ω 0 = 2. This can be explained by the case that if ω n > ω 0 . Recall that X p and its neighbor X p share similar characteristics and tend to group under the same v VDC,c . However, when ω n ≤ ω 0 , both TNR and TPR decrease. iii.
The performance of CGAN-IFCM is better when N c = [3,5] compared to N c = [6, 10] and deteriorates significantly when N c > 5, as some clusters could be redundant and lead to errors in voice disorder detection.
As a result, the voice disorder detection model achieves higher accuracy when ω n > ω 0 , which reveals the effectiveness and necessity of the membership function u neighbor,pc and weighting factor ω n . On the other hand, when ω n ≤ ω o , u VDC,pc tends to be dominated by ω 0 , which is close to the performance of existing FCM algorithms.
Next, we focus on the binary detection model with the VOICED database. Similarly, Table 3 summarizes the TNR and TPR of the selected scenarios with N c = [3,10] and ω n , ω 0 = [2, 2.4], with step sizes of 1 and 0.1, respectively. Besides following the same observations shown in Table 2, there are two extra observations. i.
The detection model is dominated by a class of candidates with voice disorders because the number of samples of voice disorders remains larger than the number of samples of healthy candidates after the adoption of CGAN. In other words, as shown in Table 3, there is a notable gap (on average, 3.94% versus 0.79%, as shown in Table 2) between TNR and TPR, where TPR is higher than TNR. ii.
The overall accuracy of the binary detection model with the VOICED database is less than that of the SVD database. This could be explained in two ways: the imbalanced dataset in SVD and the number of samples in SVD, which is 10 times that in VOICED.

Multi-Class Detection Model using CGAN-IFCM
For the construction of the multi-class detection model, the problem is extended to determining the exact type of voice disorders from among four possibilities: healthy, hyperkinetic dysphonia, hypokinetic dysphonia, and reflux laryngitis. First, the SVD database is considered. Table 4 presents the TNR and TPR of selected scenarios with N c = [3,10], ω n , ω 0 = [2, 2.4], and a step size of 1 and 0.1, respectively. In addition to the first two points stated in Table 2, there are three other observations. i.
The performance of CGAN-IFCM is better when N c = [6,8] compared to N c = [3,5] and N c = [9,10]. It achieves lower performance when N c < 6 and N c > 8 as either an insufficient number of clusters or redundant clusters are considered. ii.
The detection model is dominated by the class of healthy candidates because the number of samples of healthy candidates is far greater than the number of samples of candidates with hyperkinetic dysphonia, hypokinetic dysphonia, or reflux laryngitis after the adoption of CGAN. In Table 4, there difference between TNR and TPR (where TNR is higher than TPR) is further increased (on average, by 4.56% versus 3.94%, as shown in Table 3) between TNR and TPR, where TPR is higher than TNR. iii.
The overall accuracy of the multi-class detection model is lower than that of the binary-class detection model. Basically, the multi-class detection model is a more complicated problem compared to the binary classifier. Further, the issue of an imbalanced dataset is especially significant given the small sample size of patients with hypokinetic dysphonia (2% of the samples of healthy candidates). Next, the multi-class detection model is implemented using the VOICED database. Table 5 presents the TNR and TPR of the selected scenarios with N c = [3,10], ω n , ω 0 = [2, 2.4], and a step size of 1 and 0.1, respectively. In addition to the first two points stated in Table 2, there are two other observations. i.
The performance of CGAN-IFCM is better when N c = [6,8] compared to N c = [3,5] and N c = [9,10]. It achieves lower performance when N c < 6 and N c > 8, as either an insufficient number of clusters or redundant clusters are considered. ii.
The multi-class detection model is formulated using an equal sample size in each class after the adoption of CGAN. The difference between TNR and TPR is not significant, and there is no bias in TNR, which is either always higher than TPR, or TPR is always higher than TNR.

Comparison Between CGAN-IFCM and IFCM
IFCM includes the relationship between X p and its neighbor X p , which enhances the performance of the traditional FCM. As noted, CGAN generates new training data that not only lower the effect of imbalanced classes in SVD and VOICED but also enhance the performance of the detection model. Table 6 summarizes the performance of four cases of the detection models using CGAN-IFCM and IFCM. The results reveal that CGAN-IFCM lowers the difference between TNR and TPR by 58.4% (an average of four cases, 2.09% versus 5.02%). Further, the proposed CGAN-IFCM outperforms stand-alone IFCM by 10-12.6% and 5.8-16.2% in the TNR and TPR, respectively. This demonstrates the dual benefits of CGAN, which (i) reduces the effect of the imbalanced class (and thus lowers the difference between TNR and TPR) and (ii) generates more data to enhance detection performance.

Comparison Between CGAN-IFCM and CGAN-FCM
To demonstrate the effectiveness of IFCM, performance of detection models using CGAN-IFCM and CGAN-FCM are compared and summarized in Table 7. The proposed CGAN-IFCM improves the TNR and TPR by 7.3-9% and 3.1-12%, respectively. Therefore, combining the results in Tables 6  and 7 shows that both CGAN and IFCM help improve the detection accuracy of voice disorder detection models.

Comparison Between CGAN, SMOTE, and CSL
To study the effectiveness of CGAN in addressing the issue of an imbalanced dataset, a comparison is made with two typical approaches: the synthetic minority oversampling technique (SMOTE) [42,43] and cost-sensitive learning (CSL) [44,45]. Table 8 presents the performance of CGAN-IFCM, SMOTE-IFCM, and CSL-IFCM in binary and multi-class voice disorder detection. The results show that the proposed CGAN-IFCM model outperformed SMOTE-IFCM and CSL-IFCM by 4-6% in terms of the TNR and TPR. This result can be explained by the following factors: (i) the datasets SVD and VOICED contain conditional information that fits well with the theory of CGAN, and (ii) SMOTE does not consider the possibility of neighboring points that belong to other classes. This could potentially increase the overlapping of classes and introduce additional noise; (iii) CSL is, moreover, prone to overfitting.
Notably, existing works [12][13][14][15][16][17][18] did not formulate the voice disorder detection problem as one of multi-class detection, thereby precluding a direct comparison. Thus, in this paper, we extend the formulation to the multi-class detection problem for detecting three common types of voice disorders: hyperkinetic dysphonia, hypokinetic dysphonia, and reflux laryngitis. Table 10 summarizes the TNR and TPR of the proposed CGAN-IFCM with two typical algorithms: RF and SVM (radial basis kernel function). The same feature vector is utilized to ensure a fair comparison. CGAN-IFCM achieved a higher TNR and TPR compared to RF and SVM for three key reasons: (i) CGAN reduces the issue of imbalanced datasets; (ii) CGAN increases the amount of training data; (iii) IFCM includes the relationship between X p and its neighbor X p . Satisfactory performance (only a small deterioration of the TNR and TPN compared with the binary detection model) was achieved for CGAN-IFCM.

Comparison Between CGAN-IFCM with Other Approaches using Wilcoxon Signed-Rank Test
We conducted analysis on the performance of proposed CGAN-IFCM and various approaches in Sections 3.1-3.5. To determine if CGAN-IFCM statistically outperformed other approaches, adopting a t-test is not suitable because both TNR and TPR are bounded by 0% and 100% which does not follow normal distribution [46,47]. Non-parametric Wilcoxon signed-rank test [48,49] was chosen to confirm if CGAN-IFCM is statistically significant compared with other approaches. We assumed the significance level is 0.05. H 0 and H a were denoted as null hypothesis and alternative hypothesis, respectively. Equivalent accuracy (EQA) was defined as the weighted sum of TNR and TPR of the model. Table 11 summarizes the results of Wilcoxon signed-rank test. The subscript of EQA denotes the method of voice disorder detection model. Results reveal that the p-values of all cases are less than 0.05 so that the proposed CGAN-IFCM outperforms aforesaid approaches, presented in Sections 3.1-3.5. Table 11. Results of Wilcoxon signed-rank test on proposed CGAN-IFCM and various approaches.

Conclusions
In this paper, a conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm named CGAN-IFCM were proposed. CGAN demonstrated its effectiveness in generating new training data to reduce the effect of the imbalanced class and thus the deviation between the true positive rate and true negative rate of the detection model. CGAN improved the TNR and TPR by 10-12.6% and 5.8-16.2%, respectively. IFCM addresses the limitation of the existing FCM, where the IFCM uses multi-class voice disorder detection for three common types of voice disorders. IFCM introduces the interaction between data (feature space) and its neighboring data into the fuzzy membership function. The results show that the IFCM improves the TNR and TPR by 7.3-9% and 3.1-12%, respectively. We also discussed the advantages of CGAN for managing an imbalanced dataset with conditional information. This model could increase performance by 4-6% compared to traditional SMOTE and CSL methods. In addition, the performance between the proposed CGAN-IFCM was compared with the results of existing works that demonstrated TNR and TPR enhancements of 12.9-43.5% and 9.1-44.8%, respectively.
Future research directions are suggested as follows: (i) merging databases SVD and VOICED to enlarge the number of samples in each class, thereby allowing data heterogeneity to be properly handled. (ii) Investigating the possibility of extending the multi-class detection model to incorporate more types of voice disorders. It is expected that the performance of detection model will be lowered when the number of classes (the complexity of the model) increases. Therefore, further studies on the aspects of feature extraction and model construction should be carried out.