Diagnosis and Knowledge Discovery of Turner Syndrome Based on Facial Images Using Machine Learning Methods

Turner syndrome (TS) is a chromosomal disorder disease that only affects the development of females. Based on TS facial images, we propose a new TS diagnosis model, which preserves the original ratio of face width to face height, extracts reliable features and conducts feature analysis based on support vectors (SVs) and principal components (PCs). The proposed TS model is composed of Image Preprocessing, Feature Extraction, Classifier Construction and Knowledge Discovery. For Image Preprocessing, by utilizing the techniques of face alignment, facial area intercept and brightness normalization, the original facial images are processed to the desired gray aligned facial area images, while the original ratio of face width to face height remains unchanged. For Feature Extraction, by employing the energy features of facial organ blocks and ratio features roughly and more finely, five reliable feature sets are extracted, i.e., Rough Energy Features, Finer Energy Features (FEF), Rough Ratio Features, Finer Ratio Features (FRF) and FRF2. For Classifier Construction, support vector machine (SVM), principal component analysis (PCA), kernel PCA (KPCA) and ensemble learning methods are used to establish 15 single classifiers (i.e., 5 SVM classifiers, 5 PCA+SVM classifiers and 5 KPCA+SVM classifiers) and 2 ensemble classifiers. The classifier established by the weighted voting method achieves the highest accuracy of 0.9127; FEF outperforms the other four feature sets. For Knowledge Discovery, the feature analysis based on SVs and PCs is carried out to discover important features. It is found that less energy of external canthus areas and a lower ratio of forehead height to forehead width often occur in TS patients through analyzing SVs, and the energy and ratio features of left zygoma area are important in identifying TS by analyzing PCs.


I. INTRODUCTION
Turner Syndrome (TS) is a chromosomal disorder disease that affects the development of females, which should be diagnosed early due to their specific symptoms. It happens in approximately 1:(2000-4000) female live births [1]. A female diagnosed as TS may exhibit one or more clin-The associate editor coordinating the review of this manuscript and approving it for publication was Nadeem Iqbal.
ical characteristics, such as short stature, primary amenorrhea, ovarian failure, facial appearance characteristics, ear infections, renal anomalies, cardiovascular malformations and liver anomalies, which are caused by the absence of structural abnormality of one X chromosome in some or all the cells [2]. Moreover, the symptoms of short stature and primary amenorrhea are most likely to occur in TS patients and require early treatment [1], [3]. For short stature, growth hormone treatment is applied to help TS patients achieve adult height within the lower range for female population standards. And it is advised to start the treatment early, i.e., about 4-6 years of age, and preferably before 12-13 years old. For primary amenorrhea, estrogen replacement therapy is advised to induce pubertal development and maintain female secondary sex characteristics, which should start between 11 and 12 years of age. Therefore, the early diagnosis of TS plays an important role for physicians in planning treatment, which is conducive to bringing a high-quality life to TS patients.
The diagnosis of TS must be determined by a standard 30-cell karyotype analysis [4]- [6], while TS screening should adopt other effective methods due to the expensive and timeconsuming process of karyotyping. Many researchers turn to typical clinical characteristics at different life stages to do the preliminary diagnosis of TS. For instance, female fetuses with diffuse oedema, cystic hygroma and growth retardation [7]; newborns with a webbed neck and congenital lymphedema in the hands and feet at birth [8]; children with short stature and high follicle-stimulating hormone levels [9]; adolescents with primary amenorrhea and/or with short stature; adults with ovarian failure [10], should draw attention to be suspected TS patients. However, the above mentioned TS screening manners in clinical practice require several physicians with rich clinical experiences, which are lacking in underdeveloped cities and countryside. And these procedures can be lengthy and repetitive. Therefore, it is essential and urgent to explore a fast, cheap and automatic TS screening approach to assist physicians to achieve the early diagnosis and treatment of TS.
Machine Learning (ML) is a potentially effective tool for screening TS to address the issues mentioned above, which has been applied by some researchers to different datasets to construct TS diagnosis models. Leite [11] applied Bayesians network to the records of symptoms to detect TS. Dumancic et al. [12] built a logistic regression model utilizing characteristics of the craniofacial complex to identify TS patients. Catic et al. [13] established neural networks based on the first-trimester maternal serum screening data, ultrasonographic findings and patient demographics to classify TS cases. However, the acquisition of such data mentioned above is not an easy task. Song et al. [14] and Chen et al. [15] proposed using facial images to diagnose TS with the help of various ML techniques. A similar application has also been found in the works of Gao et al. [16] and Yao [17]. The facial image is an important diagnostic basis for TS, which is easy to acquire. Many potential characteristics existed in facial images could be applied to identify TS, such as the wide forehead, melanocytic nevus, epicanthus, higher-rooted nasal bridge and wide distance between eyes. However, these facial characteristics might be damaged or inaccurate in their works [14]- [17], because they adjusted the detected facial part to a same size 640*640 or 128*128 that may change the original ratio of face width to face height for most people. It results in that the features extracted from the above adjusted facial images could not exactly reflect the real facial characteristics of TS patients. Moreover, they didn't analyze which of the extracted numerous features is more important in identifying TS. Thence, an appropriate TS diagnosis model should be proposed that preserves the original ratio of face width to face height, extracts reliable features representing the real facial characteristics, and carries out feature analysis from a large number of features to discover the knowledge of the potential crucial features in recognizing TS.
To solve the above-arisen problems, we propose a new TS diagnosis model based on facial images, which maintains the original ratio of face width to face height, gives the corresponding feature extraction method and shows the process of feature analysis in identifying the crucial facial features from a large number of extracted features for recognizing TS. The proposed TS diagnosis model is composed of four components, as shown in Fig.1, including Image Preprocessing, Feature Extraction, Classifier Construction and Knowledge Discovery. For Image Preprocessing, face alignment, facial area intercept and facial brightness normalization are applied to the original facial images to derive the desired gray aligned facial area images, and the ratio of face width to height remains unchanged. For Feature Extraction, because different people have different ratios of face width to face height, corresponding reasonable features should be carefully extracted. By dividing the processed facial images into several facial organ blocks roughly and more finely, respectively, energy features and ratio features are extracted to finally form five feature sets, which can better reflect the real facial characteristics compared to the energy features of average blocks and distance features utilized by Song et al. [14] and Chen et al. [15]. For Classifier Construction, Support Vector Machine (SVM) and Ensemble Learning are employed. Besides, before establishing classifiers, Principal Component Analysis (PCA) and Kernel PCA (KPCA) are also used to reduce dimensionality. For Knowledge Discovery, feature analysis based on support vectors (SVs) resulted from SVM classifiers and principal components (PCs) acquired from PCA are carried out respectively, to discover the potential important features in identifying TS.
The main contributions of the paper are listed below: (1) We propose a new TS diagnosis model based on facial images, which preserves the original ratio of face width to face height, extracts reliable features and carries on feature analysis based on SVs and PCs; (2) We find that less energy in external canthus areas and a lower ratio of forehead height to forehead width happens more frequently in TS patients by analyzing SVs; the energy and ratio features of left zygoma area are important in identifying TS by analyzing PCs; (3) Compared with the recently proposed methods [14]- [17], the performance of our model is also competitive, which achieves the accuracy of 0.9127.
The rest of the paper is organized as follows. Section II gives related works about utilizing facial images to diagnose TS. Section III presents our proposed TS diagnosis model, which consists of Image Preprocessing, Feature Extraction, Classifier Construction and Knowledge Discovery.  Section IV describes the employed image data, experiment design and evaluation metrics, compares the prediction performance of different classifiers and conducts knowledge discovery based on SVs and PCs. The conclusion and future work are given in the last section.

II. RELATED WORKS
Considering the fact that the facial image is easy to acquire and ML techniques have been served as an effective tool to diagnose disease, some researchers studied applying various ML algorithms to facial images to establish automatic TS diagnosis models, which could assist physicians to realize early TS diagnosis and intervention to improve the patients' life.
Song et al. [14] and Chen et al. [15] proposed to establish TS diagnosis models by making use of ML techniques based on facial images. In their works, the facial images were intercepted and zoomed out to the size of 640*640, global geometrical features (GGF), global texture features (GTF) and multiple local features (LF) were extracted from these preprocessed images, and classifiers based on the three feature sets were constructed (i.e., GGF and GTF were sent to PCA and SVM to build classifiers, LF was used by AdaBoost to construct a classifier). Table 1 summaries their works. As shown in Table 1, Gao et al. [16] and Yao et al. [17] also exploited facial images and ML techniques to build TS diagnosis models, in which the intercepted facial image size has also been made to 640*640 or 128*128. However, the operation of adjusting all the intercepted facial images to the same size of 640*640 (or 128*128) may damage the inherent information of original intercepted facial images, for instance, the ratio of face width to face height may be changed in the adjusted images for most people. It results in that the features extracted from such morphed facial images could not accurately stand for the original facial characteristics, and even may change some facial characteristics. To reduce the damage of the original information of the facial image, we preserve the original ratio of face width to face height for each individual in our work.
In addition, none of them carried out feature analysis based on their large number of extracted features to discover potentially important features. However, identifying a smaller size of latent crucial features from such a large amount of features can provide physicians with more possibilities to discover underlying knowledge to diagnose TS. Thereby, in this paper, we carry out feature analysis (e.g., analyzing features based on PCs or SVs) for our extracted numerous features to discover the potential significant facial features in identifying TS.
Not well-defined features may be one reason to make the process of discovering knowledge difficult. For instance, the GGF and GTF features in [14] should be improved. The GGF was obtained by calculating the Euclidean distance features between two different feature points based on the 68 facial points model, as shown in Fig.2. However, the distance features may be invalid in the processed facial images, because the distances of different individuals measured in real life may have been scaled to different degrees in the processed  digital images. For example, the distances between points 0 and 16 of different individuals acquired in reality, no matter longer or shorter, have been changed to a similar length when the intercepted facial images were zoomed out to the same size of 640*640. Nevertheless, the ratio calculated between two distances are not changed, based on the assumption that the original ratio of face width to face height of the processed images has not been changed. In our work, we extract such ratio features while ensuring that the original ratio of face width to face height remains unchanged, so as to provide a reliable basis for unearthing knowledge that which ratio feature is more important in recognizing TS.
Other global features used in [14] is GTF, which may increase the difficulty of identifying key information for detecting TS. The GTF was acquired by averagely dividing the intercepted facial image into 64 blocks and computing the energy feature of each block. However, the averagely divided blocks bring trouble to discovering knowledge, because the blocks of the same position from different people do not always stand for the same facial areas. For example, a block of person A represents the left eyebrow, but due to different facial characteristics, the block of person B at the same position may represent a part of forehead nearby her left eyebrow. It results in that the information of the same organ of different people is not easy to be analyzed in identifying TS. Hence, in our work, we divide the intercepted facial image into several organ blocks rather than averagely divided blocks, which stand for their corresponding facial organs for each person, and compute their energy feature. It furnishes a relatively easy way to discover knowledge that the energy feature of which organ plays a more significant role in recognizing TS. VOLUME 8, 2020

III. THE PROPOSED TURNER SYNDROME DIAGNOSIS MODEL
In this paper, we propose a new TS diagnosis model based on facial images, which preserves the original ratio of face width to face height, extracts the corresponding reliable features and analyzes the potential important features from a large number of features in identifying TS. As shown in Fig.1, the proposed TS diagnosis model consists of four components, i.e., Image Preprocessing, Feature Extraction, Classifier Construction and Knowledge Discovery. For each component, the utilized techniques are described below.

A. IMAGE PREPROCESSING
Image preprocessing is applied to the original facial images, due to the reason that the acquired original facial images are not always uniform and directly applying such images to construct TS diagnosis model would lead to inappropriate results with low classification accuracy. The original images are acquired by the same camera with fixed camera position, which guarantees the same image quality, photo distance and illumination intensity, but by different photographers and shooting angles, which results in the ununiform facial images. As shown in Fig.1, the original images contain some unwanted non-facial areas with less brightness and skewed faces. Such a low-quality image data set could not be directly employed to construct a TS diagnosis model. Multiple images preprocessing techniques should be utilized to handle these issues, including Face Alignment, Facial Area Intercept and Facial Brightness Normalization.

1) FACE ALIGNMENT
To guarantee the consistency of the direction of the faces, face alignment should be carried out. The faces can be aligned by making the position of both eyes be on the same horizontal line. To acquire the coordinates of eyes, the 68 feature-points face model [18] is used, as displayed in Fig.2. The left eye and right eye are detected by points 36 to 41 and the points 42 to 47, respectively, summarized in Table 2. The eye centre is used to indicate the position of each eye, which can be located by calculating the mean value of each eye's points. Then the rotation angle for aligning two eyes horizontally is calculated based on the coordinates of both eyes' centres. Finally, all the faces can be aligned automatically by rotating the original facial image around the left eye centre with the calculated rotation angle.

2) FACIAL AREA INTERCEPT
Multi-task Cascaded Convolutional Networks (MTCNN) is used to intercept the facial areas from the aligned facial images. It is an effective tool, which can detect and intercept a desirable face area from an image. As a result, the intercepted facial image contains the entire face while removing nonfacial areas. In addition, the width of all intercepted facial images is set to 640 pixels based on the locked aspect ratio that guarantees the original ratio of the face width to face height unchanged. It reduces the damage of the inherent information of the original images, which benefits for the subsequent reliable feature extraction.

3) FACIAL BRIGHTNESS NORMALIZATION
The grayscale image is used to increase the brightness of an image. The following equation is used to convert the RGB value of a pixel into its grayscale value, as shown below: where red, green, blue and gray correspond to the color of a pixel, respectively. It is provided by the classical MAT-LAB function rgb2gray [19] and produces excellent results over images in general. These weights in Equation 1 are obtained empirically and are proportional to the sensitivity of the human eye to each of the trichromat colours. The human eye is most sensitive to green while it is least sensitive to blue. Finally, all the RGB images are converted to their corresponding grayscale images.

B. FEATURE EXTRACTION
Reasonable features should be carefully extracted from the processed facial images having the original ratio of face width to face height. In order to extract reliable features and benefit for the subsequent knowledge discovery, energy features of facial organ blocks and ratio features are considered, as analyzed in the last paragraph in Section II. The energy features of facial organ blocks are acquired in two ways, i.e., the rough energy features (REF) based on rough area segmentation simply using five-row split and six-column split, and the finer energy features (FEF) based on finer area segmentation defined carefully. The ratio features are obtained in three manners, i.e., the rough ratio features (RRF) calculated between two distances acquired based on 68 feature-points, the finer ratio feature (FRF) defined carefully, and another FRF2 calculated between two defined distances. The five feature sets are extracted as follows.

1) ROUGH ENERGY FEATURES
In order to acquire REF, we carried out rough area segmentation to the facial images to obtain the facial organ blocks and computed the energy of each organ block. Specifically, (1) Blocks: Rough area segmentation is used to divide the facial images into several facial organ blocks, which includes row split, column split and entire split, as shown in Fig.3. The row split is utilized to get the row parts of forehead, eyebrows, eyes, nose, cheeks, mouth and chin, as shown in Fig.3(a). The column split is employed to get the column parts of the left cheek, right cheek, left eye, right eye and nose, as exhibited in Fig.3(b). The entire split is a combination of row split and column split, which segments the facial images into several organ blocks, as displayed in Fig.3(c). As a result, 6, 5 and 30 blocks are acquired, respectively. The specific split lines used in rough area segmentation are listed in Table 3.  (2) Energy: For facial organ blocks acquired by the rough area segmentation, the energy of each block is calculated by the following equation: where B stands for a block, which is a matrix made up by pixels with value of (0,255), e(B) denotes calculating a block B's energy, p(B) means normalizing each pixel value in B to (0,1), the width and height are the block B's, and avg(·) represents the average value.
where JE(F) denotes calculating a facial image F's joint energy, e(·) is the energy of a block, and E1(·), E2(·) and E3(·) indicate the energy vector of blocks segmented by row split, column split and entire split as shown in Fig.3(a-c), respectively. Hence, REF is the joint energy of all 41 blocks acquired by the rough area segmentation.

2) FINER ENERGY FEATURES
Though the rough area segmentation obtains the main centre parts of each facial organ, the information of more precise and detailed facial organs is also significant for TS diagnosis and analysis. Accordingly, a finer area segmentation is carried out, as shown in Fig.4. All resulting 9 segmented parts are displayed in Fig.4 (b)-(j), respectively. Fig.4 (a) displays all the 9 parts in one facial image. The corresponding detected areas belong to each segmented part are shown behind them, i.e., the second and fourth columns in Fig.4. As a result, a total of 34 areas are detected. Just like REF, FEF is the joint energy of the detected 34 areas, which may reflect the facial characteristics more precisely.

3) ROUGH RATIO FEATURES
For each facial image, the reliable ratio features are extracted rather than distance features, as analyzed in the last paragraph in Section II. To acquire RRF, the Euclidean distances between each pair of points in 68 facial feature-points model as displayed in Fig.2 should be calculated first. As a result, a total of 2278 (2278 = 68 × 67/2) distances are obtained.
If directly calculating all ratio features between two different distances, there will be a ''curse of dimensionality'', as the number of ratio features is more than 2.5 million. To avoid the high dimensionality of ratio features, the face width of each image is used to normalize the distances, i.e., only the VOLUME 8, 2020 ratios between distances and face width are calculated. The face width is the distance between points 0 and 16. Finally, a total of 2277 RRF, not 2278 features are obtained, because one RRF feature acquired by distance(0,16)/distance(0,16) is removed.

4) FINER RATIO FEATURES
In spite that RRF includes a large number of ratio features calculated between Euclidean distances, the ratios between some vertical distances or horizontal distances haven't been involved, such as the ratio of forehead_width and fore-head_height as shown in Fig.4 (b_1). Thereby, we define 46 distances for the organ blocks acquired by the finer area segmentation, which are marked in blue as revealed in Fig.4. Based on these distances, there are two kinds of ratio feature sets are obtained, i.e., FRF and FRF2. FRF with 52 features is defined in Table 4. From this table, we can find that the calculation of FRF features utilizes not only the distances listed in Fig.4, but also the height and width of the face. The face_width is the distance between points 0 and 16 shown in Fig.2, and the face_height is the height of the intercepted facial image.
To make full use of the defined distances displayed in Fig.4, FRF2 is also calculated between all the defined distances. As a result, a total of 1035 FRF2 features are obtained, which include some features of FRF, but not all, due to not involved face_width and face_height.  constructed. To establish an effective TS diagnosis model, SVM is considered due to its stability, which has been widely used in medical diagnosis [20]. Moreover, ensemble learning is investigated to improve the classification performance. Accordingly, single classifiers for each feature set and ensemble classifiers combining all single classifiers are built.

1) SINGLE CLASSIFIERS
For the five feature sets, SVM is employed to establish single classifiers respectively. Considering that some redundant and irrelevant features may affect the classifiers' performance, especially for the classifiers established by a large number of features, e.g., RRF (2277 features) and FRF2 (1035 features), PCA and KPCA are applied to the extracted feature sets to reduce dimensionality. Single classifiers including base SVM classifier, PCA+SVM classifier and KPCA+SVM classifier for each feature set are established. As a result, a total of 15 single classifiers are constructed, i.e., 5 SVM classifiers, 5 PCA+SVM classifiers and 5 KPCA+SVM classifiers. Details of the adopted SVM, PCA and KPCA methods are described as below.
(1) Support Vector Machine SVM is a popular supervised classification and regression tool. The motivation of SVM is to solve the two-class pattern recognition problem. SVM makes an attempt to minimize the classification error and maximize the geometric margin (i.e., the maximum distance from the nearest samples from two classes). Thus, SVM is also called maximum margin classifier and the nearest training samples are formed as the SVs. Moreover, to solve the classification problem of non-linear data, SVM can perform non-linear classification through kernel tricks. The kernel tricks map the input into highdimensional feature spaces, which are linearly separable.
For the TS recognition, given a set of labeled training data D, let (x i , y i ), denote the training pairs, where x i is a vector of features for a TS facial image and y i is the corresponding label 1 or −1, i.e., TS or non-TS. The SVM algorithm tries to seek for the maximum-margin hyperplane and the solution is obtained through the method of Lagrange multipliers: where the variables α i and α j denote the Lagrange multipliers; C is the regularization parameter; φ(·) is a non-linear VOLUME 8, 2020 mapping, which usually maps x i into a higher-dimensional space. This formula can be solved based on the KarushKuh-nTucker theorem. The solution is a linear combination of the SVs, which are the examples lie closest to the decision boundary. Notably, the kernel function K is defined as , which is exploited to transform the original data space to a high-dimensionality and linearly separable space. In our work, we investigate multiple kernels (i.e., Gaussian kernel, Laplace kernel, Polynomial kernel and Polyplus kernel) for the classification task.
The kernel with the optimal parameter is applied to construct the prediction model.
(2) Principal Component Analysis and Kernel PCA PCA method is applied to reduce the dimensionality and expose the internal structure of data. It is one of the most widely used statistical techniques for dimensionality reduction [21]- [23]. PCA uses all of the original variables to acquire a smaller set of new variables (i.e., PCs), which can be utilized to approximately reconstruct the original variables. It is an orthogonal linear transformation which converts original data into a new coordinate system such that the greatest variance by a certain projection of data comes to lie on the first coordinate, i.e., the first PC, the second greatest variance on the second coordinate, and so on. The more relevant the original variables are, the fewer the number of new variables required. PCs are ranked based on the magnitude of their corresponding eigenvalues, which results in the first few PCs retaining most of the variations presented in the original sets.
Given a data matrix M with n features, each of which has q variables denoted as f 1 , . . . , f q , PCA computes a new variable z 1 as the first PC. The new variable accounts for the variation in the q variables. The first PC is obtained by a linear combination of the p variables, z 1 = γ 11 f 1 +γ 12 f 2 +. . .+γ 1p f p , in which the sample variance is greatest for the coefficients γ 1 , i.e., γ 11 , γ 12 , . . . , γ 1p . Notably, the coefficients should satisfy the rule, γ T 1 γ 1 = 1. Similarly, the second PC also can be acquired by . . , f q . To achieve the greatest variance, γ 2 meets the two conditions, γ T 2 γ 2 = 1 and γ T 2 γ 1 = 0. Further, the v th PC is z v = γ T v f , which gets greatest variance subject to γ T v γ v = 1 and γ T v γ j = 0(j < v). To find the coefficients γ v for v th PC, we should solve an optimization problem: The solution can be obtained by Var(z v ) = γ T v H γ v , where H is the co-variance matrix of original variables. The solution of γ v is the eigenvector of H which is corresponding to the largest eigenvalue. The eigenvalues can be acquired from the equation, |H − λE| = 0, where E is a unitary matrix and λ is some unknown eigenvalue of H . Then we get the ordered eigenvectors based on the eigenvalues. The eigenvectors with larger eigenvalues are corresponding to the PCs that achieve more variance of the original data. In addition, before using PCA, the original data should be scaled to unit variance, which makes the analysis meaningful.
The kernel tricks also can be introduced to the PCA, named as Kernel PCA (KPCA) [24]. The idea of KPCA is to firstly map the original input vectors x t into a high-dimensional feature space φ(x t ) and then calculate the linear PCA in φ(x t ).
By mapping x t into φ(x t ) whose dimension is assumed to be larger than the number of training samples.

2) ENSEMBLE CLASSIFIERS
In order to improve the accuracy of the TS diagnosis, we use the ensemble learning technique to construct effective TS prediction models. An ensemble method often can combine multiple different weak classifiers to form a stronger model based on a certain strategy. The higher diversity the weak classifiers have, the better results the ensemble model generates. In our study, 15 different single classifiers are constructed and combined together to form the final stronger TS classifier. The 15 single classifiers are trained on five welldefined feature sets (described in the above section).
Two classical ensemble methods, i.e., non-weighted voting method and weighted voting method, are employed to build powerful TS models. The non-weighted voting method utilizes the conventional majority voting principle for the combination of the weak classifiers, which decides the final class for a sample according to the predictive label of the weak classifiers in the majority. The weighted voting method adds weight to each weak classifier and the label which acquires majority votes is regarded as the final prediction result.

D. KNOWLEDGE DISCOVERY
To discover potential important features from the extracted numerous features in identifying TS, feature analysis based on SVs and PCs are carried out respectively. The process of knowledge discovery is shown below.

1) FEATURE ANALYSIS BASED ON SUPPORT VECTORS
The SVs acquired from SVM classifiers could be utilized to conduct feature analysis to dig out the latent significant features, due to the reason that SV is a subset of training data containing the most informative patterns. For each feature set, the corresponding SVM classifiers are constructed. The maximum margins of different SVM classifiers trained using different kernels and parameters could provide valuable information for the feature analysis of TS. By ranking the maximum margins built by different SVM classifiers, we select the best SVM model with the largest maximum margin. Accordingly, for each feature set, SVs derived from the best SVM model are used for the feature analysis.
To identify the crucial features based on the SVs, three widely used feature ranking methods are employed, including chi-square test, information gain (IG) and analysis of variance (ANOVA) F-test. For each ranking method, the top 10 features with the highest scores are chosen. And the common features selected at least twice are investigated to be the important features in identifying TS.

2) FEATURE ANALYSIS BASED ON PRINCIPAL COMPONENTS
The PCs derived from the PCA should be employed to analyze which feature is more important in recognizing TS. Each PC is reconstructed based on all of the original variables. PCs are ranked based on their corresponding eigenvalues, i.e., the larger the eigenvalue, the more important the PC. The first few PCs retains most variance of the original data. To discover the crucial significant features conveniently, the number of PCs that represents about 93% of the variance is chosen due to its small size. Since there are 41 (34, 2277, 52, 1035) variables in REF (FEF, RRF, FRF, FRF2) required to be considered for each PC, only the variable with the maximum absolute coefficient value in each PC is reported. In summary, the variable with the maximum absolute coefficient value in the first PC is thought to be the most important feature in identifying TS, meanwhile, other significant PCs are also presented.

A. IMAGE DATA
The facial images used in this paper are acquired from the Peking Union Medical College Hospital. Written informed consent was obtained from all participants or guardians and we also obtained the approval of our institutional ethical review board for the research. The facial images are collected from TS females ranging from 3 to 17 years of age. 90% of them are females with ages ranging from 9-14 years that are usually suspected with short stature and/or delayed puberty. Meanwhile, the controls are also selected from a large number of non-turner individuals in the hospital within the same age group. Their frontal images are employed to automatically identify TS with the help of various ML techniques. From July 2016 to December 2017, a total of 98 cases and 530 controls are obtained by excluding the unqualified images, such as blurry data and images with too many rich facial expressions. Generally, in ML, cases (i.e., TS individuals) and controls (i.e., non-TS individuals) are labelled as '+1' and '−1', respectively.

B. EXPERIMENT SETTINGS
The original dataset is divided into the training set and testing set, where training set includes 80% of cases and 80% of controls, and testing set consists of the rest of the dataset. For the training set, 3-fold cross validation is employed to decide the optimal parameters of classifiers. Based on the acquired optimal parameters, the whole training set is reused to establish the final classifiers. The testing set is utilized to evaluate the performance of the final classifiers built by different techniques.
The number of components is decided by 99.5% of information retained in each feature set, which brings good performance for classifiers. The information amount is the cumulated percentages of eigenvalues occupying the total eigenvalues.
For each kernel used for SVM and KPCA, the parameter t of Gaussian kernel (the parameter t of Laplace kernel, the parameter d of Polynomial kernel and the parameter d of Polyplus kernel with r=1) is set to the range from 10 (−5) to 10 (5) with a step size of 10 (+1) (from 10 (−5) to 10 (5) with a step size of 10 (+1) , from 2 to 15 with a step size of 1, and from 2 to 15 with a step size of 1). The grid search method is applied for the selection of optimal parameters for KPCA+SVM classifiers.

C. EVALUATION METRICS
To assess the performance of TS classifiers, three widely used evaluation metrics are considered, i.e., accuracy (acc), sensitivity (se) and specificity (sp). Accuracy indicates the judgment ability that the TS prediction model can make the correct decision for the detection of TS. Sensitivity represents the ability that the model classifies the positive TS images correctly. Specificity refers to the ratio of the number of negative images with correctly predicted results to the number of all negative images. The three criteria are calculated based on Eq.(10), (11) and (12)   between RRF and FRF. The feature set FRF2 obtains the lowest accuracy. By analyzing the features in FRF2, we find that the ratios between face height or width and other defined distances have not been included; whereas, parts of this kind of ratios are included in FRF. Therefore, it can be concluded that this kind of ratio features may have a significant role in the prediction of TS. Meanwhile, all the classifiers have a high specificity with low sensitivity. One possible reason may be the imbalanced problem between the TS cases and controls, in which the number of cases is nearly one-fifth of the number of controls. Another reason is that not all of the cases reveal special facial appearances; however, some of the controls have such facial appearances, which significantly affects the performance of the classifiers.

2) COMPARISON OF PREDICTION PERFORMANCE OF PCA/KPCA+SVM SINGLE CLASSIFIERS
The results of PCA+SVM and KPCA+SVM classifiers are presented in Fig.6 and 7, respectively. The number of components is decided by 99.5% of the variance retained in each feature set. Comparing to the results shown in Fig.5, we observe that not all the classifiers have an improvement in accuracy by using the dimensionality reduction methods. In Fig6, the PCA+SVM classifier built by FEF achieves the highest accuracy. There are apparent improvements in accuracy for RRF and FRF2, but the accuracies for FEF and FRF remains unchanged, and it even decreases for REF. Similar findings could be observed in Fig.7. As shown in Fig.7

3) COMPARISON OF PREDICTION PERFORMANCE OF ENSEMBLE CLASSIFIERS
The prediction results of two ensemble methods (i.e., nonweighted voting and weighted voting methods) are presented in Fig.8. Both ensemble classifiers achieve better performance than the 15 single classifiers. Compared with the non-weighted voting method, the weighted approach obtains a higher accuracy of 0.9127. However, the se value is still low, which may be caused by the imbalanced data and the not revealed special facial appearances in some TS cases.

4) PERFORMANCE COMPARISON BETWEEN THE PROPOSED TS MODEL AND OTHERS
Comparing the performance of our proposed method with the related works mentioned in Section II, the weighted ensemble classifier established by our method achieves the highest specificity of 0.9905 and the second higher accuracy of 0.9127, which is very close to the highest accuracy of 0.913 acquired by Gao et al. [16]. However, the sensitivity is lower than others acquired by Song et al. [14], Chen et al. [15] and Yao et al. [17]. We conjecture that the difference may be caused by the different quality of datasets. In addition, the imbalance of our dataset may also have a negative impact on the proposed model. Moreover, Chen et al. [15] also reported the diagnosis result of their dataset by clinical workers, which included 32 cases and 96 controls. However, only sensitivity and specificity were presented, which were 57.4 ± 21.9% and 75.4 ± 17.3%, respectively. From the result, we find the diagnosis of clinical workers has a fluctuation based on the facial image. Therefore, comparing with clinical workers, it can be seen that the proposed model is more stable and effective.

E. KNOWLEDGE DISCOVERY FROM SUPPORT VECTORS AND PRINCIPAL COMPONENTS 1) FEATURE ANALYSIS BASED ON SUPPORT VECTORS
Three feature ranking methods (i.e., chi-square test, IG and ANOVA F-test) are applied to SVs to select the discriminative features. For each ranking method, the top 10 features are listed in Table 5. The features selected at least twice are thought to be significant features in identifying TS, which are discussed below.
(1) For REF, the energy features of part 10, 16, 23, 29, 30, 34, 35, 36 and 40 have been selected at least twice. These important energy features should attract the attention of physicians to discover the potential facial characteristics of TS patients. For example, for cases and controls in SVs, the mean values of energy feature of part 23 are 0.2251 and 0.2370 respectively, where part 23 stands for the nasal root area. The energy feature of part 23 in cases is smaller than the one in controls. For part 35 (the lower part of the right cheek), the energy has a higher mean value of 0.1387 in cases than 0.1238 in controls. These findings may provide the possibility for physicians to carry on a further study to find the unknown facial characteristics.
(2) For RRF, the features of distances that are normalized by the face width of each individual have been ranked. The selected features with at least twice point pairs are (6, 51), (6, 52), (6, 53) and (7,66). Corresponding to the points in Fig. 2, we observe that these features are clustered in the mouth and chin areas. The feature 6_51 is the ratio calculated by (distance (6,51) / face_width). The mean value of feature 6_51 (6_52, 6_53, 7_66) is 0.3464 (0.3872, 0.4179, 0.2898) for cases and 0.3303 (0.3711, 0.4020, 0.2776) for controls in SVs. The TS patients have a larger ratio in the four features. These findings could further be analyzed by physicians to discover useful information. VOLUME 8, 2020     (3) For FEF, the features selected by the chi-square test are just selected at least twice. It is found that the energy of right_externalcanthus and left_externalcanthus, corresponding to both ends of the rectangle areas shown in Fig. 4 (d), are deemed to have potential value. For cases and controls in SVs, the mean energy value of the area left_externalcanthus is 0.2738 and 0.2980, respectively, and the mean value of area right_externalcanthus is 0.1518 and 0.1529, respectively. Therefore, TS patients have less energy in their external canthus areas. The rlowercheek_area represents the area of the right lower cheek, which gets a higher mean energy value of 0.1401 for cases than 0.1358 for controls. This finding is similar to the one found in part 35 (the lower part of the right cheek) in REF.
(4) For FRF, the features of forehead_hwratio (i.e., the forehead height to width ratio), left_palpebral_fissure_ hwratio (i.e., the left palpebral fissure height to width ratio), down_lip_hratio (i.e., the down lip height to face height ratio), lcheek_whratio (i.e., the left cheek width to height ratio), rcheek_whratio (i.e., the right cheek width to height ratio), lzygoma_whratio (i.e., the left zygoma width to height ratio), rzygoma_whratio (i.e., the right zygoma width to height ratio), llowercheek_whratio (i.e., the left lower cheek width to height ratio) and rlowercheek_whratio (i.e., the right lower cheek width to height ratio) are selected at least twice. For cases and controls, the mean values of the feature forehead_hwratio are 0.3883 and 0.3935, respectively. Forehead_hwratio is the ratio of the forehead height and width. The smaller values in cases may indicate that the facial features of the smaller forehead heights and the larger forehead widths exist more frequently in TS patients.
(5) For FRF2, we find that none of the features is selected twice or more. It is guessed that a single feature may not have strong predictive power, but their combination perhaps contributes to the classification decision.

2) FEATURE ANALYSIS BASED ON PRINCIPAL COMPONENTS
To discover the potential important features, the number of PCs that represents more than 93% of the variance is chosen. Since there are 41 (2277, 34, 52, 1035) variables in REF (RRF, FEF, FRF, FRF2) required to be considered, only the variable with the maximum absolute coefficient value in each PC is reported. Tables 6-10 show the summary of PCs in the five feature sets.
As shown in Table 6, for the feature set REF, by using PCA, the first PC occupies 61.16% information of all the resulting PCs. The variable 26 exhibited in Fig. 3 (c), i.e., the average energy of left zygoma part, has the maximum absolute coefficient value, which contributes the most important information to the first PC.
As displayed in Table 7, for RRF, the first three PCs occupies 74.94% information in all the resulting PCs. The variables of the ratio of distance calculated between the point pairs (7,23)/(11,30)/(16,30) exhibited in Fig.2 to face width, have the maximum absolute coefficient values for the first three PCs, respectively.
As presented in Table 8, for FEF, the first PC occupies 70.65% information in all the resulting PCs. From this table, we find that the average energy of lzygoma_area (the left zygoma area) shown in Fig.4 (i) is of significant value, which is similar to the finding observed from Table 6.
As shown in Table 9, for FRF, the first two PCs occupies 70.48% information in all the resulting PCs. The variables of rzygoma_whratio and lzygoma_whratio (i.e., the ratios of width to height of the right and left zygoma areas) are important for the first two PCs, respectively.
As listed in Table 10, for FRF2, the first PC occupies 64.59% information in all the resulting PCs. The variable of forehead_width/up_lip_height_mid (i.e., the ratio of forehead width to the height of up lip) has the maximum absolute coefficient value for the first PC.
In summary, the information (energy or ratio) of left zygoma area are involved in three first PCs (i.e., the first PC of REF, FEF and FRF), which implies that the left zygoma is informative to the TS diagnosis. Other informative variables are also listed in Tables 6-10, which are helpful for physicians to find special facial features for TS patients.

V. CONCLUSION
In this paper, we propose a new TS diagnosis model based on facial images preserving the original ratio of face width to face height, extracting reliable features and carrying out feature analysis based on SVs and PCs. The proposed TS diagnosis model consists of Image Preprocessing, Feature Extraction, Classifier Construction and Knowledge Discovery. Through the Image Preprocessing, the facial area is intercepted and the original ratio of face width to face height is preserved. By the Feature Extraction, the reliable feature sets are extracted, i.e., REF, FEF, RRF, FRF and FRF2. For the classifier construction, 15 single classifiers and 2 ensemble classifiers are constructed. The weighted voting method achieves the highest accuracy of 0.9127. FEF outperforms the other four feature sets, however, FRF is also a competitive feature set, which always obtain relatively good performance for the TS prediction task. For knowledge discovery, by analyzing the SVs acquired from SVM classifiers, multiple significant features are observed, such as less energy in external canthus areas and lower ratio of forehead height to forehead width in TS patients. By analyzing features derived from the PCs, we find that the energy and ratio features in left zygoma area of TS patients are important, which could be further studied by physicians.
In the future, we will try to propose more effective and reliable feature extraction methods to benefit the feature analysis of TS disease, which trends to extract the deep potential facial features involved in TS. Furthermore, deep learning techniques will also be applied to the TS dataset for constructing more accurate and reliable TS prediction models.
JIJIANG YANG received the B.S. and M.S. degrees from Tsinghua University and the Ph.D. degree from the National University of Ireland, Galway. His research areas involve e-health, e-government/e-commerce, privacy preservation, information resource management, data mining, and cloud computing. He is currently an Associate Professor at Tsinghua University. He worked for the Computer Integrated Manufacturing System/Engineering Research Center (CIMS/ERC), Tsinghua University, from 1995 to 1999. He joined or led different projects funded by the State Hi-Tech program (863 programs), NSF (China), and the European Union. Since 2009, his main focus has been e-health and medical service. He has undertaken a few projects in the National Science & Technology Supporting Program on the Digital Medical Service model and key technologies. He is also collaborating with many medical institutions and hospitals. He is a member of the expert committee of the Internet of Things (IoT) in health at the Chinese Electronic Association, the Expert Committee of remote medicine and cloud computing at the Chinese Medicine Informatics Association. He has published more than 60 articles in professional journals and conferences. QING WANG received the Ph.D. degree from Tsinghua University. He became a Researcher with the Research Institute of Information Technology, Tsinghua University. His research interests include web service technology, data mining, and machine learning, among other interests, especially in the healthcare field. Recently, his research has focused on big data technology applied in medical services. VOLUME 8, 2020