Automated Tongue Feature Extraction for ZHENG Classification in Traditional Chinese Medicine

ZHENG, Traditional Chinese Medicine syndrome, is an integral and essential part of Traditional Chinese Medicine theory. It defines the theoretical abstraction of the symptom profiles of individual patients and thus, used as a guideline in disease classification in Chinese medicine. For example, patients suffering from gastritis may be classified as Cold or Hot ZHENG, whereas patients with different diseases may be classified under the same ZHENG. Tongue appearance is a valuable diagnostic tool for determining ZHENG in patients. In this paper, we explore new modalities for the clinical characterization of ZHENG using various supervised machine learning algorithms. We propose a novel-color-space-based feature set, which can be extracted from tongue images of clinical patients to build an automated ZHENG classification system. Given that Chinese medical practitioners usually observe the tongue color and coating to determine a ZHENG type and to diagnose different stomach disorders including gastritis, we propose using machine-learning techniques to establish the relationship between the tongue image features and ZHENG by learning through examples. The experimental results obtained over a set of 263 gastritis patients, most of whom suffering Cold Zheng or Hot ZHENG, and a control group of 48 healthy volunteers demonstrate an excellent performance of our proposed system.


Introduction
Traditional Chinese Medicine (TCM) has a long history in the treatment of various diseases in East Asian countries and is also a complementary and alternative medical system in Western countries. TCM takes a holistic approach to medicine with emphasis on the integrity of the human body and the close relationship between a human and its social and natural environment [1]. TCM applies different therapeutic methods to enhance the body's resistance to diseases and prevention. TCM diagnosis is based on the information obtained from four diagnostic processes, that is, looking, listening, and smelling, asking, and touching. The most common tasks are taking the pulse and inspecting the tongue [2]. For thousands of years, Chinese medical practitioners have diagnosed the health status of a patients' internal organs by inspecting the tongue, especially the patterns on the tongue's surface. The tongue mirrors the viscera. The changes of tongue can objectively manifest the states of a disease, which can help differentiate syndromes, establish treatment methods, prescribe herbs, and determine prognosis of disease.
ZHENG (TCM syndrome) is an integral and essential part of TCM theory. It is a characteristic profile of all clinical manifestations that can be identified by a TCM practitioner. ZHENG is an outcome after analyzing all symptoms and signs (tongue appearance and pulse feeling included). All diagnostic and therapeutic methods in TCM are based on the differentiation of ZHENG, and this concept is as ancient as TCM in China [3]. ZHENG is not simply an assemblage of disease symptoms but rather can be viewed as the TCM theoretical abstraction of the symptom profiles of individual patients. As noted in the abstract, ZHENG is also used as a guideline in TCM disease classification. For example, patients suffering from the same disease may be grouped into different ZHENGs, whereas different diseases may be grouped as the same ZHENG. The Cold ZHENG (Cold syndrome) and the Hot ZHENG (Cot syndrome) are the two key statuses of ZHENG [3]. Other ZHENGs include Shen-Yang-Xu ZHENG (Kidney-Yang deficiency syndrome), Shen-Xu ZHENG (Kidney deficiency syndrome), and Xue-Yu ZHENG (Blood Stasis syndrome) [4].
In this paper, we explore new modalities for the clinical characterization of ZHENG using various supervised machine-learning algorithms. Using an automated tongueimage diagnosis system, we extract objective features from tongue images of clinical patients and analyze the relationship with their corresponding ZHENG data and disease prognosis (specifically stomach disorders, i.e., gastritis) obtained from clinical practitioners. We propose a system that learns from the clinical practitioner's subjective data on how to classify a patient's health status by extracting meaningful features from tongue images using a rich set of features based on color-space models. Our premise is that Chinese medical practitioners usually observe the tongue color and coating to determine ZHENG such as Hot or Cold ZHENG, and to diagnose different stomach disorders including gastritis. Hence, we propose using machinelearning techniques to establish the relationship between the tongue image features and the ZHENG by learning through examples. We are also interested in the correlation between the Hot and Cold patterns observed in ZHENG gastritis patients and their corresponding symptom profiles.
Various types of features have been explored for tongue feature extraction and tongue analysis, including texture [5], color [6][7][8], shape [9], spectrum [8], among others. A systematic tongue feature set, comprising of a combination of geometric features (size, shape, etc.), cracks, and textures, was later proposed by Zhang et al. [10]. Computer-aided tongue analysis systems based on these types of features have also been developed [11,12]. Our goal is to provide a set of objective features that can be extracted from patients' tongue images, based on the knowledge of ZHENG, which improves accuracy of an objective clinical diagnosis. Our proposed tongue feature set is based on an extensive color model. This paper is organized as follows: in Section 2, we provide a TCM descriptive view of the physiology of the tongue. An overview of the proposed feature extraction and learning framework along with a complete description of the color space model feature set is presented in Section 3. Our experimental results and analysis in a tongue image dataset from gastritis patients with Cold ZHENG and Hot ZHENG are discussed in Section 4 before drawing our conclusions and proposing plans for future work in Section 5.

Tongue Diagnosis in TCM
TCM believes that the tongue has many relationships and connections in the human body, both to the meridians and the internal organs. It is, therefore, very useful and important during inspection for confirming TCM diagnosis as it can present strong visual indicators of a person's overall physical and mental harmony or disharmony. In TCM, the tongue is divided into tongue tip, tongue margins, tongue center, and tongue root. Figure 1(a) shows each part of the tongue and its correspondence to different internal organs according to TCM while Figure 1(b) illustrates how we geometrically obtain an approximation of these regions from the tongue image. The tongue tip reflects the pathological changes in the heart and lungs, while the bilateral sides of the tongue reflect that of the liver and gallbladders. The pathological changes in the spleen and stomach are mirrored by the center of tongue, while changes in the kidneys, intestines, and bladder section correspond to the tongue root.
In this paper, we focus on the patients with stomach disorders, gastritis. Hence, we are interested in extracting features not only from entire tongue image but also specifically from the middle region, as this corresponds to the stomach organ, according to TCM. We extract the middle rectangular region, illustrated in Figure 1(b), as our approximation for the tongue middle region.
The practitioner examines the general and local shape as well as the color of the tongue and its coating. According to TCM, the normal tongue is pale red with thin white coating. Some signs of imbalance or pathology are red body, yellow coating, or thick coating like mozzarella cheese, and so forth. Some characteristic changes occur in the tongue in some particular diseases. Most tongue attributes are on the tongue surface. A TCM doctor looks at several attributes of tongue body: color, moisture, size, shape, and coating. These signs not only reveal overall states of health but they also correlate to specific organ functions and disharmonies, especially in the digestive system.
The two main characteristics of the tongue in TCM ZHENG diagnosis are the color and the coating. The color of the patient's tongue color provides information about his/her health status. For example [13], dark red color can indicate inflammation or ulceration, while a white tongue indicates cold attack, mucus deposits, or a weakness in the blood leading to such conditions as anemia [12]. Moreover, a yellow tongue points out a disorder of the liver and gallbladder, and blue or purple implies stagnation of blood circulation and a serious weakening of the part of the digestive system that corresponds to the area of the tongue where the color appears.
The coating on the tongue is discriminated by not only its presence but also its color. The color could be yellow, white, and other colors. However, the color in image is not the exact true color of the tongue. To properly identify the color of the tongue coating, we applied the specular component technique presented in our prior work on tongue detection and analysis [2]. Figure 2 illustrates different tongue images of patients and their corresponding ZHENG class.    system so that we can predict not only the color and coating on the tongue, but also different ZHENGs of the gastritis patients. These features are designed to capture different color characteristics of the tongue. While a single feature may not be very discriminative, our premise is that the aggregation of these features will be discriminative. We leave it to the learning algorithm to determine the weight/contribution of each feature in the final classification.

Tongue Feature Extraction and Classification Framework
Most color spaces are represented in tuples of number, normally three or four color components. Color components determine the position of the color in the color space used. There are many color spaces defined for different purposes. We designed a set of 25 features that span the entire colorspace model. They can be grouped under eight categories: RGB, HSV, YIQ, Y'CbCr, XYZ, L * a * b * , CIE Luv, and CMYK.
In this section, we first describe in detail how we compute each feature f i per ith pixel in the image. Then, we explain how each feature per pixel is aggregated to obtain F j = {F n } per tongue image j.
3.1.1. RGB. RGB is an additive color system, based on trichromatic theory in which red, green, and blue light components are added together to produce a specific pigment. The RGB model encodes the intensity of red, green, and blue, respectively. (R i , G i , B i ) for each pixel is an unsigned integer between 0 and 255. Each RGB feature { f i n | n = 1, . . . , 3} represents the normalized intensity value of the red, green, and blue component, respectively, of the ith pixel in the image. We denote the normalized value of each component All the remaining color-space model features described in our feature set derive their value from the RGB feature set.
3.1.2. HSV. HSV color space represents color using a 3-tuple set of hue, saturation, and value. It separates the luminance component of the color from chrominance information. The For each pixel p i , let M i = max{r i , g i , b i } represent the maximum value of the pixel's RGB triple set while m i = min(r i , g i , b i ), the minimum value of the set. We also denote the difference between maximum and minimum values of each RGB tuple by Thus, the HSV features are f i

YIQ.
The YIQ color model is the television transmission color space for a digital standard. The Y component represents the perceived luminance, while I and Q components are the color information. I character is referred to "in-phase" term and Q letter stands for "quadrature." I and Q can place color in a graph representing I as X axis and Q as Y axis. The YIQ system takes advantage of human color perceiver characteristics [14,15].
The gamma-corrected function is defined as where a = 0.055. Thus, XYZ model consisting  3.1.6. L * a * b * . CIE L * a * b * color space is a nonlinear transformation of the CIE XYZ color space [17]. CIE L * a * b * try to imitate the logarithmic response of the human eye. The L * component is designed to match closely with human perception of lightness. The other two components describe the chroma. The forward transformation of CIE XYZ color space to CIE L * a * b * is computed as follows: where and {δ} denotes the D65 white point given by Evidence-Based Complementary and Alternative Medicine 5 3.1.7. CIE Luv. CIE Luv, or L * u * v * , is color-spacecomputed from the transformation of the CIE XYZ color space by International Commission on Illumination (CIE) in order to perceptual uniformity [17]. Similar to CIE L * a * b * , the D65 white point is referred by {δ}: where The CMYK color space is a subtractive color system mainly used in the printing industry [16]. The components consist of cyan, magenta, yellow, and neutral black. It is a common way to translate RGB display on monitors to CMYK values for printing.
can be computed from the RGB model as follows: Thus, the CMYK features are computed as f i

Aggregate Operators for the Feature Vectors.
To train our classification model using this set of features, we need to combine the features per pixel into one composite feature vector F j = {F n } per tongue image (or region) j. We aggregate the pixel features using two different statistical averages (mean and median) and the standard deviation values. We derive five variations of feature vectors for our automated tongue ZHENG classification system using the following operators: mean, median (med F), standard deviation (σ F), "mean plus standard deviation" ({μ F, σ F}), and "median plus standard deviation" ({med F, σ F}).
Let N denote the number of pixels in a given tongue image (or region) j. The mean feature vector is denoted by μ F j = {μF n }, where μF n is given by The median feature vector, denoted by med F j = {med F n }, is computed as med F n = mid{sort(F set )}, n = 1, . . . , 25. Standard deviation depicts the margin of difference between a given feature value and its average value among all the pixels in the given region. Thus, the standard deviation feature vector is denoted by σ F j = {σF n }, where σF n is given by The "mean plus standard deviation," denoted by {μ F, σ F}, is a concatenation of the mean feature vector and the standard deviation feature vector. Similarly, the "median plus standard deviation" feature vector, denoted by {med F, σ F}, is a concatenation of the median feature vector and the standard deviation feature vector. Thus, the total number of features in both concatenated feature vectors is 50 each.

Supervised Learning Algorithms for ZHENG Classification.
We apply three different supervised learning algorithms (AdaBoost, support vector machine, multilayer perceptron network) to build classification models for training and evaluating the proposed automated tongue based diagnosis system. Each model has its strength and weakness, which we describe briefly below. We empirically evaluate their performance over our dataset.

3.2.1.
AdaBoost. An ensemble of classifiers is a set of classifiers whose individual predictions are combined in some way (typically by voting) to classify new examples. Boosting is a type of ensemble classifier which generates a set of weak classifiers using instances drawn from an iteratively updated distribution of the data, where in each iteration the probability of incorrectly classified examples is increased and the probability of the correctly classified examples is decreased. The ensemble classifier is a weighted majority vote of the sequence of classifiers produced.
The AdaBoost algorithm [18] trains a weak or baselearning algorithm repeatedly in a series of round t = 1, . . . , T. Given a training set {x i , y i } i=1,...,n , where x i belongs to some domain X and y i ∈ Y = {−1, +1} (the corresponding binary class labels), we denote the weight of ith example in round t by D t (i). Initially, all weights are set equally and so D 1 (i) = 1/n, for all i. For each round t, a weak learner is trained using the current distribution D t . When we obtain a weak hypothesis h t with error t = 6 Evidence-Based Complementary and Alternative Medicine where Z t is a normalization factor. The final hypothesis is given by

Support Vector
Machine. The support vector machine (SVM) [19] is one of the best-known general purpose learning algorithms. The goal of the SVM is to produce a model which predicts target values of data instances in the testing set given a vector of feature attributes. It attempts to maximize the margin of separation between the support vectors of each class and minimize the error in case the data is nonlinearly separable. The SVM classifiers usually perform well in high-dimensional spaces, avoid overfitting, and have good generalization capabilities. For a given a training set {x i , y i } i=1,...,n , the SVM model for an instance x can be written as [20] f where k is the kernel function used (polynomial kernel in this work), α i is the Lagrange multiplier, and b is a constant. In our work, we utilize the sequential minimal optimization (SMO) algorithm [21], which gives an efficient way of solving the dual problem of the support vector machine optimization problem.

Multilayer Perceptron Networks.
The multilayer perceptron network (MLP) [22] is a feed-forward neural network with one or more layers that are hidden from the input and output nodes. Neural networks have the ability to learn complex data structures and approximate any continuous mapping [23]. The model of each neuron in the network includes a nonlinear activation function that is differentiable such as the sigmoid. The units each perform a biased weighted sum of their inputs and pass this activation level through the transfer function to produce their output given by where w is the synaptic vector, x is the input vector, θ is the bias constant, and T is the transpose operator. For K-class classification, the MLP uses back propagation to implement nonlinear discriminants. There are K outputs with softmax as the output nonlinearity.

Dataset Labeling and Preprocessing.
Our proposed system relies on a labeled dataset, to effectively build an automated tongue-based ZHENG classification system. Our dataset is comprised of tongue images from 263 gastritis patients and a control group of 48 healthy volunteers. Most of the gastritis patients have been classified as Hot or Cold ZHENG and are identified with a color label (yellow or white) based on the color of the coating of their tongue, as determined by their Chinese doctors. The doctors also carry out a detailed profile of the ZHENG symptoms for each patient based on clinical evaluations. The list of the main symptom profile terms is summarized in Table 1.
We are also interested in the relationship between TCM diagnosis and Western medicine diagnosis; hence, for a subset of the patients, we are provided with their corresponding Western medical gastritis pathology. They are grouped into two categories: superficial versus atrophic. In Western medicine, the doctors are also interested in knowing whether the Helicobacter Pylori (HP) bacterium found in the stomach is present (positive) or absent (negative) in the patients with chronic gastritis. Thus, we are provided with that information for a subset of the patients. It was not feasibleto obtain all the different information collected per patient. Table 2 summaries the population of each subset for four different labels (ZHENG, Coating, Pathology, and HP).

Experimental Setup.
In this section, we evaluated the performance of our proposed ZHENG classification system using the three classification models (AdaBoost, SVM, and MLP) described in Section 3.2. We compared the performance of training the classifier models using the set of features extracted from the entire tongue image versus the middle tongue region only. As mentioned in Section 2, in TCM, it is believed that the middle tongue region provides discriminant information for diagnosing stomach disorders. Hence, we extract features from the middle tongue region, as described in Figure 1(b), to evaluate the performance compared to extracting features from the entire tongue region. In training and testing our classification models, we employ a 3-fold cross-validation strategy. This implies that the data is split into three sets; one set is used for testing and the remaining two sets are used for training. The experiment is repeated with each of the three sets used for testing. The average accuracy of the tests over the three sets is taken as the performance measure. For each classification model, we varied the parameters to optimize its performance. We also compare the results obtained using the five different variations of the feature vector (mean = μ F, median = med F, standard deviation = σ F, mean + standard deviation = {μ F, σ F}, and median + standard deviation = {med F, σ F}), as described in Section 3.1. We also apply Information Gain attribute evaluation on the feature vectors to quantify and rank the significance of individual features. Lastly, we apply the Best First feature selection algorithm to select the "significant" features before training the classifiers to compare the performance of training the classifiers with the whole feature set against selected features.
The performance metrics used are the classification accuracy (CA) and the average F-measure. CA is defined as the percentage of correctly classified instances over the entire set of instances classified. In our dataset, as described Hot-ZHENGrelated symptoms Fever (heat, hot), cold diet/drink preferred, desires cold environment, red flushing of face, thirsty, obvious bad mouth breath, acidic saliva, yellow urine, hard stool, constipation, and feeling hot at limbs. in Table 2, for each data label, the population of both classes (which we denote by {C 1 , C 2 }) is not uniformly distributed. Hence, evaluating the performance of our classifiers using simply the classification accuracy does not paint an accurate picture of the discriminative power of the classifier. Since the dataset distribution is skewed, we can achieve a high accuracy but very poor performance in discriminating between both classes. Thus, we judge our classifiers using the average Fmeasure obtained for both binary classes. The F-measure combines precision and recall. It measures how well an algorithm can predict an instance belonging to a particular class. Let TP represent true positive, which we define as the number of instances that are correctly classified as C 1 for a given test set, while TN denotes true negative, the equivalent for C 2 instances. Let FP represent false positive, which we define as the number of instances that are incorrectly classified as C 1 for a given test set, while FN denotes false negative, the equivalent for C 2 instances. Precision = TP/(TP + FP) and Recall = TP/(TP + FN). Thus, the Fmeasure is defined as For both binary classes {C 1 , C 2 }, let (|C 1 |, |C 2 |) denote the total number of instances belonging to class C 1 and C 2 , respectively, then the average F-measure is defined as In all the tables illustrating the different experimental results, we highlight the best F-measure obtained along with the corresponding classification accuracy of the classifier.

Classification Results Based on Tongue Coating and ZHENG for Gastritis Patients.
The experimental results presented in this section analyze the discrimination among the gastritis patients based on their tongue coating color and ZHENG category. Table 3 summarizes the results obtained using our proposed color-space feature vector to train the classifiers to automatically classify the color of the coating of a gastritis patient's tongue as yellow or white. We can observe from Table 3 that the combination of the median and standard deviation feature values ({med F, σ F}) yields the best result for both the entire tongue region and the middle tongue region only. The results for both regions are also very comparable.
When using the entire tongue region, the top three significant features for the color coating classification, ranked by the information gain attribute, were {σF 9 , med F 12 , σF 2 }, which denote the standard deviation of Q chroma (YIQ model), the median of Cr component (YCbCr), and the standard deviation of Green Channel (RGB), respectively. For the middle tongue region only, the top three were {σF 9 , σF 20 , med F 4 } which denote the standard deviation of Q chroma (YIQ model), the standard deviation of u component (L * u * v * ), and the median of the Hue (HSV). It is also interesting to observe that out of the top ten significant features using the entire region versus the middle tongue region, they both have six of those features in common.
The result obtained on ZHENG classification between the Hot and Cold groups is shown in Table 4. For the ZHENG classification, using the standard deviation feature values (σ F) performs best when dealing with the entire tongue region while the {med F, σ F} feature vector is the top performer for the middle tongue region only.
For ZHENG classification between Hot and Cold syndromes for gastritis patients, when using the entire tongue region, only one feature was considered significant by the information gain attribute: σF 9 , that is, which is the standard deviation of Q chroma (YIQ model). For the middle tongue region, the most important feature is σF 20 , the standard deviation of u component (L * u * v * ). Even though the noteworthy feature in the entire tongue area and the middle tongue area is not the same, both Q components in YIQ color space and u component in L * u * v * color space show the difference from green to red in chromaticity diagram. Table 5 summarizes the results obtained when we train different classifiers to detect the presence of the HP bacteria in a gastritis patient using the color feature vector. The classification result obtained in learning the pathology groups of the patients (superficial versus atrophic) is shown in Table 6. Both cases are not very strong, which illustrates a weak correlation between the western medicine diagnosis and the tongue information utilized by Chinese medical practitioners. No feature was identified as significant in either case.             Tables 7-10 illustrate how experimental results reflect the analysis of the classification between two pathology types of gastritis patients according to ZHENG category. Table 7 summarizes the results obtained using our proposed colorspace feature vector to train the classifiers to automatically classify between Superficial group and Atrophic group for patients labeled as Cold ZHENG. The results obtained on classification between superficial group and atrophic group for Hot ZHENG patients is shown in Table 8. We can observe from Table 7 that the σ F feature vector performed best for the entire tongue region while the {med F, σ F} feature vector yielded the best result for the middle tongue region.
Similarly, from Table 8 we can observe that for the Hot ZHENG patients, for the middle tongue region, the {med F, σ F} feature vector also performed best. However, {μ F, σ F} feature vector performs best when dealing with the entire tongue region.
When using the entire tongue region, the top three significant features for the pathology classification between superficial and atrophic in Cold ZHENG, ranked by the information gain attribute, were {σF 9 , σF 6 , σF 1 } which denote the standard deviation of Q chroma (YIQ model), the standard deviation of value component (HSV), and the standard deviation of Red Channel (RGB), respectively.
In Table 8, when using the entire tongue region, the top three significant features for the pathology classification between superficial and atrophic in Hot syndrome, ranked by the information gain attribute, were {μF 22  The next set of experimental results focus on training our classifier using our proposed color-space feature vector to discriminate Hot ZHENG from Cold ZHENG in each pathology group. Table 9 summarizes the results obtained to train the classifiers to automatically classify between Hot and Cold ZHENG for superficial gastritis patients. Table 10 reflects the results for gastritis patients. We can observe from Table 9 that both {μ F, σ F} and {med F, σ F} feature vectors perform the best for both the entire tongue region and the middle tongue region. From results in Table 10, using the standard deviation feature values ({μ F, σ F}) performs best when dealing with the entire tongue region while the ({μ F, σ F}) feature vector is the top performer for the middle tongue region.
When using the entire tongue region, the top three significant features for the ZHENG classification between Hot syndrome and Cold syndrome in the patients who are superficial, ranked by the information gain attribute, were {σF 9 , med F 3 , med F 18 }, which denote the standard deviation of Q chroma (YIQ model), the median of Blue Channel (RGB), and the median of the blue sensitivity Z component, respectively. For the middle tongue region only, the top three were med F 24 , σF 19 , and med F 5 which denote  the median of Yellow Ink (CMYK), the standard deviation of lightness component (Luv model), and the median of saturation (HSV). It is also interesting to observe that by comparing the set of the top five significant features using the entire region versus the set from the middle tongue region, they both have the Yellow Ink (CMYK) in common. When using the entire tongue region, there is only one significant feature difference for the ZHENG classification between Hot syndrome and Cold syndrome in patients who are atrophic, ranked by the information gain attribute, σF 9 which denotes the standard deviation of Q chroma (YIQ model). For the middle tongue region only, there were two significant features: {μF 19 , μF 3 } which denote the mean of the blue sensitivity Z component (XYZ) and the mean of the Blue Channel (RGB).

Classification Results for Gastritis Patients versus Control
Group. The experimental results presented in this section analyze the discrimination between the gastritis patients and control group. Table 11 summarizes the results obtained using our proposed color-space feature vector to train the classifiers to automatically classify patients with coating on tongue versus healthy patients with normal tongue (without coating). We can observe from Table 11 that the {med F, σ F} feature vector yields the best result for the entire tongue region while for the middle tongue region, it was the σ F feature vector.
When using the entire tongue region, the top three significant features for distinguishing between normal tongue and tongue with coating, ranked by the information gain attribute, were {σF 1 , σF 6  The results obtained from the classification between the normal group and the entire set of patients with ZHENG syndrome is shown in Table 12. The {μ F, σ F} feature vector performs best when dealing with the entire tongue region while the {med F, σ F} feature vector is the top performer for the middle tongue region.
When using the entire tongue region, the top three significant features for the classification between the normal group and the gastritis group, ranked by the information gain attribute, were {σF 1 , σF 6 Tables 13 and 14 show the results of training our classifiers to discriminate between the normal group and the Hot ZHENG patients only, and then normal group versus Cold ZHENG patients only. Table 13 illustrates the results for normal versus hot ZHENG. We can observe that the σ F feature vector performs best both for the entire tongue region and the middle tongue region. From Table 14, when only the normal versus Cold ZHENG patients is considered, the same feature vector, {μ F, σ F}, performs best for both cases, however, considering only the middle tongue region outperforms using the entire tongue region. When using the entire tongue region, the top three significant features for the classification between the normal group and the gastritis patients with Hot syndrome, ranked by the information gain attribute, were {σF 1 , σF 6 , σF 25 } which denote the standard deviation of Red Channel (RGB), the standard deviation of value component (HSV), and the standard deviation of Black Ink (CMYK), respectively. For the middle tongue region only, there were only two significant features: {σF 13 , σF 14 } which denote the standard deviation of lightness component (L * a * b) and the standard deviation of a * component (L * a * b * ). When the set of the top ten significant features using the entire region versus the set from the middle tongue region are compared, they both have the lightness and a * component (L * a * b * ) in common.
When using the entire tongue region, the top three significant features for the classification between the normal group and the gastritis patients with Cold syndrome, ranked by the information gain attribute, were {σF 25 , σF 22 , σF 1 } which denotethe standard deviation of Black Ink (CMYK), the standard deviation of Cyan Ink (CMYK), and the standard deviation of Red Channel (RGB), respectively. For the middle tongue region only, the top three were {σF 13 , μF 22 , σF 14 } which denote the standard deviation of lightness component (L * a * b), the mean of Cyan Ink (CMYK), and the standard deviation of a * component (L * a * b * ). Table 15 show the results of training our classifiers to discriminate between the normal group and the superficial patients while Table 16 shows the result for normal group versus the atrophic patients. When using the entire tongue region, the top three significant features for the classification between the normal group and the superficial group, ranked by the information gain attribute, were {σF 1 , σF 6  When using the entire tongue region, the top three significant features for the classification between the normal group and the atrophic group, ranked by the information gain attribute, were {μF 25 , μF 22 , μF 1 } which denote the mean of Black Ink (CMYK model), the mean of Cyan Ink (CMYK model), and the mean of Red Channel (RGB), respectively. For the middle tongue region, the top three were {med F 16 , σF 13 , σF 23 } which denote the median of red sensitivity X component (XYZ), the standard deviation of lightness (L * a * b * ), and the standard deviation of Cyan Ink (CMYK).

Analysis of Classification
Results. From the experimental results presented in Sections 4.2 and 4.3, we can draw the following conclusions. Firstly, concerning the performance of the different classification models, we observe that the MLP and SVM models usually outperformed the AdaBoost model. The multilayer perceptron neural network seems most adequate for learning the complex relationships between the color features of the tongue images and the ZHENG/coating classes. However, both the MLP and SVM models have many parameters to consider and optimize while the AdaBoost is a much simpler model. In the AdaBoost model, we use a decision tree as our base weak learner and vary the number of classifiers to optimize its performance.
Secondly, we observe that when making discriminations within the gastritis patients group (hot versus cold ZHENG, yellow versus white coating, etc.), it was more profitable to apply the feature vectors on the entire tongue image. When classifying the normal groups versus the ZHENG groupings, usually, it improved classifier performance to apply the feature vectors to the middle tongue regions only.
Thirdly, we also observe that from the evaluation of the variations of the feature vectors used, taking into account both the average and the standard deviation usually resulted in an excellent performance. It seemed like the mean outperformed the median slightly, overall, that is, {μ F, σ F}. In a few cases, simply considering variation of the spread of the values over the region ({σ F}) yielded the best performance. Thus, we can conclude that when deriving a feature vector for the tongue image, the mean (or median) as well as the standard deviation (which takes into account the variation of the spread on the region) is very important.
Lastly, we observe that though we were not able to effectively discriminate between the pathology groups (superficial versus atrophic and also the presence of the HP bacterium using our color-space feature vectors, we were able to classify them much better when we took into account the ZHENG classes. This further strengthens the notion that our proposed color-space feature vectors are able to discriminate between the hot and cold ZHENG patients in addition to discerning a ZHENG patient from a non-ZHENG (healthy) patient.

4.5.
Applying Feature Selection Algorithm. The classification results presented in Sections 4.2 and 4.3 were obtained using the whole feature set. For each experiment carried out on the entire tongue region, we also applied information gain attribute evaluation to rank the significance of the features. In this section, we apply feature selection algorithm (Best First) to select only a subset of features, which are deemed significant, before training the classifiers. Our goal is to see if this would yield a better result than using the whole feature set. The Best First algorithm searches the space of attribute subsets by greedy hill climbing augmented with a backtracking facility.
The summary of the results obtained is shown in Table 17. The normal group refers to the healthy (non-ZHENG) control group. We present the best classification result obtained for each experiment based on using the five variations of the feature vectors (μ F, med F, σ F, {μ F, σ F}, {med F, σ F}) and the three different classification models (Adaboost, SVM, and MLP). As we can observe from Table 17, using the whole feature set to train the classifiers yielded a better result in all cases except for the Atrophic Patients (Hot versus Cold ZHENG) experiment. Thus, we can conclude the overall, using the aggregate of the proposed feature sets is more discriminative even though some features are more significant than others.

Conclusion and Future Work
In this paper, we propose a novel color space-based feature set for use in the clinical characterization of ZHENG using various supervised machine-learning algorithms. Using an automated tongue-image diagnosis system, we extract these objective features from tongue images of clinical patients and analyze the relationship with their corresponding ZHENG data and disease prognosis (specifically gastritis) obtained from clinical practitioners. Given that TCM practitioners usually observe the tongue color and coating to determine ZHENG (such as Cold or Hot ZHENG) and to diagnose different stomach disorders including gastritis. We propose using machine-learning techniques to establish the relationship between the tongue image features and ZHENG by learning through examples.
The experimental results obtained demonstrate an excellent performance of our proposed system. Our future work will focus on improving the performance of our system by exploring additional tongue image features that can be extracted to further strengthen our classification models. We plan to explore ways to improve our methodology to more accurately classify the ZHENGs such as including a preprocessing step of coating separation prior to the feature extraction phase. Lastly, we plan to evaluate the classification of the other ZHENG types mentioned in Section 1.