Facial Anthropometric Measurements and Photographs — An Interdisciplinary Study

In recent years, automatic facial analysis has attracted much interest among computer science researchers in the healthcare and computer vision fields studying facial anthropometric measurements using photographs. However, to date, there have been no healthcare or computer vision publications that use standardized photographs to differentiate features between sub-ethnic groups by leveraging the power of machine learning on two-dimensional computer vision benchmark data sets (2D CVBDs). Thus, the present work is an interdisciplinary study at the interface of healthcare and computer vision fields that attempts to fill this literature gap where we explore the use of machine learning on 2,789 photographs from eleven 2D CVBDs to identify k top discriminative features in major and sub-ethnic groups. These features are ranked based on information gain values and p-values. We also provide a comprehensive analysis of using information-gain-based and p-value-based features. Our machine learning model achieves an accuracy of 96-99%, and our findings reveal that information-gain-based features have the upper hand over p-value-based features. The top three information-gain-based features in sub-ethnic groups are: dn (distance from the tip of the nose to the center of the mouth), hf (face height) and wn (nose width), while the top three information-gain-based features in major ethnic groups are: de (distance between the inner corners of the eyelids), hf and dn. These results are then compared to the results obtained using standard deep learning techniques such as OxfordNet (VGG16), Residual Networks (ResNet50), and Inception-V3, where accuracy of 90-94% was seen. We hope that these findings will lead to future collaboration between computer vision and healthcare researchers studying facial anthropometric measurement studies.


I. INTRODUCTION
In recent years, automatic facial analysis, which includes facial recognition and demographic classification (sex, age, and ethnicity estimation), has attracted much interest in the healthcare and computer vision fields and motivated computer science researchers to study facial anthropometric measurements. In this research, facial landmarks are annotated on two-dimensional objects, e.g., standardized photographs (2D face images), 3D representations of human faces, or even the skin of living humans. These landmarks are used when calculating measurements using traditional feature-based approaches (linear and vertical) that The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . are handcrafted by researchers and have successfully been used in ethnicity and sex estimation studies. Linear measurement studies [1]- [4] compute horizontal and/or vertical distances between identified anthropometric landmarks using Euclidean distances. Whereas, angular measurement studies [5]- [7] generate facial angles from anthropometric landmarks instead. A recent trend is to combine both the measurements and was used in these two studies [8], [9]. Table 1 summarizes the data sets and the statistical tests of the discussed anthropometrical measurements-based classification studies. It is worthwhile to mention that in contrast to the anthropometrical measurements-based classification scheme, appearance-based classification schemes that utilize machine learning [10]- [13] or even deep learning paradigm [14]- [16] exists, and have also obtained success in ethnicity, sex and age estimation studies. Deep learning offers a radical alternative to traditional feature-based approaches as it performs automatic feature extraction on the facial images to obtain learned features.
The purpose of doing anthropometrical measurementsbased classification is to determine reference ranges of the average soft tissue profile of human faces. A reference range is a set of values that includes upper and lower limits based on a group of healthy individuals. For example, the researchers in [18] found that individuals with major thalassemia have wider heads and faces by comparing their findings to the reference ranges of healthy individuals. Moreover, these values are crucial in healthcare applications such as the measurements of dental arch dimensions [19], diagnosis of craniofacial anomalies [20], setting standards for the planning of facial construction surgeries [21], and establishment of aging patterns [22].
However, it is well established in the healthcare field that a single reference range value cannot be applied to different ethnic and sex groups [4]. Nevertheless, few studies to date have used photography to find the distinguishing features for different ethnic groups, and therefore is the focus of the present study. Most healthcare literature [2], [4], [8], [23], [24] uses anthropometrical measurements-based classification scheme, while most computer vision literature [10]- [12] uses appearance-based classification schemes instead. Although both healthcare and computer vision literature use different classification schemes, they focus on distinguishing features in one or two major ethnic groups, with almost no attention given to sub-ethnic groups, such as Chinese, Indian, and Malay in the Asian category. It is worthwhile to mention that reference [25] is a recent survey that provides a detailed review of the state-of-the-art advances in face-race perception, principles, algorithms, and applications.
Moreover, finding discriminative features in multiple major ethnic and sub-ethnic groups has already been carried out on human faces by practitioners, for example, in [26]. This study was conducted on 1,470 subjects drawn from five regions of the world. We conclude that, to the best of our knowledge, there have been no reports in the healthcare and computer vision fields that use standardized photographs to differentiate features between sub-ethnic groups by leveraging the power of machine learning on 2D CVBDs at a large scale. Thus, this paper attempts to fill this literature gap by exploring the use of machine learning on 2,789 photographs from eleven 2D CVBDs [27]- [37] to identify k top discriminative features based on ranked information gain values and p-values. The contributions of this paper are summarized below: • Provides the first evidence that two-dimensional benchmark data sets developed in the computer vision field can be used in facial anthropometric measurement studies of the healthcare field.
• Provides a comprehensive analysis of using informationgain-based and p-value-based features to find k top discriminative features in major and sub-ethnic groups.
• Proposes the top three features that may be useful in differentiating populations in major and sub-ethnic groups. These results are then compared to the results obtained using standard deep learning techniques such as VGG16 [38], ResNet50 [39], and Inception-V3 [40]. The rest of this paper is organized as follows. Section II provides an overview of the 2D CVBDs that we were able to access, as well as the dilemma faced while categorizing VOLUME 8, 2020 the population data into their major ethnic groups. Section III presents our methodology, describes the experiments, and discusses the obtained results using handcrafted features. We compare these features with the features learned automatically by standard deep learning algorithms in Section IV. Finally, we draw some conclusions in Section V.

II. OVERVIEW OF TWO-DIMENSIONAL COMPUTER VISION BENCHMARK DATA SETS
This section describes the 2D CVBDs that we managed to gain access to and the challenges faced when categorizing the population data into the appropriate major ethnic groups.

A. TWO-DIMENSIONAL COMPUTER VISION DATA SETS
There exist numerous 2D face databases for various purposes. Facial anthropometry researchers build 2D face databases in the healthcare field [1], [2], [8], [9], [17] to evaluate facial feature differences between populations. On the other hand, computer vision researchers build 2D face databases [27]- [37], [41] with various poses, illuminations, expressions as well as different accessories (e.g., spectacles, beard, mustache) to evaluate face recognition and detection algorithms. These images are captured in controlled conditions and are often referred to as the 'gold standard data sets' for testing computer vision algorithms.
We now describe the 2D CVBDs that we managed to obtain access (see Table 2 for details). We note that the list is not exhaustive. However, it is sufficient enough to demonstrate that benchmark data sets developed in the computer vision field can be used in the healthcare field to study the facial feature differences. One of the main challenges we faced while working on the 2D CVBDs was categorizing the ground truth information (GTI) labels into their major ethnic groups as they were developed separately in different regions of the world. The study in [42] noted that the present European and American studies on race, ethnicity, and health use poorly defined labels for population studies. However, the search for an accurate definition is controversial for scientific and social reasons as well as due to the changing meaning of ethnicity in the United Kingdom and the United States. We do not argue that an accurate definition is unnecessary. Instead, we seek to review the challenges faced in the next section to assist researchers in categorizing populations' GTI labels into the appropriate major ethnic groups.

B. ETHNIC GROUPS CATEGORIZATION
There seems to be a great deal of confusion surrounding the definition of the term ''ethnic'' or ''ethnicity''. One such example is found in the Webster dictionary [43], which defines ''ethnic'' as ''of or relating to large groups of people classed according to common racial, national, tribal, religious, linguistic, or cultural origin or background''. We categorize the GTI labels from the 2D CVBDs based on their ancestry. Here, ancestry refers to the origins of a population, i.e., place of birth of the person or the person's parents or ancestors prior to migrating to a new country.
The US Office of Management and Budget (OMB) defines a WHITE person as having origins from Europe, North Africa, or the Middle East. However, in Britain, Middle Easterners and North Africans are not considered WHITE. We categorize these two conversational populations into the MIDDLE EAST ethnicity as they share the same ancestry. Furthermore, the study in [27] showed there are distinct differences between North Africans and whites in France. Likewise, the ASIAN ethnic group refers to persons of Asian origins, and thus, Indian, Chinese and Japanese people are categorized as Asians. Furthermore, the Taiwanese population is categorized as Chinese, as the authors in [44] showed that more than 95% of Taiwanese people are from China.
Likewise, the BLACK ethnic group is used to categorize people of African origin. However, there have been disputes regarding whether to categorize the Brazilian population into LATINO ethnicity. For example, OMB does not consider Brazilian Americans to be Latinos as they define the term LATINO to be synonymous with the term Hispanic, implying a Spanish-speaking society, while Brazil is a Portuguesespeaking society. We have chosen to categorize Brazilians under the LATINO ethnicity based on their Latin American origins. Second, we also categorize populations that have Hispanic labels from the benchmark data sets to be in the LATINO category, as the data sets were created in the US. As noted earlier, the Hispanic and Latino categories are used synonymously in the US. We summarize the ethnicity categorization used in the paper in Table 3. We indicate major ethnic groups using italicized uppercase letters. On the other hand, we refer to the GTI labels from the 2D CVBDs as sub-ethnic groups and indicate them using italicized lowercase letters. The importance of having sub-ethnic categorization cannot be ignored because, as pointed out by the researchers in [45], their results are weak as they selected subjects randomly from three sub-ethnic groups, and categorized them into a single major ethnic group (i.e., ASIAN).

III. EXPERIMENTAL ANALYSIS
This section describes the methodology and provides a comprehensive analysis of the results.

A. METHODOLOGY
In this section, we present our experimental framework comprising data acquisition, preprocessing, feature extraction methods, and classification.

1) DATA ACQUISITION
For this study, we used the face images from the 2D CVBDs described in Section II-A. As our results depend heavily on the accuracy of the feature extraction step, we only consider frontal and neutral face images under controlled lighting with indoor environments. Moreover, we manually removed images of subjects wearing accessories, including spectacles and long beard, to prevent occlusions from affecting our final results. Table 4 summarizes the number of images used from each database.

2) PREPROCESSING
This phase consists of detecting and normalizing face images to eliminate noise or inconsistencies that may lead to misleading results.
Face detection We begin face detection by correcting the illumination of the images using the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm from the Open Source Computer Vision (OpenCV) library before locating the head's position to reduce the search space. We applied a pre-trained histogram of gradient (HoG) features and the support vector machine (SVM) algorithm from the Dlib library for face detection over the Haar Cascade face detector. The former works very well for frontal images and is considerably faster and more lightweight than the latter. The output of this step is a bounding box around the detected face in the input image. Face normalization Even though the images share similar properties such as controlled lighting with indoor environments, some differences will still be present, in particular, the sizes and positions of faces in each database. We believe this discrepancy may be attributed to the differences in the distance between the camera and the subjects. Therefore, face normalization is an essential step before feature extraction, where the original image is warped and transformed into the desired output coordinate space. To do so, we first feed the output from the previous step to the Dlib facial landmark predictor to retrieve the eye regions as they are used as reference points. Thus, the center of the eyes and the angle between eye centroids are computed to allow for rotational correction. Next, we apply affine VOLUME 8, 2020 transformations so that both eyes are on the same horizontal line before scaling the face's size to be approximately identical. However, if a failure is detected during this step, we exclude the image from our data set. Figure 1 shows the facial images before and after normalization. We note that a similar face normalization approach was used in [46].

3) FEATURE EXTRACTION METHOD
This phase extract features using the facial landmark detector of [47], which consists of an ensemble of regression trees that have been pre-trained using the 68 landmark positions annotated on each image from the iBUG 300-W database for direct estimation from pixel intensities. This phase's output is 68 (x, y)-coordinates that highlight the contours of seven facial regions: the chin, left eye, left eyebrow, right eye, right eyebrow, nose, and mouth. For example, Figure 2a shows how the 68 annotated landmark positions overlap on  an image. Whereas, the contours, indicated by the landmarks, are shown in Figure 2b. Of note, we disregard the eyebrow as a prominent facial region and focus on the following four regions: chin, left eye (as eyes are generally shaped identically), nose and mouth as subjects may shave their eyebrows. We then represent these regions using linear measurements, with Euclidean distances computed between identified landmark positions. We used five horizontal and five vertical features. More details are given in Table 5, and these features are used throughout the study. The descriptive data of the mean, standard deviation, and maximum and minimum dimensions of our features are presented in Appendix V-A for sub-ethnic groups. Whereas, Appendix V-B report them for the major ethnic groups instead. As we have multiple features with varying ranges of value, feature scaling is necessary when using machine learning algorithms. Therefore, we consider two sets of features, described below, where the statistical significance is tested using Welch's t-test. the researchers in [48] only considered females regardless of their ethnicity with the upper lip length between 18 -22 mm range, where the minimum value is 18 mm, while the maximum value is 22 mm. They obtained these ranges that we refer to as reference range by studying 60 females aged between 18 and 35 years. Their technique assumes the upper lip length of an average female will be in the reference range, and it will be rare to find a female that has an upper lip length of more than 22 mm. However, in our study, instead of normalizing features based on gender, we normalized them based on major and sub-ethnic groups to obtain the reference range. Thus, we note that our model will fail to classify an individual that is not within the reference range until it retrains with the new reference range.

4) CLASSIFICATION
To test the effectiveness of features ranked with information gain values compared to the traditional p-values on the raw and normalized feature sets, we employed the extreme gradient boosting (XGBoost) and Select K Best (SKB) feature selection algorithms. The XGBoost library implements the gradient boosting decision tree algorithm for feature selection and classification. The feature selection algorithm ranks the importance of a feature using a value known as information gain; the more prominent the feature is, the higher the information gain value is. These ranked features are fed to the XGBoost classifier for classification. By contrast, the employed SKB feature selection algorithm implements the ANOVA F-value that generates the p-values to order the features. These ranked features are fed to the SVM for classification. Moreover, to find the k top discriminative features for major and sub-ethnic groups, another experiment that assesses the number of features is also designed, where k varies from 1 to 10. As we have imbalanced data sets, we employed stratified K-fold as the cross-validation technique for both the XGBoost and SVM classifiers. We set K to three as the lowest number of samples in the sub-ethnic groups is nine.

5) EVALUATION METRICS
We evaluated the described experiments using three machine learning evaluation metrics: accuracy, F1-Score, and confusion matrix. All values were between 0 and 1, and the standard deviations are given in brackets. A perfect 1 is the best score, while a perfect 0 is the worst score.
Accuracy is a metric tied to precision and recall. High precision and recall scores show that the classifier is giving accurate results (high precision), and the majority of the results are positive (recall). F1-Score considers both precision and recall by taking the harmonic mean of them. We report micro-and macro-averages for F1-Score, whereby each score has a different interpretation. A micro-average considers each sample equally, whereas a macro-average considers each class equally. The former is preferable for imbalanced data sets, and the latter for balanced data sets. Both the micro-and macro-averages will report the same scores if the data sets are balanced.
This paper reports the micro-average F1-Score due to the class imbalance problem, and the macro-average F1-Score to show the skewed class distribution. On the other hand, we used the normalized confusion matrix to summarize the performance of a classifier where each row represents an actual class, while each column represents a predicted class. A satisfactory confusion matrix will have most of its instances on its main diagonal.

B. RESULTS
This section describes the experimental results obtained using handcrafted features on XGBoost and SVM classifiers based on major and sub-ethnic groups.

1) SUB-ETHNIC GROUPS
This section describes the experimental results of using raw and normalized feature sets on the sub-ethnic groups. For the raw feature set, we summarize the top ten informationgain-based features selected by the XGBoost feature selection algorithm in Table 6, and the top ten p-value-based features selected by the SKB feature selection algorithm in Table 7. Of note, these algorithms ranked hn in the same order.  On the other hand, Tables 8 and 9 show the effect of varying the number of k top features on the raw feature set, where k varies from 1 to 10. The information-gain-based features achieved the best accuracy and micro-average F1-Score  (i.e., 0.62) using all the nine features. Similarly, p-valuebased features obtained the best accuracy and micro-average F1-Score (i.e., 0.67) using all the ten features. Of note, Figure 3 shows the confusion matrix of using the top nine information-gain-based features, while Figure 4 shows the confusion matrix of using the top ten p-value-based features. Hence, we can conclude that regardless of the order the feature selection algorithms ranked the features, they yielded almost similar prediction results.
Likewise, in Tables 10, 11, 12 and 13, we show the same statistics but on the normalized feature set. Table 10 shows the top nine normalized information-gain-based features,  whereas Table 11 shows the top nine p-value-based features. Both the feature selection algorithms discarded de as a prominent feature as the value became zero after normalization. The algorithms ranked dn on the first position and wf on the fourth position. As indicated in Table 12, the information-gainbased features obtained the best accuracy and micro-average VOLUME 8, 2020 F1-Score (i.e., 0.96) using the top three features. By contrast, the p-value-based features achieved the best accuracy and micro-average F1-Score (i.e., 0.66) using all the nine features, as shown in Table 13. Of note, Figure 5 shows the confusion matrix of using the top three information-gain-based features ,   TABLE 16. Effect of variation on the k top raw-information-gain-based features in the major ethnic groups. The standard deviation given in brackets.

TABLE 17.
Effect of variation on the k top raw-p-value-based features in the major ethnic groups. The standard deviation given in brackets.

TABLE 18.
The k top normalized-information-gain-based features in the major ethnic groups. Here k is ten, and the features are ordered based on the highest information gain values. while Figure 6 shows the confusion matrix of using all the nine p-value-based features. Therefore, we can conclude that information-gain-based features selected by the XGBoost feature selection algorithm yielded better prediction results than the p-value-based features selected by the SKB feature selection algorithm on the normalized feature set.

2) MAJOR ETHNIC GROUPS
This section describes the experimental results of using raw and normalized feature sets based on the major ethnic groups.    For the raw feature set, Table 14 summarizes the top ten information-gain-based features and Table 15 summarizes the top ten p-value-based features. The XGBoost and SKB feature selection algorithms ranked we, wf and dn in the same order. As indicated in Table 16, the information-gain-based features obtained the best accuracy and micro F1-Score (i.e., 0.73) using the top nine features. By contrast, p-valuebased features achieved the best accuracy and micro F1-Score (i.e., 0.75) using all the ten features (see Table 17). Of note, Figure 7 shows the confusion matrix of using the top nine information-gain-based features, while Figure 8 shows the confusion matrix of using the top ten p-value-based features. Thus, we can conclude that regardless of the order the feature selection algorithms ranked the features, they yielded almost similar prediction results.
Similarly, Tables 18,19,20 and 21 show the same comparisons, but on the normalized feature set. Table 18 summarizes the top ten information-gain-based features, whereas Table 19  TABLE 24. Descriptive statistics of the raw feature set in the sub-ethnic groups. summarizes the top ten p-value-based features. The feature selection algorithms ranked de and hf in similar positions. The p-value-based features required all the ten features to obtain the best accuracy and micro F1-Score (i.e., 0.84). By contrast, the information-gain-based features only required the top three features to achieve the best accuracy and micro F1-Score (i.e., 0.99). See Tables 20 and 21 for details. Furthermore, Figure 9 shows the confusion matrix of using the top three information-gain-based features, while Figure 10 shows the confusion matrix of using all the ten p-value-based features. Therefore, we can conclude that information-gain-based features selected by the XGBoost feature selection algorithm yielded better prediction results than the p-value-based features selected by the SKB feature selection algorithm in the normalized feature set.

IV. AUTOMATICALLY-LEARNED FEATURES
This section describes the experimental results obtained using features learned automatically by standard deep learning techniques such as VGG16, ResNet50, and Inception-V3. For all the three deep learning techniques, the epoch size is 50, the batch size is 32, the target size (i.e., height and width) of the images is 128, pooling is avg and the optimizer is Adam. The default learning rate is used for ResNet50, while the other two techniques used the learning rate of 3E-4. These techniques were implemented using the Keras library. Table 22 shows the evaluation metrics results on the subethnic groups. Cross-entropy loss increases as the predicted labels continue to differ from the predicted labels. As evaluation metrics F1-Score has been removed from Keras, we report our findings using precision and recall metrics. Likewise, in Table 23,we show the same statistics but on the major ethnic groups. Therefore, we can conclude that Inception-V3 gave the best accuracy score (i.e., 0.93-0.94) with the lowest cross-entropy loss (i.e., 0.32-0.39).
To understand the features used by deep learning techniques to classify major and sub-ethnic groups, we used Local Interpretable Model-Agnostic Explanations (LIME) [49]. LIME is a library that can determine the set of features used for the classification. For example, we show the k top features used by the Inception-V3 model to classify subject 69a from the FEI database in Figure 11. Here k is ten. Of note, the LIME library interpreted similar features for the VGG16 and ResNet50 models as well.

V. CONCLUDING REMARKS
We conclude from the above discussion that compared to handcrafted features that are Euclidean distance-based, automatically learned features are shape-based. The shape of the VOLUME 8, 2020  learned features includes the hair, forehead, and neck. These features are disregarded in the handcrafted features as during the preprocessing phase, a bounding box around the detected face is fed to the feature extraction method.
Of note, both the XGBoost and SKB feature selection algorithms gave similar prediction results with the raw feature set in the major and sub-ethnic groups. However, using the normalized feature set, information-gain-based features yielded the best prediction results in both the groups. Hence, we have demonstrated that machine learning can be used to obtain ethnicity-based reference ranges for the features. The current healthcare literature [1], [6]- [9], [48] uses statistical tests, as indicated in Table 1, to obtain the reference ranges.
The top three information-gain-based features in the subethnic groups are: dn (distance from the tip of the nose to the center of the mouth), hf (face height) and wn (nose width), while the top three information-gain-based features in the major ethnic groups are: de (distance between the inner corners of the eyelids), hf and dn. Of note, the researchers in [50] observed that Indian American females have a smaller de but a larger wn than North American White females. On the other hand, the study in [51] revealed that African American females have a longer lower face and wider nose than the North American Caucasian females. However, to the best of our knowledge, we are not aware of any ethnicity studies that use the nose's tip to the mouth's center (dn) as a differentiating feature. Nevertheless, such a measurement has been used by plastic surgeons to calculate the golden ratio of the human face [52].
In this paper, we have 1) provided the first evidence that two-dimensional benchmark data sets developed in the computer vision field can be used in facial anthropometric measurement studies of the healthcare field; 2) provided a comprehensive analysis of information-gain-based and pvalue-based features to find the k top discriminative features in major and sub-ethnic groups; and 3) proposed the top three features that may be useful in differentiating populations in major and sub-ethnic groups and compared them to the features obtained automatically from standard deep learning techniques.    Table 29.