3D evaluation model of facial aesthetics based on multi‐input 3D convolution neural networks for orthognathic surgery

Quantitative evaluation of facial aesthetics is an important but also time‐consuming procedure in orthognathic surgery, while existing 2D beauty‐scoring models are mainly used for entertainment with less clinical impact.

Defining a person's facial aesthetics was a subjective notion and could vary according to different races, regions, and cultures. 1vertheless, researchers have continued to propose different concepts and approaches attempting to objectively and quantitatively evaluate facial beauty since ancient times. 2 In eastern Asia, the 'Three Courts and Five Eyes' concept was used to describe how a beautiful face should be proportional to a preset ratio, 3,4 where the face was vertically divided into three equal sections from hairline to chin and horizontally divided into five sections with the equal length of an eye.In Western culture, facial aesthetics was geometrically described by Euclid using a mathematical proportion called the 'golden ratio', 5,6 which defined the proportion of the total length of longer parts is equal to the proportion of the two parts with a fixed value of 1.618.Although both these geometrical standards together with other ancient concepts are still frequently applied even in the modern clinics, considering the complexity of human facial structures, a more comprehensive and objective tool is still highly desirable to better define the facial attractiveness of a person. 7thognathic surgery is a discipline that corrects the deformities or defects around the mandible, maxilla, and skull for treating diseases or aesthetic purposes. 8In the current orthognathic surgery, evaluating facial aesthetics has its practical meaning to quantify a person's appearance beauty and make a satisfactory preoperative plan, which could be an important reference for both surgeon and patient. 9From the perspective of surgeons, they must have a clear blueprint of what a normal, proportional, and symmetric face should be before conducting any operation to correct defects or improve existing beauty. 10Evaluating the physical condition of a person is the first step for the surgeon to conduct a predictable and controllable surgery.From the perspective of patients, they usually have over-high expectations for surgical outcomes, 11 nevertheless, the postoperative results may be heavily affected by the patients' pre-operative facial conditions.Using a third-party tool to show quantitative evaluation results of facial aesthetics could unify surgical expectations and reduce the cognitive difference between patient and surgeon.
Conventional clinical methods usually use cephalometric analysis, 12 golden ratio mask, 13 and aesthetic angle formula 14 to evaluate the patient's facial condition before surgery.Cephalometric analysis was conducted by specifically selecting several anatomical landmarks to calculate the distance or angles among them.The results could be compared with preset standards to define the various concepts such as facial symmetry, proportion, and movement.However, manually drawing landmarks in the cephalometric analysis is still timeconsuming and tedious work.Some custom algorithms were proposed to partially automatise the procedures, 15 whereas landmarkbased aesthetic evaluation heavily relied on the surgeon's experience and needed remarkable pre-operative labour work. 16The golden ratio mask was an extended application of the golden ratio theory in the clinic.Ricketts 17 and Marquardt 18 created 2D and 3D masks to compare the patients' images with the mask, which could indicate how much the surgeon should move the patient's facial organs to make it closer to the golden ratio mask.However, this approach suffered from the same problems as cephalometric analysis because the surgeon had to conduct the evaluation every time for a new patient, which made it an effective but cumbersome method.
Aesthetic angle formula was a derivation of cephalometric analysis and golden ratio mask.There were various concepts proposed by different researchers, such as Powell and Humphreys' aesthetic angles, 14 Peck and Peck's angles, 19 and 'H angles'. 20The patient population that has undergone surgeries is growing fast in recent years, 21 which makes the above-mentioned three clinical evaluation methods suffer from heavy manual work, and an automatising method is especially preferred for reducing surgeons' workload.
However, significant anatomical differences in individuals make it difficult for specifically developed algorithms to cover the facial variety; a more generalised approach is particularly applicable to mimic surgeons' thinking mechanisms and to replace fixed algorithms.
Benefiting from the recent development of deep learning, some evaluation models of facial aesthetics have been proposed based on convolution neural networks (CNN) using common 2D images. 22ese models either generated an average face based on ethnicity 23 or gave a score for a person's facial aesthetics. 24,25From a clinical perspective, the 2D image did not convey enough anatomical information for evaluating the solid facial structure, making these models have less clinical impact since the surgeon's evaluation and surgical plan were usually conducted based on 3D CT images instead of common 2D images. 26 parameter and computational capability of the hardware. 28Raw CT images had considerable redundant information, and lots of computational power could be spent merely extracting the wanted information if the image was not preprocessed.[32] The former method may not be preferable in a task where the whole 3D information should be considered and the latter method may fail to consider the correlation among different medical information.Therefore, only the raw medical images were preprocessed into standardized training materials and direct 3D CNN was used without adopting compromised methods, then it is possible to train an application-oriented model that is in accord with clinical reality without introducing the abovementioned two problems.
In other surgical types such as knee replacement, surgeons developed different scoring forms (indexes) to evaluate a patient's preoperative condition as a reference for the surgical plan. 33However, there were still lacking quantitative evaluating tools for facial aesthetics in orthographic surgery, especially using deep neural networks, which make orthognathic surgeons still rely on conventional subjective methods and do not benefit from the latest achievements in deep learning.Therefore, in this study, we introduced a 3D evaluation model (DeepBeauty3D) to quantitatively estimate facial aesthetics for orthognathic surgery by seamlessly integrating two successive modules (Figure 1).The first one was an image-preprocessing module, which automatically transferred raw CT images into standard training material without relying on any commercial medical imaging software.
The second module was the predicting network module, which accepted the inferring-ready data from the image-preprocessing module and used a 3-input-2-output 3D CNN to output facial aesthetics values.To the best of our knowledge, this work is the first one that adopted multi-input 3D CNN to classify facial aesthetics from raw CT images for clinical purposes.
Considering the clinical practice and latest research progress in the literature, we made the following contributions to address current issues and meet the urgent demand in orthognathic surgery.
(1) We designed and trained a 3-input-2-output 3D CNN to provide surgeons with a quantitative and automated facial aesthetic evaluation tool for reducing their workload and unifying the surgeon-patient aesthetics perspective, which is particularly desirable since such scoring models are still lacking to meet the clinical needs of a fast-growing patient population.
(2) We acquired 133 CT images from The University of Tokyo Hospital to train the 3D model and annotate the ground truth data using clinical-accepted methods to ensure the work's practical usability, rather than using common 2D images to train a 2D beauty predicting model for entertainment purposes only.
(3) This work provided an end-to-end solution from raw CT images to the final facial aesthetics scores by seamlessly integrating a self-developed image preprocessing module and a predicting network module, without relying on expensive commercial medical software to conduct image preprocessing or using compromised CNN training methods by sacrificing data integrity.
(4) This work considered three correlated anatomical factors: soft tissue, skeleton, and personal physical information simultaneously to define a person's facial aesthetics rather than merely surface skin conditions, thus providing a more comprehensive perspective for assessing facial aesthetics.images would be discarded if the weight information of the patient was not recorded in DICOM data and could not be found in other medical data.The statistics of patients' personal information are listed in Table 1.

| Manual annotation for preparing ground truth data (a) Soft-tissue score annotation
The output value of both the soft-tissue score and skeleton score needed to be manually annotated as ground truth.As mentioned in the introduction section, a surgeon might use a golden ratio mask to determine how much he/she should move a patient's bone to make the final results closer to a standard gold ratio mask.There are different golden ratio mask models proposed by different researchers.In this study, we used the mask proposed by Dr. Marquardt as a reference to evaluate the patient's soft tissue.However, the 3D gold ratio mask was a set of series bars, while the soft tissue was a solid instance; they could not be directly compared since these two data had fundamental differences.Nevertheless, surgeons usually use a series of anatomical landmarks to evaluate the facial conditions of a person during cephalometric analysis.Inspired by this method, we also used landmarks as a medium to quantitively measure the differences between a person's soft tissue data and the 3D golden ratio mask.We selected 12 landmarks to define the soft tissue conditions and used the same landmarks in the gold ratio mask as reference (Table 2, Figure 2 The coordinates of the different CT images might vary because of the random head pose of the patient during CT scanning, which needed to be correlated to the golden ratio mask coordinate, otherwise any comparison between these two data would be meaningless.Therefore, we used the nasion point as a reference point and let two landmark sets roughly match at the nasion point (Figure 3).After the coarse match, an iterative closest point (ICP) algorithm was used to conduct the refined match, which unified the coordinates and calculated the distance error between the corresponding landmarks.Then the error was summed up to be interpreted as the aesthetic value of soft tissue.As the ICP process is a standardized calculation and commonly used in the field, we moved the detailed calculation process to the Supporting Information S2 to keep the manuscript concise and neat.The following are key steps for calculating the aesthetic scores of soft tissue.Assume that the landmark set of golden ratio mask P and the landmark sets of a specific patient soft tissue Q are: After the ICP process, the new landmark set of patient soft tissue O can be obtained by using the following simple transfer matrix: Where R * and t * are the optimum solution of rotation matrix R and translation matrix t and can be obtained using SVD (Supporting Information S2).
Finally, the L1 norm between the landmark set of golden ratio mask P and the new landmark set of a patient's soft tissue O can be calculated by summing up all landmark errors.
As shown in Figure 3, the rough match using a reference point reduces the impact of the over-large original error range between two landmark sets caused by random head pose during CT scanning.
Then, the ICP process conducted the refine match and calculated the error between the landmarks of soft tissue and and golden ratio mask.This two-step alignment and calculation process ensured that the error was concentrated in a limited range and was not affected by the original position of the CT images.After ICP matching, the larger the error was, the smaller the aesthetics value would be.Unlike the soft tissue that had a standard golden ratio mask to evaluate the mutual difference, there were no widely accepted criteria to define the aesthetic levels of the skeleton.In some machine-learning-based beauty scoring models, the researchers usually evaluate a person using a preset score based on their personal aesthetic perception.In addition, studies also suggested that some subjective evaluation methods had a correlation with their quantitative counterparts and proved their reasonability.
Therefore, we also adopted a similar method and set five factors each with 10 scores to label a person's skeleton aesthetics, as listed in Table 3.Some of these factors were inspired by the definition of cephalometric analysis.Skeletal aesthetics were evaluated by a professional surgeon based on his own medical judgement.It took 10-15 min to finish the annotation of one patient.

| Image-preprocessing module
We Then, the nearest-neighbour interpolation algorithm was implemented to the CT image.Assume the original image was stored in the tensor I o and the resampling factor was the ratio of the new size and the original size, then the new image I r after resample can be obtained by: To keep the equation neat, here we omit the detailed expansion of the nearest-neighbour interpolation algorithm.In this study, most of the images have a resolution of (0.351, 0.351, 0.5) mm/ pixel, and we resampled the image with the new resolution ( The padding number of the Z axis was decided based on the remaining size after the resampling process.If the size was less than 200, then add half of the differential to each ending slice of the Z axis; if the size was larger than 200, then only the first 200 slices starting from the head would be used and the rest would be discarded.Assume the size of the Z-axis after resample is s z , the image after padding is I p and its Z-axis branch is I z p , then the detailed padding number p z under different conditions can be calculated by using Equation ( 6) T A B L E 3 Factors and definitions of skeleton score.
The value to evaluate the regularity of facial form.For example, the skeleton should not significantly larger or smaller making it not consistent with the common human aesthetics perspective.

Symmetry 10
The value to evaluate the symmetry of the facial skeleton mainly in the coronal plane.For example, the left part of the skull should not significantly lean to the right part, vise versa; the upper part skull should not significantly lean to the lower part skull, vise versa.

Defects 10
The value to evaluate the completeness of the facial skeleton.For example, there is no damage or loss (holes or teeth loss) in the bone.

Condition of skeleton 10
The value to evaluate the consistency between the skeleton matureness and patient's age.
For example, the skeleton should not incompletely grow due to some genetic defects.

Facial prognathism 10
The value to evaluate the discrepancy of mandible and maxilla mainly in the sagittal plane.
For example, the mandible should not significantly behind or forward to the maxilla (d) Segmentation The after-padding data was further segmented based on the HU of different organs.We used 226-3071 HU for the skeleton mask I m and −718~-177 HU for the soft tissue mask I 0 m based on the standard medical grade of CT scans.Then, the skeleton mask and soft tissue mask were obtained by using the calculation below: Since the value range after segmentation was still large for directly inputting the network, we normalised both skeleton data I n and soft tissue data I 0 n into 0-1 for better training purposes (Equation 8).
Part of the final preprocessed skeleton and soft tissue data are shown in Figure 4.
For metadata, the gender was one-hot encoded with 0 for the female and 1 for the male, and the age and weight were normalised to the range of 0-1.Then, all data were saved as tensor data to feed into the network for training the model.

| Predicting network module
As mentioned in the Introduction section, feeding independently internal-correlated data into a deep neural network may ignore their anatomical interaction effects.Therefore, we designed a 3-input-2output deep neural network using CNN and multi-layer perceptron to extract the feature map of the skeleton, soft tissue, and metadata, and then concatenated them into one to consider all information together when inferring facial aesthetics score (Figure 5).
We used three convolution blocks for both the skeleton and soft tissue data; each block had a standard Conv-ReLU-BN-POOL layer.
All convolution layers had the same kernel size as 3 � 3 � 3 with The soft tissue branch used the standard cross-entropy loss between the output and ground truth.Here ŷi and y i respectively corresponded to the classification output on the n th sample and its corresponding label (Equation 9).
The skeleton branch also used the cross-entropy loss and calculated below, where y 0 i and b y 0 i represent the classification output of the skeleton branch on the nth sample and its corresponding labels.
Finally, a scale factor α was applied to weigh two losses for multitask training.Based on the preliminary evaluation, we empirically set α as 0.5 and found this weight ratio showed the best results when compared with others.In this case, the model was trained to minimise the following combined loss L: GPUs (total memory: 128 GB).

| Graphic user interface
To lower the learning curve of the surgeon and give the patient a userfriendly view of the predicted results, we developed a graphic user interface (GUI) for this 3D evaluation model of facial aesthetics (Supporting Information S1, YouTube Link: https://youtu.be/Oqne-HyVxe3s).As shown in Figure 6, both the image-preprocessing module and the predicting network module are seamlessly integrated into the GUI.Once started, it could automatically process raw CT images and give the aesthetic value of both the skeleton and soft tissue without relying on any commercial medical software and human involvement.
The image preprocessing module consisted of three different regions.In the file load region, the surgeon could select the target DICOM data of a patient.After clicking the start button, the original image size and resolution, together with the patient personal information, would be shown in the file load region and metadata region, respectively.The dynamic 3D model of both the skeleton and soft tissue would be shown in the 3D model region, which allowed the surgeon to rotate, zoom, and move the model to check the details.
The three types of data generated by the image-preprocessing module would directly flow to the predicting network module.Both the aesthetic value of the skeleton and soft tissue were calculated by using the trained network.The results are indicated by the exact value and a five-star rating graph.The more the stars were highlighted, the higher the aesthetic score of the patient.

| RESULTS
The proposed 3D model could be implemented in two working modes.For only one CT image, the GUI-based mode was preferred to manually load the patient's data and show user-friendly results to the patient.For a large number of CT images, the programme-based mode could be used to obtain the facial aesthetics score of a batch of patients' images continuously, which was faster than the GUI mode.We first tested the GUI mode of this model.After selecting the target DICOM data and clicking the start button, the model was able to finish the process automatically.It would take about 90 s from loading the raw CT image to get the results s CT volumes in GUI mode (Supporting Information S1).The model successfully worked on all tested PCs using various CT volumes.
To test the model's average predicting accuracy, we used the To evaluate the model-predicted results and the ground truth, we selected some extreme cases to analyse their differences; the results are shown in Figure 7.There were two volumes with the highest skeleton scores, and we chose the one with a higher soft tissue score for analysis.There were three volumes with the lowest skeleton score and one had the lowest total score at the same time; we chose the one

F I G U R E 7
Comparisons in six extreme cases between ground truth score and model predicted score with soft tissue images, skeleton images, and metadata listed, respectively.Both soft tissue scores and skeleton scores listed in this figure are original scores from the modelpredicted results and ground truth data.The scores shown in the GUI mode are further interpreted using the five-star rating method, where all scores are indicated using one-integer-and-one-decimal format, that is, the original score 49 will be 4.9 stars in the five-star rating method when the model is in the GUI-based mode.
methods and did not benefit from the latest progress in machine learning.We believe that a scoring model could be a useful medium to bridge the gap between the surgeon's actual capability and the patient's over-high expectations for the surgery.Such a third-party and quantitative tool could be highly desirable in the clinic.On the one hand, it is a pure machine-learning-based automatic method that does not require a surgeon to put much effort into performing manual evaluations.On the other hand, it can show the patient's personal pre-surgery condition in a more acceptable way, and the patient will not feel offended if the score is low because the results were told by a model not by a human surgeon.
This study proposed an end-to-end solution to prepare the standard training materials and extract the key medical information.
Some professional medical software is expensive and uses GUI-based data preprocessing methods.Not all users can afford the high licence fees for these softwares.It is also inconvenient to preprocess a large number of medical images using GUI-based software.The preprocessing modules proposed in this study can be a useful and free tool not only for the proposed surgeries but also for other purposes.
It can automatically preprocess a large number of medical data without human involvement.The predicting network modules can also be used for other purposes to extract the key information from the input data.For example, if the ground truth data were prepared to predict tumours, proper surgical treatments, and postoperative conditions, then the backbone network can also be used for predicting one of (single output) or all of (multi-output) these values.
Here, we used facial aesthetics (beauty) as our output because it is the key factor and common concept in orthognathic surgery.If researchers are interested, they can use the proposed method to train a specific model for their research purpose.In addition, more and more robot-assisted or computer-assisted technologies are introduced into orthognathic surgeries to reduce the high workload of surgeons during the intraoperative phase. 37Nevertheless, there is still a lack of preoperative evaluation tools to automate the workflow for analysing the patient's condition before surgery.The proposed facial aesthetic model could assist surgeons with surgical plan designing, thus further improving the autonomy of the current robotassisted surgery and reducing the surgeon's workload. 38,39fining a person's facial aesthetics is still a complex and difficult task.As an early attempt in this concept, our current model can yield acceptable results, whereas it still has some limitations and can be improved in future studies.
However, few CNN models have been proposed for clinical 3D images and the reasons could be attributed to two major challenges.The first one was the difficulty of preparing training materials using raw 3D images.Unlike common 2D images that could be easily obtained by using a camera or downloaded from an open-access dataset, 27 3D medical images needed expensive CT scanners to acquire and professional medical software to preprocess, allowing limited access under the strict ethical issues regulations.The differences in resolution, image size, Hounsfield unit (HU) ranges are significant in different medical volumes.Commercial graphic-userinterface-based medical software such as MIMICS could perform some preprocessing work, but they were expensive and could not automatically prepare the standardized training-ready images for a large number of data.The second challenge was the difficulty of training a 3D CNN model, which was introduced by the abovementioned first challenges.Training a large and deep 3D network had high requirements for researchers' tuning skills of hyper- Before proposing any type of evaluating model, we must first identify the key factors that define a person's facial aesthetics from a clinical perspective.In orthognathic surgery, surgeons usually perform operations on the patient's maxilla and mandible areas.They needed to reshape, relocate, and rearrange the facial skeleton, which further changed the soft tissue (skin) condition and improved the facial aesthetics.Changes in the skeleton could lead to changes in soft tissue; nevertheless, the soft tissue itself might also independently affect the aesthetics level of the patient.Even two persons with nearly the same skeleton structure might have different soft tissue conditions and result in different appearances.Furthermore, individual physical differences in gender, age, and weight might also significantly affect personal aesthetics.It should be mentioned that the above three factors were closely correlated and should be considered simultaneously when evaluating a person's aesthetics condition.In the clinic, a patient's CT images were stored in DICOM format, which could be further preprocessed to obtain the intended medical information.A researcher could make a different mask by using different Hounsfield units (HU) to segment out different organs and use metadata to extract the personal information of the patient.Therefore, in this study, we empirically selected skeleton, soft tissue, and personal physical information to be three major factors for evaluating a person's facial aesthetics.Then, the facial aesthetics was sub-grouped into the soft-tissue score and skeleton score, respectively.In this study, we collected 133 CT images of patients who underwent orthognathic surgery at The University of Tokyo Hospital.Data acquisition and processing were approved by the F I G U R E 1 General workflow of the proposed 3D evaluation model.The image-preprocessing module provides the normalised tensor data for the predicting network module, where the feature maps of all three input data are extracted and further concatenated for the final inferring.Once initiated, the whole procedure worked in an end-to-end architecture from raw CT images to the final facial aesthetics score without relying on commercial medical software and human surgeon involvement.Medical Ethics Committee of the Graduate School of Medicine of the University of Tokyo (No. 2553-(3)).All data were processed within the hospital and the personal information of patients was anonymised.Inclusion criteria were set to screen out the eligible data for model training and aesthetics definition in this study.The selected CT images should cover at least a range from the nasion point to the menton point, and the patient's soft tissue was not covered by medical bands during CT scanning.In addition, CT ).Two of the authors conducted the manual landmarking of 133 volumes in 3-Matics (Materialise NV) based on the proposed landmark list.It took around 20-40 min to annotate one volume.
In a rare case where a patient from a new dataset had a greater error than the largest value in the ground truth dataset, then an extreme outlier processing algorithm was used to give it the same value as the largest one.All the original errors were normalised into 0-50 as the final score of soft tissue.(b) Skeleton score annotation

F I G U R E 2 F I G U R E 3
used MIMICS and 3-Matics to manually annotate the soft-tissue and skeleton images for preparing the ground-truth data, which are two commercial software widely used by surgeons to conduct 3D medical image processing and evaluations.MIMICS is primarily a GUI-based software, which is not suitable for automatically processing the raw medical image to prepare the standard training material of a CNN model.In addition, MIMICS is a relatively expensive software; integrating it with a machine learning model will not only increase the surgeon's financial burden but also affect the technological independence of the model.Therefore, in this study, we developed a custom image-preprocessing module to conduct the segmentation and preprocessing work without relying on commercial software such as MIMICS.Corresponding locations of 12 landmarks in a softtissue model (A) and the 3D golden ratio mask (B).Red landmark is the reference landmark (nasion point).The numbers next to the landmarks indicate the annotating sequence.Original location(A), rough matched location at the reference point (B), and the final iterative closest point matched location (C) of soft-tissue landmark set and golden ratio mask landmark set.The final errors are accumulated as the soft tissue score.(a) Getting metadata The image-preprocessing module reads the metadata from the raw CT image first and categorised the information into three groups.The first group has four parameters: name, age, gender, and weight.Only the name was used for labelling different data, and the rest of the three parameters would be used as training materials.The second group was the size of the volume with three parameters: image width, image height, and image depth, which would be used in the subsequent padding process.The third group was the resolution of the volume at three axes, which would be used as the references of the resampling process.(b) ResamplingThe original DICOM data is stored as a measure of radio density, which cannot be directly used and needs to be transferred into the pixel unit.Therefore, the image-preprocessing module set the unit values that were out of the scanning boundary into zero and converted the effective unit into the Hounsfield unit (HU) by multiplying the original unit with the rescale slope and adding the intercept.As mentioned in the Introduction section, the size of the original CT image was too large and had different resolutions at different axes.Therefore, we resampled CT images to downscale the input data and set its resolution to a fixed value to unify the resolution of all three axes.Assume that the original resolution and original size of the image were stored in the vector R o and S o , then the new size of the image S n under new resolution R n could be calculated and rounded as: (c) Padding After the resampling, the size of the X and Y axes was reduced to 180, while the Z axis still had various sizes based on the slice number of the original DICOM data.A padding process was further implemented to reshape the 3D image into a 200*200*200 tensor.X and Y axes were added 10 extra volumes with 'air' (HU = −1000) to each ending side.

stride = 1
followed by a ReLu activation and batch normalisation.A max-pooling layer with the size of 2 � 2 � 2 and stride = 2 was used at the end of the block to reduce the spatial dimension of the feature maps by half.Then, the data was flattened and processed by an FC-ReLU-BN-DP block.The fully connected layer had 128 neurons and was followed by the ReLu activation, batch normalisation, and dropout layer with a ratio of 0.5.Another dense layer with 64 neurons and ReLu activation was used as the final.Usually, the skeleton mask had much more effective values in tensor than the soft tissue mask.Therefore, we used larger 3D convolution layers to extract the internal features of the skeleton mask.The skeleton channel has(16,   32, 64) neurons, while the soft tissue channel has half of the neurons with(8, 16, 32).The metadata channel was processed by a 2-layer multi-layer perceptron (MLP) with ReLU activation and 16 and 8 neurons, respectively.Then, all three channels were concatenated into one and further processed by a two-branch 3-layer MLP with the size of (128, 64, 51), respectively.

E 4
Part of batch preprocessed skeleton (green) and soft tissue (yellow) data.Patients' metadata are stored as the file names and are mosaicked here for protecting patients' privacy.The network used an Adam optimiser with a learning rate of 1 � 10 −6 .The categorical cross entropy was selected as the loss function with the accuracy as the evaluation function.The data were fed into the network with a batch size of 4 and trained the network with 1000 epochs.The network was implemented in the Keras framework with TensorFlow backend (V 1.9.0).The model was trained using a high-performance workstation (Deep Learning Box II, GDEP ADVANCE Inc.) equipped with an Intel Core i9 CPU (Memory: 128 GB) and four NVIDIA Quadro GV100 programme-based mode to conduct an experiment on 13 CT volumes that were independent of the training data.The five-star rating method is frequently used to show facial aesthetic value in common 2D-imagebased evaluation models.Although the present research used 3D CT images and a different research framework, given its user-friendly and wide range of acceptance, we also adopted the five-star rating method to interpret the model-predicted results.The original results were further divided by 10 to form the one-integer-and-one-decimal format F I G U R E 5 Network structure with layer size indicated.Flatten and dropout layers are omitted.Convolution layers and the dense (fully connected) layers at different branches are indicated with different colours.as the final five-star rating results.To avoid misunderstanding of the two results, we also calculated the percentage accuracy as supplementary results since it remained the same whether using the original or five-star rating value.The five-star rating and the corresponding percentage results showed that this model was able to predict the skeleton and soft tissue score with 0.231 � 0.218 (4.62%) and 0.100 � 0.344 (2.00%) score errors when compared with ground truth.The average time from loading the raw DICOM data to get the predicted results is 11.203 � 2.824 s in programme-based mode.Our current model used a classification output, which had high accuracy in most of the tested data while showing relatively unstable errors in some extreme situations that contributed the most of the errors and resulted in a relatively larger deviation.This could be caused by the nature of classification tasks in a deep learning model, where the outputs were discrete values indicating the classification labels.In contrast, the regression outputs were continuous values that usually yielded stable results compared with classification outputs.In our test, the model with regression output reached the accuracy of 0.632 � 0.163 (12.65%) and 0.641 � 0.288 (12.81%) for skeleton and soft-tissue scores, respectively.The deviation is much smaller, but the mean accuracy is larger than the classification output.Therefore, in this study, we used the classification output as the model's final inferring result.

Figure 8 ,
Figure 8, there is a trend of linear correlation between the skeleton score and soft tissue score with a slope of 0.177 that is significantly different from zero at 0.05 level.The Person's r is 0.294 and the p-value is 5.907E-4 < 0.05.Although the correlation evaluation is independent of the performance of the proposed model, the statistic values demonstrated the validity of our methodological design that combining both the skeleton and soft tissue as well as personal physical information was a sound approach to evaluate a person's facial aesthetics.
For example, due to the immature technology in the interpretability of the CNN, we still cannot quantitatively evaluate which channel of metadata, soft tissue, or skeleton contributes the most to a person's facial aesthetics.More specifically, it is more favourable to identify which part of soft tissue and skeleton, and which metadata parameters (gender, age, weight) are important to the final results of the CNN model's prediction, which will benefit the surgeons' and researchers' understanding of human aesthetics.As the deep learning method is still developing rapidly, we hope in the future studies we can use the newly proposed analysing tools to quantitatively evaluate the contribution ratio of each data channel to improve the interpretability of the model, thus having a better understanding of the CNN's internal processing workflow and providing the surgeon with a more predictable and comprehensible deep-learning-based model for facial aesthetic analysis.In addition, compared with common 2D pictures, 3D CT images contain much more solid information and medical data that are not available in 2D images.Nevertheless, 2D images can indicate other important information that medical images do not include, such as hairstyles and surface skin conditions, which are also important in evaluating a person's facial aesthetics.We do not have access to patients' 2D pictures; otherwise, the model can be expanded from the current three-input network to a four-input network by adding additional 2D pictures into the backbone architectures.By doing so, a more comprehensive model for facial aesthetics evaluation can be obtained.In conclusion, this study proposed a facial aesthetics evaluating model by using 3-input and 2-output deep learning neural networks based on 133 CT images.The experiment results showed that this model could predict the skeleton and soft tissue score with 0.231 � 0.218 (4.62%) and 0.100 � 0.344 (2.00%) score accuracy in an average processing time of 11.203 � 2.824 s from raw CT image.This model could be a useful clinical tool to automate surgeons' current workflow and reduce the gap in the aesthetics perspective between the patient and the surgeon for a better surgical plan.In the future study, we would like to expand the dataset and input more facial-aesthetic-related data to train a more holistic model based on a new network architecture.
1 Statistics of patient information.Gender (-MA ET AL.