Craniofacial Reconstruction via Face Elevation Map Estimation Based on the Deep Convolution Neutral Network

School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China Jiangsu Provincial Key Laboratory of Computer Network Technology, Southeast University, Nanjing 211189, China Institute of Stomatology, Nanjing Medical University, Nanjing 210029, China Jiangsu Key Laboratory of Oral Diseases, Nanjing Medical University, Nanjing 210029, China Affiliated Hospital of Stomatology, Nanjing Medical University, Nanjing 210029, China


Introduction
Craniofacial reconstruction is a technique producing a reconstructed face from a human skull. Based on the relationship between the skull and face in forensic medicine, anthropology, and anatomy, this technique has been widely used in criminal investigation and archaeology. e traditional craniofacial reconstruction is mainly implemented manually by experts, based on the anatomical law of the human head and face on the plaster model of the victim's skull and according to the relationship between the soft tissue of the human head and face and the morphological characteristics of the face and skull. e facial appearance of the victim is gradually reproduced with adding rubber clay and other materials. is method usually requires a complicated process, high cost, and time-consuming. In addition, the result largely depends on the practitioner's experience, so its application in criminal investigations based on timeliness and truthfulness is greatly restricted.
With the development of computer visualization and virtual three-dimensional technology, computer-aided craniofacial reconstruction technology has greatly reduced the repair time and work difficulty and reduced the subjective deviation factors, which has attracted widespread attention. e current reconstruction methods are based on either template [1] or feature points [2,3]. For the template-based methods, a face template set in advance is required. In the reconstruction process, the template is deformed according to the shape of the skull until the feature points on the face template match with the feature points estimated from the skull. Reconstruction can be based on fixed templates [4][5][6][7] or dynamic templates [8][9][10][11][12][13][14]. e feature points-based methods first estimate the soft tissue thickness of the facial key points and then restores the facial surface. Although feature points based methods have been practically applied in the field of forensics, there are still limitations, which are mainly reflected in two aspects. First, in the process of recovering complete face surface from sparse feature point information, the loss of facial details will be inevitable. Second, human interaction is often required to ensure the accuracy of feature points positioning, which result in an extra anthropic factor.
Craniofacial reconstruction is essentially a problem of sample generation based on reference data. With the rapid development of deep learning technology, data generation based on the convolutional neural network shows significant advantages, among which the representative technologies are the variational autoencoder (VAE) [15,16] and generative adversarial network (GAN) [17]. Both VAE and GAN attempt to learn the mapping of hidden space variables to real data distribution through training samples. e difference is that VAE calculates the mean and variance of samples through the neural network, constrains them to obey standard normal distribution, and then samples out hidden variables for reconstruction [18]; while, the GAN adopts the idea of game theory and directly measures the distance between real distribution and generated distribution through the discriminator, forcing the generator to generate a more realistic distribution. In recent years, the GAN has received extensive attention from the industry, and many variants have been derived, such as the WGAN [19], CGAN [20], Pix2Pix [21], and BEGAN [22]. e convolution neural networks have also been introduced into the field of craniofacial reconstruction. Li et al. [23] proposed to use a convolutional neural network based on a codec structure, which can well predict the distribution of skeleton soft tissue. e method is with high computation cost, and high performance hardware requirements are also needed, but the generated results are not satisfying. Yuan et al. [24] used the GAN to reconstruct 3D face images. Limited by the data amount and computing power, the author use sparse representation of 3D data to reduce the computation cost and improve the recovery ability; Liu and Xin [25] proposed a prediction method based on the autoencoder and GAN. Candidate faces are generated through the autoencoder. e human face and skull are superimposed to determine the best face. e GAN is used afterwards to optimize the results. Such scheme is essentially a deep learning version of the templatebased method. Although the reconstruction accuracy is relatively high, the common problem of the template method is inevitable, that is, the generation process is cumbersome, and the network structure is complex.
Based on the above research, we propose an end-to-end facial morphology prediction method based on the deep convolutional neural network to automatically estimate face information from skull data. For the proposed method, named cylindrical facial projection residual net (CFPRN), it needs neither preset face template nor feature point detection. In order to ignore unnecessary calculations, we do not reconstruct the face data directly in 3D space but try to estimate the face elevation map in 2D cylindrical projection space, and back-projection operation is performed afterwards to get the 3D face surface. We use U-shape network structure so as to adapt with features of different scales. e CFPRN is easy to implement, and experiments have verified the robustness and accuracy of the proposed method.

Data Segmentation.
e objective of craniofacial reconstruction is to recover the 3D face surface data from 3D skull data. Both data are obtained from 3D head CTscan. e face surface can be simply retrieved via threshold segmentation, as shown in Figure 1(a); however, due to the complexity distribution of soft tissue and cartilage, threshold segmentation is not suitable for the skull. In order to obtain a clean skull structure, we choose to use adaptive threshold segmentation with a sliding window. e size of the sliding window is set to be 7 * 7 * 7. e comparison of global and adaptive thresholding is shown in Figures 1(b) and 1(c).

Projection and Back-Projection.
For craniofacial reconstruction task based on convolutional neural networks, the 3D volume data obtained via head CT scans are usually with excessive data volume [26]. e existing hardware conditions are difficult to meet the problem of constructing a feature network directly for 3D data under the original resolution. In fact, during the reconstruction, only the surface of the skull and the face needs to be considered. erefore, we use projection operations to map the 3D data to the 2D space for calculation. Considering that the human head is close to a circle in the cross-section and in order to avoid the inconsistency of the resolution in the vertical axis, we use a cylindrical projection surface. e plane projection and sphere projection are not considered because the former leads to inconsistent resolution in vertical direction, and the latter results in inconsistent resolution in different horizontal slices. As shown in Figure 2(a), the crosssection of the CT scan is the XOY plane, and the Z-axis is perpendicular to the cross-section. e coronal plane and the sagittal plane are the XOZ plane and the YOZ plane, respectively. Figure 2(b) shows the projected plane coordinate system. e coordinate transform between 3D space and the cylindrical projection plane is defined as follows. For projection, For back-projection, 2 Security and Communication Networks where x ′ , y ′ , z ′ are the coordinates from the 3D space, and u, v are the coordinates from the projection plane. r is the pixel value of the 2D projected altitude map which represents the distance from the point to the projection axis in 3D space. us, the depth information in 3D facial and skull surface is preserved in the projection and back-projection steps. 2n is the total sample number in U axis.

Network Architecture
e network structure refers to the encoder-decoder structure of U-net [27] and draws on the relevant ideas of the CGAN [20], Pix2Pix [21], and other networks to realize an end-to-end network.
In the encoder-decoder structure, the first half of the network acts as an encoder, which successively is downsampling through pooling, convolution with strides, to extract deep features from the input image. e second half of the network acts as a decoder, which successively is upsampling through deconvolution, interpolation, to map the feature output by the encoder back to the size of the previous level. In the meantime, cross-layer connection is considered, so that the high-level feature map after being upsampled by the decoder and the low-level feature map of the same scale in the encoder are connected in the channel dimension, and feature information of different scales are merged to make the prediction result more accurate and stable. Figure 3 shows the specific structure of the proposed network.
e network is generally divided into two parts: encoder module and decoder module. e encoder module is mainly composed of a convolutional layer and five convBlocks; each convBlock, as shown in the bottom right of Figure 3, contains a leaky Relu activation layer, a 3 * 3 convolutional layer, and a group normalization layer. e encoder module performs 6 downsampling in total, and the pooling operation is replaced by a convolution operation with a step size of 2 so as to retain more feature information.
e decoder module is composed of five deconvBlocks and a convolutional layer. Each deconvBlock, as shown in the bottom right of Figure 3, contains a leaky Relu activation layer, an upsampling layer, a 3 * 3 convolutional layer, and a group normalization layer. e decoder module performs upsampling 6 times in total; bilinear interpolation is considered for upsampling, expanding the height and width of the feature map by 2 each time. e feature map after each upsampling is connected in the channel dimension with the feature map of the corresponding scale in the encoder. rough such a cross-layer connection, the deep and shallow features can be effectively merged.
In the meantime, we use some tricks to improve the performance of the entire network. (i) Replace deconvolution with a structure of upsampling using bilinear interpolation and convolution, which can effectively avoid the checkerboard effect [28]. (ii) Replace Relu with leaky Relu, which can effectively reduce the dead neurons. Replace pooling operation with convolution operation with a step size of 2 to retain more features. (iii) Use group normalization [29] instead of batch normalization which can effectively avoid the impact of batch size on the training results.
We use normalized skull elevation map as network input. e data range is limited to (−1, 1). e normalization can speed up the convergence of the network and increase the generalization ability of the model. For the supervised data, we have 2 options: one is to use the face elevation map directly and the other is to use the residual between the face and skull surface (mentioned as "face" and "res," respectively, in the experiment section). e loss is defined as the distance between predicted and real face elevation map. We use mean square error

Security and Communication Networks
(MSE) to define the loss function, which represents the average value of the square of the difference between the predicted and the real elevation map. e expression is as follows: where m denotes the number of pixels, and the terms x, y denote the predicted and the label value, respectively.

Data Description.
e dataset used for experiment is acquired from the head cone-beam CT scan from NewTom 5G. e dataset contains CT data of 1447 participants from Affiliated Hospital of Stomatology, Nanjing Medical University. Each sample has 540 CT slices, the resolution for each slice is 610 × 610, and the pixel size is 0.3 mm * 0.3 mm. 1310 samples were randomly selected as training set, and the validation set is composed of the rest 137 samples.

Evaluation Indices.
Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [30] is chosen as evaluation indices for the experiments. e peak signal-to-noise ratio (PSNR) measures the ratio between the energy of the peak signal and the average energy of the noise, which is commonly used for signal recovery quality. e PSNR is defined as where MAX denotes the maximum pixel value in the data, and MSE is the mean square error. Besides PSNR, SSIM is considered for the similarity measure between the ground truth and the predictions. e SSIM measures image similarity from three aspects of brightness, contrast, and structure, with value range (0, 1), and larger value stands for smaller image distortion.

Result.
We intuitively visualized the experimental results. Figure 4 shows the input elevation map of the skull. Figure 5 shows that the prediction results of the face elevation map correspond to skulls in Figure 4 and the corresponding ground truth. We can see the predicted result is very close to the ground truth. Pseudocolour maps shown in Figure 6 visualize the difference between the output and the ground truth (in percentage), from which we may see that the error mainly occurs in the eyes, nose, and mouth area. Obviously, because of the cavity in the skull, it is impossible to accurately predict the eyes and nose. We use the predicted elevation map to generate 3D facial data through back-projection. e generated 3D face is compared with ground truth. e difference map is shown in Figure 7, from which we may see that for most part, the error is limited to 1 mm.

Comparison.
We have repeated experiments on different network architectures and different image sizes. e specific results are given in the table, and the bold line is our proposed one. Table 1 indicates that the proposed CFPRN is with high accuracy and shows best performance among all the candidates. e abbreviation "Res" means the network output is the residual of the face and skull, and the abbreviation "Face" means the network output is the face surface directly. Table 2 indicates that the CFPRN works well under different resolution settings.

Error Analysis
(1) In order to simplify the network and improve efficiency, we reduce the dimension of the input, which causes a partial loss of data accuracy. After the prediction is   completed, back-projection is performed, which may also cause extra error transfer. (2) From the error map, we may see that basically, all samples have large errors in the part of the nose and eyes. is is because the skull has holes in the eyes and nose, which cannot be accurately predicted, and this might be overcome by introducing much more samples.

Conclusion and Prospects
In this study, we propose an end-to-end deep learning method for craniofacial reconstruction. e main contribution of the proposed method can be summarized as follows: (1) We use projection and back projection for the transfer between 3D skull and face data into 2D elevation map. Instead of performing craniofacial   reconstruction in 3D space, the recovery runs in 2D space. e face elevation map is estimated according to the skull elevation map. Such design largely reduces the data size and computation cost, so that the proposed method is available on consumer graphics cards. (2) We design an U-shaped end-to-end network to fit for the features in different scales. e accuracy and robustness of the prediction are guaranteed according to the experiment results.
According to our experiment results, we can also make further prospects: (1) We should expand the amount of samples. Divide samples according to gender and age to balance the distribution of sample data. (2) e eyes and nose of the skull should be hollowed out or filled. Because the specific shape of the face in these parts cannot be inferred from the skull, it is helpful to reduce the impact on the experimental results by hollowing out or filling these parts. (3) We will try other network architectures, such as the conditional GAN. By introducing more conditions, we may provide subdivided predictions with higher accuracy.

Data Availability
All the experiment data are obtained from Affiliated Hospital of Stomatology, Nanjing Medical University. e access to the original data is restricted due to the patient privacy. However, the data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.