Role for 2D image generated 3D face models in the rehabilitation of facial palsy

The outcome for patients diagnosed with facial palsy has been shown to be linked to rehabilitation. Dense 3D morphable models have been shown within the computer vision to create accurate representations of human faces even from single 2D images. This has the potential to provide feedback to both the patient and medical expert dealing with the rehabilitation plan. It is proposed that a framework for the creation and measuring of patient facial movement consisting of a hybrid 2D facial landmark fitting technique which shows better accuracy in testing than current methods and 3D model fitting.

The outcome for patients diagnosed with facial palsy has been shown to be linked to rehabilitation. Dense 3D morphable models have been shown within the computer vision to create accurate representations of human faces even from single 2D images. This has the potential to provide feedback to both the patient and medical expert dealing with the rehabilitation plan. It is proposed that a framework for the creation and measuring of patient facial movement consisting of a hybrid 2D facial landmark fitting technique which shows better accuracy in testing than current methods and 3D model fitting.

Introduction:
Recent medical studies [1][2][3] have highlighted that patients diagnosed and treated with specific types of facial paralysis such as Bell's palsy have outcomes that are directly linked to the rehabilitation provided. While various treatment and rehabilitation paths exist dependant on the specifics of the facial palsy diagnosis, the aim is to restore a degree of facial muscle movement to the patient. Lindsay et al [4] completed a comprehensive study over 5 years of the rehabilitation process and outcomes for 303 facial paralysis patients, the key finding was the need for specialised therapy plans tailored via feedback for the best patient outcomes. While Banks et al [5] have shown that quality qualitative feedback to a clinician is required for the best development of rehabilitation plans.
Tracking and providing qualitative feedback on the progress of rehabilitation for a patient is an area where the application of computer vision and machine learning techniques could prove to be highly beneficial. Computer vision methods can provide the capability of capturing accurate 3D models of the human face these in turn can be leveraged to analyse and measure changes in face shape and levels of motion [6].
Applying 3D face modelling techniques in an automated framework for tracking facial palsy rehabilitation progression has a number of potential benefits. 3D face models generated from a 2D face image can provide a detailed topography of an individual human face which can be qualitatively measured for change over time by a computer system. Potential benefits of such an automated system include providing the clinician dealing with a patients rehabilitation to gather regular objective feedback on the condition and tailor therapy without always needing to physically see the patient or providing continuity of care if for instance the clinician changes during the rehabilitation period. Patients will have a visual evidence in which to see the progress that has been made. It has been indicated that patients suffering from facial palsy can also be affected by psychological and social problems the capacity to track rehabilitation privately within a comfortable setting like their own home may be of benefit.
Some previous studies [7] have looked at the process of aiding diagnosis through the application of computer vision techniques these have been limited to 2D imaging which measure on a spare set of landmarks. The hypothesis is that 3D face modelling consisting of thousands of landmarks provides a far richer model of the face which in turn can present a more accurate measurement system for facial motion.
In this Letter we propose a framework applicable for accurate generation of 3D face models of facial palsy patients from 2D images applying state-of-the-art methods and a proposed method of using geometrical features to track rehabilitation and present our conclusions.

Proposed system overview:
The accuracy of the facial representation is a key components of any computer-based system which aims to measure facial motion. We suggest that the more complex a depiction of the individuals patient facial topography the greater the potential is for the desired level of accuracy. Developing such a system requires a framework of methods to build and measure such a model.
As camera systems which perceive depth within an image are not currently common place or require specialist and expensive hardware initially we require a method for face detection and 2D face  Fig. 1 shows an example of 2D face alignment where 68 landmark fitted to the face. Many methods have been researched for this purpose and in the limited previous work on facial palsy the method have adopted a variation of the active shape model [7]. Over the recent other method have shown state-of-the-art results such as discriminative response map fitting (DRMF) [8], deformable part models (DPM) [9] and more recently a deep learning variation which applies convolutional neural (CNN) networks for pose-invariant 3D face alignment (PIFA) [10].
Following the 2D face alignment process we propose the generation of a 3D facial model. 3D facial modelling provides a much richer representation of a individuals facial geometry comprised of a dense mesh generally contains many thousand vertices'. The use of 3D face models theoretically provides us with a set of geometric features which can provide a more accurate measure of facial movements in our prosed system the 3D morphable model (3DMM) [6] is applied. 3DMM have been shown to produce accurate models in research and recent 3DMM fitting approaches in [11] has shown excellent results as demonstrated by the fitted model shown in Fig. 2.
Once a 3D face model is generated a set of features is required to be used for measuring the facial motion. Geometric features have previously been shown to be in areas such as facial expression recognition [12] which shares some similarities with this problem domain. With a larger set of key-points (example of which is shown in Fig. 3) that describe the face in rich detail we believe that geometric feature have the potential to measure facial movement ranges with a greater degree of accuracy. Extraction of a feature set based upon geometric features is also relatively computationally inexpensive.
3. Methods: Our framework consists of three specific components which are 2D face alignment, 3DMM fitting and geometric feature extraction. Within this section each components methods are discussed in further detail.
To provide the most accurate detection of 2D facial landmarks we propose a hybrid method based upon our experimental findings discussed later in this Letter. This hybrid method consists of applying two distinct methods each fitting a subset of the 2D facial landmarks. The first 2D facial alignment method for fitting a majority of the landmarks required to construct a 3DMM is DRMF, which is a form of a parts based constrained local models (CLM). The model is setup as M = {S, D} in which a set of detectors D of the various facial landmarks corresponds to fiducial points of the shape model S. CLMs define a face as a 3D object as follows In (1) For brevity we refer the reader to [8] for a full technical overview of the DRMF method.
The second method applied for fitting the important mouth region landmarks we apply the PIFA method [10]. PIFA applies a series of CNNs within a cascaded regression framework is to estimate the shape parameter p. A mapping to predict p is learnt from a N d set of training images. An estimated update to the shape parameter at the kth stage of the cascaded CNN is learnt as per eqn. (3) where the true shape update is the difference between the current shape parameter and the ground truth as Dp k i = p k i − p k−1 i , I i is the training image, U i is current estimated 2D landmarks and v k−1 i is estimated landmark visibility A six-stage cascaded CNN is used, at the initial input stage CNN  invariant feature patches, extracted from the current estimated 2D landmarks. We refer the reader to [10] for a more comprehensive overview of the method and the novel feature patches employed.
We concatenate the set of 2D facial landmarks from the relevant points of the outcomes from the DRMF and PIFA methods, these are passed to the second stage of the framework in which a 3DMM is generated. The 3DMM is used to represent a dense 3D shape of an individual's face in our framework we apply the fitting technique as described by [13] S =S + A id a id + A exp a exp (4) S describes the 3D face whereS is the mean shape, A id and A exp are the principle axes trained on the 3D face scans with neutral expression and expression scans, respectively. While a id are the shape parameters and a exp the expression parameters. A id and A exp are provided by the Basel Face Model [14] and Face-Warehouse [15], respectively. A weak perspective projection is used to project the face model to the image plain for the fitting of the 3DMM to a face image s 2d = fPR(S + t 3d ) (5) s 2d are the 2D positions of 3D points on the image plane, f denotes the scaling factor, P is the orthographic projection matrix 1 0 0 0 1 0 , R = (a, b, g) is the 3 × 3 rotation matrix constructed with pitch(a), yaw(b) and roll(g) and t 3d is the translation vector.
The fitting of this model is defined by (6) where the 2D landmarks identified in stage one of the framework defined here as s 2dt associated 3D points and estimate the model parameters by minimising the distance between s 2d and s 2dt . arg min f ,R,t 3d ,a id ,a exp s 2dt − s 2d (6) A fitted 3D face model S is a dense mesh consisting of m vertices where m = 53215 in the example shown in Fig. 3.
We propose an initial technique for extracting n relevant geometric feature sets that can be applied to measure and track the restoration of facial motion In (7) a set of evaluations E are defined by the clinician which forms the basis for measuring the rehabilitation progress of the patient. S 0 defines the 3D face model at a neutral expression, while S 1 is the model at end range of the prescribed evaluation movement. d defines a N-dim index vector indicating the indexes of semantically meaningful 3D vertexes. As facial palsy often affects the facial movement in an asymmetrical manner between the left and right side while also the range of affected musculature is not always equal between the upper (eye and brow region) and lower (mouth region), the semantically meaningful 3D vertexes will differ for patients though are likely to be quadrant or region based Equation (8) defines a basic rehabilitation measurement where E 1 is the set of initial evaluation taken pre-rehabilitation and E 1+i is the most recent set of evaluations. A more semantically meaningful metric could be provided through the incorporation of a mapping to one of the recognised medical grading systems for facial palsy such Yanagihara, House-Brackmann or Sunnybrook.

Results:
A private data set of six individuals who have a confirmed diagnosis of facial palsy are used to conduct some initial tests on the capability of the prosed hybrid method for fitting 2D facial landmarks. Each image is a cropped full frontal facial images which have been manually marked with a 68 facial landmarks to be used as the ground truth landmark positions. For testing purposes a subset of landmarks are applied that are identically marked for each of the methods tested. The methods test are DRMF [8], PIFA [10], DPM [9] where we apply both a fully independent part model and also a shared part model using 99 parts released by the authors. Finally we show results for the mention hybrid approach where PIFA is used to fit the landmarks relating to the mouth and DRMF for the other landmarks. We apply the root mean square error with ocular scaling to deal with the images having different size faces as used in [8] to measure the accuracy of the techniques. Fig. 4 shows that across the dataset the accuracy of methods on a subject to subject basis can vary to a fairly large margin. DPM shared model is especially volatile giving the best fit for subject 3 but by far the worst for subject 4. While not performing the best for any subject the proposed hybrid method is the consistent in terms of accuracy across the dataset. When examining the mean RMSE as shown in Table 1 we can see that the hybrid performs above all other methods with DRMF also showing a distinct advantage over the other methods including the state-of-the-art CNN-based PIFA method.
When we further examine the results as shown in Fig. 5 based on the accuracy for certain key facial landmarks DRMF has very high accuracy for all landmarks except for the mouth area. Whereas the PIFA method fits mouth landmarks better but struggles on this dataset with accuracy in other areas. The hybrid method provides the best accuracy though there is still some issues when fitting the corners of the mouth. This is likely due to the asymmetrical nature of the mouth location on the test set and that none of the models tested have been trained on any data specifically relating to this condition.

Conclusions:
In this Letter we have proposed a potential framework for measuring the progress of rehabilitation for patients with facial palsy through automatically building a 3D face model from basic 2D images of the patient. We have investigated landmark fitting methods using state-of-the-art techniques and proposed a hybrid 2D landmark fitting method incorporating these which provides better accuracy when measured against the ground truth 2D images. To realise the potential of an application for facial palsy rehabilitation measurement there two key areas of further work. The first is that although the hybrid method proposed provides a high degree of accuracy on landmark fitting a significant level of error resides in fitting mouth landmarks specifically in facial palsy patients when there is a large range of asynchronous movement. This level of error could negatively impact the accuracy of the rehabilitation tracking and therefore further study of asymmetrical motion in the face needs to be captured with 2D landmark fitting systems. The second is to develop available datasets of facial palsy specifically graded 3D models which can be used as ground truth to fully support the proposed framework in its entirety.

Acknowledgment:
The authors thank the financial support from the EPSRC grant (EP/P009727/1).

Funding and declaration of interests:
Conflict of interest: none declared.