Composite recurrent network with internal denoising for facial alignment in still and video images in the wild

Facial alignment is an essential task for many higher level facial analysis applications, such as animation, human activity recognition and human - computerinteraction. Althoughthe recent availability of big datasets and powerful deep-learning approaches have enabled major improvements on the state of the art accuracy, the performance of current approaches can severely deteriorate when dealing with images in highly unconstrained conditions,whichlimitsthereal-lifeapplicabilityofsuchmodels.Inthispaper,weproposeacompositerecurrent tracker with internal denoising that jointly address both single image facial alignment and deformable facial trackinginthewild.Speci ﬁ cally,weincorporatemultilayerLSTMstomodeltemporaldependencieswithvariable lengthandintroduceaninternaldenoiserwhichselectivelyenhancestheinputimagestoimprovetherobustness of our overall model. We achieve this by combining 4 different sub-networks that specialize in each of the key tasks that are required, namely face detection, bounding-box tracking, facial region validation and facial alignment with internal denoising. These blocks are endowed with novel algorithms resulting in a facial tracker that isboth accurate,robust to in-the-wild settingsand resilient against drifting. We demonstrate this by testing our model on 300-W and Menpo datasets for single image facial alignment, and 300-VW dataset for deformable facial tracking. Comparison against 20 other state of the art methods demonstrates the excellent performance of the proposed approach. © 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http:// creativecommons.


Introduction
The human face is arguably one of the most important deformable objects for analysis in real world applications, such as facial animation, human activity recognition and human -computer interaction [1]. Facial alignment, which aims to detect a set of facial landmark positions, is essential for the high level analysis required in those applications to function well [2]. Currently, the development of facial alignment models is growing rapidly with the availability of large facial landmarked datasets such as 300-W [3] and Menpo [4]. This has made it possible the development of powerful deep learning models that have pushed forward the alignment accuracy and are considered the current state of the art [5,6].
However the performance of current facial alignment models can severely deteriorate when dealing with images in highly unconstrained conditions, e.g. extreme pose or illumination changes, large occlusions [7] or, in general, whenever the test images can be considered to show less favorable conditions than those available for training. In other words, we may say that under challenging conditions, test images contain some form of distortion or noise that will impair the performance of facial alignment models. This limits the real-life applicability of such models to in-the-wild images which naturally contain such challenging conditions [8,9]. While there have been attempts to improve the robustness of facial alignment models to target in-the-wild data [7,10], most of them have done so without modeling the effects of noise in their formulation. Nevertheless, understanding and incorporating such effects within the model training has proven beneficial to improve performance in other facial analysis tasks [8,11,12].
A very important aspect of facial alignment algorithms is their ability to operate reliably in the presence of image sequences (i.e. video data). In such cases, the problem becomes more challenging as there is the need for persistent stability to keep track of the facial features throughput the whole sequence [13]. Progress on facial tracking has been relatively slower when compared to single image facial alignment, and it has been less influenced by deep learning models [14]. Furthermore, the majority of currently available trackers make little use of temporal information: most of them process each frame independently and rely on doing so with sufficient precision to achieve tracking-like performance [15,16], or convert single-image facial alignment to perform tracking by using the results of the previous frame as initialization [17][18][19]. However, the latter approach makes the models vulnerable to drifting since they are not designed specifically to track [14]. In contrast, other trackers incorporate some temporal modeling, but they are mostly limited to the adjacent frames [20,21]. All of the above suggests that current facial trackers may not take full advantage of the temporal information contained in video sequences [22].
In this paper we present a composite end to end facial tracking model which jointly addresses both single image and video sequence alignment. We introduce a robust facial landmark estimator equipped with an internal denoiser autoencoder to selectively enhance the images on a case-by-case basis to boost the accuracy of single-image landmark estimation. Temporal information is handled by means of internal Long Short Term Memory (LSTM) layers that allows modeling of short and long temporal dependencies between frames when video is available. Combination of these two main modules results in an algorithm that produces very accurate facial alignment and is also resilient against drifting, leading to a robust facial tracker.
The contributions of this paper are as follow: 1. We present a unified approach for both single image alignment and facial tracking with our joint robust facial alignment and tracker.
2. Our model employs an internal image denoising network to obtain an enhanced intermediate facial image that helps to improve the accuracy of landmark localization. 3. We incorporate temporal modeling between frames using internal multilayer LSTMs, which unlike other approaches, allow to consider time dependencies at various time scales. 4. We achieve state of the art results for both targeted tasks: Single Image Alignment (on 300-W [3] and Menpo [4] datasets), and Deformable Facial Tracking on the wild (using 300-VW dataset [2]). 5. We investigate the impact of the considered temporal scale for tracking as well as the impact of the internal denoiser in the overall accuracy of our model.
Preliminary results of landmark detection by denoising autoencoder networks and composite recurrent tracking were presented in [19,23]. The rest of the paper is organized as follows: Section 2 describes the related work in context of single image facial alignment and facial tracking. In Section 3, we explain our Denoised Composite Recurrent Tracker, which consists of multiple sub-networks operating in tandem and merged using our novel tracking algorithm. Section 4 reports our experiments on both Single Image Facial Alignment and Deformable Facial Tracking. Finally in Section 5 we derive our conclusions.

Related work
In this section, we describe the prior works related to the two fundamental problems addressed by our approach, namely Single Image Facial Alignment and Facial Tracking.

Single image facial alignment
Face alignment has attracted considerable attention due to its importance for several applications, such as face recognition [24], head pose estimation [25], facial reenactment [26], emotion recognition [67,68] and others [66]. This task is generally conducted by estimating a predefined set of landmark positions which provide a structure representation of the facial geometry. Traditionally, facial landmarks have been estimated using either shape and appearance models [27,28] or regression based models [29,30], with the latter offering some advantage due to their computational efficiency [31]. Recent examples of regression-based models include the work of Kazemi and Sullivan [32], who use a cascade of regression functions to efficiently regress the landmark locations, and Zhu et al. [33] who refined the regressors cascade with coarse to fine search.
The recent availability of large datasets such as 300-W [3] and Menpo [4] allows for large scale data modeling and development of deep learning-based models that benefit from these load of data. For instance, the work of Bulat and Tzimiropoulos [5] used multiple hourglass shaped convolutional networks with heatmap-guided layers to predict the final facial landmarks, while the Zadeh et al. [34] introduce multiple patch experts to adjust each of their final landmark estimations, which has been incorporated in the Openface framework [35]. Zhang et al. [36] in addition, enforced the facial attributes as auxiliary features to their cascaded convolutional network to improve landmark estimates. These deep learning models currently hold the state of the art accuracy both on the single image facial alignment [3] and on the facial landmark tracking tasks [2].
Despite the maturity of current facial alignment models, there are still challenges for them when targeting images with large appearance variations due to heavy occlusion, severe pose or illuminations conditions, etc. [7]. This is especially relevant in real world application deployed in highly unconstrained settings [8]. Thus, there has been growing interest to improve the performance of facial models under such settings, including efforts such as iterative initialization of regression cascades to minimize the impact of outliers [10] or the combination of data-and model-driven estimators to enhance the robustness of landmark detection [7]. However, most efforts have focused on building robust feature extractors with the assumption that the model will be able to discard the noise and select only the meaningful features.
In contrast, we address this problem from a different perspective: we focus on modeling image noise by means of our internal denoiser network, which is conceived as an auxiliary block that aims to automatically enhance the quality of the input image to improve the accuracy of landmark localization. This strategy can be justified by recent findings such as those from Dong et al. [11], who aggregated different styles of images using Generative Adversarial Networks to improve their landmark estimates, revealing that even a slight color style variations in a given facial images (which is otherwise kept fixed) can impact the accuracy of landmark estimations. Similar findings have been reported in other facial analysis tasks, such as those from Zhou et al. [9], who evaluated several facial detection models with synthetic noise, and Goswami et al. [12] who investigated the robustness of facial classification to synthetically distorted images. The findings in these reports have led us to hypothesize that modeling certain types of noise may help to characterize the behavior of facial analysis systems with unconstrained images and that incorporating such noise modeling into the training process could improve the robustness of their results.

Facial tracking
Currently, the most popular facial tracking technique is Tracking by Detection, which consists of performing facial detection and landmark localization at each frame. Some examples of this strategy include the work from Uricar et al. [37] and the OpenFace tracker. Uricar et al. use tree-based Deformable Part Models (DPM) for facial landmark detection and localization with Kalman Filter smoothing, while the OpenFace tracker [35] uses the multiple convolutional experts method described in the previous Section [34] initialized with the bounding box from a Multi-Task Cascaded Neural Network face detector [38].
Other tracking methods perform face detection only in the first frame and then apply facial landmark localization using the fitting result from the previous frame as initialization. One such example is the work from Xiao et al. [20] which adopts a multi-stage regressionbased approach to initialize the shape of landmarks with high semantic meaning. Other examples include the work from Raja et al. [21] which combines a global shape model with sets of response maps for different head angles indexed on the shape model parameters and the work from Wu et al. [39] who apply shape augmented regression. There are also hybrid approaches that combine tracking by detection and initialization based on the latest fitting result. Among these, combinations of Coarse-To-Fine Shape Search (CFSS) [33] landmark localizer with multiple general-object trackers have shown to perform particularly well [14].
However, all methods derived from tracking by detection share the limitation of not considering the temporal information contained in video sequences. Furthermore, it is difficult to obtain consistent initializations from most face detectors, which tends to reduce the final landmark localization accuracy [40]. Some approaches try to mitigate this problem by including the information from the adjacent frames to capture short temporal dependencies. For example, Yang et al. [17] used time series regression on pairs of adjacent frames, which led them to achieve the best result reported so far on the original challenge in the 300 Videos in the Wild dataset (300-VW) [2], which is the largest deformable facial tracking benchmark to date.
With the recent growth of facial landmark datasets, such as 300-W [3], Menpo [41], 300-VW and LS3D-W [5], current methodologies on facial analysis started to shift from systems based on handcrafted features towards incorporating deep learning architectures [4,42]. Rapid progress can be seen on the development of various convolutional architectures as the main spatial feature extractor used on both facial detection [38] and landmark localization models [5,43], achieving state of the art accuracy. In spite of this, localization is still mainly performed on every single frame, without taking into account the temporal information.
On the other hand, introduction of recurrent neural networks (RNN), especially Long Short Term Memory (LSTM) [44], has allowed incorporating temporal information with great success in several applications [45]. This is the case of the recently introduced general object tracker Re 3 [46], which is robust against image occlusions and can be trained on long sequences thanks to its internal LSTM networks. Nonetheless, RNN have received little attention in the context of facial tracking. The only exceptions so far are the methods by Jiang et al. [6] and Peng et al. [47]. Jiang et al. [6] proved that an end to end RNN is capable to work on multiple domains including facial landmark tracking, with very low failure rates but without reaching state of the art accuracy. Peng et al. [47] used an internal recurrent encoder-decoder network with heatmap guidance, thus allowing temporal modeling, but at the expense of high model complexity.
In summary, even though there exist models that allow both the single image alignment and facial tracking tasks, such as OpenFace [35], they consist of separate sub-models designed independently for each task, which is likely to produce sub-optimal results [38]. To the best of our knowledge, we are the first to present a facial tracker that incorporates a joint model for single-image denoising, facial landmark localization and temporal modeling. As it will be shown in the experiment Section 4, this strategy improves the single image alignment accuracy and also enhances the stability of the tracking operation.
3. End to end denoised composite recurrent facial tracker 3Our tracking model is a composite network that receives raw frames as input and returns the localization of facial landmarks as the final output. It is composed by four sub-networks, arranged in a way that permits end-to-end training for each of them without involving any hand-crafted features. Specifically, let X t and X t−1 denote the current and previous frame; our Denoised Composite Recurrent Tracker (DCRT) will estimate the position of n facial landmarks in the current frame l t : where Φ are the parameters {Φ 1 ,Φ 2 , . . Φ 5 } of our composite networks DCRT and fx 1 …x n ;ŷ 1 …ŷ n g∈ℝ > 0 . Our DCRT consists of four individual sub-networks: Multi-Task Cascaded Neural Network face detector (MTCNN), facial bounding Box Tracker (BT), Facial Validator (FV) and Denoised Facial Alignment (DFA). Note that for face detection we relied on the state of the art MTCNN [38]. This task-specific arrangement allows us to optimize each network to each of their task characteristics, and simultaneously to inspect their behavior and contributions to our final landmark estimates when chained together using our tracking algorithm (cf. Section 3.4).
A schematic diagram of our tracker can be seen in Fig. 1. We start by assuming a tracking scenario, where we have an existing estimate for the bounding box of the preceding frame. 1 This bounding box, together with the current and previous frames (X t , X t−1 ) are fed to our BT network to produce a first estimate of the targeted landmarks (l t BT ) and bounding box (b t BT ), while at the same time its internal state is updated.
Once we have our first landmarks estimate l t BT , we use the FV network to validate the result obtained by the tracker. To do so, we train the FV network to estimate the probability that the object tracked within l t BT is a face (p(f)). In case of obtaining a low probability, which would suggest that the BT network has lost track, we use the MTCNN to perform face detection on the current frame and re-initialize the whole network for the next time step. In contrast, if l t BT is successfully validated by the FV network, the current frame and its bounding box b t BT are fed to the DFA network, which produces the final estimates for the target landmarks l t F and the corresponding bounding box, b t F . DFA works by first evaluating the image conditions and, if necessary, applying denoising to the image; subsequently, landmark coordinates are estimated from a cleaned version of the input image. Note that, DFA will operate from an already detected and validated bounding box, which allows it to achieve a more accurate result. Further, the DFA network would be able to operate by itself, independently of the rest of the network (e.g. in the single image facial alignment task), but in such case it would lose the temporal modeling given by the previous sub-networks.

Bounding box tracker
We base our BT network on the structure of the Re 3 tracker [46], which is a full end to end object tracker with internal LSTM networks to capture the temporal dependencies from video. Given input frames {X t , X t−1 } cropped as {X t Pb , X t−1 Pb } with the previous Bounding Box (P b = b t−1 ), the BT network estimates the landmark positions for the current frame l t BT and updates the internal state of the LSTM h t as follows: where LSTM refers to the set of internal LSTM [44] networks, EL stands for the Embedding Layer, W BT , W EL is the set of weights of each fully connected layers of BT and EL respectively and res is the Inception-Residual Network [48] (Inception-Resnet). The Embedding Layer is a weighted concatenation of the residual network coefficients: We use Φ 1 to denote the parameters of all sub-networks contained in BT. Finally, we also generate an estimate of the bounding box for the current frame b t BT directly from the estimated landmarks: Note that, even though the architecture of BT is based on Re 3 , we introduce several key modifications to adapt this recurrent tracker model into this new problem domain: 1. First we preconditioned the convolutional network of our BT to contain common facial features by replacing the internal Skip Convolution Networks (SkipNet) with the more sophisticated Inception-Resnet that has been pre-trained on the MS-Celeb [49] and CasiaWebFace [50] datasets 2 with triplet loss [51]. Fig. 2 visualizes the differences between the original SkipNet on Re 3 versus the more complex structure of BT, which is inherited from the Inception-Resnet (Version 1). Each block of Inception-Resnet architecture can be expressed as below: where r i and r i+1 are the input and output of the i-th block, H(b i ) is the identity matrix and F represents the combined effect of the various convolutional and ReLU layers. Notice that SkipNet does not have the advantage of residual connection as in the Inception-Resnet which eases the gradient flow in optimization [48]. 2. Second we use the BT network to produce a first estimate of landmark locations (l t BT ) following the work of [6], but we split the fully-connected layer that receives the output from the LSTMs into five independent fully-connected networks so that each of them is focused on a specific facial region. To this end, we divide the facial landmarks in the following regions: facial silhouette (our outer contour), eyebrows, eyes, nose and lips. Thus 3. Finally, we reduce the dimensionality of the input image size to 128 × 128 to accelerate the training process. This in turn also reduces the number of the neurons of the original Re 3 by half and helps to avoid over-fitting [52] while still achieving state of the art accuracy.

The facial validator
After the initial estimates produced by the BT network we use the FV network to validate the results before further processing. The main reason for doing so is to avoid the drift problem, well known in the tracking literature [13]. Specifically, the FV network can be understood as a conditional function that determines whether to continue the processing pipeline based on the current estimates from BT or to reset the tracker and attempt to re-detect the facial region because the current estimates are not reliable enough.
We follow the methodology in [14] to build a strong classifier to estimate the probability p(f) that the object currently being tracked by BT is a face. To this end, we use concatenated small patch regions from the estimated landmarks (l t BT ) as follows: where CNN is the composite function of standard stacked convolution layers followed by a bottleneck layer with W FV parameterized by Φ 2 and 0 < p(f) < 1. We set a threshold T c to determine the  lowest probability that is acceptable from FV to validate the current estimate.

Denoised facial alignment
We argue that when given a noisy input image, final landmark estimation can be improved by first minimizing the existing noise and only then passing the image to any facial landmark estimator. To achieve this, we adopt a joint approach involving two major subnetworks to build our Denoised Facial Alignment network (DFA): Internal Image Denoiser (IID) and Facial Landmark Estimator (FLE) networks. We use the current cropped facial area X t Pb as input image, which we assume to be potentially contaminated with unknown noise. In other words, X t Pb is assumed to be an estimate of a noiseless image Y to be estimated by the IID network. We do so in a two-step approach in which we first detect if the noise in X t Pb complies with any of our noise models and, if so, clean the input image to generate does not match any of the internal noise models, Y will be fed to the FLE network to estimate n landmark coordinates l I :

Internal image Denoiser
We adopt a selective denoising approach [19,53] by combining Multi-Task noise Classifier(MTC) network and multiple Specialized Denoise Auto Encoders (SDAE) sub-networks to work in tandem. This joint model selectively performs denoising based on the detected condition of the input image to trigger separate specialized denoiser sub-models. This has been shown to perform better than directly denoising all input images, as the latter may actually distort an already clean input image [19]. The formulation of the IID block is as follows: Given C known image degradation types, the MTC will estimate the probability that degradation type c is present in the input image X t Pb ): where W MTC are the multinomial bottleneck regression layers parameterized by Φ 3 , CNN is the set of convolutional layers for MTC, and s c is the estimated probability that the current input X t Pb ) is contaminated with specific noise-class c, with s 0 denoting the noiseless case. If the classifier detects noise in the input image, then b c>0 and the image is denoised by one of the specialized denoisers {DAE 1 , DAE 2 …DAE C }. We build our Internal Image Denoiser based on the Hourglass shaped Auto-Encoder Architecture with skip connection [54]. The structure of our DAE is similar to the work of [55] with a few additional mirror layers on both encoder and decoder parts. Each of these blocks can be trained to capture a different type of noise: where Φ 4 contains the parameters learned for all DAE blocks. In case b c ¼ 0, the denoising process is skipped to avoid unnecessarily distorting the input image, i.e. SDAE ϕ 4(0, X t Pb )) = X t Pb .

Facial landmark estimator
We build our final pipeline of FLE using the state-of-the-art Squeeze and Excitation Network (SENET) [56] which has been pre-trained on the recently published VGGFace2 [57] facial dataset. 3 This landmark localization procedure can be expressed mathematically as below:

Recurrent denoised facial tracking algorithm
The operation of our Denoised Composite Recurrent Tracker, DCRT, which combines all the blocks explained previously in this section, is shown in Algorithm 1. When a suitable detection of the facial region is available, e.g. from initialization or the previous frame (lines 8 and 10), the BT network produces a first estimate of facial landmarks (line 13) and bounding box (line 14). Then, the FV network is used to estimate the probability p(f) that the output from BT corresponds to a face. If p(f) is sufficiently high (above threshold T c ), the initial estimate is refined by the DFA network to produce the final tracker estimate (lines 18 and 19) with its selective internal denoisers. Otherwise, it is assumed that the BT has lost track and there is a need to re-initialize the tracker (line 16).
We perform re-initialization between lines 3 and 6. We start by detecting the face in the current frame by means of the MTCNN network. This detector is likely to produce multiple detections, hence its outputs are validated with respect to the bounding box of the previous frame b t−1 . Specifically, we compare the Euclidean distance between each new detection and the center of the previous bounding box d(b t−1 , b MT ) with respect to the magnitude of the previous bounding box, and keep the one that produces the minimum ratio: as long as there is at least one detection whose ratio is below threshold T B . Otherwise, all new detections are too far from the previous tracking result and no re-initialization is performed. The latter is necessary to tackle the cases in which the face being tracked moves out of the visual field. In such cases, without threshold T B the system might be incorrectly re-initialized to track another face. In contrast, by using T B the tracker remains in its latest valid coordinates awaiting for the tracked object to come back to the field of view. Finally, (SeqBT) controls the length of the temporal window that is considered by the tracker (in frame units), which is fixed at training time (see next section). If the tracker is re-initialized or if the sequence length (SeqT) exceeds the temporal window (SeqBT), then the internal state of the BT network is reset to 0 (line 12). This allows the network to refresh the facial appearances encoding to adapt to the surrounding characteristic of the current view.

Overall loss and model training
To train the BT sub-network, we performed curriculum learning with sequence lengths between SeqBT = 2 to SeqBT = 32 frames [46]. Multiple stages of transfer learning [58] were used to condition the pre-trained Inception-Resnet. To do so, we fine-tuned this network to perform single image landmark estimation by an adding auxiliary Fully Connected layer and training the resulting network with ℓ 2 loss using the 300-W [3] and Menpo [41] datasets. After convergence, we integrated the internal convolutional layers into our BT network, which was then trained end-to-end for landmark tracking using the 300-VW training dataset with ℓ 1 loss. Furthermore, we also used the same facial landmark alignment datasets to train FV with standar cross entropy loss.
We consider two types of image degradations to train the IID subnetwork: Image Blur and Illumination Differences. This is motivated by our previous findings about the common weaknesses of current facial alignment models against such degradation models [19]. Specifically, each degradation model was obtained as follows: 1. Blur Model. We added Gaussian Blur by convolving the input image with two dimensional Gaussian filters with σ ∈ {1,3,5} [9,19]. In addition, we also included motion blur [59]  wheel-whirled lamp). We used the faces taken on the ideal lighting, indicated through the provided metadata of 'illuminationQuality' label of valued 0 for each associated emotion expression as the ground truth for this dataset. Finally, we crop-centered the faces from SOF dataset using the provided 17 facial landmarks, while our previous tracker [23] and facial alignment [19] were used to obtain the landmarks for the Yale data given that they are not available for this dataset. Examples of some training images together with their degraded versions are depicted in Fig. 3.
These datasets, alongside with the previous facial alignment dataset were used to perform joint training of sub-modules of DFA, i.e. FLE and IID, allowing them to be optimized in parallel. We argue that synchronous optimization of IID and FLE allows IID to produce cleaner intermediate images Y since that favors their mutual aim to achieve more accurate landmark estimates. Hence the complete loss for DFA training is as follows: where X is the input image, Y is the input image after denoising (i.e. an estimate of the unknown clean image Y), l are the ground-truth landmarks, b c is the estimated noise class, H() is the cross-entropy and X is the autoencoder reconstruction of X. The λ coefficients act as regularization parameters for each term [5,63]. In the initial training phase, we set a higher value for λ 2 compared to λ 1 , for instance λ 2 of 0.75 and λ 1 of 0.25, to accelerate the training of the denoising part (λ 3 and λ 4 are set with equal value of 0.5), since we found that this part took more time to converge than the former. As training progresses, we gradually increase the value of λ 1 by a step value of 0.1, and reduce the value of λ 2 with a similar value to better balance the training priority, reaching a final value of 0.5 for both coefficients. Fig. 4 illustrates the impact of our proposed joint training when the input images present some sort of noise or distortion. We can see that using this joint training scheme, DFA is able to improve the accuracy of the estimated landmarks (5th column), compared to the results using separate loss (6th column). We can also see that the internally denoised images are also less distorted (4th and 5th column), suggesting that these cleaner image charactheristics are beneficial for FLE to produce more accurate landmark estimates [59]. Finally, as we expected, the use of FLE (7th column) by itself, i.e. without IIE, yields the worst performance given its lack of any internal denoising modeling (more details can also be seen on the next Section 4).
Finally, in all the training process, we performed data augmentations by means of horizontal flipping, −45°to 45°degree rotations and artificial strip boxes across the images to simulate occlusions. We trained our model using both Stochastic Gradient Descent (SGD) and ADAM optimizer [64] with scheduled weight learning decay every 10.000 iterations. We first initialized training with SGD to speed up convergence, and then used ADAM for fine-tuning at final training stages. Five NVIDIA Titan GPUs were used for training which took approximately one to two days to train a single BT for a defined sequence length, and around two days for both DFA and FV.

Experiments
We present our results on two major experiments: Single Image Face Alignment and Deformable Facial Tracking. For fair comparisons, we use the widespread 68 facial landmark points from [2][3][4] and compare our model against other state of the art methods by either gathering their reported results or reproducing them from their publicly available code.

Facial alignment experiments
In this section, we compare our internal Denoised Facial Alignment (DFA) presented in Section 3.3 for single image facial alignment. In addition, we also report results from our FLE component operating without IID to analyze the impact of internal denoising in our final landmark estimates.

Experiment setup
We compare the performance of our DFA model against 8 alternative state of the art models using 300-W [3] and Menpo Challenge [4] test datasets. We follow the standard procedure as in [4] by first initializing each models with corresponding bounding box obtained from the ground-truth point for each images. Subsequently we calculate the Normalized Mean Square Error (NMSE), defined as the average error for all landmarks divided by the bounding box size [5,17]. Finally, we also include the Area Under the Curve (AUC) and Failure Rate (FR) using the standard NMSE threshold of 0.08 [5] as additional metric.
We compare our results against both traditional and deep learning models, including also the results from our previous FADeNN model [19], to highlight the relative improvement gain provided by DFA. Specifically, we compare the following models:  [35] which internally uses multi patch experts [34]. 8. FADeNN: Our previous models of Facial Alignment with Internal Denoising Auto-Encoder [19]. Table 1 shows the results on both 300-W and Menpo test datasets, while the corresponding AUC graph is shown on Fig. 5. Based on these results we can see the improvement obtained by incorporating the internal denoiser (IID) in our facial alignment model. Firstly, DFA obtains higher performance than FLE (namely, the same model without denoising) in all the considered metrics. Secondly, DFA produces state of the art results for both datasets, consistently achieving the highest AUC and the lowest FR and NMSE among all compared methods. Finally, We also observe a noticeable improvement against our previous FaDeNN [19] model, which highlights the effectiveness of the newly proposed approach.

Results on 300-W and Menpo datasets
We also observe that generally, models based on deep learning perform particularly well, with SAN [11] and ECT [7] achieving the second and third best performance in terms of AUC, followed by FAN [5], which seems to produce more stable estimates judging from its low FR values. Overall, these models perform better than those using handcrafted features, such as ERT [32] and CFSS [33], which reflects the maturity of deep learning approaches and their advantage compared to the traditional approaches. Fig. 6 visualizes examples of estimated facial landmarks for our model without (FLE) and with (DFA) denoising, against the other three best performing approaches in Table 1: SAN [11], ECT [7], and FAN [5]. For each dataset, the different columns show examples of input images under different imaging conditions. The first column of each dataset shows what could be considered an optimal (clean) image (columns 1 and 6). The second and third columns show blurring artifacts, and the fourth and fifth columns show sub-optimal illumination. By visually investigating these examples, we first notice that in case of relatively cleaned image input, our models produce more accurate predictions than the compared alternatives. This might be due to the inclusion of the denoising block during training, which removes part of the artifacts present in the training data and thus enhances the quality of the training set. On other hand, in case of degraded images, i.e. with  This finding of the correlation between cleaned input image and accurate landmark estimates, i.e. cleaner the images input will yield more accurate landmark predictions, also conform with a recent related study [59] suggesting the benefit of our denoising approach. Finally, we also notice that the internally enhanced images are visually better than the original ones, which further shows the effectiveness of each of our internal denoiser modules for removing their specific noise class.

Facial tracking experiments
In this section, we compare our full DCRT model described in Section 3 for deformable facial tracking in the wild. We empirically set the threshold values of T B = 1.0 and T C = 0.5 for our model to perform tracking. As will be shown later, our model produces the best results when the sequence length is fixed to 2 frames and the internal denoiser module is activated (see Section 4.2.4).

Experiment setup
We compared our model against other 13 available state of the art facial tracking models, including our previous results using the CRCT model [23] and FADeNN [19]. We also provide results form the SAN model [11], which obtained the second-best performance in the facial , blurry input (columns 2 and 3) and low illumination (columns 4 and 5). The first two rows are the results of our models of FLE and DFA respectively, followed with the original and internally enhanced images on the third and four rows. Finally, rows fifth to seven show the results from SAN [11], ECT [7] and FAN [5], respectively. alignment experiments from the previous section, by converting it to a facial tracker. We do so by adopting: 1) a tracking by detection approach, which we refer to as (SAN_DT); and 2) by using the result from the previous frame as initialization for the current one (SAN_PREV). This is interesting to asses the ability of current facial alignment models when converted to minimum-effort trackers (i.e. using none or very little temporal information). The list of compared trackers is as follows: 1  [20], Raja et al. [21], and Wu et al. [39]. 5. OpenFace: OpenFace tracker [35] which operates with tracking by detection approach. 6. SAN_DT and SAN_PREV: The SAN facial image alignment model [11] converted into a tracker by using the MTCNN face detector (SAN_DT) or the previous frame result (SAN_PREV) as initialization. 7. CRCT: Our previous tracking model [23], which in contrast to the currently proposed model, does not include internal denoising. 8. FADeNN: Our previous facial alignment model [19] which uses internal denoising, converted into a tracker by using the result from the previous frame as initialization.
We performed our experiments using the 300-VW dataset [2], which is the largest available deformable facial landmark tracking dataset. It comprises 55 videos divided into three categories according to the difficulty levels: 1. The first category contains people recorded in well-lit conditions, which is intended to evaluate facial tracking with images acquired in nearly ideal conditions. 2. The second category consists of people recorded in unconstrained conditions (variable illumination, dark rooms, etc.) with arbitrary poses. This setting evaluates facial tracking models in real-world human-computer applications. 3. The Third category contains videos of people recorded in fully unconstrained conditions which include cases of ambient illumination differences, large occlusions, expressions, etc. This scenario aims to asses the models under arbitrary recording conditions.
We use the original 2D facial landmarks directly as ground-truth and first-frame initialization for all models. We follow [2,14] to evaluate each tracked landmark with Area Under the Curve (AUC) and Failure Rate (FR) for Normalized Mean Error (NME), as previously explained in Section 4.1.1. Table 2 shows the results of all compared models for each category, while the respective AUC curves for several trackers are shown in Fig. 7. We can observe that our model achieves top state of the art performance, consistently outperforming all compared alternatives. We also notice a large improvement with respect to our previous models, namely CRCT [23] and FADeNN [19], which were partial implementations of the short/long-range tracking and denoising, respectively; this highlights the need for a unified approach as the one presented here.

Results on 300-VW test dataset
Regarding the other compared methods, we found the results of the OpenFace library (one of the most popular tools for nowadays) among the less accurate ones, which contrasts with the relatively good results obtained by this library in the single image alignment experiments. This occurs because, as stated previously, OpenFace is not actually designed for tracking and is indicative of the accuracy loss that can be experienced by not taking into account temporal dependencies. This is also supported by the results obtained by SAN_PV, which showed comparable or better performance than several other trackers and was clearly superior than its detection counterpart (SAN_DT). Fig. 8 showcases results of our DCRT model against the state of the art Yang et al. [65] and the widely used OpenFace [35]. We can point out several observations based on these visual examples. Firstly, our model performs better on the majority of clean inputs, as shown in the first two rows, consistently producing accurate estimations even in cases of extreme and profile poses.

Visual results on 300-VW
Secondly, we see that our model is also robust against severe image degradation, e.g. blur, occlusion and illumination. This is especially clear when processing the extremely dark frames from the last two rows, where the faces sometimes are hardly visible to a human observer. In these examples, our model still manages to produce consistent landmark estimates thanks to the denoising block, while other models struggle. When we visually inspect such difficult examples, we find that the intermediate images generated by the internal denoiser are considerably changed with respect to the original ones. This is especially noticeable in cases of poor illumination (e.g. last 2 rows of Fig. 8), where the denoised images not always look visually "cleaner" from a perception point of view, but we can still see that the illumination has been considerably enhanced, especially in the facial area.
Thirdly in some cases, we also notice that our tracker produced facial landmarks that are visually more convincing than the actual groundtruth locations, as can be seen on frames 346, 356, 358 and 706 of the  first row. These occurrences, though small, occur due to the need to use semi-automatic annotation methods to generate ground truths for very large datasets [5].

Ablations study
In this section, we perform a detailed analysis of the behavior of our tracker under different operation conditions. Firstly, we analyze the impact of the temporal context in our full tracker, DCRT, by varying length of the training sequences i.e. SeqBT = 2, 4, 8, 16 which we refer to as DCRT-2, DRCT-4, …, DCRT-32. Next, we also report the results obtained by a partial implementation of our tracker that does not include denoising, namely CRT-2, CRT-2, …, CRT-32. Finally, we also report results under minimum-effort tracking: 1. Tracking by detection without denoising (FLE-Det). 2. Tracking by detection with denoising (DFA-Det). 3. Initialization from the previous frame, without denoising (FLE-Prev). 4. Initialization from the previous frame, with denoising (DFA-Prev). Table 3 shows the results of all the above variants of our tracker separately for each category of the test dataset, and jointly considering the images in all categories to sumarize the overall performance of each model. The respective AUC curve are displayed in Fig. 9. We also include the result of Yang etl. [17] as the baseline, as well as our previous results for both CRCT [23] and FaDeNN [19].
We can summarize our findings based on these results as follows: • The minimum-effort tracking, when using initialization from the previous frame, yields quite comparable accuracy to the one provided by the baseline [17]. However, these results are still inferior against our models with internal temporal modeling (DCRT), in all cases with quite large margin. Further, all of our full models (DRCT-x, with internal temporal modeling), achieve better results than all tracking by detection approaches. • The difference between results of models trained on different sequence lengths is minimal. In the majority of cases, shorter lengths were slightly better than longer ones. Nevertheless, these results must be read in relation to the test sequences, which show quite irregular (not necessarily natural) facial movements [23]. • We observe a general improvement when introducing the Internal  Denoiser on all metrics. This correlates well with the results from previous experiments on single image alignment, demonstrating the effectiveness of the internal denoiser in improving the final landmark estimations. • We also notice lower Failure Rates (FR) when the internal denoiser is incorporated suggesting that, apart from increasing accuracy, it also improves the stability of the tracker.

Conclusions
In this paper, we present an end to end composite recurrent tracker with internal denoising that is capable of performing joint single facial alignment and deformable facial tracking. We incorporate multilayer LSTMs to model temporal dependencies with variable length and introduce an internal denoiser which selectively enhances the input images to improve the robustness of our overall model. We achieve this by combining 4 different sub-networks that specialize in each of the key tasks that are required, namely face detection, bounding-box tracking, facial region validation and facial alignment with internal denoising. We tested out model against state of the art alternatives in the most popular datasets for facial alignment and tracking. We started by comparing our results on single image facial alignment to other 8 facial alignment models in the 300-W and Menpo datasets. In this experiment, we found that our model consistently outperformed all compared alternatives, producing higher AUC and lower NME and FR values, respectively. Qualitative assessment highlighted the usefulness of our internal denoiser, which successfully enhanced the input images to produce intermediate representations in which facial details were easier to visualize, thus facilitating the task of the landmark localization algorithm.
Facial tracking experiments were performed using the 300-VW test dataset, in which we compared our model against 13 other facial trackers. We found that our unified approach outperformed all other compared models consistently over all categories of the test set. Further, our models showed to be considerably more robust than the compared alternatives in cases of input image with severe degradation, again due to our internal denoiser. We also noticed that, overall, tracking-by-detection approaches produced comparatively lower accuracy than models that utilized temporal modeling, which supports the advantage of incorporating temporal dependencies.
Finally, to investigate the impact of the different building blocks in our composite model, we presented a set of ablation experiments using 300-VW, in which one or more of the building blocks were removed, or some of its parameters were modified. These tests allowed to quantitatively confirm: 1) that inclusion of an internal denoising block outperformed the results without internal denoiser, even in images from category 1 that are expected to show nearly ideal conditions; 2) that the denoising block also contributed to the stability of the tracker, by reducing the failure rates, 3) that inclusiong of temporal modeling produced more accurate landmark estimates than either tracking-by-detection or tracking by initialization with the results from the previous frame. We also found that the improvements introduced by the temporal modeling were optimized for a temporal context between SeqBT = 2 and SeqBT = 8 frames, although differences were usually small between models as long as SeqBT ≥ 2. Nevertheless, the latter must be read in relation to the test sequences that were used, which show quite irregular (not necessarily natural) facial movements.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.