Heatmap-Guided Selective Feature Attention for Robust Cascaded Face Alignment

Face alignment methods have been actively studied using coordinate and heatmap regression tasks. Although these regression tasks have the same objective for facial landmark detection, each task requires different valid feature maps. Therefore, it is not easy to simultaneously train two kinds of tasks with a multi-task learning network structure. Some studies have proposed multi-task learning networks with two kinds of tasks, but they do not suggest an efficient network that can train them simultaneously because of the shared noisy feature maps. In this paper, we propose a heatmap-guided selective feature attention for robust cascaded face alignment based on multi-task learning, which improves the performance of face alignment by efficiently training coordinate regression and heatmap regression. The proposed network improves the performance of face alignment by selecting valid feature maps for heatmap and coordinate regression and using the background propagation connection for tasks. This study also uses a refinement strategy that detects global landmarks through a heatmap regression task and then localizes landmarks through cascaded coordinate regression tasks. To evaluate the proposed network, we tested it on the 300W, AFLW, COFW, and WFLW datasets and obtained results that outperformed other state-of-the-art networks.


Introduction
The human face provides crucial information for understanding user behavior in human-computer interactions and has been studied in computer vision for a long time. Many methods for analyzing face attributes, such as facial expression recognition [1,2] and head pose estimation [3,4], detect facial regions during the preprocessing step. The face region detection method is divided into the face bounding box, which defines the position of the rectangular region of the face, and face alignment, which extracts the optimal face region. Face alignment is a method for detecting facial landmarks, which are key points representing facial components. It improves the performance of applications by extracting information, such as a face component, size, rotation, and position. Although deep learning algorithms have improved face alignment capabilities in recent years, there is still a need to improve performance in noisy environments.
Face alignment studies using deep neural networks have focused on coordinate and heatmap regression. Coordinate regression directly estimates facial landmark coordinates and is designed based on common deep neural network structures, such as VGG [5] and ResNet [6]. Heatmap regression in face alignment has been studied since Newell et al. [7] first proposed a stacked hourglass network. Heatmap regression methods exhibit robust performance against noise by estimating the probability that a landmark exists at a pixel location. However, the accuracy of landmark locations largely depends on the resolution of the heatmap because pixel units are expressed as integers. This issue is known as the quantization error problem [8].
of the heatmap because pixel units are expressed as integers. This issue is known as the quantization error problem [8].
Recently, hybrid methods with coordinate and heatmap regression tasks have been proposed [9,10]. Wu et al. [9] trained on these tasks independently and sequentially. Park et al. [10] trained coordinate and heatmap regression tasks simultaneously, but their method did not perform better than the sequential training method.
Although these regression tasks have the same objective for facial landmark detection, each task requires different valid feature maps. Therefore, it is not easy to simultaneously train two kinds of tasks with a multi-task learning network structure. Some studies have proposed multi-task learning networks with two kinds of tasks, but they still need to propose an efficient network that can train them simultaneously due to the shared noisy feature maps.
We propose heatmap-guided selective feature attention for robust cascaded face alignment based on multi-task learning, which improves the performance of face alignment by efficiently training heatmap and coordinate regression tasks. The attention module can select valid feature maps without losing the properties of low-and high-level feature maps, and the cascaded coordinate regression network improves the performance of face alignment using the backward propagation connection for tasks. This study also employs a refinement strategy that detects global landmarks through a heatmap regression task and then localizes landmarks through cascaded coordinate regression tasks.
Heatmap regression estimates a heatmap using stacked hourglass networks and converts it to landmark coordinates. The estimated heatmap and converted landmark coordinates are used as information in the cascaded coordinate regression (CCR) stage. CCR detects local facial landmarks using region of interest (ROI) feature maps around the landmark coordinates in the previous CCR stage. The result of each CCR stage is propagated as input information for the next CCR stage. Figure 1 shows examples of accurate searches using a CCR network. To verify the effectiveness of the network, we evaluated popular face alignment benchmarks, including 300W [11], AFLW [12], COFW [13], and WFLW [9]. This study compared the performance of the proposed network with that of previous state-of-the-art networks. Our main contributions are summarized as follows: 1. We propose an effective attention method by selecting multi-level features through the estimated heatmap. To verify the effectiveness of the network, we evaluated popular face alignment benchmarks, including 300W [11], AFLW [12], COFW [13], and WFLW [9]. This study compared the performance of the proposed network with that of previous state-of-the-art networks. Our main contributions are summarized as follows: 1.
We propose an effective attention method by selecting multi-level features through the estimated heatmap.

2.
We propose backward propagation connections between a heatmap regression network and a coordinate regression network for effective multi-task learning to improve performance.  3. We designed a heatmap-cascaded coordinate regression network and verified its performance for the proposed network through ablation studies.

Related Work
Traditional face alignment methods include the active appearance model [14] and the constrained local model [15], based on a dimension-reduction technique using principal component analysis. These methods can express facial features in low dimensions and work well indoors. A method using shape-indexed features [16,17] has demonstrated the possibility of face alignment in the wild. In recent years, deep-learning-based coordinate and heatmap regression have been proposed and have shown good performance in noisy environments.

Coordinate Regression Methods
Coordinate regression directly estimates the position of facial landmarks. These methods have improved performance with the development of backbone networks, such as VGG [5] and ResNet [6], and additional methods for face alignment. Feng et al. [18] proposed a Wing loss to increase the training contribution of samples with small loss values. Su et al. [19] initialized facial landmarks using a ResNet-based network and searched for regions around the location found in a previous network. Li et al. [20] and Lin et al. [21] demonstrated robust performance in the presence of occlusions using a graph convolutional layer. Xu et al. [22] addressed large poses using multiple predefined landmark templates. Zheng et al. [23] proposed a pre-training method based on contrastive learning using extra datasets. Li et al. [24] and Xia et al. [25] proposed a transformer structure, but the transformer module estimated displacements of facial landmarks using local patches. These coordinate regression methods have been studied for their fast processing and accurate performance.

Heatmap Regression Methods
Heatmap regression indirectly estimates the positions of facial landmarks using a heatmap. A heatmap expresses the probabilities of landmark existence, and high probabilities can be regarded as candidates for facial landmarks. Since the stacked hourglass network [7] was proposed for human pose estimation, heatmap regression for face alignment has also employed an encoder-decoder structure. Bulat et al. [26] and Yang et al. [27] initially proposed the same network as the stacked hourglass network [7]. Wang et al. [28] proposed an AWing loss to reduce the problem caused by many background pixels in a heatmap. Zhang et al. [29] designed an hourglass network using an Inception-Resnet module [30] and refined the landmark coordinates estimated by the heatmaps. Huang et al. [31] proposed an attention module that converts a landmark heatmap into a boundary heatmap. Lan et al. [8] expressed an offset from a previously estimated landmark position to the ground truth as a local high-resolution heatmap to solve the quantization error caused by a lower-resolution heatmap. Jin et al. [32] estimated not only the landmark heatmap but also an offset heatmap and a neighbor heatmap, which indicate the distance to a neighboring landmark. Bulat et al. [33] proposed a Siamese-based training method. Heatmap regression methods have shown good results on various benchmarks, and many methods have recently been proposed for solving quantization errors.

Hybrid Methods
The hybrid model consists of a heatmap and a coordinate regression task. Valle et al. [34,35] initialized landmarks using a heatmap regression and refined the initialized landmarks using ensemble regression trees [17], which is a traditional method of shape-indexed features. Wu et al. [9] estimated a boundary heatmap and applied it to a ResNet-based coordinate regression task. Park et al. [10] improved the performance by converting the results of the coordinate regression task into a heatmap and combining them with the results of the heatmap regression task. In these methods, the heatmap and coordinate regression tasks were not trained simultaneously but were trained separately. It has a limited effect on multi-task learning [36] because the information needs to be propagated interactively during the training of each task.

Multi-Task Learning
Multi-task learning in the face alignment field has mainly been studied to estimate other facial attributes along with landmark detection. Ranjan et al. [37] proposed a network that estimates gender and pose, together with facial landmark detection. Kumar et al. [38] estimated the location, probability distribution, and visibility of landmarks. Prados et al. [39] initialized landmarks using a head pose and estimated landmark displacements. Although multi-task learning in the face alignment field efficiently estimates various facial attributes, it does not describe how facial attribute estimations improve face alignment performance except through landmark initialization methods.

Proposed Network
The structure of the proposed face alignment framework is illustrated in Figure 2. The proposed network consists of a feature extractor, selective feature modules for each task, a heatmap regression network for global landmark detection, and a CCR network for local landmark detection. The proposed network has a refinement structure based on multi-task learning, in which the heatmap regression and CCR tasks can be trained simultaneously, and the information from each task is transferred to other tasks. We added attention modules for selecting valid feature maps and designed backward propagation for effective multi-task learning. marks using ensemble regression trees [17], which is a traditional method of shape-indexed features. Wu et al. [9] estimated a boundary heatmap and applied it to a ResNetbased coordinate regression task. Park et al. [10] improved the performance by converting the results of the coordinate regression task into a heatmap and combining them with the results of the heatmap regression task. In these methods, the heatmap and coordinate regression tasks were not trained simultaneously but were trained separately. It has a limited effect on multi-task learning [36] because the information needs to be propagated interactively during the training of each task.

Multi-Task Learning
Multi-task learning in the face alignment field has mainly been studied to estimate other facial attributes along with landmark detection. Ranjan et al. [37] proposed a network that estimates gender and pose, together with facial landmark detection. Kumar et al. [38] estimated the location, probability distribution, and visibility of landmarks. Prados et al. [39] initialized landmarks using a head pose and estimated landmark displacements. Although multi-task learning in the face alignment field efficiently estimates various facial attributes, it does not describe how facial attribute estimations improve face alignment performance except through landmark initialization methods.

Proposed Network
The structure of the proposed face alignment framework is illustrated in Figure 2. The proposed network consists of a feature extractor, selective feature modules for each task, a heatmap regression network for global landmark detection, and a CCR network for local landmark detection. The proposed network has a refinement structure based on multi-task learning, in which the heatmap regression and CCR tasks can be trained simultaneously, and the information from each task is transferred to other tasks. We added attention modules for selecting valid feature maps and designed backward propagation for effective multi-task learning. Overview of the proposed network structure. Each task shares layer weights in the feature extraction. In the first stage, heatmaps are regressed by stacked hourglass networks. The predicted result is then propagated to the next stage and used to regress the offset coordinates.

Heatmap-Guided Selective Feature Attention
Low-level feature maps in deep neural networks have a lot of noise and unrefined information and are mainly composed of spatially filtered information, such as edges [19]. By contrast, high-level feature maps have less noise as the layer deepens and contain important semantic information for the output. As the feature maps of each level are sequentially filtered toward the target, the rich information in low-level feature maps is gradually Figure 2. Overview of the proposed network structure. Each task shares layer weights in the feature extraction. In the first stage, heatmaps are regressed by stacked hourglass networks. The predicted result is then propagated to the next stage and used to regress the offset coordinates.

Heatmap-Guided Selective Feature Attention
Low-level feature maps in deep neural networks have a lot of noise and unrefined information and are mainly composed of spatially filtered information, such as edges [19]. By contrast, high-level feature maps have less noise as the layer deepens and contain important semantic information for the output. As the feature maps of each level are sequentially filtered toward the target, the rich information in low-level feature maps is gradually reduced. The effectiveness of low-level feature maps has been demonstrated using fully convolutional networks [40] and feature pyramid networks [41]. In the face alignment field, Lin et al. [21] also demonstrated the effectiveness of multi-level features using convolutional block attention modules (CBAM) [42] for removing noisy information from low-level feature maps.
We propose selective feature attention ( Figure 3) that selects multi-level feature maps. The attention module can select valid feature maps without losing the properties of multi- level feature maps. We used the estimated heatmap to improve the CCR performance before attention. They are filtered using convolution layers as where i is an index of the branch, t is an index of the task stage, ⊕ is the concatenation, F is the feature map, H is the heatmap, F res is the residual block, and F 1×1 and F 3×3 are the convolution operations. The superscript on F 1×1 is the order of the network layers. The filtered feature maps of each branch are combined using In Equation (2), the feature maps are integrated through an element-wise summation and calculated using the global average pooling ℱ and convolution layer ℱ × , . The value of is then calculated using ℱ × , , for each branch excitation, and the final scale , is calculated using the softmax function as The final feature map is calculated through multiplication of , and , .
The dimensions of the feature maps were adjusted using the output of a 1 × 1 convolution layer. The effectiveness of the selective feature module is shown in Figure 4. The feature maps were brightly expressed on the face. In addition, we also verified that the heatmap causes the feature map to focus on the facial components and reduce noise. In Equation (2), the feature maps U are integrated through an element-wise summation and calculated using the global average pooling F gap and convolution layer F 1 1×1,t . The value of z is then calculated using F 2 1×1,i,t for each branch excitation, and the final scale s i,t is calculated using the softmax function as The final feature map F att t is calculated through multiplication of s i,t and U i,t .
The dimensions of the feature maps were adjusted using the output of a 1 × 1 convolution layer. The effectiveness of the selective feature module is shown in Figure 4. The feature maps were brightly expressed on the face. In addition, we also verified that the heatmap causes the feature map to focus on the facial components and reduce noise.

Designing Backward Propagation Connections
In the proposed network, as shown in Figure 2, the selective feature module and summation component are essential for exchanging information between stage tasks. The selective feature module connects the feature extractor to the network layer of all stage tasks, and the summation component connects the network layer of the previous stage task to the network layer of the current stage task. Therefore, the learning result for each stage task in the proposed network is significantly affected by the backward propagation connection structure of the summation component.
We designed backward propagation connections for the summation component to control the learning influence between the network layers of each stage task. Figure 5 shows the three backward propagation connections between the network layers of the tasks.

•
Task-wise connection: No backward propagation connection for the summation component in all stage tasks ( Figure 5a). It is a common structure for multi-task learning, and all tasks share the feature extractor module in the early network layer. The shared feature extractor prevents overfitting for a single task type. Because the feature extractor module is a front-end network module, it slightly impacts performance.

•
Fully connection: A backward propagation connection for the summation component in all stage tasks ( Figure 5b). The backward information of tasks affects not only the feature extractor shared by all tasks but also task-specific layers. Because the information from the neighbor stage is backward-propagated to the specific task layers, an improvement or deterioration of performance is clearly observed for the backward propagation of the neighbor stage. • CCR connection: Having a backward propagation connection for the summation component only in the CCR tasks, except in the heatmap regression task (Figure 5c). Compared with the full connection, it removes the backward propagation connection between the heatmap regression task and the first CCR task. By not propagating the bad backward information of the first CCR task to the heatmap regression network, it improves the performance of CCR tasks and makes the training for each task manageable.

Designing Backward Propagation Connections
In the proposed network, as shown in Figure 2, the selective feature module and summation component are essential for exchanging information between stage tasks. The selective feature module connects the feature extractor to the network layer of all stage tasks, and the summation component connects the network layer of the previous stage task to the network layer of the current stage task. Therefore, the learning result for each stage task in the proposed network is significantly affected by the backward propagation connection structure of the summation component.
We designed backward propagation connections for the summation component to control the learning influence between the network layers of each stage task. Figure 5 shows the three backward propagation connections between the network layers of the tasks.

•
Task-wise connection: No backward propagation connection for the summation component in all stage tasks ( Figure 5a). It is a common structure for multi-task learning, and all tasks share the feature extractor module in the early network layer. The shared feature extractor prevents overfitting for a single task type. Because the feature extractor module is a front-end network module, it slightly impacts performance. • Fully connection: A backward propagation connection for the summation component in all stage tasks ( Figure 5b). The backward information of tasks affects not only the feature extractor shared by all tasks but also task-specific layers. Because the information from the neighbor stage is backward-propagated to the specific task layers, an improvement or deterioration of performance is clearly observed for the backward propagation of the neighbor stage. • CCR connection: Having a backward propagation connection for the summation component only in the CCR tasks, except in the heatmap regression task (Figure 5c). Compared with the full connection, it removes the backward propagation connection between the heatmap regression task and the first CCR task. By not propagating the bad backward information of the first CCR task to the heatmap regression network, it improves the performance of CCR tasks and makes the training for each task manageable.
We used the CCR connection in the proposed network and evaluated the performance of the types in the ablation study described in Section 5.3. We used the CCR connection in the proposed network and evaluated the performance of the types in the ablation study described in Section 5.3.

Cascaded Face Alignment Network with Heatmap-Guided Selective Feature
The cascaded face alignment network with heatmap-guided selective features (CHS) is the proposed network structure in this study. The heatmap regression task estimates the probability that a landmark exists in each pixel, and the coordinate regression task predicts the position of the landmark in the image. Although they have the same objective for facial landmark detection, each task has parameters with different units and scales. Therefore, it is not easy to simultaneously train two kinds of tasks with a multi-task learning network structure. For effective multi-task learning, the proposed network is composed of four types of modules, as shown in Figure 2.

Cascaded Face Alignment Network with Heatmap-Guided Selective Feature
The cascaded face alignment network with heatmap-guided selective features (CHS) is the proposed network structure in this study. The heatmap regression task estimates the probability that a landmark exists in each pixel, and the coordinate regression task predicts the position of the landmark in the image. Although they have the same objective for facial landmark detection, each task has parameters with different units and scales. Therefore, it is not easy to simultaneously train two kinds of tasks with a multi-task learning network structure. For effective multi-task learning, the proposed network is composed of four types of modules, as shown in Figure 2.

•
Feature extractor: The feature extractor extracts feature maps from an input image, and they are shared by all tasks. It consists of a convolution layer and B + 1 residual blocks for B input branches of the selective feature module. • Selective feature module: The selective feature module in this paper selects valid feature maps from several branches extracted from the feature extractor.
• Heatmap regression network: The heatmap regression network estimates landmark heatmaps and a boundary heatmap, such as the stacked hourglass network in AWing [28]. • Cascaded coordinate regression network: The CCR network extracts the ROI feature map for each landmark through the ROI pooling layer and concatenates the coordinate channels [43]. The coordinate channels that represent the coordinates in the feature map can improve coordinate regression performance by concatenating the original feature channels [28,43]. In this paper, the coordinate channels are concatenated to the feature map for each landmark to improve CCR performance. Here, the ROI feature map was independently created through a residual block and a convolution layer for each landmark. The feature maps were concatenated in the last layer and used to estimate the offset coordinates using the fully connected layer. The global landmark coordinates of the current stage were obtained by adding the estimated offset to the global landmark coordinates of the previous stage.
We experimentally found that the CHS with the 4-CCRs has the best performance (Section 5.1), but there is only a slight difference in performance at each stage of the CCR network. The network parameters can be reduced by pruning the stage CCR network after training. Figure 6 shows the loss value and NME measured at each CCR stage of the last epoch of the trained heatmap-4CCRs model for the 300W dataset. This study experimentally confirmed an effective performance improvement for the second CCR of four CCRs. Finally, we pruned the third and fourth CCR networks from the entire model after the training.

•
Feature extractor: The feature extractor extracts feature maps from an input image, and they are shared by all tasks. It consists of a convolution layer and + 1 residual blocks for input branches of the selective feature module. • Selective feature module: The selective feature module in this paper selects valid feature maps from several branches extracted from the feature extractor.

•
Heatmap regression network: The heatmap regression network estimates landmark heatmaps and a boundary heatmap, such as the stacked hourglass network in AWing [28].

•
Cascaded coordinate regression network: The CCR network extracts the ROI feature map for each landmark through the ROI pooling layer and concatenates the coordinate channels [43]. The coordinate channels that represent the coordinates in the feature map can improve coordinate regression performance by concatenating the original feature channels [28,43]. In this paper, the coordinate channels are concatenated to the feature map for each landmark to improve CCR performance. Here, the ROI feature map was independently created through a residual block and a convolution layer for each landmark. The feature maps were concatenated in the last layer and used to estimate the offset coordinates using the fully connected layer. The global landmark coordinates of the current stage were obtained by adding the estimated offset to the global landmark coordinates of the previous stage.
We experimentally found that the CHS with the 4-CCRs has the best performance (Section 5.1), but there is only a slight difference in performance at each stage of the CCR network. The network parameters can be reduced by pruning the stage CCR network after training. Figure 6 shows the loss value and NME measured at each CCR stage of the last epoch of the trained heatmap-4CCRs model for the 300W dataset. This study experimentally confirmed an effective performance improvement for the second CCR of four CCRs. Finally, we pruned the third and fourth CCR networks from the entire model after the training. The final loss function in multi-task learning is a combination of task losses. To overcome training problems caused by different units and scales of task losses, it was used as a weighted sum of task losses using fixed weights in previous studies [37,38], which dealt with face alignment and other facial attribute classifications. However, the performance of this approach largely depends on predefined weights. Kendall et al. [44] proposed adaptive weights through maximum likelihood inference, as shown in Equation (5), assuming that each task follows a Gaussian distribution to determine the optimal weights for image segmentation and depth regression loss. The final loss function in multi-task learning is a combination of task losses. To overcome training problems caused by different units and scales of task losses, it was used as a weighted sum of task losses using fixed weights in previous studies [37,38], which dealt with face alignment and other facial attribute classifications. However, the performance of this approach largely depends on predefined weights. Kendall et al. [44] proposed adaptive weights through maximum likelihood inference, as shown in Equation (5), assuming that each task follows a Gaussian distribution to determine the optimal weights for image segmentation and depth regression loss. where L is the loss, T is the number of tasks, and σ is a trainable parameter adjusted to suit the training state. Figure 7 shows the change in the adaptive weight proposed in this paper. The weight of the heatmap regression task loss is significantly larger than that of the coordinate regression task loss, and the difference between them increases as training progresses. The significant difference makes it difficult to determine the optimal weights. To reduce the difference, we first used the fixed loss weights and then alleviated them using the adaptive loss weights.
Sensors 2023, 23, x FOR PEER REVIEW 9 of 23 ℒ adaptiveWeight = 1 ℒ + ln (5) where ℒ is the loss, is the number of tasks, and is a trainable parameter adjusted to suit the training state. Figure 7 shows the change in the adaptive weight proposed in this paper. The weight of the heatmap regression task loss is significantly larger than that of the coordinate regression task loss, and the difference between them increases as training progresses. The significant difference makes it difficult to determine the optimal weights. To reduce the difference, we first used the fixed loss weights and then alleviated them using the adaptive loss weights. ℒ scaledAdaWeight = 1 ℒ + ln (6) In Equation (6), is a fixed weight. The proposed method uses the AWing loss [28] for heatmap regression, and the log function in the Wing loss [18] was used for coordinate regression (Equation (7)).

Wing
∆ , ∆ = wing ln 1 + |∆ − ∆ | Wing ,∆ = y − ∆ is the estimated offset, and ∆ is the ground-truth offset. We did not use a linear function of the Wing loss because the ROI constrains coordinate regression. The large error caused by the offset estimated outside the ROI makes it difficult to reach the global minimum of the loss function during the training. In this study, because a large error is calculated as a small error using the log function, it has little effect on the shared layer. The hyperparameters of the heatmap loss function and the CCR loss function are the same as those in the AWing [28], and the Wing loss was set to wing = 8.0 and ϵ Wing = 10.0. The final loss function is shown in Equation (8): is the estimated heatmap and is the ground truth heatmap. In Equation (6), α is a fixed weight. The proposed method uses the AWing loss [28] for heatmap regression, and the log function in the Wing loss [18] was used for coordinate regression (Equation (7)).
Wing small (∆y, ∆ŷ) = w wing ln 1 + |∆y − ∆ŷ|/ Wing , ∆ŷ = y −ŷ (7) ∆y is the estimated offset, and ∆ŷ is the ground-truth offset. We did not use a linear function of the Wing loss because the ROI constrains coordinate regression. The large error caused by the offset estimated outside the ROI makes it difficult to reach the global minimum of the loss function during the training. In this study, because a large error is calculated as a small error using the log function, it has little effect on the shared layer. The hyperparameters of the heatmap loss function and the CCR loss function are the same as those in the AWing [28], and the Wing loss was set to w wing = 8.0 and Wing = 10.0. The final loss function is shown in Equation (8): Wing small (∆y, ∆ŷ) + ln σ t H is the estimated heatmap andĤ is the ground truth heatmap.

Experiments
To demonstrate the effectiveness of the proposed method, we conducted experiments on four popular benchmark datasets: 300W [11], AFLW [12], COFW [13], and WFLW [9]. The 300W dataset is the most widely used, and the 300W private test dataset is used for the competition of models trained on the 300W public dataset. AFLW focuses on large poses, whereas COFW focuses on large occlusions. WFLW is currently the most challenging dataset and can be used to evaluate the performance of each noise by providing attributes. Since the proposed method does not address a training method using extra data, we compared the proposed method to state-of-the-art training from scratch. Results of state-of-the-art studies [9,23,24,33] known to have used extra data were excluded for a fair comparison.

Normalized Mean Error
The normalized mean error (NME) was used to evaluate the distance between the facial landmark detection result and ground truth as wherex is the predicted landmark, x is the ground truth, N is the number of landmarks, and d is the normalization factor. We employed the inter-pupil distance (IPD) and inter-ocular distance (IOD) as normalization factors on the 300W and COFW datasets. The IPD is the distance between the centers of the two eyes, and the IOD is the distance between the outer endpoints of the two eyes. AFLW uses the face size and 300W private test dataset, and WFLW uses the IOD.

Failure Rate
The failure rate (FR) is another metric for evaluating the quality of the detection performance and indicates the ratio of samples for which the NME exceeds the threshold in all samples. It can be interpreted that the larger the FR value, the more failed samples there are. In this study, to evaluate the COFW and WFLW datasets, the threshold of the NME was defined as 10%

Area under the Curve
The area under the curve (AUC) is calculated by integrating the cumulative error distribution (CED) curve. The CED can be expressed as a curve by connecting the ratio of the samples to the corresponding NME. In general, the curves are expressed with an NME below a certain threshold, which we define as 7% for the AFLW dataset and 10% for the COFW and WFLW datasets.

Implementation Details
We cropped only face images for all the training and test sets using the bounding box provided by the dataset. Because no bounding box was provided for the 300W private test dataset, we used a ground truth bounding box created by the outermost landmarks of the ground-truth landmark. The cropped facial images were resized and used as inputs to the network.
In the heatmap regression network, the heatmap size was 64 × 64 pixels. We used a 4-stacked hourglass network. The heatmap regression network estimates landmark heatmaps and boundary heatmaps [28]. The input feature maps of the selective feature had dimensions of 64 × 64 × 128 pixels and used three branches. ROI pooling in the coordinate regression network applied a fixed ROI with a pixel resolution of 5 × 5 pixels. The residual block in the heatmap regression network used a hierarchical, parallel, and MS block [45], whereas the residual block in the coordinate regression network used a basic block [6]. We set the fixed loss weights to 1.0, except for the last CCR loss weight of 3.0.
The coordinate transformation in the heatmap was conducted in the same manner as that used by Newell et al. [7]. We trained the network model using the SGD optimizer and set the momentum to 0.9 and the weight decay to 5 × 10 −4 . The batch size was 10, and the learning rate schedule was as follows: {120, 200, 240, 245, 250, 255, . . . , 330} (10) where k is the epoch, and the initial learning rate lr 0 is 1 × 10 −2 . This learning rate schedule was experimentally determined, and networks are trained from scratch using only the training set provided by datasets. We applied the same data augmentation, learning schedule, and network structure to all datasets. The only difference in the implementation is the fully connected layer at the final output of the network due to the different number of landmarks in each dataset.

Evaluation of 300W
The 300W dataset consists of 3148 images as a training set and 689 as a test set with 68 landmarks. The test set was divided into a common subset with less noise and a challenging subset with large poses and occlusions. The 300W private test dataset consists of 300 images as an indoor subset and 300 as an outdoor subset. COFW-68 [46] provides 68 landmarks for the test set of COFW. The CHS trained on the 300W dataset is also evaluated on the 300W private test and the COFW-68 dataset. Figure 8 shows the results of the CHS for large occlusions. As shown in Table 1, the CHS showed performance improvements of 0.3%, 2.2%, and 0.7% in the common subset, challenging subset, and fullset, respectively, compared to ADNet [47] in the IOD evaluation. The CHS also showed performance improvements of more than 2.3% compared to SPIGA [39] on the 300W private test set and 1.3% compared to ACN [10] on the COFW-68 dataset, as shown in Table 2.   [49] divided the AFLW dataset into 20,000 images as the training set and 4386 images as the test set, and separately evaluated 1314 images in the test set for a performance evaluation of the frontal face. Figure 9 shows the good results of the CHS for various poses. As shown in Table 3, we applied an evaluation according to each criterion. The CHS showed performance improvements of 8.6% and 2.3% for full and frontal NME diag in comparison to FaRL [23] and 2.2% for full NME box in comparison to DTLD [24], respectively. Figure 10 shows the CED curves of the test samples for the CHS, LUVLi, and KDN results. As the figure indicates, the CHS (blue curve) has a higher proportion of samples with fewer errors than LUVLi (orange curve) and KDN (green curve).

Evaluation of AFLW
AFLW consists of 24,386 face images, including 21 landmarks and large poses. Of the 21 landmarks, we used only 19 points, excluding the points on the left and right sides of the face. Zhu et al. [49] divided the AFLW dataset into 20,000 images as the training set and 4386 images as the test set, and separately evaluated 1314 images in the test set for a performance evaluation of the frontal face. Figure 9 shows the good results of the CHS for various poses. As shown in Table 3, we applied an evaluation according to each criterion. The CHS showed performance improvements of 8.6% and 2.3% for full and frontal NMEdiag in comparison to FaRL [23] and 2.2% for full NMEbox in comparison to DTLD [24], respectively. Figure 10 shows the CED curves of the test samples for the CHS, LUVLi, and KDN results. As the figure indicates, the CHS (blue curve) has a higher proportion of samples with fewer errors than LUVLi (orange curve) and KDN (green curve).      [49] divided the AFLW dataset into 20,000 images as the training set and 4386 images as the test set, and separately evaluated 1314 images in the test set for a performance evaluation of the frontal face. Figure 9 shows the good results of the CHS for various poses. As shown in Table 3, we applied an evaluation according to each criterion. The CHS showed performance improvements of 8.6% and 2.3% for full and frontal NMEdiag in comparison to FaRL [23] and 2.2% for full NMEbox in comparison to DTLD [24], respectively. Figure 10 shows the CED curves of the test samples for the CHS, LUVLi, and KDN results. As the figure indicates, the CHS (blue curve) has a higher proportion of samples with fewer errors than LUVLi (orange curve) and KDN (green curve).

Evaluation of COFW
The COFW consists of 1315 training sets and 507 test sets, including large occlusions, and provides 29 landmarks. Since the size of the COFW training set is smaller than other datasets, the generalization of the network is important for the evaluation. Figure 11 shows the good performance of the CHS on largely occluded samples of the COFW datasets. As shown in Table 4, the CHS obtained NMEs of 4.56 and 3.16 in IPD and IOD metrics, respectively. This result showed improved performance of 2.6% and 0.6% compared to the state-of-the-art IPD and IOD metrics, and a reduced failure rate of 33.9% compared to ADNet [47] in the IPD metric. The low failure rate of the CHS showed the generalization performance and robustness for large occlusions on the COFW.

Evaluation of COFW
The COFW consists of 1315 training sets and 507 test sets, including large occlusions, and provides 29 landmarks. Since the size of the COFW training set is smaller than other datasets, the generalization of the network is important for the evaluation. Figure 11 shows the good performance of the CHS on largely occluded samples of the COFW datasets. As shown in Table 4, the CHS obtained NMEs of 4.56 and 3.16 in IPD and IOD metrics, respectively. This result showed improved performance of 2.6% and 0.6% compared to the state-of-the-art IPD and IOD metrics, and a reduced failure rate of 33.9% compared to ADNet [47] in the IPD metric. The low failure rate of the CHS showed the generalization performance and robustness for large occlusions on the COFW.

Evaluation of WFLW
As shown in Table 5, the CHS obtained NMEs of 4.04, 6.76, 4.33, 3.98, 3.87, 4.71, and 4.64 for the entire test set, pose, expression, illumination, make-up, occlusion, and blur subsets, respectively. We improved the performance by 0.5%, 5.3%, 2.9%, 0.5%, −1.6%, Figure 11. Visualized examples on the COFW datasets. Green dots denote predictions by the CHS, and red dots denote the ground truth. Table 4. Evaluation of the COFW dataset. The best scores are denoted in bold.

Evaluation of WFLW
As shown in Table 5, the CHS obtained NMEs of 4.04, 6.76, 4.33, 3.98, 3.87, 4.71, and 4.64 for the entire test set, pose, expression, illumination, make-up, occlusion, and blur subsets, respectively. We improved the performance by 0.5%, 5.3%, 2.9%, 0.5%, −1.6%, 4.8%, and 0.2%, respectively, compared to SPIGA [39]. This result shows a higher performance for the pose and occlusion subsets than other subsets. From the experimental results of the WFLW test set, Figure 12 shows the landmark detection results of CHS for various noises.

Ablation Study
In this section, we present four ablation studies. The proposed CHS network performs multi-task learning, which trains a heatmap regression task and CCR tasks. We evaluated the contribution of the components to network models that deal with heatmap-CCR networks, loss weights, and selective feature modules. In addition, we compared the performance of the proposed selective feature attention to other attention methods. For effective multi-task learning, we evaluated the performance of the proposed network for backward propagation connections between tasks. Finally, we compared the cost with the networks of other studies through an experiment.

Evaluation of Different Components
To evaluate the effectiveness of the CHS, we trained the network models with different components on the 300W dataset. Table 6 lists the performance of the network models. When fixed loss weights and adaptive loss weights for multi-task learning were not applied, the fixed loss weights were set to 1.0 for training. The performance of four stacked hourglass networks (four HGs) using only the AWing [28] loss did not achieve state-ofthe-art NME. However, the models of all the proposed configurations showed good performance, and the effects of each element were experimentally proven. When all the components were applied, the performance was improved by 8.33% for the four HGs.

Component
Choice

Ablation Study
In this section, we present four ablation studies. The proposed CHS network performs multi-task learning, which trains a heatmap regression task and CCR tasks. We evaluated the contribution of the components to network models that deal with heatmap-CCR networks, loss weights, and selective feature modules. In addition, we compared the performance of the proposed selective feature attention to other attention methods. For effective multi-task learning, we evaluated the performance of the proposed network for backward propagation connections between tasks. Finally, we compared the cost with the networks of other studies through an experiment.

Evaluation of Different Components
To evaluate the effectiveness of the CHS, we trained the network models with different components on the 300W dataset. Table 6 lists the performance of the network models. When fixed loss weights and adaptive loss weights for multi-task learning were not applied, the fixed loss weights were set to 1.0 for training. The performance of four stacked hourglass networks (four HGs) using only the AWing [28] loss did not achieve state-of-the-art NME. However, the models of all the proposed configurations showed good performance, and the effects of each element were experimentally proven. When all the components were applied, the performance was improved by 8.33% for the four HGs. Table 6. Evaluation of different components. The best scores are denoted in bold.

Comparison of Feature Map Attention Methods
We compared the proposed selective feature module with conventional attention modules to evaluate them. Figure 13a shows the attention structure with single-level feature maps, which was proposed by CBAM [42] and a selective kernel [52]. Figure 13b shows the attention structure with multi-level feature maps using the CBAM proposed by SDFL [21]. Figure 13c shows the selective feature module proposed in this study, and Figure 13d shows the structure of heatmap-guided attention with multi-level feature maps. Table 7 shows the evaluation results of the trained model for the 300W dataset by applying different attention modules to the network layer of the four CCRs. Owing to the structural features of the ROI feature maps in the CCR network layer, the performance of the attention modules without the heatmap deteriorated because they used features outside the ROI of the landmark. Meanwhile, heatmap-guided attention improves performance by focusing on features around landmarks. The selective feature module in this study showed 1.45% better performance than the attention of SDFL using multi-level feature maps and heatmaps together.

Comparison of Feature Map Attention Methods
We compared the proposed selective feature module with conventional attention modules to evaluate them. Figure 13a shows the attention structure with single-level feature maps, which was proposed by CBAM [42] and a selective kernel [52]. Figure 13b shows the attention structure with multi-level feature maps using the CBAM proposed by SDFL [21]. Figure 13c shows the selective feature module proposed in this study, and Figure 13d shows the structure of heatmap-guided attention with multi-level feature maps. Table 7 shows the evaluation results of the trained model for the 300W dataset by applying different attention modules to the network layer of the four CCRs. Owing to the structural features of the ROI feature maps in the CCR network layer, the performance of the attention modules without the heatmap deteriorated because they used features outside the ROI of the landmark. Meanwhile, heatmap-guided attention improves performance by focusing on features around landmarks. The selective feature module in this study showed 1.45% better performance than the attention of SDFL using multi-level feature maps and heatmaps together. Figure 13. Structures of the feature attention module. We applied these attention modules to our network. (a) The single-level feature map attention includes CBAM [42] and Selective Kernel [52]. These use only the feature maps in the last residual block of the feature extractor. (b) The attention module in SDFL [21] is designed using the CBAM for the multi-level feature maps. The last 1 × 1 convolution layer after the SDFL attention is for adjusting the feature map dimension. (c) The selective feature is proposed by us. (d) The heatmap-guided attention is applied to the SDFL and the selective feature. The estimated heatmap concatenates with the feature maps. These are operated before the multi-level feature attention module, such as SDFL attention and selective feature.  Figure 13. Structures of the feature attention module. We applied these attention modules to our network. (a) The single-level feature map attention includes CBAM [42] and Selective Kernel [52]. These use only the feature maps in the last residual block of the feature extractor. (b) The attention module in SDFL [21] is designed using the CBAM for the multi-level feature maps. The last 1 × 1 convolution layer after the SDFL attention is for adjusting the feature map dimension. (c) The selective feature is proposed by us. (d) The heatmap-guided attention is applied to the SDFL and the selective feature. The estimated heatmap concatenates with the feature maps. These are operated before the multi-level feature attention module, such as SDFL attention and selective feature.

Evaluation of Different Backward Propagation Settings in the CCR Stage
The CHS is a multi-task learning-based network model consisting of a heatmap regression task and CCR tasks. Each stage of the CCR task generates information propagation between each other because it adds the landmark coordinate in the previous stage task to the coordinate offset in the current stage. In multi-task learning, the connected relationship between each stage of the task significantly affects the learning performance owing to forward or backward propagation. Table 8 shows the results of the trained model for the 300W dataset for the three types of backward propagation connections to the previous stage, as shown in Section 3.2. The CHS proposed in this study showed good performance in the connected CCRs. It shows that backward propagation in the CCR network layer degrades the performance at the end of the heatmap regression network, but the feature extractor shared by the heatmap regression network and CCR network improves not only the performance of heatmap regression but also the performance of backward propagation between CCRs. Table 8.
Evaluation of different backward propagation settings. NME heatmap-stage and NME 4th-CCR-stage are checked values in each task stage. NME heatmap-stage -NME 4th-CCR-stage is the difference of both NMEs. The best scores are denoted in bold.

Model Complexity
The CHS requires a relatively high computational load. The proposed heatmap regression network is based on the AWing [28] network, which consists of many parameters, and the CCR network also consists of many parameters due to the independent convolution kernels having each landmark after the ROI pooling. However, as described in Section 3.3, CCR pruning reduces the computational cost while maintaining the performance of the NME. Table 9 and Figure 14 present the results of the network models trained on the WFLW dataset for comparison with other state-of-the-art models. Although the number of parameters of the four CCR networks proposed in this study is large, the NME's performance is the best. With the proposed one-CCR network, the pruning model of the one-CCR network slightly increased 1.14 times in FLOPs compared to the AWing network but achieved a 7.1% improvement in the NME. A network with all CCRs pruned has the same structure as the AWing network using the selective feature. The computational cost of this network increases very slightly compared to the AWing network, but the network with all CCRs pruned shows better performance than the AWing network with the selective feature trained from scratch. Because the feature extractor and the heatmap regression network are improved by multi-task learning with CCR. one-CCR network slightly increased 1.14 times in FLOPs compared to the AWing network but achieved a 7.1% improvement in the NME. A network with all CCRs pruned has the same structure as the AWing network using the selective feature. The computational cost of this network increases very slightly compared to the AWing network, but the network with all CCRs pruned shows better performance than the AWing network with the selective feature trained from scratch. Because the feature extractor and the heatmap regression network are improved by multi-task learning with CCR. Figure 14. NME of the WFLW test dataset versus FLOPs. Our CHSs outperformed other methods in terms of the NME. In particular, the pruned CHSs maintained their performance while reducing the computational cost.

Discussion
This study improved the quality of feature maps and the performance of face alignment using heatmap-guided selective feature attention and the multi-task learning-based network.
The heatmap-guided selective feature attention selected valid feature maps, which are robust against noisy environments, from multi-level feature maps with different properties. Figure 4 shows the visualized feature maps with reduced noise by the proposed method. In Section 4, the proposed method obtained good results even on noisy datasets, including the 300W challenging subset, AFLW, and COFW. The proposed method also significantly improved 5.3% and 4.7% compared to the previous best one in the WFLW pose and occlusion subsets, respectively.
The multi-task learning-based network controlled the information propagation using the proposed backward propagation connection at the last layer connected between task stages of the CHS. Table 8 shows the performance improvement of the CCR connection, which connects to the same type of task. In addition, the proposed network showed good generalization performance. As shown in Table 5, the CHS performed better on FR 10 relative to other evaluation metrics. This result means that the proposed method makes fewer estimation failures with challenging data.
Previous face alignment methods based on multi-task learning trained regression tasks separately [9,10] or with other facial properties [37,38]. This study demonstrated the positive effect of multi-task learning by improving the performance of the proposed network trained with the same objective tasks, which are coordinate and heatmap regression for facial landmark detection. As shown in Table 9, the network trained from scratch with selective feature attention and four-stacked hourglass networks but without CCRs obtained a 4.27 NME. However, the network trained with all components and 4-CCRs obtained a result of 4.06 NME, which improved by 4.9% even after pruning the 4-CCRs.
The proposed method showed robust performance on most face images but bad performance in a few large pose face images. The leading cause of failure in large pose cases is that all the datasets used in learning have many samples with the frontal pose but few with large poses. Similar failure cases are also observed in other face alignment methods. Figure 15 shows several failure cases of the proposed method. However, as shown in Section 4 comparison experiments, the performance of all face alignment methods is low on the challenging subset of 300W and the pose subset of WFLW, which contain relatively large pose face images. However, the proposed method in this paper performs better than other methods in large pose subsets. To deal with the imbalance of samples for large poses, we will improve in future work through oversampling, such as pose-based data balancing [18] or data augmentation using synthesis samples such as GEAN [48]. other methods in large pose subsets. To deal with the imbalance of samples for large poses, we will improve in future work through oversampling, such as pose-based data balancing [18] or data augmentation using synthesis samples such as GEAN [48].

Conclusions
We propose a heatmap-guided selective feature attention and a multi-task learningbased network with a refinement strategy that combines a heatmap regression task and cascaded coordinate regression tasks for face alignment.
The proposed method improves the quality of feature maps by using valid feature maps, which are selected from multi-level feature maps. Multi-level feature maps provide rich information on various properties, and heatmap-guided selective attention helps to select valid feature maps by focusing on facial components. The designed backward propagation connection improves multi-task learning of the proposed network with coordinate and heatmap regression tasks.
Various experiments have shown that the proposed method is superior to conventional methods. The proposed method is 5.3% and 4.7% higher than the previous best one in NME performance for WLFW pose and occlusion subsets, respectively.
Ablation studies have shown that the proposed method improves the performance of face alignment. Each proposed component has demonstrated effectiveness through evaluations of various network structures in which the components are combined. The heatmap-guided selective feature attention has been compared to other attention methods, and an experiment on designing backward propagation connections finds that the CCR connection with the same task type helps multi-task learning.
The proposed method has shown robust performance on most face images but has failed on large pose face images because most of the datasets consist of mainly frontal pose face images. In the future, we plan to study methods to improve performance using data balancing.

Conclusions
We propose a heatmap-guided selective feature attention and a multi-task learningbased network with a refinement strategy that combines a heatmap regression task and cascaded coordinate regression tasks for face alignment.
The proposed method improves the quality of feature maps by using valid feature maps, which are selected from multi-level feature maps. Multi-level feature maps provide rich information on various properties, and heatmap-guided selective attention helps to select valid feature maps by focusing on facial components. The designed backward propagation connection improves multi-task learning of the proposed network with coordinate and heatmap regression tasks.
Various experiments have shown that the proposed method is superior to conventional methods. The proposed method is 5.3% and 4.7% higher than the previous best one in NME performance for WLFW pose and occlusion subsets, respectively.
Ablation studies have shown that the proposed method improves the performance of face alignment. Each proposed component has demonstrated effectiveness through evaluations of various network structures in which the components are combined. The heatmap-guided selective feature attention has been compared to other attention methods, and an experiment on designing backward propagation connections finds that the CCR connection with the same task type helps multi-task learning.
The proposed method has shown robust performance on most face images but has failed on large pose face images because most of the datasets consist of mainly frontal pose face images. In the future, we plan to study methods to improve performance using data balancing.