Suppressing Spoof-Irrelevant Factors for Domain-Agnostic Face Anti-Spoofing

Face anti-spoofing aims to prevent false authentications of face recognition systems by distinguishing whether an image is originated from a human face or a spoof medium. In this work, we note that images from unseen domains having different spoof-irrelevant factors (e.g., background patterns and subject) induce domain shift between source and target distributions. Also, when the same SiFs are shared by the spoof and genuine images, they show a higher level of visual similarity and this hinders accurate face anti-spoofing. Hence, we aim to minimize the discrepancies among different domains via alleviating the effects of SiFs, and achieve improvements in generalization to unseen domains. To realize our goal, we propose a novel method called a Doubly Adversarial Suppression Network (DASN) that is trained to neglect the irrelevant factors and to focus more on faithful task-relevant factors. Our DASN consists of two types of adversarial learning schemes. In the first adversarial learning scheme, multiple SiFs are suppressed by deploying multiple discrimination heads that are trained against an encoder. In the second adversarial learning scheme, each of the discrimination heads is also adversarially trained to suppress a spoof factor, and the group of the secondary spoof classifier and the encoder aims to intensify the spoof factor by overcoming the suppression. We evaluate the proposed method on four public benchmark datasets, and achieve remarkable evaluation results in generalizing to unseen domains. The results demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
Computerized face-recognition techniques [1]- [4] have been successfully deployed in a wide range of real-world applications, such as criminal identification and e-commerce systems. Despite their remarkable accuracy, these face recognition systems can erroneously authenticate deceivers as genuine users if the spoofing images of the genuine users are displayed. Therefore, to ensure the security of face recognition, it is important to devise a way to prevent this error.
Various methods have been proposed to prevent the deceptions. Initially, handcrafted features were utilized to capture discriminative cues such as highlight distortions [5], moiré patterns [6] and eye blinking [7]. Recently, several deep feature-based methods have been proposed to prevent deceptions with the guidance of auxiliary supervisions to improve robustness to unseen domains. Reference [8], [9] utilized the fact that spoof medium and genuine faces are distinctively different in terms of facial depth, and exploited pseudo-depth The associate editor coordinating the review of this manuscript and approving it for publication was Huiyu Zhou. maps as auxiliary supervision. Other methods have shown improvements by utilizing reflection maps [10] or physiological signals [9] along with the facial depth maps. However, these methods were inherently limited as the methods are heavily dependent on the performance of the prior model that estimates auxiliary information. We note that one major reason why the generalization of the models gets degraded is the lack of consideration of task-irrelevant factors that exist in each image, hence we aim to mitigate the effects of these factors instead of utilizing auxiliary information.
Images captured in unconstrained real-world scenarios are largely diverse in terms of spoof-irrelevant factors, such as identities or background patterns, compared to the images of the existing benchmark datasets. Since it is impossible to collect images by considering every possible combination of these factors, large discrepancies between a training set and real-world data are inevitable. Therefore, it is desirable to mitigate the effect of such spoof-irrelevant factors, so that the model learns faithful spoof factors independently from spoof-irrelevant factors by using the training datasets that are limitedly diversified.
Here, the term spoof-irrelevant factors (SiFs) indicates the factors that are uninformative and irrelevant to the face antispoofing, and at the same time induces visual similarities. For instance, the spoof and genuine images of the same identity may look much similar to each other than the two images of the same class (e.g., spoof or genuine) with different identities. Therefore, if features are naïvely extracted from images without taking the SiFs into account, then the extracted features may be located in accordance with the SiFs, and deterioration of face anti-spoofing accuracy would be yielded ( Figure 1). FIGURE 1. Top: The spoof and genuine images that share the same SiFs, such as identity, background, and illuminations are visually similar to each other than the two genuine images that do not share the factors. Bottom: In our method, two types of adversarial learning strategies are deployed, hence our method is termed Doubly Adversarial Learning. The first adversarial learning is deployed to suppress SiFs, and samples that share the same SiFs are dispersed as a result. The second adversarial learning is deployed to intensify the spoof factor in the encoded features.
Some methods [11], [12] have been proposed to increase generalization ability by alleviating the effects of SiFs. Reference [11] proposed a method to train a model to achieve invariance to noise patterns incurred by sensory devices. Reference [12] attempted to make a model to be invariant to identities by explicitly disentangling identity features using identity labels. Multiple types of SiFs exist in the face images, however, the existing methods have considered only a single factor, hence not readily generalized.
Our insight is that we need a learning scheme that efficiently suppresses multiple SiFs simultaneously in a way that does not disturb the learning of a spoof factor. Various types of SiFs exist in images, thus reducing the effect of only a single SiF may erroneously guide the network to encode the remaining SiFs as spoof factors. In this regard, it is important that we design the model to be invariant to multiple SiFs. In addition, we aim to build the model to stably suppress multiple SiFs, so that suppression of multiple SiFs does not dominate the overall learning procedure of the encoder, nor weakens the discriminative power of the encoder. To realize our goal, we propose an architecture called Doubly Adversarial Suppression Network (DASN) that suppresses SiFs by adopting a doubly adversarial learning strategy. To consider various types of SiFs, we deploy multiple discrimination heads after an encoder, and adversarial learning is conducted between the discrimination heads and the encoder. Furthermore, our DASN performs additional adversarial learning between the group of the encoder and the secondary spoof classifier, and the intermediate layer of each discrimination head to intensify the spoof factor in the encoded feature. Incorporating the suppression of multiple SiFs is more challenging compared to the suppression of a single SiFs since the training process is hindered by instability. We note that by deploying the second adversarial learning scheme, the overall training process becomes stabilized and the performance is further boosted. The two different adversarial learning procedures are performed comprehensively, hence the overall learning scheme is termed Doubly Adversarial Learning.
The contributions of our work can be summarized as follows: • We propose a DASN, which adopts doubly adversarial learning to effectively suppress the spoof-irrelevant factors, and intensify a spoof factor for enhanced generalization ability.
• DASN achieves the-state-of-the-art performance on various benchmark datasets [13]- [16] for domain generalization of face anti-spoofing. Moreover, our extensive ablation studies show that the suppression of SiFs is effectively conducted by adopting our DASN.

II. RELATED WORKS A. FACE ANTI-SPOOFING METHODS
Face anti-spoofing is becoming increasingly important as concerns about the security of face-recognition systems grow. Many face anti-spoofing methods have been proposed; early methods used hand-crafted features, such as Local Binary Patterns (LBP) [17], [18] or Histogram oriented Gradients (HoG) [19], [20] to solve the problem. After the success of deep learning in various tasks, many researchers proposed methods that use deep features [7], [21], [22], and achieved improvements over the traditional methods. Some other researchers have noted that ethnic bias affects the performance of the models [23], and various methods have been proposed to alleviate the ethnic bias via singleor multi-modal learning [24]. Recently, researchers considered auxiliary information, such as facial depth maps [8], physiological signals [9], and reflection maps [10] to capture VOLUME 9, 2021 discriminative cues that could be universally applied to different domains. The researchers exploited the difference in these auxiliaries between spoof media and real humans (e.g., live humans exhibit physiological signals, whereas spoof mediums do not), and achieved notable achievements by showing improved generalization to unseen domains. However, the existing methods depend on the inaccurate pseudo-labels (e.g., depth maps are limited to facial areas), and the performance is still not satisfactory due to the large variations among different domains.

B. DOMAIN GENERALIZATION METHODS
To address the problems, researchers aimed to tackle the face anti-spoofing in the perspective of domain-generalization to take advantage of multiple seen data [25]- [27]. Reference [26] proposed a multi-adversarial network that aims to enhance the generalization ability to unseen domains by training a generator to learn a feature space that is shared by multiple discriminators pre-trained on different source domains. Reference [27] proposed a method to supervise the network with generalized learning directions by incorporating domain shift scenarios in the meta-learning framework.
Reference [25] attempted to learn spatio-temporal features by deploying both the image-based network and video-based network, and utilized class-conditional domain discriminators to further improve the generalization capability. Reference [12] proposed a method that disentangles identity factors to be invariant to an identity. However, [12] has considered only a single number of the spoof-irrelevant factors that can disturb the face anti-spoofing. In contrast, we fully consider all types of spoof-irrelevant factors that can be utilized.

III. PROPOSED METHOD
In this section, we first discuss the spoof-irrelevant factors. Then we introduce a doubly adversarial learning scheme for suppressing spoof-irrelevant factors and intensifying the spoof factor in the encoded features. Finally, we describe a Doubly Adversarial Suppression Network (DASN), which is a spoof classification network that is trained using the proposed doubly adversarial learning.

A. SPOOF-IRRELEVANT FACTORS
We term the spoof-irrelevant factors as the factors that are irrelevant and uninformative to the face anti-spoofing, and that can incur visual similarity when the same type of SiFs are shared by images. For example, information regarding the facial structure of a subject, or gender is not meaningful in terms of detecting face spoofs, and different images of the same identity would be visually alike to each other ( Figure 1). Hence, the identity of a face is an irrelevant factor. There can be a variety of SiFs, and lack of consideration of these factors can cause difficulties in classifying spoof and genuine images that share the same SiFs. To avoid this problem, we aim to build a model that learns features that are discriminative for spoof classification, but insensitive to the variations of the SiFs.
Face anti-spoofing databases have been collected under various acquisition scenarios by varying SiFs, such as identities and camera sensors. In order to consider various SiFs, we consider every possible factor that is provided in the form of labels. Among the different types of the SiFs, at most three types of SiFs are commonly provided as labels from each database: (1) identities of each face, (2) environments (i.e., conditions of illuminations and backgrounds), and (3) sensors (i.e., types of cameras) ( Table 1). In our method, the adversarial learning between an encoder and multiple discrimination heads is conducted to suppress the SiFs.

B. SPOOF-IRRELEVANT FACTORS SUPPRESSION
We introduce the first adversarial learning scheme that suppresses spoof-irrelevant factors, so that the trained model becomes invariant to them. In this learning scheme, an encoder and multiple discrimination heads are adversarially trained, and SiFs are suppressed in the encoded features ( Figure 2).

1) SPOOF CLASSIFICATION
We are given a set X of images, a corresponding set Y of spoof class labels, and a corresponding set F k of labels of each SiFs, where k is a SiF in a set K = {identity, environment, sensor}. Here, the entire spoofing network consists of an encoder E and a spoof classifier C (Figure 3). Given an image and its corresponding label (x, y) ∼ (X , Y ), the encoder E and the spoof classifier C are trained to predict whether the image is spoof or genuine; each does this by minimizing the following spoof classification loss: Overview of DASN; the network is trained by our doubly adversarial learning in two steps. In the first learning step, the encoder, the spoof classifier, and the secondary classifier strive to suppress SiFs and intensify the spoof factor in a collaborative way [28]. A GRL reverses the sign of the gradients so that the encoder is updated in a way that suppresses SiFs. In the second learning step, the discrimination heads learn to suppress the spoof factor and classify SiFs; the SiF-aware intermediate layer I k and the discriminator D k are updated with another GRL to suppress the spoof factor in I k . In each stage, only the blue-colored parts are updated. As the training progresses, the discrimination heads gradually diverge, and DASN successfully suppresses the three SiFs: k ∈ {identity , environment , sensors}, and the spoof factor is intensified as the encoder overcomes the spoof factor suppression.
where σ denotes a softmax function. Training the network solely with the spoof classification loss leads to degraded results, and we hypothesize that this degradation is a result of features that are entangled with SiFs.

2) SPOOF-IRRELEVANT FACTORS SUPPRESSION
We introduce a new learning scheme that makes the model focus on the spoof factor by discarding SiFs, so that accurate face anti-spoofing is conducted. In the learning scheme, we deploy multiple discrimination heads that consist of a SiF-aware intermediate layer I k and a discriminator D k where k corresponds to each of the SiFs in K (Figure 3). Each discriminator classifies the corresponding SiF from the encoded features by minimizing SiF classification loss: where f k is a SiF label that corresponds to x, and 1 [n=f k ] is an indicator function which equals to 1 if n = f k and 0 otherwise. N k denotes the number of classes in each type of spoof-irrelevant factor ( Table 1). As an opponent of the discrimination heads, the encoder is adversarially trained to maximize the loss with the objective of suppressing SiFs. This learning scheme is interpreted as the following adversarial procedure: where I and D denotes a set of I k and a set of D k for all SiFs in K, respectively. The maximization process of the encoder is implemented by using a gradient reversal layer (GRL) [29] that is inserted in between the encoder and each of the discrimination heads (Fig 3). The GRL acts like an identity function during the forward propagation, and reverses the sign of gradients by multiplying −1 during the backpropagation. If SiFs are abundant in the encoded features, the discriminators can easily distinguish the SiFs by using the encoded features with small errors. Hence, for the encoder to maximize the loss, the encoder strives to suppress the SiFs during the feature-encoding process. We observe that as the network is successfully trained, the SiF classification losses gradually diverge; this trend implies that the discriminators fail to classify the SiFs from the encoded features, and that the encoder succeeds in suppressing the SiFs so that the encoded features become invariant to the SiFs.

C. DOUBLY ADVERSARIAL LEARNING
The first adversarial learning alleviates the effects of SiFs in the encoded features, however, it may cause the encoder to lose discriminative power since the SiF suppression procedure does not guarantee the features to be discriminative in terms of spoof detection. Also, the overall learning of the encoder can be dominated by the SiF suppression procedure since the encoder has to compete against multiple discrimination heads. In order to address the problem, we deploy an additional learning strategy. Along with the spoof classification loss (Eq. 1), this learning scheme guides the encoder to learn more discriminative features, and the encoder can be strengthened with intensified spoof factors. In this learning scheme, the group of the encoder and the secondary classifier is adversarially trained against each of the intermediate layers of the discrimination heads ( Figure 2).

1) SECONDARY SPOOF CLASSIFICATION
We deploy a secondary spoof classifier S after each of the SiF-aware intermediate layers (Figure 3). The objective of the VOLUME 9, 2021 secondary spoof classifier is to minimize the secondary spoof classification loss which is defined as: (E(x))))) + y log(σ (S(I k (E(x))))) .
In contrast, each of the intermediate layers is adversarially trained to maximize this loss. Thus, this learning scheme is interpreted as the following min-max procedure: As the intermediate layer I k (Fig. 3) tries to maximize the spoof classification loss, it suppresses the spoof factors in the features that are passed from the encoder. The encoder, on the other hand, has to minimize the spoof classification loss even though the features are later suppressed by the I k , therefore the encoder strives to intensify the spoof factors in the encoded features, so that it can be well classified even though spoof-factors are suppressed by I k . Also, as the encoder is adversarially trained to overcome the suppression of the multiple SiF-aware intermediate layers, the encoder is collaboratively [28] trained on varied gradients from each SiF-aware intermediate layer.

2) DOUBLY ADVERSARIAL LEARNING
The overall learning process involves two different adversarial learning processes, so we call it Doubly Adversarial Learning. The full objective is summarized by aggregating the defined losses with corresponding weight terms, λ k sif , for each k in K, and is written as: where Eq. 6 is a min-max procedure to suppress SiFs, and Eq. 7 is a min-max procedure to intensify the spoof factor.

D. DOUBLY ADVERSARIAL SUPPRESSION NETWORK
We propose DASN to realize the introduced learning scheme; it learns complex objectives that are in opposition to each other, so we divide the overall learning scheme into two steps ( Figure 3).

1) STEP 1: UPDATE ENCODER AND SPOOF CLASSIFIERS
We train the encoder E, the spoof classifier C, and the secondary spoof classifier S. E and C are updated by minimizing the spoof classification loss (Eq. 1). E is also trained to maximize the SiF classification loss (Eq. 2) at the same time. In this way, E is trained to suppress the SiFs and to encode features that are informative for the spoof classification. To suppress SiFs by maximizing the SiF classification loss, the GRL is inserted in between the encoder and each of the discrimination heads. The discrimination heads are not updated during this step and only pass the gradients to the encoder.

3) INFERENCE STAGE
The discrimination heads and the secondary spoof classifier are only utilized for the training, hence the proposed DASN does not require any additional computational resources in the inference stage.

IV. EXPERIMENTS A. DATASETS AND METRICS
We train and evaluate the proposed method using four public benchmark datasets (Table 1):

5) METRICS
We evaluate the proposed method by following the same evaluation protocols [12], [27] of the existing methods; one of the datasets is selected as a testing set, and the remaining three are utilized as training sets. Therefore, four evaluation tasks are possible: O&C&I to M, O&M&I to C, O&C&M to I, and I&C&M to O. We report the performance of the proposed method by using Area Under the Curve (AUC) and Half Total Error Rate (HTER) = (False Acceptance Rate + False Rejection Rate)/2 [30].

B. EXPERIMENTAL SETUP
We use all frames of all the videos to train and test our method. For each frame, we locate face regions by using a face detector [31], then the located regions are cropped and resized to 256×256. As a backbone network, we use ResNet-18 [32] that is pre-trained on ImageNet [33]. We use Xavier initialization [34] to initialize the parameters of layers that are not a part of the pre-trained network. We apply global average pooling to the output feature map of the backbone network to obtain vectorized features of 512-dimensionality. Except for the last fully connected (FC) layers, every other FCs are followed by a ReLU activation function, and the number of their hidden nodes is 512. We train the network using a constant learning rate of 10 −5 with the Adam optimizer [35] on a single NVIDIA V100 GPU. The weight terms λ k sif are simply selected by observing the initial loss values to balance the loss values in the similar scales; e.g., we use 0.05 for λ identity sif , 0.08 for λ environment sif and 0.08 for λ sensor sif on O&C&I to M. We set the size of a mini-batch to 32 on the I&C&M to O task, and 64 on the remaining three tasks.

C. ABLATION STUDIES
We demonstrate the effectiveness of the proposed method by conducting extensive ablation studies. In the ablation studies, all the models also are based on ResNet-18; the baseline model is trained solely with the spoof classification loss.

2) COMPARISON WITH STATE-OF-THE-ART METHODS
The proposed method shows significant improvements over the baseline, and also shows superior results over the stateof-the-art methods in most of the protocols (Table 3). For O&C&I to M, DASN improves 10.00 percentage points on HTER compared to the baseline; the result is 5.56 percentage points lower than HTER of the state-of-the-art method. For I&C&M to O, DASN improves 4.13 percentage points on HTER compared to the baseline; the result is 1.39 percentage points lower than HTER of the state-of-the-art method. For O&C&M to I, despite the lower performance of the baseline, DASN shows comparable performance to state-of-theart, and achieves significant improvement (6.12 percentage points of HTER).

3) LIMITED SOURCE DOMAINS
We evaluate the proposed method on another task that uses a training set consisting of only two domains (Table 4). Our DASN also outperforms the other methods on those tasks. For M&I to C, DASN improves 6.77 percentage points on HTER compared to the best model; for M&I to O, DASN improves 12.25 percentage points on HTER compared to the best model.

4) SPOOF-IRRELEVANT FACTORS SUPPRESSION
The performance of models can be significantly improved by suppressing SiFs because these factors can disturb face anti-spoofing (Table 2). Especially, among the SiFs, it is shown that suppressing the identity factor can be most effective, as the performance improvements over the baseline are largest, except for the O&C&I to M task.

5) DOUBLY ADVERSARIAL LEARNING
We also compare the performance of the models by varying the combinations of the suppressed SiFs (Table 2). We observe that the increase in the number of SiFs does not guarantee improvements in performance, and some even show degraded results compared to the ones that consider the less number of SiFs. This observation implies that as the number of the discrimination heads that the encoder competes against increases, the encoder can be disturbed by the discrimination heads, and the spoof factor in the encoded features can be suppressed. By deploying the doubly adversarial learning scheme, the model stably incorporates the multiple SiFs as the proposed DASN explicitly intensifies the spoof factor in the encoded features, and the collaborative learning [28] behavior also contributes to the improved results.

6) ASN d VS. DASN
To distinguish our method from the conventional approach of domain generalization [29], we compare our DASN and ASN d , which uses domain information as a SiF and is trained by the 1 st adversarial learning scheme ( Table 2). DASN shows enhanced generalization ability than ASN d and this implies that DASN more delicately removes the factors that interfere with the face anti-spoofing.

1) GRAD-CAM VISUALIZATION
We present a visualization of Gradient-weighted Class Activation Mapping (Grad-CAM) [40] that produces a localization map highlighting important regions for predicting the classes, so that more insights on how the proposed network makes decisions can be obtained ( Figure 4). The visualized results show that the attended regions can be diversified, even if the spoof and genuine images share the same SiFs, and this observation supports that our method is insensitive to the SiFs.
In addition, we observe that for spoof images, the attended regions are diversely located in the image, whereas the regions are consistently attended on facial areas for the genuine images. This trend implies that the clues for discriminating spoof images are more diversely located than the clues for genuine images.

2) t-SNE VISUALIZATION
We visualize the distributions of the output features of DASN by using t-SNE [41] (Figure 5). It is shown that the output features of DASN are more clearly clustered for the spoof factor. Compared to the output features baseline model, the output features of the DASN are more dispersed, even though they share the same SiFs, and this trend is most distinguished for identity. This implies that the proposed method is effective in suppressing the SiFs, and intensifying the spoof factor.

V. CONCLUSION
In this paper, we propose a DASN that adopts a doubly adversarial learning scheme to improve generalization capability for face anti-spoofing. In our learning scheme, the encoder is trained against multiple discrimination heads to maximize the SiF classification loss, with the objective of suppressing the SiFs in the encoded features. In addition, the encoder is trained to learn the intensified spoof factor by involving additional adversarial learning, where the encoder aims to minimize the secondary spoof classification loss with the objective of overcoming the suppressions of SiF-aware intermediate layers. The extensive empirical evaluation results on public benchmark datasets demonstrate the effectiveness of our proposed method.