Self-supervised 2D face presentation attack detection via temporal sequence sampling

Conventional 2D face biometric systems are vulnerable to presentation attacks performed with different face artefacts, e.g., printouts, video-replays and wearable 3D masks. The research focus in face presentation attack detection (PAD) has been recently shifting towards end-to-end learning of deep representations directly from annotated data rather than designing hand-crafted (low-level) features. However, even the state-of-the-art deep learning based face PAD models have shown unsatisfying generalization performance when facing unknown attacks or acquisition conditions due to lack of representative training and tuning data available in the existing public benchmarks. To alleviate this issue, we propose a video pre-processing technique called Temporal Sequence Sampling (TSS) for 2D face PAD by removing the estimated inter-frame 2D aﬃne motion in the view and encoding the appearance and dynamics of the resulting smoothed video sequence into a single RGB image. Furthermore, we leverage the features of a Convolutional Neural Network (CNN) by introducing a self-supervised representation learning scheme, where the labels are automatically generated by the TSS method as the stabilized frames accumulated over video clips of different temporal lengths provide the supervision. The learnt feature representations are then ﬁne-tuned for the downstream task using labelled face PAD data. Our extensive experiments on four public benchmarks, namely Replay-Attack, MSU-MFSD, CASIA-FASD and OULU-NPU, demonstrate that the proposed framework provides promising generalization capability and encourage further study


Introduction
Face recognition (FR) has become an indispensable component in numerous real-world application domains requiring reliable person verification or identification, such as device unlocking, online banking, smart homes, airports, video surveillance and law enforcement.The threat of presentation attacks (spoofing) is one of the main issues with FR systems as conventional FR techniques are vulnerable to direct sensor-level attacks, where an artificial biometric sample is presented to confuse the recognition system using different presentation attack instruments (PAI), e.g., printouts, displays, paper masks and wearable 3D masks.Compared with other biometric traits, such as iris and fingerprint, face biometric samples of the targeted person are much easier to obtain.For in-stance, people are sharing their pictures openly on the Internet using different social media platforms, from which attackers can easily acquire photographs to create face artefacts.Since conventional FR algorithms are not inherently capable of discriminating attacks from bona fide faces, dedicated presentation attack detection (PAD) methods are needed to mitigate the vulnerabilities to spoofing.
The accuracy of automatic FR is no longer a major concern in numerous real-world applications, thus the focus in FR research community has shifted towards mitigating the threat posed by presentation attacks.Traditionally, software-based face PAD techniques have been founded on hand-crafted (low-level) features describing liveness and motion cues, like eye blinking and lip movements [1] , and facial texture [2,3] and image quality [4,5] properties of bona fide and artificial faces, for instance.However, lowlevel features rely heavily on human experience to extract detailed information and the designed feature spaces might not be able to distinguish subtle differences between bona fide samples and various face artefacts.In the past few years, end-to-end learning of deep features, e.g., Convolutional Neural Networks (CNNs) with different loss functions, have been successfully utilized to overcome some limitations of hand-crafted descriptors [6][7][8][9][10][11][12][13][14][15] .A comprehensive overview of the recent advances in deep learning based face PAD can be found in [16] .
Although promising results have been achieved, even the stateof-the-art deep learning based face PAD techniques have shown unsatisfying generalization performance when facing unknown operating conditions of unconstrained real-world applications.The lack of generalization is largely due to the domain shift between source (train) and target (test) data as the existing public face PAD benchmarks suffer from severe bias across different covariates, including user demographics, PAIs, sensors, image/video resolution, frame rate, illumination conditions and stand-off distance between face and sensor.The domain generalization issues of softwarebased face PAD methods have been widely acknowledged and the recent trend in face PAD research has been increasingly on improving the performance in: 1) cross-database studies where a method is trained and tested with different datasets [17] , and 2) specific intra-database cross-test evaluation protocols where pre-defined subsets of a dataset are used to introduce unseen test conditions, e.g., cameras, PAIs, illumination and environments [8,18] .
Leveraging the potential of the state-of-the-art deep learning architectures and tuning well-generalizing face PAD models are still very difficult problems due to the huge number of parameters and the limited amount of representative training data available in the existing public datasets.The approaches proposed in the context of generalized face PAD can be roughly categorized into: 1) face PAD-specific feature learning to capture the intrinsic differences between real and fake faces [7] , 2) data augmentation and synthesis [13] , 3) auxiliary supervision [8,[12][13][14][15] , 4) domain adaptation [11,19,20] and generalization [6] , and 5) continual detection and learning of novel attack types [9] .While face PAD has been traditionally treated as a "black box" binary classification problem, Jourabloo et al. [7] proposed a deep CNN architecture for explicitly extracting PAI dependent spoof noise, e.g., characteristic reflections, colour distortions and moiré patterns, from facial images and then use spoof noise modelling for discriminating attacks from bona fide samples.Yang et al. [13] introduced a data synthesis technique to simulate digital medium-based spoofing attacks and were able to significantly improve the PAD performance with their augmented training data.Liu et al. [8] proposed to increase the generalization of face PAD methods by exploiting spatial and temporal auxiliary supervision, where face depth can be considered as spatial information while remote photoplethysmography (rPPG) signals (pulse) are used as temporal cues.Several works have also approached the generalization issues in face PAD from domain adaptation and domain generalization point of view.Domain adaptation [11,19,20] based approaches exploit some data from the target domain to match the feature distributions of source and target domains, whereas domain generalization [6] based techniques try to minimize the bias between diverse source domains without using any data from the target domain.Rostami et al. [9] proposed to tackle the problem of unknown attacks using continual detection and learning of novel attack types and developed a method to update a face PAD model with test samples that do not fit the training distribution in an embedding space.
Despite the generalization ability of the face PAD methods proposed in the literature has been gradually improving, the results have been still far from satisfying for real-world applications.For instance, the performance of methods using auxiliary supervision depends largely on the accuracy of the estimated auxiliary information.Monocular depth estimation from single face images or even short video sequences is rather difficult if active user interaction, e.g., challenge-response approach, is not utilized during liveness check.Also, reliable estimation of rPPG signals is hard when the subject is moving or lighting conditions are challenging.A major problem with domain adaptation based approaches is that col-  lecting data from the target domain is expensive or even impossible in some real-world use cases.
In this work, we propose to use spatiotemporal information for face PAD because we argue that both static and dynamic information provide important visual cues for discriminating artificial faces from real ones.However, the successive frames in PAD videos are highly redundant.The videos might comprise hundreds of frames repeating similar patterns, which makes it difficult to extract meaningful liveness cues even with deep learning based approaches.Therefore, we propose a simple, yet effective pre-processing method called Temporal Sequence Sampling (TSS) to accumulate appearance and dynamic information of video sequences into single RGB images.This is achieved by splitting an input video sequence into non-overlapping segments, and then estimating the trajectories of keypoints within each video clip.We focus on the problem of print and display attacks (consisting of both digital photos and video-replays) when the PAIs can be considered as planar 2D objects.Therefore, we stabilize each video segment by removing the inter-frame 2D affine motion estimated based on the keypoint trajectories and then aggregate the resulting video frames into a single image.Fig. 1 provides an illustration of the steps described above.A comparison between straightforward frame aggregation and the output of the proposed TSS approach is shown in Fig. 2 , which highlights the amount of inter-frame 2D affine motion in the original print attack video clip. 1t is worth noting that the cumulative 2D affine transformation estimated within a video clip is not directly used for face alignment but to enrich the spatiotemporal discrepancies between real 3D faces and flat 2D face artefacts in the observed view.The pro-posed approach can handle also print attacks where the 2D surface is warped, as bending a photograph leads to a highly distorted cumulative 2D affine mapping that is not characteristic for real faces.The problem of video-replay attacks exhibiting also non-rigid facial motion is tackled by focusing on appearance information, which has shown promising generalization in detecting display attacks, e.g., in [21] , due to evident screen bezels, video compression artefacts, display noise signatures, moiré effects, and luminance and colour distortions, for instance.
Recently, self-supervised learning [22] has been receiving increasing attention as solving pretext tasks, like patch location, order and rotation prediction, in unsupervised manner has shown to be successful in learning meaningful and more interpretable visual representations from the data itself, thus mitigating the need for human annotations for the downstream task.Inspired by the work on frame order prediction where 3D CNNs and optical flow information have been utilized [23,24] , we propose a self-supervised learning scheme where the pretext task is to predict the length of the original video clip based on the TSS encoded data.To be more specific, the stabilized frames accumulated over video segments of different tem poral lengths provide the supervision for training a 2D CNN with the aim of learning more meaningful representations from the videos aggregated into single RGB images.The learnt visual features are then fine-tuned for the downstream face PAD task.
The main contributions of this work can be summarized as follows: 1.In order to reduce temporal redundancy and remove interframe 2D affine motion in videos, Temporal Sequence Sampling (TSS) is introduced to encode video clips into a compact representation in the form of a single RGB image.2. The need for annotated data in face PAD is mitigated using selfsupervised learning.3. The effectiveness of the proposed approach is demonstrated using the official cross-test evaluation protocols of the OULU-NPU database [18] and several widely used cross-database configurations, where promising generalization ability with new stateof-the-art results is achieved.
We also provide the source code 2 to the research community for reproducing, verifying and extending our results.

Proposed method
The backbone of the proposed face PAD approach is the TSS method, which removes inter-frame 2D affine motion within a video segment and accumulates frames of the resulting motioncompensated video clip into a single RGB image (see, Figs. 1 and  2 ).The main architecture of our face PAD framework is illustrated in Fig. 3 .During the self-supervised learning phase, the CNN receives unlabelled TSS encoded frames accumulated over video segments of different temporal lengths and the pretext task is to predict the length of the original video clip.The learnt feature representations are then fine-tuned on labelled TSS encoded video clips by performing PAD as the downstream task (stage 1 in Fig. 3 ).Finally, a Bidirectional Long Short-Term Memory (BiLSTM) [25] subnetwork is trained using the fine-tuned CNN features to make the final face PAD decision (stage 2 in Fig. 3 ).A more detailed description of the proposed TSS method and self-supervised learning scheme is provided in the following, while the implementation details and the training process are discussed later in Section 3.3 .

Temporal sequence sampling (TSS)
The steps of the proposed Temporal Sequence Sampling method are illustrated in Fig. 1 .First, the input video is equally partitioned into S non-overlapping segments (clips), where each video clip contains the same number of frames, e.g., 45.We estimate the 2D affine motion between all adjacent frames of a video clip based on sparse point correspondences.We use first Speeded Up Robust Features (SURF) descriptor [26] to detect keypoints from both the face and background regions in each video frame and then find the corresponding points between all adjacent video frames using the Hamming distance.
The M-estimator SAmple Consensus (MSAC) algorithm [27] is utilized to mitigate the impact of incorrect point correspondences and to get robust estimates of the 2D affine transformations between the adjacent frames.MSAC is an improved version of RANdom SAmple Consensus (RANSAC) where an M-estimator is introduced to set outlier point correspondences a constant weight while inliers are weighted based on how well they fit the estimated transformation.
The resulting 2D affine transformation between adjacent frames is a 3 x 3 matrix: where a n represents scale, rotation, and shearing transformations and t x and t y correspond to translation.However, we convert the 2D affine transformation described above into a simpler and more stable four parameter transformation to produce the final motioncompensated video clip: where s is a scale factor and θ rotation angle.
The 2D affine transformation between two frames F i and F i +1 is denoted with A i when the cumulative 2D affine transformation of a frame F i with respect to the first (reference) frame F 0 of the video segment corresponds to cascaded inter-frame transformations: The estimated cumulative transformation A 0 ,i is used to remove inter-frame 2D affine motion by warping each frame F i relative to the first frame F 0 of the video clip.It is worth highlighting that the cumulative transformation aims at removing the infer-frame 2D motion, but the frame F i is not necessarily aligned with the reference frame F 0 due to the cumulative errors in estimating the motion between adjacent frames and changes in the observed view within the video clip.
Finally, we take the temporal average of the motioncompensated frames to encode the whole video clip into a single RGB image.An example of the TSS output is shown in Fig. 2 .

Self-supervised representation learning
The amount and nature of spatiotemporal variations within video segments, and, consequently, the TSS outputs and their corresponding CNN feature representations depend largely on the duration of the input sequence.Therefore, the key idea of our selfsupervision scheme is to generate sets of TSS encoded video clips with different tem poral lengths L and then learn the spatiotemporal variations and context across these different length settings.To be more specific, we first generate T classes of TSS outputs with e.g., L = { 5 , 15 , 30 , 45 , 60 } and then use these labels to train

Experimental setup
In the following, we introduce briefly the public benchmark face PAD datasets and describe the evaluation metrics and protocols used in our experimental analysis.Finally, the implementation details of the proposed approach are also provided.

Experimental data
To assess the generalization of the proposed face PAD approach, we considered four widely used publicly available databases consisting of bona fide and 2D face presentation attack videos, namely Idiap Replay-Attack Database [28] (denoted as I), CASIA Face Anti-Spoofing Database [29] (denoted as C), MSU Mobile Face Spoofing Database [5] (denoted as M), and OULU-NPU Database [18] (denoted as O).
Idiap Replay-Attack Database [28] consists of bona fide and attack videos of 50 subjects captured under two different lighting conditions.Five different attacks are launched with iPhone 3GS (digital photo and video-replay), 1st generation iPad (digital photo and video-replay) and hard copies.All videos are recorded with a built-in webcam of a MacBook Air laptop.Altogether, the database contains 1200 videos, which are divided into three subject-disjoint subsets for training, development and testing (15, 15 and 20 subjects, respectively).
CASIA Face Anti-Spoofing Database (CASIA-FASD) [29] contains bona fide and attack videos of 50 subjects recorded with three different of imaging qualities (low, normal and high) and considers three kinds of attack presentations (warped photo, cut-photo and video-replay).Consequently, each subject has three kinds of bona fide videos and nine different attack presentations.Altogether, the database contains 600 videos, which are divided into two subjectdisjoint subsets for training and testing (20 and 30 subjects, respectively).
MSU Mobile Face Spoofing Database (MSU-MFSD) [5] includes bona fide and attack videos of 35 subjects recorded with two mobile devices (a Google Nexus 5 smartphone and a MacBook Air laptop).Three kinds of attack presentations are considered, including two video-replay attacks of different quality (iPhone 5S and iPad Air) and a print attack.Consequently, each subject has two kinds of real videos and six different attack presentations.Altogether, the database contains 280 videos, which are divided into two subjectdisjoint subsets for the training and testing (15 and 20 subjects, respectively).
OULU-NPU Database [18] is one of the most recent commonly used face PAD datasets.It contains bona fide and attack videos of 55 subjects recorded in several acquisition conditions (six highresolution smartphone front cameras and three sessions) and considers two kinds of print attacks and two kinds of video-replay attacks.Four cross-test protocols are used to evaluate the generalization performance of a face PAD method across different covariates.Protocols 1, 2, and 3 introduce a single previously unseen test condition, namely illumination, PAI and sensor, respectively, while the fourth and most challenging protocol evaluates the generalization performance simultaneously across unknown sensors, attacks and illumination conditions.Altogether the database contains 5940 videos, which are divided into three subject-disjoint subsets for training, development and testing (20, 15 and 20, respectively).

Evaluation metrics and protocols
All four datasets are used in our cross-database experiments, while only the OULU-NPU database is utilized also for intradatabase experiments following its official cross-test protocols that assess generalization across different covariates.
For our cross-database experiments, we follow the widely used evaluation metrics and protocols introduced in [17] .The results are reported using Half Total Error Rate (HTER), which denotes the mean of the False Acceptance Rate (FAR) and False Rejection Rate (FRR).The HTER is computed on the test set of the target domain using the threshold τ corresponding to the equal error rate (EER) operating point on the development set of the source domain.In the case of the CASIA-FASD and MSU-MFSD datasets, the threshold τ is computed on the training set because they lack pre-defined validation sets (see, Section 3.1 ).
The intra-database results on the official cross-test protocols of the OULU-NPU database are reported in terms of Average Classification Error Rate (ACER), which denotes the mean of Attack Presentation Classification Error Rate (APCER) and Bona Fide Presentation Classification Error Rate (BPCER).APCER and BPCER essentially correspond to FAR and FRR, respectively, but APCER is computed separately for each PAI, e.g., print and video-replay, and the final PAD performance corresponds to the attack with the highest APCER, i.e., the most successful PAI.Similarly to the HTER, the ACER is computed on the test set using the threshold τ corresponding to the EER operating point on the development set.

Implementation details
We utilize both the face and background regions for PAD, thus no face cropping is applied.The TSS method processes the frames of the input video clips at their native resolution, and the TSS generated accumulated output frames are resized to 224 × 224 according to the input image size of the pre-trained CNN (ResNet-101 [30] ).The video segment length for TSS methods was set to 45 and the number of TSS encoded segments depends on the total length of an input video sequence.For instance, a video of 270 frames results in six TSS encoded video clips.No data augmentation is applied during training.
To evaluate the generalization performance of the proposed TSS method, the pre-trained CNN is fine-tuned using Stochastic Gradient Descent (SGD) with mini-batch size of 32 and validation frequency of 30, and by shuffling every epoch.We do not use fixed epochs because an early stopping function is utilized to automatically stop the model training when over-fitting is observed [31] .The learning rate is fixed to 0.0 0 01 in our cross-database experiments, while we adjust the learning rate to 0.001 on all four intradatabase cross-test protocols of the OULU-NPU dataset.
The fine-tuned feature vectors are extracted from the output of the last pooling layer with size of 2048.The BiLSTM subnetwork is trained using cross-entropy loss and Adam optimizer with fixed learning rate of 0.0 0 01.The number of hidden units is fixed to 100 in the cross-database experiments, while the number of hidden units is increased to 500 on all four intra-database cross-test protocols of the OULU-NPU dataset.We set the recurrent weights with He initializer [32] that performs the best in all scenarios of our experiments.
During the self-supervised training stage, we first fine-tune the pre-trained CNN with the aforementioned settings on the unlabelled data of the pretext task, i.e., sets of TSS outputs with different temporal length combinations.Then, the model is further finetuned using the binary labels (bona fide and attack) of the downstream task by replacing the fully connected layer with a new one with the output size of 2. Finally, the BiLSTM subnetwork takes the input of the last average pooling layer of the fine-tuned CNN and gives the final binary PAD decision.

Experimental results
Our experimental analysis focuses on assessing the generalization performance of the proposed approach under two settings: 1) different cross-database configurations, and 2) the official intradatabase cross-test protocols of the OULU-NPU dataset.In the following, we first investigate the effect of the input video sequence length on the TSS method and then study the effectiveness of the proposed self-supervised learning scheme in cross-database tests between the CASIA-FASD and Replay-Attack datasets.Finally, we compare the performance of the proposed approach against the state of the art in several widely used cross-database configurations and on the four official evaluation protocols of the OULU-NPU database.

The effect of video segment length
We begin our experiments by exploring how the performance of the proposed TSS method depends on the length of the video segment.We examine the generalization performance by varying the length of video segments L from 5 to 60 frames.The crossdatabase results on the Replay-Attack and CASIA-FASD databases shown in Fig. 4 depict that the HTER decreases as the number of frames per video segment increases.However, when we further increase the temporal length of frames to more than 45 frames, the face PAD performance on the CASIA-FASD and Replay-Attack datasets starts decreasing.Therefore, we set L = 45 where the best performance is achieved, i.e., HTER of 9 .3% on Replay-Attack and 18 .1% on CASIA-FASD database.

Comparison against the state of the art
The results presented in Table 2 depict that our TSS method achieves astonishing cross-database performance between the Replay-Attack and CASIA-FASD datasets, and that the generalization ability can be further improved when the proposed self-supervised learning stage is included in training the PAD model.The TSS method combined with self-supervised learning obtains the state of the art in this widely used cross-database configuration, achieving an HTER improvement from 9 .3% to 5 .9% on the Replay-Attack dataset and from 18 .1% to 15 .2% on the CASIA-FASD dataset, respectively.Thus, the proposed self-supervised learning scheme indeed helps in fine-tuning a 2D CNN to learn more meaningful representations from the TSS encoded video segments.
Table 3 presents the generalization performance of our TSS approach in another cross-database configuration, combining the MSU-MFSD and Replay-Attack databases for training and the CASIA-FASD and OULU-NPU databases for testing.The proposed TSS method with CNN-BiLSTM framework achieves the best results and significant improvement with respect to the state of the art in HTER from 31 .89% to 28 .66% on the CASIA-FASD and from 36 .01% to 30 .12% on the OULU-NPU dataset.
The results of the intra-database experiments following the official evaluation protocols of the OULU-NPU database are presented in Table 4 .From these results it can be seen that the proposed TSS method with CNN-BiLSTM framework ranks first on the protocols 1 and 2 of the OULU-NPU database obtaining ACER of 0 .1% and 0 .6% , respectively, and achieves very competitive performance of 1 .5% and 7 .1% in terms of ACER on the protocols 3 and 4, respectively.We have also included the performance of the proposed TSS method in the cross-database experiments and the intra-database tests on the OULU-NPU without the BiLSTM subnetwork in order to demonstrate the importance of the BiLSTM component on the final face PAD performance.

Network visualization and analysis
In this section, we use Gradient-weighted Class Activation Mapping (Grad-CAM) [33] to help in explaining why the proposed method makes a particular decision.Sample Grad-CAM visualizations of real faces, video-replay attacks and print attacks are presented for further analysis in Fig. 5 .The first row represents samples of real faces, from which one can see that the network gives clear focus on the actual facial region due to e.g., head motion, non-rigid facial movements, eye blinking and skin texture, while the background region does not provide liveness cues.In contrast, the samples of video-replay and print attacks in the second and third rows, respectively, depict that the discriminative visual and motion cues are PAI related and attention is more dispersed and focusing also on background regions, i.e., non-face related information.

Conclusions
In this paper, we addressed the generalization issues in 2D face presentation attack detection.We proposed a Temporal Sequence Sampling (TSS) method that removes the estimated inter-frame 2D affine motion within short video clips and encodes the appearance and dynamics of the resulting frames into a single colour image.We also introduced a self-supervised learning scheme where the stabilized video frames accumulated over sequences of different temporal lengths provide the supervision to train a 2D Convolutional Neural Network.We conducted extensive experimental analysis using the official intra-test protocols of the OULU-NPU database and several cross-database configurations on four public face PAD databases to demonstrate the robustness of the proposed framework.
The proposed approach needs capturing and processing of relatively long input sequences, i.e., approximately two seconds of video, in order to achieve robust face PAD performance, thus it cannot be used in authentication applications requiring real-time response or biometric systems operating on single facial images.A drawback of encoding stabilized video clips into a compact representation in the form of a single colour image is that the subtle inter-frame motion (direction) information is lost due to frame aggregation.Therefore, we plan to extend our work by developing methods that explicitly model the geometrical differences in the feature or facial landmark based trajectories between motioncompensated bona fide and attack videos.In this work, we focused only on detecting attacks launched with 2D PAI, i.e., prints and displays, thus it is yet unknown whether the proposed approach generalizes well under unseen or other types of facial artefacts, including paper and 3D masks.In the future, we will explore the robustness of our method on new emerging face PAD datasets and evaluation protocols.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.In the proposed TSS method, the input video is first divided into N equal video clips and the inter-frame 2D affine motion is estimated within each video clip based on the trajectories of SURF keypoint matches.Then, the estimated interframe 2D affine motion is removed from the video frames and the resulting clips are accumulated into single RGB images.

Fig. 2 .
Fig. 2. First frame of a sample print attack video clip (left) and the mean of the corresponding stabilized video clip (middle).The result of simple frame averaging (right) is included for comparison to demonstrate the amount of inter-frame 2D affine motion in the original print attack video clip.

Fig. 3 .
Fig. 3.An illustration of the proposed self-supervised and TSS training tasks.During the self-supervised learning phase, the CNN receives unlabelled TSS sampling outputs accumulated over different temporal lengths and the pretext task is to predict the length of the original video clip.The learnt feature representations are then fine-tuned on labelled TSS encoded video clips by performing PAD as the downstream task (stage 1).Finally, the BiLSTM subnetwork is trained using the fine-tuned features (stage 2) to make the final PAD decision.a deep CNN with softmax loss to predict the length of a given TSS encoded video segment.After the self-supervised spatiotemporal context adaptation step, the learnt 2D visual features are then further fine-tuned for the actual downstream task of face PAD and finally the BiLSTM subnetwork is trained using the resulting CNN features.

Fig. 5 .
Fig. 5. Grad-CAM visualizations for TSS encoded videos corresponding to real faces (first row), video-replay attacks (second row) and print attacks (third row).

Table 1
Cross-database performance of the proposed self-supervised learning scheme in terms of HTER (%).

Table 2
[34]s-database performance in terms of HTER (%) on the Replay-Attack and CASIA-FASD databases.Comparative results are obtained from Yu et al.[34].

Table 4
[34]a-database evaluation on the four official protocols of the OULU-NPU database.Comparative results are obtained from Yu et al.[34].