1 Introduction

Iris recognition systems are vulnerable to presentation attacks (PAs). An imposter can use a printed image or replay an iris video to impersonate an enrolled user or wear textured contact lenses to escape recognition. Therefore, developing a reliable iris PAD algorithm is still a challenging task. Considering that neural networks successfully improve the performance in many computer vision fields, deep learning-based algorithms are further applied for iris PAD [6, 15, 17, 30, 39]. However, most neural networks suffer from overfitting, where the network does not generalize very well on an unseen test set. Several strategies are therefore proposed to improve the generalizability of networks, e.g., Dropout [34], Batch normalization [25]. In contrast to such methods, the data augmentation technique targets the root problem and insufficient training data variability. Most iris PAD datasets are limited to a small-scale compared to the datasets used for general purposes, for privacy security. Data augmentation can be categorized into data warping and oversampling. Data warping creates more images based on affine transformation like rotation or translation. Oversampling generates synthetic images, such as using Generative Adversarial Networks (GAN) [18]. Data augmentation techniques improve the performance of modern image classifiers without doubts [10, 23, 32]. In the iris PAD field, several studies also showed the improvement of performance by augmentation techniques. Gragnaniello et al. [19] utilized a data augmentation to generate more training data by rotating the original images for iris PAD task. Their results are slightly improved when applying data augmentation. Raghavendra et al. [30], Chen et al. [6] and Choudhart et al. [7] also utilized the augmentation techniques to avoid the overfitting in training phase (see Table 1). However, the contribution of augmentation techniques is not clear because no analysis or experimental comparison is provided as summarized in Table 1. It is worth noting that iris images generated by GAN [18] cannot be used as augmented data in our application to improve the performance, as done for general computer vision tasks. This is because the generated iris images are considered another type of presentation attack for impersonation [37]. As a result, we chose to explore the effect of the data warping technique on iris PAD performance due to the restricted condition of augmentation techniques in the PAD field.

Furthermore, the detailed effect of the data augmentation on iris PAD performance is relatively understudied. In this regard, this work provides answers to the following questions: (1) What is the relative effect of various data augmentation techniques on the performance of iris PAD? (2) Does the combination of all augmentation techniques at various design levels always lead to superior performance, or can there be a formal approach to augmentation methods selection? (3) Do different augmentation strategies improve PAD performance by bringing the “same” misclassified samples to the correct classes? Or do they have a less overlapping effect?

To answer these questions, we explore the impact of different augmentation techniques, specifically data warping techniques, on the generalization of deep learning-based iris PAD. The main contributions of the work are as follows: (1) provide a first in-depth analysis of data augmentation techniques role on the performance and reliability for iris PAD, (2) propose a classification error overlap-based augmentation selection protocol, (3) demonstrate the experiments in terms of fine-tuned and trained from scratch networks with various augmentations on multiple cross-validation scenarios datasets, (4) visualize and discuss the overlapping effect of different augmentation techniques to provide a better explanation of the generalizability induced by augmentation techniques.

Table 1 Algorithm properties including the used augmentation techniques and relation study

2 Related work

Iris recognition systems have been widely applied in different recognition scenarios due to the uniqueness and high accuracy of iris features [2,3,4,5]. However, the operational security of the iris recognition has raised many concerns. This section provides a brief review of deep learning-based iris PAD algorithms and general data augmentation techniques. The Iris-LivDet-2017 [42] is the most recent published competition. The used competition datasets and protocols indicated that improving the generalizability of iris PAD is a major challenge. Some recent iris PAD competitions, such as Iris-LivDet-2017 [42] or Iris-LivDet-2020 [11], are organized to evaluate the generalizability of iris PAD algorithms. In contrast to Iris-LivDet-2017 [42], the 2020 edition competition [11] did not offer any official training data and the test data are not yet publicly available, the experiments and analysis in this work are still based on the protocols designed in Iris-LivDet-2017 competition [42]. Hence, we focus here on the algorithms and results in Iris-LivDet-2017. The protocols in this competition are designed under cross-dataset and cross-PA scenarios to reflect the real-world situation. In this competition [42], CASIA proposed to train two SpoofNets to detect printouts and textured contact lenses separately, while UNINA relied on the Scale Invariant Descriptor (SID) and Bag of Words (BoW) to classify the attacks. Afterward, Kuehlkamp et al. [27] proposed to combine 61 CNN lightweight CNNs via meta-fusion to classify multiple Binarized Statistical Image Features (BSIF) views of the iris image to overcome such generalization problems. Their results outperformed the winners of the competition. Furthermore, Sharma et al. [31] proposed a DenseNet network-based iris PA detector, D-NetPAD, to demonstrate the experiments on a proprietary dataset and four public competition datasets. They trained a D-NetPAD model on their private dataset, including 12,772 training data. Then, this pre-trained model was used in three ways to examine the generalizability on the competition datasets: 1) the pre-trained D-NetPAD is used directly on the test sets in the competition, 2) train a D-NetPAD model from scratch on the competition training sets, 3) fine-tune this pre-trained model on the competition training sets. As expected, the fine-tuned model performed the best. They achieved the lowest error rate (0.30% ACER values) on the Notre Dame dataset in the competition, whereas the second-lowest error is 3.28% from the previous Meta-Fusion method. However, there is a slight problem that their proprietary training data include the testing data of Notre Dame. To fairly compare the results using the same data, we only report the D-NetPAD trained from scratch, and also the Meta-Fusion results later on in Table 12. Besides, we compare our results with the multi-layer fusion (MLF) method achieving the 2.31% ACER in Notre Dame and the recently published micro-stripe analysis (MSA) method ([14, 17]) obtaining good performance (11.13% ACER value) in the IIITD-WVU dataset.

Even though such neural network-based algorithms obtain good performance, they still suffer from overfitting. One reason is that training data are insufficient, in quantity and variation. For example, there are only 1200 training iris images in the Notre Dame dataset in the competition [42], which is quite limited compared to datasets designed for generic computer vision tasks. Moreover, not only iris PAD algorithms have this problem, and most networks suffer from overfitting leading to low generalization. Under this condition, data augmentation can help to reduce overfitting and enhance the generalizability of networks by virtually generating more training images (more variations) from the original data. The data augmentation techniques can be categorized into data warping and synthetic oversampling [38]. The term data warping can be traced back to the distortion of handwriting in [1]. The warped data are created by applying geometric and color augmentations, such as rotation, shift, flipping, and changing the contrast. In addition to data warping applied in data-space, synthetic oversampling creates images in feature-space by using GANs. The recent iris PAD studies and their used augmentation techniques are presented in Table 1. It is noticed that many works did not mention applying data augmentation, and those who did, did not study the effect of that augmentation in an ablation study. Only [19] did measure this effect, however, as all other works, did not study multiple augmentation methods nor provided a formal selection protocol for augmentation selection. It should be noticed that such generated synthetic images [26, 41] are classified as a type of presentation attack in the PAD field, i.e., only increase the number of attack samples without bona fide samples. Such synthetically generated iris images are exploited by an adversary to impersonate someone else’s identity. For example, Yadav et al. [41] studied the impact of the synthetic data on PAD algorithms when used as a presentation attack. Hence, we explore the impact of augmentation techniques on the performance of iris PAD algorithms. Nevertheless, we only perform data warping augmentation methods due to the imbalance generation of synthetic oversampling techniques.

As summarized in Table 1, the augmentation techniques used in most iris PAD works are rotation, flip, and shear. However, the exact impact of these transformations on PAD performance is unspecified in these works. Moreover, our experimental results (in Sect. 5) show that not all single or combined augmentations increase the iris PAD performance. Therefore, it is essential to find out the most contribute augmentations by considering the unique characteristic of iris data, e.g., NIR illumination, specific sensors, and no noise background. Furthermore, as shown later in Sect. 5, these individual data augmentations that can improve the performance and generalizability of networks help understand the nature of the variations in the attacks. Consequently, studying the specific role of augmentations inspired us to fuse them by sorting overlap classification rates.

3 Methodology

In this section, we will introduce the investigated data augmentation techniques along with the augmentation selection and fusion protocols, as well as the three CNNs used in our iris PAD study.

3.1 Data augmentation techniques

The collection of large-scale iris datasets is challenging for iris research because of various factors, e.g., privacy concerns and high demand for acquisition environment specifications. Deep learning-based iris PAD studies are thus limited by inadequate datasets. Compared to datasets designed for general purposes like ImageNet dataset [12], most iris PAD datasets have only a dozen to a hundred distinct irises (distinct subjects) as summarized in [8]. The problem of training on small-scale datasets is overfitting, which refers to the phenomenon that a trained network can not generalize well on unseen data. Besides, the Iris-LivDet-2017 competition results suggested that cross-PA and cross-dataset scenarios can be considered the major challenges of current iris PAD fields. To simultaneously validate against insufficient data resources and cross scenarios, we explore the impact of data augmentation methods on iris PAD generalization ability.

To observe the respective impact of data augmentation strategies, we perform six geometric transformation-based augmentation techniques. Notably, the oversampling augmentation technique is neglected in this work because the iris data generated by the GAN [18] are considered fake iris [26, 41], i.e., an attack. The explored six basic augmentations in this study are: horizontal shift, vertical shift, brightness adjustment, zoom in/out, and horizontal flipping. Such augmentation techniques are widely used in the computer vision field with proved positive effect [10, 23, 32] and also in the iris PAD field [6, 7, 19, 30]. More reasons that lead us to choose these augmentations are: (1) even under a controlled environment, the irises are not in the same position and same viewpoint. There is still a small geometric variation between iris images. (2) the capture light condition varies between the different datasets when performing the cross-dataset evaluation. (3) the size of the captured irises varies slightly depending on the collectors. (4) iris textures are distinct between the left and right eyes of the same person [8]. However, in some cases, only a single eye of a person is contained in PAD datasets [12]. Hence, it is interesting to explore if horizontal flipping of iris images can improve the performance of PAD algorithms. Considering that the position, direction, size, and illumination differences of iris images are small, we augment the images in a relatively small degree to avoid inducing unwanted noise. The detailed augmentation parameters are listed in Table 4, and the corresponding explanation is in Sect. 4.2. Most interestingly, we look at the effect of each of these augmentation in respect to the other methods.

3.2 Fusion and augmentation protocol

Furthermore, we investigate two methods to fuse the above individual augmentation strategies: strategy-level and score-level fusion. For the former category, the training data are generated by using a combination of several augmentation strategies. For example, an iris image can be rotated, shifted, zoomed, and other operations simultaneously. For the latter category, the prediction scores by each network (trained with one of the single augmentation methods) are fused to calculate a final prediction.

On the other hand, we investigate an augmentation selection protocol. This protocol is based on the overlapping ratio of misclassified samples caused by the different augmentations (as explained later) and thus their relative effect on the performance. This selection step is based on two assumptions: (1) different augmentation techniques contribute to different aspects of the PAD performance, (2) selecting augmentations with the lower overlap of misclassified samples to fuse may improve the results as they focus on the different types of variability in the images.

Let \(A = \{A_1, ..., A_n\}\) define a set of augmentation techniques. \(I_{A_n}^{a} = \{ I_{A_n}^{a_1}, ..., I_{A_n}^{a_m} \}\) presents a set of misclassified attack images with augmentation \(A_n\) and \(I_{A_n}^{bf} = \{I_{A_n}^{bf_1}, ..., I_{A_n}^{bf_k}\}\) is a set of misclassified bona fide images with augmentation \(A_n\). The misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) denotes the ratio of attack samples classified incorrectly with augmentation technique \(A_p\) that are also classified wrongly with augmentation \(A_q\). Similarly, the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) denotes the ratio of bona fide samples misclassified with augmentation technique \(A_p\) that are also misclassified with augmentation \(A_q\). The ratios can be computed as followed equations:

$$\begin{aligned} O_{A_{pq}}^{a} = \dfrac{\#(I^{a}_{A_p} \bigcap I^{a}_{A_q})}{\#I^{a}_{A_p}} \end{aligned}$$
(1a)
$$\begin{aligned} O_{A_{pq}}^{bf} = \dfrac{\#(I^{bf}_{A_p} \bigcap I^{bf}_{A_q})}{\#I^{bf}_{A_p}} \end{aligned}$$
(1b)

where \(p, q \in \{1, ..., n\}\). Then, the overall overlap ratio \(O_{A_{pq}}\) between augmentation techniques \(A_p\) and \(A_q\) is:

$$\begin{aligned} O_{A_{pq}} = (O_{A_{pq}}^{a} + O_{A_{pq}}^{bf}) / 2 \end{aligned}$$
(2)

The detailed pseudo-code of the selection protocol can be found in Algorithm 1. We set \(k = 3\) in our experiment and select the \(A_b\) with the minimum Equal Error Rate (EER) values.

figure a

3.3 Neural networks

To evaluate the effect of data augmentation on iris PAD more generally, we train three neural networks: (1) fine-tuning ResNet50, (2) fine-tuning VGG16, (3) training from scratch MobileNetV3-small. On the one hand, ResNet and VGG networks are used widely either as feature extractor or end-to-end architectures in biometric research fields [29, 36, 39]. For example, Nguyen et al. [35] used ResNet [21], VGGNet [33], etc., to extract image features for iris recognition. Yadav et al. [39] fused features extracted from off-the-shelf VGG16 model and handcrafted haralick features to detect iris presentation attacks. Therefore, we fine-tune the pre-trained ResNet50 [21] and VGG16 [33] to perform iris PAD. On the other hand, most generic models trained on ImageNet datasets [12] have different patterns compared to iris images. Therefore, we train a lightweight network architecture, MobileNet V3 Small [22] from scratch to target iris PAD issues additionally. MobileNet V3 small has only 2.25M parameters, which is suitable to deploy on mobile devices and to be trained on limited iris data, while ResNet50 has 25.64M parameters and VGG16 has 138M parameters. MobileNet V3 [22] uses the depth-wise convolution and squeeze-and-excitation to reduce parameters and preserve the accuracy at the same time. The training hyperparameters are listed in Table 3. In this work, we focus on the impact of various augmentation techniques and aim to discover the consistency of data augmentation effects, the augmentation selection protocols, and the fusion protocols under diverse network architectures and training strategies. Therefore, we opted to intentionally select a diverse set of networks and training protocols that have shown good performances on iris PAD in previous works [14, 16, 17, 39]. Hence, we fine-tune the ResNet50 and VGG16 networks and train from scratch MobileNetV3 following the experimental settings adopted in [14, 16, 17, 39].

4 Experimental setup

This section describes the datasets, the used parameters in the neural networks and data augmentation techniques, and the evaluation metrics.

Fig. 1
figure 1

Samples of iris images from the used datasets. It can be seen that the bona fide and attack samples of different datasets have distinctive appearance, which are affected by capture sensors, light conditions, and printer types, among other factors. This variation indicates the challenging task of cross-dataset PAD

4.1 Datasets

The experiments are demonstrated on publicly available benchmark datasets used in the Iris-LivDet-2017 competition [42] to explore the impacts of different data augmentation techniques on PAD performance. The Iris-LivDet-2017 competition [42] contains four datasets: Clarkson, Warsaw, Notre Dame, and IIITD-WVU. Because the Warsaw dataset is no longer publicly available, we use the remaining three datasets in our experiments. Furthermore, the Iris-LivDet-2017 are designed for cross-PA, cross-sensor, and cross-dataset evaluation. Figure 1 presents iris samples from the training and test sets of each of the used datasets. The varying appearance between different datasets indicates the challenging task of cross-dataset PAD. Table 2 summarizes the description of the used datasets, including the number of images in the training and test sets and sensors.

Table 2 Summarized information of the used datasets

Clarkson dataset The Clarkson dataset is designed as a cross-PA evaluation. The test set consists of additional unknown attack image types that are not present in the training set. The unknown data include visible-light image printouts attack and the extra pattern contact lenses produced by different manufactures. The bona fide visible-light images are presented neither in the training set or the test set.

Notre dame dataset The Notre Dame dataset contains bona fide iris images (without lenses) and textured contact lens attacks. The test set is a combination set of the known subset and unknown subset, corresponding to the cross-PA scenario. The unknown subset includes iris images with textured lenses produced by different manufacturers (different patterns) and not represented in the training data. Another difficulty of this dataset is the limited training data.

IIITD-WVU dataset The IIITD-WVU dataset is an amalgamation of two datasets: the IIITD dataset used for training and the WVU dataset for testing. The experiments performed on the IIITD-WVU dataset correspond to the cross-dataset evaluation because the sensors, data acquisition environments, subject population, and PA generation procedures for the training and testing are different. The training set (IIITD set) was selected from the IIIT-Delhi Contact Lens Iris (CLI) dataset [40] and IIITD Iris Spoofing (IIS) dataset [20], where the images were captured by multiple sensors under a controlled environment. The test set (WVU set) was captured using a mobile iris sensor under both controlled (indoor) and uncontrolled (outdoor) environments. The Iris-LivDet-2017 competition results [42] indicated that the cross-dataset evaluation was considered the most challenging task on account of the significant variations.

4.2 Parameters setting

To make our experimental setting compliant with the Iris-LivDet-2017 competition [42], we use the pre-defined training and the test set as described in Sect. 4.1. Additionally, 20% of the images are selected randomly from each training set to serve as a validation set during the training procedure. The training hyperparameters listed in Table 3 are used to fine-tune the ResNet50 [21] and VGG16 [33] networks, and train the MobileNetV3-small [22] from scratch. The input size of the three networks is \(480 \times 640 \times 3\), where the grey-scale iris images are converted to three-channel images filled with the same pixel values. The number of actual training epochs is controlled by the early stopping method. The training stops if the validation loss does not decrease after ten epochs or the training reaches its maximum training epochs in our experiments.

Table 3 The training hyperparameters

The parameters of augmentation techniques are listed in Table 4. An image can be shifted horizontally or vertically by a specific ratio of the image width or height. The range of the shift is 0 to 100%. In our case, the specific ratio sets to 10%. A rotation augmentation randomly rotates the image clockwise between 0 and 360 degrees. We limited the maximum rotation degree to 15 degrees. Also, the brightness of the image can be augmented by either randomly darkening or brightening. The range of the brightness argument is from 0 to 200%. The brightness is not changed when the value is 100. The values less than 100 darken the image, whereas values larger than 100 brighten the image. Furthermore, the iris image can be zoomed in/out with a specific ratio. The range of zoom arguments is 0 to 200%. The image is not changed when the zoom argument is 100%. In the experiment, we zoom the images between 85% and 115% randomly. Finally, more iris images can be produced by horizontally flipping. The code is implemented based on the Keras library.Footnote 1

Table 4 The parameters of different data augmentations

4.3 Evaluation metrics

The following metrics are used to measure the PAD algorithm performance:

  • Attack Presentation Classification Error Rate (APCER): The proportion of attack images incorrectly classified as bona fide samples.

  • Bona Fide Presentation Classification Error Rate (BPCER): The proportion of bona fide images incorrectly classified as attack samples.

  • Average Classification Error Rate (ACER): corresponds to the average of BPCER and APCER.

The APCER, BPCER, and ACER follow the standard definition presented in the ISO/IEC 30107-3 [24]. The threshold used to decide an iris image is bona fide is 0.5, as defined in the Iris-LivDet-2017 protocol [42]. Moreover, the Detection Equal Error Rate (D-EER) and the BPCER value by fixing the APCER value at \(1\%\) are reported for more analysis.

Furthermore, we use the Fisher Discriminant Ratio (FDR) to examine the achieved class separability (attack and bona fide) induced by different augmentation settings to indicate classification generalizability. The FDR is described in [28] and [9] as the measurement of separability between genuine and imposter scores. In our work, the high separation between bona fide and attack scores indicates higher reliability of the applied augmentation technique in the iris PAD system. The FDR is described in Equ. 3:

$$\begin{aligned} FDR = \dfrac{(\mu ^{bf} - \mu ^{a})^2}{(\sigma ^{bf})^2 + \sigma ^{a})^2} \end{aligned}$$
(3)

where \(\mu ^{bf}\) and \(\mu ^{a}\) are the respective standard deviation of bona fide and attack scores, and \(\sigma ^{bf}\) and \(\sigma ^{a}\) are their mean values. We also analyze the differences in the augmentation-induced enhancement of different augmentation strategies with the help of the confusion matrix plotted based on the overlap misclassified samples as mentioned in Sect. 3.1. The details of this confusion matrix are described in Sect. 5.2.

5 Experiments evaluation

This section evaluates the several augmentation techniques using the three models on three datasets in terms of the different metrics. In addition to the individual augmentation methods (see Tables 5, 6, 7, 8, 9, 10), we also report the results of strategy-level and score-level combination in Table 11. We also draw the ROC curves of either single augmentation technique (as appended in Fig. 2) or multiple fusion methods (as appended in Fig. 3). Furthermore, we analyze the overlapping misclassified images by employing the confusion matrix (as shown Fig. 6, 7 and 8).

5.1 Results

In this subsection, we first analyze the results in terms of individual datasets per specific augmentation technique. Then, for further study, the fusion-based results are discussed. Finally, we compare our results with the SoTA algorithms for an overall analysis.

Clarkson Results Table 5 reports the iris PAD performance in terms of D-EER, the BPCER value at 1% APCER value, and FDR. It can be observed that (1) translation, brightness, and horizontal flip augmentation produce better results in some cases, e.g., applying the MobileNetV3 model, (2) however, not all augmentations can improve the PAD performance, (3) the higher FDR values mostly coincide with the lower D-EER value. By looking at Table 5 and Table 6 together, we can find that the FDR value has a greater potential to suggest a lower ACER value relative to the D-EER metric.

Table 5 Iris PAD performance (%) reported in terms of D-EER, the BPCER value at 1% the APCER value on Clarkson dataset
Table 6 Iris PAD performance (%) reported in terms of APCER, BPCER and ACER on Clarkson dataset
Table 7 Iris PAD performance (%) reported in terms of D-EER, the BPCER value at 1% the APCER value on Notre Dame dataset
Table 8 Iris PAD performance (%) reported in terms of APCER, BPCER and ACER on Notre Dame dataset
Table 9 Iris PAD performance (%) reported in terms of D-EER, the BPCER value at 1% the APCER value on IIITD-WVU dataset
Table 10 Iris PAD performance (%) reported in terms of APCER, BPCER and ACER on IIITD-WVU dataset
Table 11 Fusion-based PAD performance (%) reported in terms of D-EER, ACER and FDR

Notre Dame Results Table 7 and Table 8 describe the iris PAD performance on Notre Dame dataset. As shown in Table 7, the model fine-tuned without augmentation (ResNet50 and Vgg16) outperforms than most other augmentations. In contrast, the performance of the MobileNetV3 (scratch) is mostly improved compared to training without any augmentation techniques. Moreover, unlike the lowest ACER acquired by MobileNetV3 on the Clarkson dataset, ResNet50 achieved the best result (9.56% ACER) in Table 8 by using brightness augmentation on the Notre Dame dataset. The Clarkson and Notre Dame datasets both correspond to the cross-PA scenarios that include unseen cosmetic lens patterns. However, the same network architectures show a significant difference. As shown in Table 6 and Table 8, ResNet50 performed worst on the Clarkson and best on the Notre Dame dataset, whereas MobileNetV3 performed best on the Clarkson and worst on the Notre Dame. One possible reason for this opposite variation is insufficient training data for the Notre Dame dataset (4937 training data in Clarkson and 1200 training in Notre Dame). Another possibility is the differences in the ratio of their unknown PA in the test set (21.03% unknown attack in the attack test subset in Clarkson and 50% in Notre Dame). Considering these two reasons, we argue that models pre-trained on large-scale datasets may perform better on unseen pattern data with insufficient training data for fine-tuning. Besides, similar to the third finding in the Clarkson dataset, the augmentation technique obtained with the higher FDR values also achieves the lower ACER values determined by a pre-defined threshold.

IIITD-WVU Results Table 9 and Table 10 denote the results of the IIITD-WVU dataset, which corresponds to a challenging cross-dataset scenario. It can be observed that D-EER values and ACER values of the IIITD-WVU are higher than the Clarkson and Notre Dame datasets. As shown in Fig. 2, when fixing the APCER values (x-axis), the ROC curves indicate that the IIITD-WVU dataset has higher BPCER values (the y-axis coordinate is 1-BPCER) than the Clarkson and Notre Dame datasets. Moreover, the variation between individual augmentation techniques is more pronounced on the IIITD-WVU dataset. In addition to such variations on different datasets, the variations of augmentation techniques are slightly different across methods. For example, ResNet50 and VGG16 achieve better results with vertical shift on all datasets; however, the MobileNetV3 model performs worse when using vertical shift (See referable AUC values). Looking at Table 5, horizontal shift and zoom yields better results with VGG16 and MobileNetV3 networks. The lowest D-EER (9.26%) and the lowest ACER (10.05%) are achieved by the vertical shift when fine-tunning the ResNet50 model. Consistent with the observations in the Clarkson and Notre Dame datasets, the higher FDR value potentially points to a lower ACER value in most cases. Therefore, we can conclude that the FDR metric is more suitable than the D-EER metric for measuring the reliability and generalizability of the PAD algorithms.

Fusion-based Results Table 11 presents the performance results of the Best Single augmentation (BS) for each dataset and network, and four fusion-based methods: (1) STategy-level fusion (ST) with all augmentations, (2) SCore-level fusion (SC) with all augmentations, (3) Least Overlap-based strategy-level fusion (\(LO_{ST}\)), (4) Least Overlap-based score-level fusion (\(LO_{SC}\)). The augmentations used for \(LO_{ST}\) and \(LO_{SC}\) are selected by Algorithm 1 described in Sect. 3.1. It can be observed in Table 11 that strategy-level fusion has a greater probability to produce the best results than the score-level fusion method. For instance, the ST method obtains the lowest D-EER values in the Clarkson and the Notre Dame dataset by using the ResNet50 model, and \(LO_{ST}\) fusion achieves the best performance in the Clarkson by VGG16 and in the IIITD-WVU by the MobileNetV3 network. Moreover, for VGG16 and MobileNetV3 networks, our augmentation selection protocol achieves one of the two lowest ACERs for five of the six experimental setups. Although a pre-defined threshold can influence the ACER value, a higher FDR value always suggests a lower ACER value. Therefore, the higher the FDR value, the higher the reliability of the PAD algorithm.

Fig. 2
figure 2

ROC curves of single augmentation technique. The columns from left to right are ResNet50, VGG16, and MobileNetV3. The rows from top to bottom are for the Clarkson, Notre Dame, and IIITD-WVU datasets. The x-axis is the APCER values, and the y-axis is the \(1-BPCER\) values

Fig. 3
figure 3

ROC curves of multiple fusion methods (corresponding to Table 11. The rows from top to bottom are for the Clarkson, Notre Dame, and IIITD-WVU datasets. The x-axis is the APCER values and the y-axis is the \(1-BPCER\) values.)

Comparison with SoTAs We also compare our results with several SoTA algorithms in Table 12. The first three rows are the winners of the Iris-LivDet-2017 competition [42], followed by four of the latest SoTAs, and then the best results of our three networks, respectively. The detailed description of the competition and SoTA algorithms is presented in Sect. 2. The Meta-Fusion [27] approach combined 61 CNNs to classify multiple BSIF views of the iris images via SVM meta-fusion. D-NetPAD method [31] adopted a DenseNet model that is pre-trained on a private combined iris dataset. They also trained a DenseNet model on the competition datasets from scratch. We report these scratch D-NetPAD results for a fair comparison on the same data resource. MLF method [13] fused the information from multiple network layers to make a PAD decision. MSA [14, 17] approach focuses on the artifacts differences in the image dynamics around the iris/sclera border area by extracting information from micro-stripes. Because MLF and MSA do not report the results on the Clarkson dataset, we mark ’-’ in Table 12.

For the Clarkson dataset, the lowest ACER value (0.84%) is produced by the MobileNet trained with the horizontal shift augmentation. For the IIITD-WVU dataset, our ResNet50 model trained with vertical shift generated data achieves the best result with the ACER value of 10.05%. However, the MLF [13] method achieves the best results on the Notre Dame dataset, while our solutions perform worse than Anon1, D-NetPAD, Meta-Fusion, and MSA methods. Due to the lack of training data in the Notre Dame dataset (1200 training data, 3600 testing data), even though data augmentations improve the results, the model still overfits. Therefore, we concluded that shift augmentation is worth attempting for the improvement of the PAD performance. Also, fusing various augmentations in the strategy-level is a good start for iris PAD by considering all the previous results.

Cross-dataset evaluation In addition to inter-dataset evaluation, we also report the cross-dataset results in terms of D-EER, ACER, and FDR values in Table 13. In the cross-dataset scenario, the training data are the training subset of one dataset, while the test data are the test subset of the other two datasets. For instance, the model trained on the Clarkson dataset is used to produce the prediction scores on the test subset of Notre dame and IIITD-WVU datasets. The threshold is set to 0.5 as defined in the Iris-LivDet-2017 competition protocol. To demonstrate the generalizability of the different fusion strategies, we provide the results generated by the BS, ST, SC, \(LO_{ST}\) and \(LO_{SC}\) settings, similarly to the inter-dataset results in Table 11. In addition to fusion methods, the results of the training without augmentation technique (denoting as No) are also reported for comparison. The bold values in the D-EER and ACER columns are the lowest two error rates, and the FDR column’s bold values indicate the Top-2 separability measured by FDR. For further comparison, we also provide a visual representation of the D-EER values achieved by the different experimental settings in Fig. 4 and the ROC curves in Fig. 5. As can be concluded from Table 13, (1) training without augmentation techniques performs worse than using augmentations in most cases. (2) BS and ST methods achieve one of the two lowest ACER values in half of the experimental setups. (3) SC augmentation method obtains one of the lowest D-EER values for nine of the eighteen experimental setups, notably eight of the nine lowest D-EER values are produced by the fine-tuned ResNet50 and VGG16. Furthermore, the reliability of the FDR value is consistent with the observation of the previous inter-dataset results that a higher FDR value hints at a lower ACER value, even though the ACER value can be affected by a pre-defined threshold. It can also be noticed in Table 13 that training MobileNetV3 from scratch with \(LO_{ST}\) performs better than with other augmentation strategies in most cases. Similar observation can be found in Fig. 4. SC (yellow) and \(LO_{SC}\) (green) methods achieve lower D-EER values than ST (grey) and \(LO_{ST}\) (navy blue) methods for ResNet50 and VGG16 networks. In contrast, SC and \(LO_{SC}\) produce higher D-EER values than ST and \(LO_{ST}\) for the MobileNetV3 network. One possible reason is the different training strategies of networks.

Table 12 Iris PAD performance (%) reported in terms of APCER, BPCER and ACER in comparison with the SoTAs
Table 13 Iris PAD performance reported in terms of D-EER (%), ACER (%) and FDR on cross-dataset scenarios

5.2 Analysis and discussion

This section explores if different augmentations lead to the same or different kinds of performance improvements. To do that, we analyze the overlap of misclassified samples between different augmentation protocols, including four fusion methods with the help of confusion matrices. Furthermore, the limitations and potentials of our analyses will be discussed. The confusion matrices for each dataset can be seen in Figs. 6, 7 and 8. The horizontal axis (X axis) from left to right and the vertical axis (Y axis) from top to bottom correspond to the augmentation strategies: No, Shift\(_{h}\), Shift\(_{v}\), Rotation, Brightness, Zoom, Flip\(_{h}\), ST, SC, \(LO_{ST}\) and \(LO_{SC}\), respectively.

The matrices from left to right are generated by the ResNet50, VGG16, and MobileNetV3 separately. The value in top matrices refers to the misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) computed as in Eq. (1a), and the bottom matrices present the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) computed as in Eq. (1b) in Sect. 3.1.

As can be seen from the previous results, different augmentation strategies improve the performance on the different datasets. In this case, shift, rotation, and horizontal flip play a relatively prominent role. The most overlap values are between 0.2 and 0.7 in confusion matrix plots. In general, the lower overlap rates indicate that different augmentation techniques enhance the model to adapt to different variations in iris samples. As shown in Figs. 6, 7 and 8, we can find that the overlap misclassification rate on MobileNetV3 network is lower (lighter blue) compared to the ResNet50 and VGG16 for each dataset. A general observation can be made from Figs. 6, 7 and 8, the fusion of multiple augmentation techniques (all or by our proposed augmentation selection protocol), especially on the score-level (SC and \(LO_{SC}\)), leads to higher overlap with the basic augmentation methods. This indicates our success in addressing a larger number of variations in the data simultaneously. This is not the case when we apply the strategy-fusion method, as the multiple augmentation methods used in the training phase might cause confusion.

Fig. 4
figure 4

The histogram of performance on cross-dataset evaluations. The x-axis the different experimental settings, and the y-axis is the D-EER (%) value

Fig. 5
figure 5

ROC curves of cross-dataset evaluations (corresponding to Table 13. The rows from top to bottom are for training on Clarkson and testing on Notre Dame and IIITW-WVU case, then training on Notre Dame and testing on Clarkson and IIITD-WVU, and training on IIITD-WVU and testing on Clarkson and Notre Dame cases. The columns from left to right are ResNet50, VGG16, and MobileNetV3 networks. The row and column order are the same as in Table 13). The x-axis in each ROC plot is the APCER values, and the y-axis is the \(1-BPCER\) values

Fig. 6
figure 6

Overlap confusion matrix for the Clarkson dataset. The horizontal axis (X axis) from left to right and the vertical axis (Y axis) from top to bottom correspond to the augmentation strategies: No, Shift\(_{h}\), Shift\(_{v}\), Rotation, Brightness, Zoom, Flip\(_{h}\), ST, SC, \(LO_{ST}\) and \(LO_{SC}\), respectively. The values in top matrices are the misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) computed as in Eq. (1a), and the values in the bottom matrices are the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) computed from Eq. (1b)

Fig. 7
figure 7

Overlap confusion matrix for the Notre Dame dataset. The horizontal axis (X axis) from left to right and the vertical axis (Y axis) from top to bottom correspond to the augmentation strategies: No, Shift\(_{h}\), Shift\(_{v}\), Rotation, Brightness, Zoom, Flip\(_{h}\), ST, SC, \(LO_{ST}\) and \(LO_{SC}\), respectively. The values in top matrices are the misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) computed as in Eq. (1a), and the values in the bottom matrices are the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) computed from Eq. (1b)

Fig. 8
figure 8

Overlap confusion matrix for the IIITD-WVU dataset. The horizontal axis (X axis) from left to right and the vertical axis (Y axis) from top to bottom correspond to the augmentation strategies: No, Shift\(_{h}\), Shift\(_{v}\), Rotation, Brightness, Zoom, Flip\(_{h}\), ST, SC, \(LO_{ST}\) and \(LO_{SC}\), respectively. The values in top matrices are the misclassified attacks overlap ratio \(O_{A_{pq}}^{a}\) computed as in Eq. (1a) and the values in the bottom matrices are the misclassified bona fides overlap ratio \(O_{A_{pq}}^{bf}\) computed from Eq. (1b)

Summing up all the results, we can see that training with augmentation techniques significantly improves PAD performance than training only with original data. Each augmentation method plays a positive role on a particular dataset or network. Shift augmentation performs better than other methods in most cases. However, the results do not exhibit an exact consistency across all networks, augmentation techniques, and datasets. One improvement can be to preserve the created images in the memory rather than randomly augment and fed them to the network during the training process The advantage is the later exact knowing numbers of original and augmented data, whereas the drawback is the higher hardware requirements. The data augmentation techniques are classed into two general categories, data warping, and oversampling. Because the images generated by oversampling methods like using the GAN network should be detected as attack images, this could easily exacerbate the imbalance in the data. For iris PAD, only data warping can be applied to augment the training data. However, there is no consensus about the best augmentation strategy, especially no best combination way. The reason is that the intrinsic bias in the capture environment, subject population, or scale of datasets is different. Consequently, the first future work is to learn an optimal augmentation strategy in an automatic way. Also, we need to find an optimal dataset size after augmentation by balancing the used strategy and the available memory for storing augmented images. Moreover, the imbalance between bona fide and attack samples can be addressed.

6 Conclusion

This paper addresses a clear research gap by providing an in-depth analysis of the data augmentation role in iris PAD. Data augmentation technique is one of the crucial steps to address the limitation of iris attack data. We explore the impact of widely used data augmentation strategies and two combination methods, strategy-level, and score-level, on the generalization of iris PAD. We also propose a least overlap-based augmentation selection protocol to bring different types of wrongly classified samples into the correct classification. This is based on a detailed analysis of the overlap between the effect of different augmentation techniques. The experiments are performed on three datasets in the Iris-LivDet-2017 competition [42] and with three neural networks for comparison and analysis. The experimental results linked certain data augmentation methods to significant enhancement of generalizability and indicated the relatively low-overlapping effect of these augmentations.