Non-reference image quality assessment and natural scene statistics to counter biometric sensor spoofing

Non-reference image quality measures (IQM) as well as their associated natural scene statistics (NSS) are used to distinguish real biometric data from fake data as used in presentation/sensor spoofing attacks. An experimental study shows that a support vector machine directly trained on NSS as used in blind/referenceless image spatial quality evaluator provides highly accurate classification of real versus fake iris, fingerprint, face, and fingervein data in generic manner. This contrasts to using the IQM directly, the accuracy of which turns out to be rather data set and parameter choice-dependent. While providing very low average classification error rate values for complete training data, generalisation to unseen attack types is difficult in open-set scenarios and obtained accuracy varies in almost unpredictable manner. This implies that for each given sensor/attack set-up, the ability of the introduced methods to detect unseen attacks needs to be assessed separately.


Introduction
We have observed a drastic increase in biometric authentication techniques being applied in various applications, ranging from border control to financial services. This is done to either complement or even replace classical authentication techniques based on tokens or passwords. Of course, this increased usage has also caused fraudulent attacks being mounted more often against biometric systems. Besides injecting fraudulent data into the communication inside a biometric system or attacking the template database, attacks against the proper functioning of the biometric sensor gain increasing importance. Such attacks are usually termed 'presentation' -or 'sensor-spoofing' -attacks and are conducted by presenting artefacts mimicking real biometrics traits to the biometric sensor to be deceived or by replaying earlier captured biometric sample data on some suited device, thus also attempting to deceive the sensor ('replay attack').
Counter-measures to this type of attacks have of course been considered already and are typically termed as 'anti-spoofing' or 'presentation-attack detection' measures [1]. In this context, very different approaches have been followed. The first type of antispoofing approach targets the liveness of the presented biometric traits in a passive or active manner and is thus termed as 'livenessdetection'. For example, pulse can be measured from facial video or hippus can be determined from temporal high-resolution iris video (in both cases, passive liveness detection is conducted). An example for active liveness detection is to determine reaction to illumination changes in pupil dilation during data acquisition for facial, periocular, or iris recognition systems. Passive liveness detection is efficiently able to prevent attacks conducted by e.g. gummy fingers or facial masks; however, it can be fooled by a replay attack as signs of liveness are also present in recaptured video. Active liveness detection on the other hand is able to withstand both types of attacks.
The second anti-spoofing approach directly focuses on the replay of previously recorded biometric sample data -as this attack involves the recapturing of previously recorded data by the biometric sensor the corresponding counter-measure is termed 'recapturing detection'. Techniques in this category include the detection of unnatural movement in video footage as indication of an attack, e.g. caused by hand motion when presenting a photo or display device to the sensor. Other approaches look into the interference between display refresh rate of the replaying device and temporal resolution of the video captured by the biometric sensor to detect an ongoing replay attack. Obviously, these methods are not able to detect attacks conducted with artefacts as they are directly and solely focused on the replay of the data.
The third type of anti-spoofing approach is more generic and uses texture properties of real biometric trait data acquired by the biometric sensor to discriminate from either recaptured data or data resulting from presenting some spoofing artefact to the sensor. Contrasting to liveness-based methods, which are specific to the target modality, and recapturing detection, which is limited and has to be focused to specific sensor/display type (including print-outs of course) combinations, texture-based methods usually employ generic texture descriptors together with subsequent machine learning techniques to discriminate real biometric data from spoofed variants. Of course, for this purpose, training data for classifier training is required. For example, a large variety of local image descriptors have been compared with respect to their ability to identify spoofed iris, fingerprint, and face data [2] and of course highly successful texture descriptors like local binary patterns have been extensively used for this purpose. However, it is often cumbersome to identify and/or design texture descriptors suited for a specific task in this context. Therefore, also generative techniques like deep learning employing convolutional neural networks have been successfully applied to discriminate real from spoofed biometric data [3,4].
A very different way to identify spoofed data is to look into the quality of the imagery, assuming that the quality of the real biometric data is better or at least different from spoofed data. This of course can be seen as a specific type of texture-based discrimination approach. Related work considers two approaches in this context: first, the approach can be entirely agnostic of the considered modality by using general purpose image quality measures (IQM) [5,6], and second, image quality metrics can be tailored to the biometric modality under investigation (see e.g. [7] which use face-specific data quality in order to recognise spoofing attacks against face recognition systems). The major contribution of this paper is to employ general purpose non-reference IQM (also termed 'blind' IQM) as well as the underlying natural scene statistics (NSS) in biometric spoofing attack/presentation attack detection and to assess their corresponding performance in different application settings. Complementing earlier results [5], we use (i) a different and larger set of non-reference IQM (six instead of two) and (ii) do not fuse the results with full-reference IQM values but focus on using one or several fused blind IQMs as generic spoofing detection technique.
Extending own prior work on using non-reference IQM for presentation attack detection [6,8], (i) we add support vector machine (SVM) as a second classifier, also avoiding datadependent parameter optimisation in its employment and thus achieving better result generalisability and present ISO/IEC 30107-3 compliant evaluation, (ii) we directly train blind/ referenceless image spatial quality evaluator (BRISQUE) NSS on our data instead of using IQM output as classification input features, and (iii) we experimentally evaluate a specific type of open-set classification scenario, where our presentation attack detection schemes are confronted with real sample data of different sensors (i.e. looking into cross-sensor spoofing detection) and fake sample data of unseen subjects. Section 2 introduces and explains the blind IQM as used in this paper. The databases specifically provided to test presentation attack detection techniques for iris, fingerprint, face, and fingervein recognition used in the present work are described in Section 3. Section 4 presents corresponding experimental anti-spoofing results in three distinct experimental set-ups, while Section 5 provides the conclusions of this paper.

Non-reference image quality metrics
Non-reference or blind IQM are easier in deployment when compared with full-reference or reduced-reference IQM as no information of the full-quality reference image is required for application. On the other hand, they are also harder to design as this lack of comparison data renders the design of these IQM much more difficult. There are different ways how to design blind IQM, depending on the necessity and type of training data used and the extent of generalisation potential of the admissible distortion types considered. Thus, depending on these design principles, we face some limitations. Among the techniques designed so far, we may distinguish opinion-aware (OA) IQM designs, where the IQM are trained on databases containing distorted imagery for which human annotations in terms of quality are available, and opinion-unaware (OU) IQM designs, which only rely on deviations from statistical regularities seen in natural images without the requirement of training on human annotated distortion databases. OA IQM are intrinsically limited as their assessment is limited to quality impairment resulting from distortion types they have been trained on. The examples for the first type, i.e. OA IQM, are distortion identification-based image verity and integrity evaluation (DIIVINE), blind image quality index (BIQI), and BRISQUE, while natural image quality evaluator (NIQE), blind image integrity notator (BLIINDS-II), and blind image quality assessment through anisotropy (BIQAA) are OU IQM.
Systematic comparisons of non-reference or blind IQM (NR IQM) as considered subsequently in spoofing detection have been published on traditional IQM tasks [9,10]. Similarly, in nontrained [9] as well as in specifically trained manner [10], the correspondence to human vision is highly dependent on the target data set and on the nature of distortion present in the data. Thus, present studies did not identify a 'winner' among the techniques available concerning the correspondence to subjective human judgement and objective distortion strength.

OU NR image quality metrics
NIQE: The NIQE [11] is a spatial-domain IQM relying on an NSS model. The image is partitioned into patches for which sharpness is determined and only patches with sufficient sharpness are considered further. Those patches are pre-processed by local mean removal and divisive normalisation. From these data, for each patch, 36 NSS features are computed and these are fit to a multivariate generalised Gaussian (MVG) model. This MVG model is then compared to the 'natural' MVG model which is obtained by conducting the same procedure on natural images of good quality only. The extent of deviation from this model determines quality.

BLIINDS-II:
The BLIINDS-II [12] computes NSS from a local discrete cosine transform (DCT) domain. After partitioning the image into patches, a local 2D DCT is computed on each of the blocks. Subsequently, the DCT domain in each block is partitioned into a low-frequency, mid-frequency, and high-frequency DCT subband, respectively. Furthermore, the DCT block is partitioned into three differently oriented subregions. Subsequently, an MVG fit is computed for each of the DCT subbands defined in this manner. From these parameters, the quality is derived in comparison to corresponding MVG parameters computed from high-quality imagery.
BIQAA: BIQAA [13] is the only NR IQM considered in this work which does not rely on NSS. In contrast, BIQAA measures the variance of the expected entropy of the image to be assessed in a set of predefined directions. Entropy is computed on a local basis by using a spatial-frequency distribution as an approximation for a pre-defined probability density function. For BIQAA, the generalised Renyi entropy and the normalised pseudo-Wigner distribution (PWD) are chosen in the used implementation. In this context, a pixel-by-pixel entropy value is computed enabling the generation of entropy histograms. The variance of the expected entropy is measured for different directions, and the differences are used to indicate anisotropy. Directional selectivity can be achieved by using an orientation-selective one-dimensional (1D) PWD implementation.

OA NR image quality metrics
BRISQUE: BRISQUE [14] operates in the spatial domain and uses virtually the same NSS as NIQE. The major difference to NIQE is the training on distorted images. For this purpose, similar kinds of distortions as present in the LIVE image quality database were introduced in each training image with varying strengths to create a set of the distorted images: JPEG 2000, JPEG, white noise, Gaussian blur, and fast fading channel errors. Subsequently, a mapping is learned from feature space to quality scores resulting in a measure of image quality. For that purpose, a SVM regressor is used.
DIIVINE: The DIIVINE [15] employs a two-stage framework consisting of distortion identification with subsequent distortionspecific quality determination. DIIVINE considers three common distortion types, i.e. JPEG compression, JPEG2000 compression, and blur.
In order to compute statistics from distorted images, the steerable pyramid decomposition is used. The steerable pyramid is an over-complete wavelet transform offering enhanced orientation selectivity when compared to using classical wavelet transform, as e.g. in BIQI.
BIQI: The BIQI [16] is based on a two-stage framework like DIIVINE as well and employs a classical wavelet transform over three scales using Daubechies 9/7 wavelet biorthogonal wavelet basis. The computed wavelet subband coefficients are used to compute NSS parameters (again an MVG fit is conducted): The first step is image distortion classification (which is based on a measure of how the NSS are modified and uses five distortion types: JPEG, JPEG2000, WN, Blur, and FF), the second step is quality assessment, using an algorithm specific to the distortion identified.

Natural scene statistics
IQM applied to images result in a single quality score in a certain range ([0, 100] in our set-up) for each IQM. Typically, these scores are obtained by applying machine learning techniques to map NSS to quality scores based on human judgement of distorted images or undistorted images only. Thus, the actual quality score delivered by IQM is neither directly related nor does it necessarily fit well to our application case for discriminating real from spoof biometric data. An alternative solution to this drawback is to avoid the deviation via quality scores but to train NSS directly on the 'real' and 'fake' labels of our data. Doing this, we also avoid the dimensionality reduction to a 1D quality score but retain the full NSS information for training. of the BioSecure data set [17]. Four samples of each iris were captured in two acquisition sessions with the LG Iris Access EOU3000. Thus, the database holds 800 real image samples (100 irises × 4 samples × 2 sessions). The fake samples were also acquired with the LG Iris Access EOU3000 from high-quality printed images of the original sample. As the structure is the same as for the real samples, the database comprises 800 fake image samples (100 irises × 4 samples × 2 sessions). Fig. 1 displays example images.

Used spoofing/presentation attack databases
The data set has been used before in spoofing/presentation attack detection investigations, e.g. [2,5,18].
ATVS-FFp DB: The ATVS-FFp database consists of fake and real images taken from a human's index and middle finger of both hands. Those fingerprints can be divided into two categories: with cooperation (WC) and without cooperation (WOC). 'WC' means that acquisition assumes the cooperation of the fingerprint owner, whereas images taken 'WOC' are latent fingerprints which had to be lifted from a surface.
Independent of the category, four samples of each finger were captured in one acquisition session with three different sensors: • flat optical sensor Biometrika Fx2000 (512 dpi), • sweeping thermal sensor by Yubee with Atmel's Fingerchip (500 dpi), • flat capacitive sensor by Precise Biometrics model Precise 100 SC (500 dpi).
As a result, the database consists of 816 real/fake images (68 fingers × 4 samples × 3 sensors) samples taken WC and 768 real/ fake images (64 fingers × 4 samples × 3 sensors) samples taken WOC. Fig. 2 displays example images from this data set.
IDIAP replay-attack DB [22]: The replay-attack database for face spoofing consists of 1300 video clips of photo and video attack attempts to 50 clients under different lighting conditions. All videos were generated by either having a real client trying to access a laptop through its webcam or by displaying a photo/video to the webcam. Real as well as fake videos were taken under two different lighting conditions: • Controlled: The office light was turned on, blinds are down, background is homogeneous. • Adverse: Blinds up, more complex background, office lights are out. To produce the attack, high-resolution videos were taken with a Canon PowerShot SX150 IS camera. The way to perform the attacks can be divided into two subsets: the first subset is composed of videos generated using a tripod to present the client biometry ('fixed'). For the second set, the attacker holds the device used for the attack with his/her own hands ('hand').
In total, 20 attack videos were registered for each client, 10 for each of the attacking modes just described: • four times mobile attacks using an iPhone 3GS screen (with resolution 480 × 320 pixels), • four times high-resolution screen attacks using an iPad (first generation, with a screen resolution of 1024 × 768 pixels), • two times hard-copy print attacks (produced on a Triumph-Adler DCC 2520 colour laser printer) occupying the whole available printing surface on A4 paper.
As the algorithms used in our experiment are not compatible with videos, we extracted every Xth frame from each video and used them as test data in our experiment. Fig. 3 displays example images used in experimentation.
The spoofing-attack finger vein database [23]: This data set is provided by IDIAP Research Institute, consisting of 440 index finger vein images (both real authentications and spoofed ones (i.e. attack attempts)) corresponding to 110 subjects. Two different types of samples are available (as shown in Fig. 4): full (printed) images and cropped images where the resolution of the full images is 665 × 250 and that of the cropped images is 565 × 150 pixel, respectively.
This data set has been released in the context of the '1st Competition on Counter Measures to Finger Vein Spoofing Attacks' [23] and now it is the data basis for most research in finger vein sensor spoofing [24][25][26].

Experimental set-up
For each image in the databases, quality scores were calculated with the IQM described in Section 2. We used the MATLAB implementations from the developers of BIQI, BLIINDS-2, NIQE, DIIVINE, BRISQUE (all available from http://live.ece.utexas.edu/ research/quality/) and BIQAA (available at https:// www.mathworks.com/matlabcentral/fileexchange/30800-blindimage-quality-assessment-through-anisotropy). In all cases, we used the default settings. We normalised the result data with the result that 0 represents a good quality and 100 the bad one which is already the default result in all cases except BIQAA. Originally,  the data of BIQAA is between 0 and 1. However, the values are so small that we had to define our own limits for the normalisation. A thorough analysis shows that our values are all between 0.00005 and 0.05; therefore, we used these figures as our limits. Moreover, we had to change the 'orientation' of the BIQAA quality scores to be conforming to our definition. Summarising, the following formula (1) was built: 4.1.1 Experiment 1: training sensor/setting identical to evaluation sensor/setting: In the first stage of experiment 1, we only consider the distribution of the quality scores. Our aim was to eventually find a threshold between the values of the real data and the fake ones for the various IQM. Afterwards, in the second stage, we used the quality scores for a leave-one-subject-out cross-validation (training data is all data but the samples of the current subject to be classified, which is applied to each subject) to get an exact assertion about the classification possibility with NR IQM. To classify our data, we used k-nearest neighbours (kNN) as well as SVM classification. For kNN, our used k were 1, 3, 5, 7, and 9 (denoting the number of images with the closest feature vector considered) for this experiment and we exhaustively evaluate all combinations of IQM (i.e. resulting in different feature vector dimension and composition). Thus, we combined several quality scores of the different measures into one vector and used this for the kNN classification. The distance for the kNN-classification was the distance between the two vectors corresponding to the two images in question. The kNN results presented are the best ones, which means that we introduce a bias in the results here to see what is possible, but the best configuration in terms of IQM combination and k will be data dependent and will probably not generalise. For SVM, we use feature vectors consisting of all IQM scores (i.e. dimension 6) applying LIBSVM [27] with RBF kernel for training and thus, no bias by selecting certain IQM is introduced as all IQM are used. The grid-parameters (c, g) for the scalable vector regression were searched on a grid in logarithmic space. In order to conduct a fair evaluation in the used cross-validation, (c, g) are optimised within each training fold but then applied to the evaluation data.
The quantitative performance of the different techniques is measured according to the metric developed in ISO/IEC 30107-3 in terms of: (i) attack presentation classification error rate (APCER), which is defined as the proportion of an attack presentation incorrectly classified as normal (or real) presentation (falsenegative spoof detection); (ii) normal presentation classification error rate (NPCER), which is defined as the proportion of a normal presentation incorrectly classified as an attack presentation (falsepositive spoof detection). Finally, the performance of the overall technique is assessed in terms of average classification error rate (ACER): Of course, the lower the values of ACER (as well as APCER and NPCER), the better is the performance of the spoofing detection.

Experiment 2: training with BRISQUE NSS:
In our experiment 2, we applied the BRISQUE NSS data and trained it on our labels. As first option, we applied kNN to the 36-dimensional BRISQUE NSS again using different values for k, presenting the best result achieved. As second option we applied SVM: The BRISQUE software does not only provide a pre-trained model delivering quality scores but also offers the option for training on different labels than quality scores using LIBSVM [27]. This is applied within the cross-validation evaluation.

Experiment 3: training sensor/setting different from evaluation sensor/setting:
As correctly pointed out in [28], sensor spoof detection can of course not be considered a closed set problem. This means, that in a real-world scenario, the training data for a specific sensor will never be complete as in general we do not know which artefacts will be used by an attacker -thus the classifier should also work on unseen spoof types. This of course raises the question how to train a classifier based on such incomplete training data, a typical case of open set binary classification. The fact that the performance of a classifier will decrease when testing with samples unseen in training has been well studied in machine learning and pattern recognition, e.g. related to the 'over-fitting problem'. This generalisation problem of data-trained classifiers has been discussed also in the context of general image classification [29] and in biometrics (see e.g. in gender classification [30]). The general open-set recognition problem has recently been addressed [31][32][33] and the developed open-set classification techniques have been successfully applied to soft biometrics (mark, scar, and tattoo classification [34]), camera attribution and device linking [35], and fingerprint spoof detection [28]. In the latter work, emphasis is set to detect also attacks with unseen spoofing artefact fabrication material. Contrasting to that, one aspect of experiment 3 covers the issue of cross-sensor or inter-database spoofing detection. This means that unseen attacks involve samples acquired with different sensors than those the antispoofing system has been trained on, a topic that has gained increasing importance. Recent work has considered this scenario with various presentation attack detection methods for fingerprint [36][37][38], face [39], speaker [40], and iris [41,42] recognition techniques, respectively. In experiment 3, we investigated two different settings to simulate open-set scenarios: first, the ATVS-FFp database contains classical fingerprint imprints (WC) and latent fingerprints (WOC). So far, we have strictly separated those two sets as for both types real and fake versions are available. In order to simulate the openset scenario, we trained the used classifier with classical imprints, while we evaluated on the latent fingerprint data. This can be seen as a special case of considering unseen fabrication material.
As a second setting, we used real sample data captured by different sensors and investigate how the spoof detection techniques trained on the sample data used before do react. In this setting, it is not entirely clear what to count as correct or incorrect decision (i.e. how to define APCER and NPCER): a (real) sample captured by a different sensor could be rated as 'real' as it corresponds to data captured from a real finger; on the other hand, it could be rated as 'fake' as it has been captured by a different sensor and might be the result of a successful injection attack. We follow the first consideration also due to the possibility to consider cross-and multi-sensor spoof detection techniques. Thus, a real sample captured by a different sensor should be rated correctly as being 'real', thus accumulating errors (samples rated as 'fake') in NPCER.
For iris samples, we used the SDUMLA-HMT data: This multimodal data set was collected during the summer of 2010 at Shandong University, Jinan, China. One hundred and six subjects, including 61 males and 45 females with age between 17 and 31, participated in the data collecting process, in which five biometric traits -face, finger vein, gait, iris, and fingerprint, are collected for each subject [43]. SDUMLA-HMT is available at http:// mla.sdu.edu.cn/sdumla-hmt.html. Every subject provided ten iris images, i.e. five images for each of the eyes.
For fingerprint samples, we employed samples of 49 individuals from the CASIA-FingerprintV5 data set (http:// biometrics.idealtest.org/dbDetailForUser.do?id=7) also used in [44,45] differ between the two data sets). Thus, contrasting to the cases before, samples from this data set should be correctly classified as 'fake' and errors (i.e. samples rated as 'real') were counted in APCER. See Fig. 5 for an example of each data set. Note, that in experiment 3, we have an intrinsic separation of training and evaluation data (contrasting to experiments 1 and 2). Therefore, we did not apply a leave-one-subject-out crossvalidation but a direct classification of the evaluation samples based on the training data. In the second setting, involving the additional data sets containing real or fake data only, we only get NPCER or APCER results, thus ACER does not make sense and is omitted. Figs. 6 and 7, we display the distribution of single IQM values for real and fake data. For some cases, we notice a decent separation of the values almost allowing to specify a separation threshold. In the figures, we have depicted the threshold leading to the lowest ACER and have coloured the areas correctly classified in green. However, for most configurations, this simple strategy does not lead to useful results.

Experiment 1 -results: In
In many cases (see e.g. Fig. 7), we could not recognise any separation between the distributions because they exhibited a similar mean and spread for real and the fake data. That was the reason for employing training-based classification techniques and fusion techniques.
In the case of kNN classification with only one IQM, we already obtain surprisingly good results [6,8]. However, we were not able to identify a single IQM specifically well suited for the target task. In contrast, it seems that the different distortions present in the spoofed data are quite specific in terms of the nature and characteristic of the distortions, which is the only explanation of different IQM performing best on different data sets.
In fact, our results confirm the general results on IQM quality prediction performance [9,10] in that it is highly data set and distortion-dependent which IQM provides the best results.
A further increase in classification accuracy (as computed by 100 − (APCER + NPCER)) is obtained by the combination of several IQM. Table 1 shows the best metric combinations in the case of kNN-classification for the considered databases from an exhaustive search. On average, we could improve our results by 7% compared to the single measure results [6,8] and so most of the results are over 90%. From the latter table, we notice that there is a trend of getting best results when combining a larger number of IQM, confirming earlier results in this direction [5]. In order to look into this effect more thoroughly (and to clarify the role of the k-parameter in kNN classification), we have systematically investigated the results of the exhaustive classification scenarios in [6,8]. We found that combining more metrics and choosing k large leads to better results on average, whereas the top results are achieved when using three to six metrics depending on the considered data set. For optimal values of k, we are not able to give a clear statement as k was also found to be 1 for three data sets in Table 1.
In Table 2, we display ISO/IEC 30107-3 compliant results comparing kNN and SVM classification. For correctly interpreting these results, it is important to consider that for kNN classification, we present the best result in terms of ACER achieved when considering all admissible values for k and all possible combinations of IQM. For the kNN case (left table half), we also provide the corresponding k value and the number of employed IQM in this best configuration. For SVM, we do not introduce any bias by including all six IQM score values into the feature vector.
The overall trend in terms of ACER is quite comparable for both kNN and SVM, as the databases exhibiting large and small ACER are identical for both techniques. For both kNN and SVM, there is no clear trend if APCER is usually larger as NPCER or vice versa. Also, there is no clear trend, which classification approach is better -SVM is superior for three data sets, while kNN is for six data sets. However, given that the kNN results come from a data-dependent parameter optimisation, SVM is strongly preferable as these results will generalise well as they are not at all fitted to the data. Overall, we face quite significant variations in terms of achieved ACER magnitude which implies that the methodology cannot be recommended as a general spoof detection approach but is restricted to suited data sets. Table 3, we show the results achieved when using BRISQUE NSS feature vectors instead of IQM ones, for both kNN as well as SVM classification. We observe identical behaviour with respect to the relation between APCER and NPCER as observed for IQM feature vectors (no clear trend which type of error is more frequent). When considering ACER, there is no clear improvement when changing from IQM to NSS feature vectors in the case of kNN classification.    The situation is very different when considering the SVM results. NSS-based ACER values are clearly better for all but a single database (for which the values are identical) compared to IQM ones, partially considerably so. For example, ACER is reduced from 13.14 to 0.64 for the optical fingerprint data set (WC) and from 4.69/5 to 0.1 for both the capacitive fingerprint data set (WOC) and the full sized fingervein data set, respectively. Also, ACER values are superior for SVM compared to their kNN counterparts in the case of the NSS feature vectors. This is of particular interest, as the SVM results are expected to be highly generalisable due to the avoidance of data-specific bias. The significant superiority of SVM-NSS compared to kNN-NSS can probably be attributed to the significantly higher dimension of its feature vectors compared to SVM-IQM, for which SVN is much better able to exhibit its strengths as compared to kNN. As a consequence, we propose the employed SVM-NSS technique as a generic and rather accurate spoof detection methodology.

Experiment 3 -results:
The set of last experiments is devoted to the open-set topic, i.e. looking into effects in case the type of evaluated samples are not part of the available training set. Table 4 shows the results when classifying real and fake latent fingerprints (denoted as WOC) when classification is based on classical fingerprint data (WC). We compare all four considered classification techniques in this table.
When comparing the obtained ACER results with the corresponding ones in Tables 2 and 3, we realise that in all four classification cases, ACER values are clearly worse in the 'openset' scenario. Interestingly, worst ACER results are now exhibited for SVM-NSS, the approach clearly performing best in the 'closedset' scenario. While this is surprising at first sight, it is not in fact. SVM-NSS is able to generate a very accurate model of the training data and thus performs quite well when working on seen spoof data. Contrasting, when confronted with unseen data very different from the training data, many errors do occur. Interestingly, not a single real sample is incorrectly classified as a fake one. However, almost every second fake sample is misclassified into a real one. This is also a very different behaviour as seen with the closedset scenario. In the open-set scenario, we observe significantly different magnitudes for APCER and NPCER, and the relation depends on the feature vector type. While for IQM-based feature vectors NPCER is clearly larger, the opposite is true for NSS-based ones.
Finally, in Table 5, we display results when confronting our spoof detection methodology with samples from unseen sensors or unseen subjects. As explained earlier, we only present NPCER or APCER values, as the employed data set only contain real or fake samples, but not both. Again, we compare the four classification methodologies considered so far. Additionally, kNN-IQM ∅ denotes the NPCER/APCER averaged over all results varying the number of used IQM exhaustively and taking the minimal value for k = 1, 3, 5, 7, 9. The aim is to show that average behaviour of the kNN behaviour may significantly deviate from the best results presented so far in the results. The results in the table clearly confirm this -average results are clearly worse as compared to the best ones as shown in the first column. In some cases, the difference is small (e.g. fingervein full), in other cases results change from perfect spoof detection to entirely useless results like for iris when changing from the best result to the average behaviour. This also implies that data dependency for kNN is rather high which leads to poor generalisation potential for this approach.
When looking at the results overall, we do hardly observe any general trends apart from the fact that results seem to strongly depend on the data sets considered and features/classification schemes employed. SVM-NSS, the classification scheme of choice for the closed-set scenario, performs perfectly for fingervein data, thus enabling cross-sensor spoof detection. On the other hand, for iris data, it does not work at all, classifying almost every real sample data as fake one while for fingerprint data every other real sample data is classified as fake. When looking at the actual pictorial data, it seems that fingervein data from different sensors is more similar than iris or fingerprint data from different sensors is (e.g. compare Figs. 4 and 5 for the fingervein case). NSS used with kNN classification exhibits the best overall results, with perfect classification for iris, three out of six fingerprint settings as well as for correctly detecting fake fingervein sample of unseen users but identical sensor. Applying SVM to IQM directly leads to consistent misclassifications in many cases, however, for two cases, the classification is almost perfect. The kNN results using IQM underpin the necessity of the k-parameter optimisation in case sensible results are expected. One might expect that corresponding fingerprint sensor types (i.e. optical versus optical) lead to lower error rates that different ones; however, the results do not reflect this behaviour. Overall, it is impossible to explain most effects in a sound manner, like the almost opposite behaviour for kNN and SVM on iris data for both feature vector types or the single outlier result of kNN-NSS for full fingervein data versus UTFVP. Also, the reasons for the entire failure of IQM feature vectors as opposed to NSS feature vectors for the full fingervein versus VERA data are hard to figure out.

Conclusion
We have found a high dependency on the actual data set/modality under investigation when trying to answer the question about the optimal settings when using non-reference IQM for biometric spoof detection. For some data sets, we obtain almost perfect separation of real and fake sample data, while for others, ACER values up to 10% can be observed. The situation changes considerably, when directly training NSS features (in our experiments those used by BRISQUE) on our data, especially when using SVM classification. In this setting, worst ACER values are bound by 3.8%, with a majority of computed ACER values being significantly <1%, which makes this approach an interesting candidate for a generic spoof detection methodology.
In case the proposed spoof detection techniques are confronted with data from unseen sensors and/or subjects (modelling a more realistic open-set classification scenario incomplete training data), many results seem to be rather unpredictable. Thus, it seems to be advisable to apply recent open-set classification schemes to result in more stable and more generalisable results in case unseen data is to be expected.

Acknowledgments
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement no. 700259. Also, this work has been partially supported by the Austrian Science Fund, project no. 27776.