Likelihood ratio data to report the validation of a forensic fingerprint evaluation method

Data to which the authors refer to throughout this article are likelihood ratios (LR) computed from the comparison of 5–12 minutiae fingermarks with fingerprints. These LRs data are used for the validation of a likelihood ratio (LR) method in forensic evidence evaluation. These data present a necessary asset for conducting validation experiments when validating LR methods used in forensic evidence evaluation and set up validation reports. These data can be also used as a baseline for comparing the fingermark evidence in the same minutiae configuration as presented in (D. Meuwly, D. Ramos, R. Haraksim,) [1], although the reader should keep in mind that different feature extraction algorithms and different AFIS systems used may produce different LRs values. Moreover, these data may serve as a reproducibility exercise, in order to train the generation of validation reports of forensic methods, according to [1]. Alongside the data, a justification and motivation for the use of methods is given. These methods calculate LRs from the fingerprint/mark data and are subject to a validation procedure. The choice of using real forensic fingerprint in the validation and simulated data in the development is described and justified. Validation criteria are set for the purpose of validation of the LR methods, which are used to calculate the LR values from the data and the validation report. For privacy and data protection reasons, the original fingerprint/mark images cannot be shared. But these images do not constitute the core data for the validation, contrarily to the LRs that are shared.


a b s t r a c t
Data to which the authors refer to throughout this article are likelihood ratios (LR) computed from the comparison of 5-12 minutiae fingermarks with fingerprints. These LRs data are used for the validation of a likelihood ratio (LR) method in forensic evidence evaluation. These data present a necessary asset for conducting validation experiments when validating LR methods used in forensic evidence evaluation and set up validation reports. These data can be also used as a baseline for comparing the fingermark evidence in the same minutiae configuration as presented in (D. Meuwly, D. Ramos, R. Haraksim,) [1], although the reader should keep in mind that different feature extraction algorithms and different AFIS systems used may produce different LRs values. Moreover, these data may serve as a reproducibility exercise, in order to train the generation of validation reports of forensic methods, according to [1]. Alongside the data, a justification and motivation for the use of methods is given. These methods calculate LRs from the fingerprint/mark data and are subject to a validation procedure. The choice of using real forensic fingerprint in the validation and simulated data in the development is described and justified. Validation criteria are set for the purpose of validation of the LR methods, which are used to calculate the LR values from the data and the validation report. For privacy and data protection reasons, the original fingerprint/mark images cannot be Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/dib shared. But these images do not constitute the core data for the validation, contrarily to the LRs that are shared.
& 2016 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Subject area
Forensic Biometrics More specific subject area

Forensic Fingerprints
Type of data Empirical validation report example based on real forensic fingerprint images.
Likelihood ratio values computed from those real forensic fingerprints, in order to replicate the validation report.

How data was acquired
Fingerprints scanned using the ACCO 1394S live scanner, converted into the biometric scores using the Motorola BIS 9.1 algorithm.

Data format
Text files, Calibrated likelihood ratios supporting either H p or H d propositions Experimental factors Biometric scores were treated as per description in paragraph 4.

Experimental features
Same [SS] and Different [DS] source scores were produced using a Motorola AFIS comparison algorithm and used to compute the LR values as described in paragraph 5.

Data source location
Netherlands Forensic Institute, Laan van Ypenburg 6, 2497 GB, The Hague, The Netherlands Data accessibility Data is with the article.

Value of the data
Real forensic data in a form of LR values suitable for validation and performance evaluation are provided. The availability of LRs from forensically relevant data is limited, which increases the value of these data.
Complete empirical validation case study presented in a form of a validation report including a validation decision is provided. The data serve for reproducibility of validation reports of automatic forensic evaluation methods as described in [1].
The performance characteristics of the LR method developed is measured in terms of accuracy, discriminating power, calibration, generalization, coherence and robustness [1], provided in a form of calibrated likelihood ratios for boththe baseline and the multimodal LR method.

Data
The term "data" is used to denote the LR values, which are produced using two different LR methods presented below. The data are shared with the forensic biometric community, alongside with the description of an empirical example of a validation report generated using the LR values, which is included in [2]. The LR data can be used to reproduce the validation experiments for the accuracy, discriminating power and calibration in the validation report in [2]. The validation report is of potential interest of forensic researchers who aim to validate and accredit their LR systems/LR methods, and the data presented here are of use to assess the reproducibility of the results presented in the report. Presented below is an experimental design, materials, methods as well as the datasets used to produce the LR values.

Experimental design, materials and methods
In Section 3 we start off with the validation matrix in which the performance characteristics, metrics and graphical representations used are organized; introduce the similarity scores in Section 4; describe the datasets used for validation and LR method development in Section 5, define the LR methods in Section 6; define the validation criteria in Section 7; present the validation report organized in 6 tables (one per each performance characteristic) in Section 8 and conclude by introducing the validation decision in Section 9.
A more complete example of the validation report using this particular data can be found in [2].

Validation matrix
A validation report must include the specification and description of the different aspects of the validation process. Sometimes, these aspects are summarized in a so-called "Validation matrix" ( Table 1).
The following aspects are essential to any validation process: Performance characteristic: characteristic of a LR method that is thought to have an influence in the validation of a given method. For instance, LR values should be discriminating in order to be valid, provide clear distinction between comparisons under different hypotheses. In this case, discriminating power is a performance characteristic.
Performance metric: variable whose numeric or categorical value measures a performance characteristic. For instance, the minimum log-likelihood ratio cost (minC llr ) can be interpreted as a measure of discriminating power, and therefore it can be used as a performance metric of the discriminating power. Graphical representation: representation of a performance characteristic, its distribution or its variation in the form of a graph. Note that not all graphical representations recommended in the original article [1] are included in the Validation Matrix, but at least one for each characteristic.
Validation criteria: these define conditions for validating the method for each of the performance characteristics considered (i.e., rows in the matrix). For instance, if we are measuring accuracy using Cllr as a metric, the validation criterion can be Cllr o0.2. The establishment of these criteria depends on the policy of each forensic laboratory, and should be transparent and not easily modified during the validation process. Some implications of this are discussed previously in this document.
Data: description of the database used for validation, both in the development and in validation stages.
Experiment: a description of the experimental protocol to generate the likelihood ratio values. Each experimental protocol might vary among different performance characteristics, especially for the secondary ones. For instance, in order to measure coherence, the protocol might significantly vary with respect to the measure of accuracy [3].
Analytical result: value of the performance metric for the experiment. For instance, if we are measuring accuracy using Cllr as a metric, the analytical result can be Cllr¼ 0.2. It is also often useful to express the result as a relative improvement with respect to a clearly defined baseline or reference.
Validation decision: for each performance characteristic, the validation decision will be pass if the validation criterion is met by the analytical result, and fail otherwise.

Fingerprint evidence evaluation using AFIS scores
The method to be validated in this example is based on the output scores of an Automated Fingerprint Identification System (AFIS) comparison algorithm. The aim is to compute a likelihood ratio for each score provided by the AFIS in a comparison between a fingermark and a fingerprint. The "commercial off-theshelf" AFIS algorithms producing comparison scores are primarily developed to support the process of selection of candidates for forensic investigation and not aimed for the process of description of the evidential value for forensic evaluation [4]. However, the information of the AFIS can be evaluated by means of a LR in order to yield complementary information to forensic examiners, especially if they are unsure about the conclusions of a comparison between a fingerprint and a fingermark. Previous work regarding this procedure can be found in [2,[5][6][7]. As a consequence, different methods to compute LR values from AFIS scores have been implemented and evaluated at the Netherlands Forensic Institute [2,3,8].
The AFIS comparison algorithm (Motorola BIS -Printrak 9.1) is used here as a black box, without the aim of scrutinizing its internal approach to compute scores. A detailed description of the algorithm inside the black box can be found in [2]. In recent work [9] it is shown that the higher the amount of scores to train the models, the more adequate the plug-in method.
In this example, the propositions for the computation of the LR are established at source level, and defined as follows:  The determination of the relevant propositions in a specific case is mandatory. However, the hypotheses determined in this particular example are generic and not intended as a recommendation in the original article [1]. They are just given for the purpose of illustration. Each particular case will lead to a different set of propositions, and this should be considered in the scope of the validation process. The determination of the hypotheses is part of the scope of the validation procedure conducted, which should be incorporated to other requirements from each particular laboratory or institution.

Datasets used
As recommended in the original article [1], different datasets are used for the development and validation stages. A "forensic" dataset, consisting of fingermarks from the real cases, was used in the validation stage. The LRs generated by the methods, are the values used to conduct the validation process, and are the data presented in this contribution.

Development dataset
Since it is notoriously difficult to find forensically relevant, sufficiently large datasets including the known ground truth about the origin of the specimens, we decided to use a set of simulated 1 [10,9] 8-minutiae 2 fingermarks from 26 individuals paired with their corresponding fingerprints. The fingermarks were obtained by capturing an image sequence of the finger of each individual from an optical live scanner (Smiths Heimann Biometrics ACCO 1394S live scanner) and splitting the frames captured into 8 minutiae configurations.
For generating same-source (SS) scores we used the AFIS scores of simulated fingermarks and the corresponding reference fingerprint of the same finger, captured from the same individual under controlled conditions. For generating different-sources (DS) scores we used the mark in the case compared against a 200'000 -fingerprint subset of population database provided by the National Services of Dutch National Police. The number of comparisons used to generate scores is summarized in Table 2.
In order to generate an appropriate modelling of the scores for the development stage, scores are obtained on a "leave-one-person-out" basis, meaning that in the computation of a likelihood ratio from a score, the latter is eliminated from the training data for the models.
It is worth noting that, in score-based LR computation, there is some theoretical controversy about the way in which scores are computed from the training dataset (see e.g. [11]). However, we think that the proposed scheme to obtain scores is adequate for the sake of illustration in the original article [1], and it is by no means proposed as a recommendation for score-based systems.

Validation dataset
The validation dataset consists of data from real forensic cases: 58 identified fingermarks in 12minutiae configuration and their corresponding fingerprints. The ground-truth labels of the dataset, indicating whether a fingermark/fingerprint pair originates from the same source as stated by forensic examiners is denoted as "ground-truth by proxy" because of the nature of the pairing between fingermarks and fingerprints: they have been assigned after examination by human examiners, 3'758 marks -1 print 3'758 marks -200'000 prints 1 Simulated fingermarks in this case refer to series of image captions of a finger moving on a glass plate of the fingerprint scanner (the procedure is described in detail in [10]). 2 Please note that the performance characteristics of the LR model described in Section 6 have been evaluated using the development dataset based on the fingermarks in the 8 minutiae configuration (which is the quality threshold for usability of fingermark evidence in some countries). Subsequently the LR model was validated using the validation dataset for a range of 5-12 minutiae configuration fingermarks.
indirectly taking into account not only the 12 minutiae, but also the correspondence of other features. The minutiae feature vectors 3 of the fingermarks have been manually extracted by examiners while the minutiae feature vectors of the fingerprints have been automatically extracted using the feature extraction algorithm of the AFIS used, and manually checked by examiners. Those feature vectors are used to feed the AFIS comparison algorithm for the computation of scores.
In order to obtain multiple minutiae configurations for the validation of the LR method, the minutiae extracted from the fingermarks have been clustered into configurations of 5-12 minutiae, according to the method described in [10]. Following the clustering procedure, we obtain 481 minutiae clusters in a 5minutiae configuration from the 58 fingermarks with 12 minutiae. For each cluster in the marks, a samesource (SS) score is obtained by comparing each minutiae cluster from a fingermark with the corresponding reference print. Similarly, a different-source (DS) score distribution is obtained by comparing each minutiae cluster from a fingermark to a subset of a police fingerprint database. This database consists of roughly 10 million 10-print cards captured in 500 dpi. The higher the number of minutiae in each cluster, the lower the number of clusters, as can be seen in Table 3.

Description of the behaviour of AFIS scores
Before the LR model under validation (and its baseline) will be introduced, an analysis of the AFIS scores is performed in order to determine the set of desirable performance characteristics (qualities) of the LR models." Worth noting, this analysis is performed on training data, which is not used as validation database afterwards.
Additionally, the AFIS technology used employs the concept of early outs. Thus, there are three consecutive stages in each comparison: 1. Firstly, the system uses a quick comparison between the mark and the print. If the score obtained in this first comparison is À 1, it is called a first level early-out and the score is delivered for that comparison, stopping the comparison process. Otherwise, a second comparison is performed. 2. If the score was not a first early-out, the AFIS does not still output the score, but performs a more sophisticated (but still fast) comparison between the mark and the print. If the score obtained is between 0 and 300 it is called a second level early-out, and it is delivered for that comparison, stopping the comparison process. Otherwise, a third level comparison is proposed. 3. If the comparison does not result in first or second early-outs, the AFIS performs a more computationally intensive comparison, where a final score bigger than 300 is finally delivered.
This behaviour of the system divides the range of scores into three regions ( À1, {0,300} and more than 300. This is shown in Fig. 1, where the scores that result from the AFIS algorithm applied to a subset of the development data are clearly distributed in those three regions (R). In Region 1 (R1) (score of À 1) the first level early-outs are found. In Region 2 (R2) (scores in the {0,300} range) the second level early-outs are distributed. Finally, in Region 3 (R3) the full comparison of all the features is performed (the algorithm outputs scores bigger than 300). Additionally, it should be considered that the family of probabilistic distributions of SS and DS scores observed in each region might be different, mainly because the early-out scoring process implies the use different comparison algorithms.
The original fingerprints cannot be shared with the forensic biometric community due to restriction related to privacy and data protection. But the likelihood ratios which were produced by the two compared LR methods can be shared with the biometric community. They are the core data of the experiment, allowing to reproduce the published results.

Multimodal LR method and baseline KDF
In this section, we describe the model to validate and its baseline. The aim of the LR method to validate (the so-called multimodal method, briefly described below) is to outperform the baseline, as we discuss later. This description is needed in the validation report, if there is not a proper bibliographic reference to address it.

Data produced using the baseline LR method: Kernel Density Functions
The multimodal nature of the SS and DS score distributions and the non-overlap of the three regions suggests the use of flexible, non-parametric score-to-LR transformation models. A popular choice in the literature [12,13] has been the Kernel Density Functions (KDF or KDE). For this reason, KDE will be used as the baseline model in our validation experiment. In the KDE baseline experiment we treat all the SS (and DS) scores in all three regions together to calculate LR's from the AFIS scores.
KDE (or any other parametric / non-parametric modelling method) will not be of much use particularly in the R1 region, since all the scores in this region have the same discrete value S ¼ À1. It is an excellent example of a limitation of the use of KDE for this kind of score distribution. However, as KDE is typically chosen and recommended by many references in forensic science, and it is also theoretically grounded, we will choose it as a baseline.
Let S denotes the score obtained by the AFIS in the comparison between the fingermark found on the crime scene and the fingerprint of the donor. The baseline KDE LR model implements the general

LR expression:
where for the fingerprint evidence evaluation datasets are defined in the following way: S ssa set of scores obtained from comparing a training set of simulated fingermarks of the donor with the reference fingerprint of the donor. They will be used to fit the KDE probability density in the numerator.
S dsscores obtained from comparing the crime scene fingermark and a subset of fingerprints from the population database used in the model (in this case a subset of the operational AFIS database of the National Unit of the Dutch Police). They will be used to fit the KDE probability density in the denominator.
This approach has been proposed in [12][13][14], and has been dubbed asymmetric anchoring [8,11]. As mentioned before, there is some discussion about the usage of the databases in score-based likelihood ratio computation [8,11], the selection of the asymmetric anchoring as a procedure to generate the scores should not be seen as a recommendation, and discussions about this are outside the scope of this example. However, we will use it in this example as a choice for data usage in order to compute scores for training the models, just for the sake of illustration in the original article [1]. The outcomes of this method are two sets of LR values, supporting either the H p or H d .

Data produced using the Multimodal LR model
In order to obtain the LR for a given score, the proposed multimodal LR model to be validated in this example independently assigns probabilities to each score region by regional models, and then combines them by following the rules of probability. A detailed description of the method to compute LRs can be found in [15].
As a result of the application of the LR model, one LR per comparison in the validation process is generated. Both for development and validation. The resulting set of LRs constitute the data included in this contribution. Table 4 Validation criteria. First 3 columns of the Validation Matrix used in this example. Note that not all metrics recommended in [1] are included in the Validation Matrix, but at least one of it for each characteristic.

Validation criteria
The validation criteria are established with respect to the results of the performance characteristic of the baseline method, as mentioned in Table 4 below.

Validation report
In this section, we present a validation report following the EN ISO/IEC 17025:2005 recommendations, where all the items in the validation matrix above are addressed ( Table 4). The report is presented per performance characteristic in Subsections 8.1 to 8.6 below.

Accuracy
In [1] defined as "the closeness of agreement between an assigned LR and the ground truth status of the proposition in a decision-theoretical framework". It is measured by the Cllr and represented by the ECE plot, as shown in Fig. 2.

Validation criterion
Validation criterion for accuracy is based on the Kernel Density Function (KDE) baseline LR method. Using the development dataset in 8 minutiae configuration, Cllr ¼0.16 for the baseline.
Better or comparable Cllr value on the development dataset in 8 minutiae configuration is expected for the multimodal LR method than for the KDE baseline (e.g. Cllr o ¼ 0.16).

Experiment
The Cllr (solid line in the ECE plot) is measured for both methods -KDE baseline and the multimodal LRon the development and validation datasets.

Data
Development dataset consists of fingermarks in 8 minutiae configuration, corresponding fingerprints, reference subset of operational police database. Validation dataset consists of the fingermarks in 8 minutiae configuration and corresponding fingerprints originating from the real forensic casework.

Validation decision for the accuracy
Based on the results presented the validation criterion was satisfied.

Discriminating power
In [1] defined as "representing the capability of a given method to distinguish amongst forensic comparisons under each of the propositions involved". It is measured by Cllr min and EER and represented by the ECE and DET plots, as shown in Figs. 3 and 4 respectively.

Validation criterion
Validation criterion is based on the Kernel Density Function (KDE) baseline LR method. Using the development dataset in 8 minutiae configuration, Cllr min ¼ 0.145 and EER¼3. 7% for the baseline method.
Better or comparable multimodal LR method Cllr min and EER values on the development dataset in 8 minutiae configuration are expected than the KDE baseline.

Experiment
The Cllr min (the dashed line in the ECE plot) and EER is measured for both methods -KDE baseline and the multimodal LRon the development and validation datasets.

Data
Development dataset consists of fingermarks in 8 minutiae configuration, corresponding fingerprints, reference subset of operational police database. Validation dataset consists of the fingermarks in 8 minutiae configuration and corresponding fingerprints originating from the real forensic casework.

Validation decision for the discriminating power
Based on the results presented the validation criterion was satisfied.

Calibration
In [1] defined as "the property of a given set of LR values to yield the same set of LR values when computing the LR trained from the same data (in other words, the LR of the LR is the LR for a given set of LR values)". It is measured by Cllr cal and represented by the ECE plot, as shown in Fig. 5.

Validation criterion
Validation criterion for accuracy is based on the Kernel Density Function (KDE) baseline LR method. Using the development dataset in 8 minutiae configuration Cllr cal ¼0.02 for the baseline method. Hence we defined the calibration criterion as Cllr cal (val) rCllr cal (dev) þ0.1.

Experiment
The Cllr min is measured for both methods -KDE baseline and the multimodal LRon the development and validation datasets.

Data
Development dataset consists of fingermarks in 8 minutiae configuration, corresponding fingerprints, reference subset of operational police database. Validation dataset consists of the fingermarks in 8 minutiae configuration and corresponding fingerprints originating from the real forensic casework.

Validation decision for the calibration
Based on the results presented the validation criterion was satisfied.

Robustness to the lack of data
In [1] defined in a following way. "Data driven LR methods do have a tendency to provide LR values of inappropriate magnitude when the data used to train them is not enough. Inappropriate (not suitable) LR methods may result in LR values of huge magnitudes, which given the limited amount of data cannot resemble reality." It is observed for a range of LR values and represented in a Tippett plot, as shown in Fig. 6.

Validation criterion
Multimodal LR method yields LR values that present moderate weight-of-evidence for the values in the baseline KDE that are extremely high (see [2] page 84).

Experiment
The range of the LR values is analysed in search of LR values of large magnitude.

Data
Development dataset consists of fingermarks in 8 minutiae configuration, corresponding fingerprints, reference subset of operational police database. Validation dataset consists of the fingermarks in 8 minutiae configuration and corresponding fingerprints originating from the real forensic casework.

Analytical results
The KDE baseline methods yields evidence of enormous magnitudes supporting the wrong proposition (in extreme cases bigger than 10^90) {shown in [1] page 84}, as opposed to the method proposed, in which the support to the wrong proposition is much more confined (not bigger than 10^9 in a single extreme case). Hence the multimodal LR method developed is more robust to the lack of data than the KDE baseline method.

Validation decision for the calibration
Based on the results presented the validation criterion was satisfied.

Coherence
In [1] defined as "measures the agreement in the variation of performance metrics (Cllr, EER) when the amount of information in the evidence varies, like the quantity of minutiae in a fingerprint and a fingermark." It is measured using the Cllr, Cllr min and the EER and represented in a ECE and DET plots, as shown in Figs. 7 and 8 respectively.

Validation criterion
Observe improvement in the performance metrics (accuracy and discriminating power) with the increasing number of minutiae (presenting additional information).

Experiment
Vary the number of minutiae from 5 to 12 minutiae and observe improvement in Cllr, Cllr min and EER.

Data
Multimodal LR method was trained using the development dataset. Validation dataset consists of the fingermarks in 5 to 12 minutiae configurations and corresponding fingerprints originating from the real forensic casework. Table 5. False Acceptance Probability (in %)

Validation decision for the calibration
Based on the results presented the validation criterion was satisfied with the following remark: There are two different algorithms at the AFIS minutiae comparison algorithm. The first algorithm is used for comparing fingermarks in 5 to 9 minutiae configuration; the second algorithm is used for comparing fingermarks in 10 þ minutiae configuration.
This makes the coherence to fail in the transition between algorithms. However, this is a consequence of the AFIS black-box technology and not a consequence of the LR method, because the discriminating power is also affected by this, and not only the calibration.
Therefore, the proposed method clearly shows coherence within each of the algorithms. In order to show full coherence, it would be beneficiary to replace the twin-cored comparison algorithm by a dedicated minutiae comparison algorithm that would work across the whole range of minutiae configurations. However, as the use of this particular AFIS algorithm is specified in the scope of the validation process, we conclude with the accomplishment of the coherence.

Generalization to the previously unseen data under the dataset shift
In [1] defined as the "capability of a method to keep its performance under dataset shift, which is here defined as the difference in the conditions between the training data (used to train the LR methods) and the data that will be used as evidence in operational conditions." It is measured using the Cllr, Cllr cal , Cllr min and the EER and represented in a ECE and DET plots, as shown in Figs. 9 and 10 respectively.

Experiment
Multimodal LR method is trained using the development dataset and tested using the previously unseen validation dataset. An example using fingermarks in 8 minutiae configuration is used. The baseline LR method is trained using the development dataset, the Multimodal LR method trained using the development dataset and in the end the Multimodal LR method validated using the previously unseen validation dataset.

Data
Development dataset consists of fingermarks in 8 minutiae configuration, corresponding fingerprints, reference subset of operational police database. Validation dataset consists of the fingermarks in 8 minutiae configuration and corresponding fingerprints originating from the real forensic casework. Table 6 and Table 7. 8.6.5. Validation decision for the generalization to the previously unseen data Based on the results presented the validation criteria were satisfied.

Validation decision
The multimodal LR method developed for the forensic fingerprint evidence evaluation appears to be satisfying the validation criteria specified above, with a remark regarding the coherence. Summary across different performance characteristics is presented in Table 8 below.