Measuring calibration of likelihood-ratio systems: A comparison of four metrics, including a new metric devPAV
Introduction
An increasing majority of forensic scientists is convinced that forensic evidence should be interpreted by a likelihood ratio (LR). An LR is a statistic that discriminates between two propositions, and that can be used in Bayes rule to update prior odds for these propositions to posterior odds. In order to advance forensic science, in the past few decades automated systems that calculate likelihood ratios have been proposed for several evidence modalities, e.g. fingerprints [4], [5], glass [6], [7], [8], XTC [9], gasoline [10], [11], ignitable liquid class in fire debris [12], [13], DNA (see for example [14]), etc.
With the arrival of these automated LR-systems questions about the objective testing of their validity have arisen, and several publications have addressed this issue [3], [14], [15], [16], [17], [18], [19]. A key issue in the testing of validity is the measurement of calibration of the LR-system. It is important that an LR-system is well-calibrated [20]. If not, the value of the LR cannot be trusted and updating by Bayes rule results in misleadingly large or small posterior odds.
An intuitive explanation of what well-calibrated means, is given by [3], [21]. Imagine a weather forecaster who keeps track of his predictions on the probability of rain tomorrow for a sequence of days. Preferably, when examining those days for which his predicted probability of rain was 0.9 say, he would like it to have rained on 90% of these days. When his predicted probability is (nearly) equal to this relative frequency, the weather forecaster is said to be well-calibrated. His probability statements make ‘empirical sense’. A deviation between empirical frequency and prediction may be caused by statistical models or parameter estimates that do not reflect reality well. This is – of course – undesirable. A probability statement of 0.9 that does not mean that it will rain on 90% of such days is very misleading!
In analogy to a dataset of probability statements, a dataset of LR-values may also make ‘empirical sense’ (or not). Consider the general likelihood ratio formula:Where denotes evidence1 and and the prosecution and defense proposition respectively, and denotes a particular value of the likelihood ratio. Just as the forecaster is well-calibrated if predicted probabilities match empirical frequencies: P(rain| = V) = V, an LR-system is well calibrated if ‘the LR of the LR is the LR’:for any scalar [1], [22]. As the likelihood ratio is a comprehensive representation of the evidential value, re-calculating the evidential value regarding as evidence this scalar should give the same evidential value. This also gives a handle on how to measure calibration of an LR-system. If an empirical deviation of Eq. (2) is found for a dataset of -values, this indicates the -system is ill-calibrated. For instance, say we have an -system that we apply to a dataset of 100 situations in which is true, and also to 100 situations in which is true. And assume the results are as follows: when is true 90 of the 100 -values have a value of 10, erroneously indicating support for ; when is true, 50 of the 100 -values have a value of 10. We then have an ill-calibrated LR-system. For such a system, a deviation from Eq. (2). is found: the LR according to Eq. (2) associated with an LR0 = 10 would be 0.5/0.9 = 0.56, which is quite different from 10 and more reflective of the fact that this value is observed in 90% of instances when is true.
An LR-system can be ill-calibrated in several ways. In this research, four general types of ill-calibrated LR-sets were investigated by creating them by simulation. The first set has too large LRs. In the second one, all LRs are too small. In the third situation, the LRs are too extreme (i.e. too large for LRs greater than 1, and too small for LRs less than 1) and in the fourth situation too weak (for details see Section 2.2).
From the literature [7], [9], [10] it is known that often a small number of LRs in the dataset are ill-calibrated: they constitute misleading evidence with excessive values. Therefore, in a second simulation experiment, this situation was imitated by adding a misleading LR with excessive value (either under or under ) to data simulated from a perfectly calibrated LR-system. Together these simulation experiments cover the typical ways in which LR-data can be ill-calibrated.
One of the examined calibration metrics in this study (see Section 2.3) is defined in terms of misleading evidence. This refers to LRs pointing into the wrong direction, i.e. an LR greater than 1 when in fact the defense proposition is true, or an LR less than 1 when in fact the prosecution proposition is true. A relatively large fraction of misleading LRs in a set may be indicative of ill-calibration.
Before embarking on the study, it is important to define criteria that a good metric has to oblige. For this study, we defined two such criteria: differentiation and stability.
Obviously, a metric should differentiate between well- and ill-calibrated LR-systems. But, when one wants to be able to interpret the outcome of a metric it also has to be stable. Stability of results is important so that a typical range of values can be defined for well-calibrated LR-systems. Differentiation is measured by comparing metric outputs for well- and ill-calibrated LRs for the same simulation conditions. Stability is measured by comparing metric outputs for well-calibrated systems under different simulation conditions.
There may be ambiguity in the use of the terms discrimination and differentiation. We use the terms as follows. Discrimination is used in the context of the LR-system itself, in the way an LR-system discriminates between and . One summary of the discrimination power of an LR-system is the equal error rate. A second type of discrimination is the differentiation of a calibration metric between well- and ill-calibrated LR-systems. To avoid confusion, in the context of performance of a calibration metric we will not speak of discrimination but of ‘differentiation’ or to ‘differentiate’ between well- and ill-calibrated LR-systems.
The interest of this study is how to best measure calibration of LR-systems. Although several studies have used several metrics to measure calibration, as far as the authors are aware their performance has not yet been compared. In this study, four metrics to measure calibration are compared, three from existing literature [1], [2], [3] and one newly developed method (coined devPAV). We compare them by simulating well- and ill-calibrated LR-datasets and studying how well the metrics could differentiate between these datasets, and how stable the outcome is for well-calibrated LR-sets under different conditions. Finally, we discuss limitations of the comparison study and conclude by recommending some metrics for the measurement of calibration and a further comparison study.
Section snippets
Methods
In this study calibration metrics will be applied to several datasets of LRs. It is required to know the characteristics of these datasets, so that a desired behavior of the calibration metrics can be defined. We therefore use simulated data, so that the ground truth and the LR-distributions2
Results
The results are presented in three sections. First, the differentiation between condition ‘P’ and the other conditions is studied, by showing results for parameters and nSS = 300. These results are used to zoom in on the four most promising calibration metrics. For these four calibration metrics, some results concerning parameters and nSS = 300 are shown.
Second, stability of results with respect to and n is also important, since this allows for a good interpretation of the value
Discussion and conclusion
In this study, the performances of several calibration metrics were compared with the use of simulated LR-data. The studied calibration metrics were known for this purpose in the literature (, mom0, mommin1, mislHp and mislHd) or proposed by us (devPAV). For Gaussian LLR-distributions with varying discrimination ( = 1 to = 17; e.g. EER = 24 to 0.18%), and varying sample sizes (number of datapoints under = 50 to 300), the performance of above metrics was studied.
For the varying
CRediT authorship contribution statement
P. Vergeer: conceptualization, methodology, software, validation, formal analysis, writing – original draft, writing – review & editing, visualization, supervision. Y. van Schaik: methodology, software, formal analysis, writing – original draft, writing – review & editing, visualization. M. Sjerps: methodology, validation, writing – original draft, writing – review & editing, supervision.
Conflict of interest
All co-authors have seen and agree with the contents of the manuscript and there are no conflicts of interest to report.
References (41)
- et al.
Reliable support: Measuring calibration of likelihood ratios
Forensic Sci. Int.
(2013) - et al.
An online application for the classification and evidence evaluation of forensic glass fragments
Chemom. Intell. Lab. Syst.
(2015) - et al.
Implementation and assessment of a likelihood ratio approach for the evaluation of LA-ICP-MS evidence in forensic glass analysis
Sci. Justice
(2017) - et al.
Different likelihood ratio approaches to evaluate the strength of evidence of MDMA tablet comparisons
Forensic Sci. Int.
(2009) - et al.
Likelihood ratio methods for forensic comparison of evaporated gasoline residues
Sci. Justice
(2014) - et al.
A method for forensic gasoline comparison in fire debris samples: a numerical likelihood ratio system
Sci. Justice
(2020) - et al.
Class-conditional feature modeling for ignitable liquid classification with substantial substrate contribution in fire debris analysis
Forensic Sci. Int.
(2015) - et al.
Model-effects on likelihood ratios for fire debris analysis
Forensic Chem.
(2018) - et al.
DNA Commission of the International Society for Forensic Genetics: Recommendations on the validation of software programs performing biostatistical calculations for forensic genetics applications
Forensic Sci. Int. Genet.
(2016) - et al.
A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation
Forensic Sci. Int.
(2017)
Measuring the validity and reliability of forensic likelihood-ratio systems
Sci. Justice
Application-independent evaluation of speaker detection
Comput. Speech Lang.
Testing likelihood ratios produced from complex DNA profiles
Forensic Sci. Int. Genet.
Why calibrating LR-systems is best practice. A reaction to “The evaluation of evidence for microspectrophotometry data using functional data analysis”, in FSI 305
Forensic Sci. Int.
Score-based likelihood ratios for handwriting evidence
Forensic Sci. Int.
Are low LRs reliable?
Forensic Sci. Int. Genet.
Are reported likelihood ratios well calibrated?
Forensic Sci. Int. Genet. Suppl. Ser.
Statistical Evidence: A Likelihood Paradigm
Fingermark evidence evaluation based on automated fingerprint identification system matching scores: the effect of different types of conditioning on likelihood ratios
J. Forensic Sci.
Cited by (9)
From data to a validated score-based LR system: A practitioner's guide
2024, Forensic Science InternationalAutomated interpretation of comparison scores for firearm toolmarks on cartridge case primers
2023, Forensic Science InternationalLikelihood ratio method for the interpretation of iPhone health app data in digital forensics
2022, Forensic Science International: Digital InvestigationCitation Excerpt :The LRs are calibrated using a Pool Adjacent Violators algorithm (Brümmer and Du Preez, 2006; Fawcett and Niculescu-Mizil, 2007) and are visualized in PAV plots. The metric devPAV is used to study the difference between the pre- and post-calibrated LRs (Vergeer et al., 2021). ECE plots (Ramos et al., 2013, 2018) are presented along with the log-likelihood ratio cost (Cllr) (Morrison, 2011, 2021; Ramos et al., 2018) to measure the accuracy.
Embracing likelihood ratios and highlighting the principles of forensic interpretation
2021, Forensic Science International: ReportsCitation Excerpt :The LR approach forms part of the Case Assessment and Interpretation (CAI) model. The CAI model is considered the best practice for a logical probabilistic approach to addressing the facts at issue in criminal investigations from examination strategies through evidential analysis to interpretation [7,17–19]. As part of the LR approach, there are 3 Principles of Forensic Interpretation, which should be considered vital for correct interpretation with minimal mistakes.
In the context of forensic casework, are there meaningful metrics of the degree of calibration?
2021, Forensic Science International: SynergyCitation Excerpt :Forensic-evaluation systems should output likelihood-ratio values that are well calibrated [1–12].