Measuring calibration of likelihood-ratio systems: A comparison of four metrics, including a new metric devPAV

doi:10.1016/j.forsciint.2021.110722

Forensic Science International

Volume 321, April 2021, 110722

https://doi.org/10.1016/j.forsciint.2021.110722 Get rights and content

Highlights

•
Four metrics that measure calibration of sets of likelihood ratios are compared.
•
Three are taken from the existing literature.
•
A new metric devPAV is introduced.
•
Comparison is based on simulating Gaussian LLR-distributions.
•
DevPAV performs equally well or better than existing methods.

Abstract

Numerical likelihood-ratio (LR) systems aim to calculate evidential strength for forensic evidence evaluation. Calibration of such LR-systems is essential: one does not want to over- or understate the strength of the evidence. Metrics that measure calibration differ in sensitivity to errors in calibration of such systems. In this paper we compare four calibration metrics by a simulation study based on Gaussian Log LR-distributions. Three calibration metrics are taken from the literature (Good, 1985; Royall, 1997; Ramos and Gonzalez-Rodriguez, 2013) [1–3], and a fourth metric is proposed by us. We evaluated these metrics by two performance criteria: differentiation (between well- and ill-calibrated LR-systems) and stability (of the value of the metric for a variety of well-calibrated LR-systems). Two metrics from the literature (the expected values of LR and of 1/LR, and the rate of misleading evidence stronger than 2) do not behave as desired in many simulated conditions. The third one ( $C_{llr}^{cal})$ performs better, but our newly proposed method (which we coin devPAV) is shown to behave equally well to clearly better under almost all simulated conditions. On the basis of this work, we recommend to use both devPAV and $C_{llr}^{cal}$ to measure calibration of LR-systems, where the current results indicate that devPAV is the preferred metric. In the future external validity of this comparison study can be extended by simulating non-Gaussian LR-distributions.

Introduction

An increasing majority of forensic scientists is convinced that forensic evidence should be interpreted by a likelihood ratio (LR). An LR is a statistic that discriminates between two propositions, and that can be used in Bayes rule to update prior odds for these propositions to posterior odds. In order to advance forensic science, in the past few decades automated systems that calculate likelihood ratios have been proposed for several evidence modalities, e.g. fingerprints [4], [5], glass [6], [7], [8], XTC [9], gasoline [10], [11], ignitable liquid class in fire debris [12], [13], DNA (see for example [14]), etc.

With the arrival of these automated LR-systems questions about the objective testing of their validity have arisen, and several publications have addressed this issue [3], [14], [15], [16], [17], [18], [19]. A key issue in the testing of validity is the measurement of calibration of the LR-system. It is important that an LR-system is well-calibrated [20]. If not, the value of the LR cannot be trusted and updating by Bayes rule results in misleadingly large or small posterior odds.

An intuitive explanation of what well-calibrated means, is given by [3], [21]. Imagine a weather forecaster who keeps track of his predictions on the probability of rain tomorrow for a sequence of days. Preferably, when examining those days for which his predicted probability of rain was 0.9 say, he would like it to have rained on 90% of these days. When his predicted probability is (nearly) equal to this relative frequency, the weather forecaster is said to be well-calibrated. His probability statements make ‘empirical sense’. A deviation between empirical frequency and prediction may be caused by statistical models or parameter estimates that do not reflect reality well. This is – of course – undesirable. A probability statement of 0.9 that does not mean that it will rain on 90% of such days is very misleading!

In analogy to a dataset of probability statements, a dataset of LR-values may also make ‘empirical sense’ (or not). Consider the general likelihood ratio formula: $LR (F) : = \frac{P (F | H_{p})}{P (F | H_{d})} = V,$ Where $F$ denotes evidence¹ and $H_{p}$ and $H_{d}$ the prosecution and defense proposition respectively, and $V$ denotes a particular value of the likelihood ratio. Just as the forecaster is well-calibrated if predicted probabilities $p_{o}$ match empirical frequencies: P(rain| $p_{o}$ = V) = V, an LR-system ${LR}_{o}$ is well calibrated if ‘the LR of the LR is the LR’: $LR ({LR}_{o} = V) : = \frac{P ({LR}_{o} = V | H_{p})}{P ({LR}_{o} = V | H_{d})} = V$ for any scalar $V$ [1], [22]. As the likelihood ratio is a comprehensive representation of the evidential value, re-calculating the evidential value regarding as evidence this scalar should give the same evidential value. This also gives a handle on how to measure calibration of an LR-system. If an empirical deviation of Eq. (2) is found for a dataset of ${LR}_{o}$ -values, this indicates the ${LR}_{o}$ -system is ill-calibrated. For instance, say we have an ${LR}_{o}$ -system that we apply to a dataset of 100 situations in which $H_{d}$ is true, and also to 100 situations in which $H_{p}$ is true. And assume the results are as follows: when $H_{d}$ is true 90 of the 100 ${LR}_{o}$ -values have a value of 10, erroneously indicating support for $H_{p}$ ; when $H_{p}$ is true, 50 of the 100 ${LR}_{o}$ -values have a value of 10. We then have an ill-calibrated LR-system. For such a system, a deviation from Eq. (2). is found: the LR according to Eq. (2) associated with an LR₀ = 10 would be 0.5/0.9 = 0.56, which is quite different from 10 and more reflective of the fact that this value is observed in 90% of instances when $H_{d}$ is true.

An LR-system can be ill-calibrated in several ways. In this research, four general types of ill-calibrated LR-sets were investigated by creating them by simulation. The first set has too large LRs. In the second one, all LRs are too small. In the third situation, the LRs are too extreme (i.e. too large for LRs greater than 1, and too small for LRs less than 1) and in the fourth situation too weak (for details see Section 2.2).

From the literature [7], [9], [10] it is known that often a small number of LRs in the dataset are ill-calibrated: they constitute misleading evidence with excessive values. Therefore, in a second simulation experiment, this situation was imitated by adding a misleading LR with excessive value (either under $H_{p}$ or under $H_{d}$ ) to data simulated from a perfectly calibrated LR-system. Together these simulation experiments cover the typical ways in which LR-data can be ill-calibrated.

One of the examined calibration metrics in this study (see Section 2.3) is defined in terms of misleading evidence. This refers to LRs pointing into the wrong direction, i.e. an LR greater than 1 when in fact the defense proposition is true, or an LR less than 1 when in fact the prosecution proposition is true. A relatively large fraction of misleading LRs in a set may be indicative of ill-calibration.

Before embarking on the study, it is important to define criteria that a good metric has to oblige. For this study, we defined two such criteria: differentiation and stability.

Obviously, a metric should differentiate between well- and ill-calibrated LR-systems. But, when one wants to be able to interpret the outcome of a metric it also has to be stable. Stability of results is important so that a typical range of values can be defined for well-calibrated LR-systems. Differentiation is measured by comparing metric outputs for well- and ill-calibrated LRs for the same simulation conditions. Stability is measured by comparing metric outputs for well-calibrated systems under different simulation conditions.

There may be ambiguity in the use of the terms discrimination and differentiation. We use the terms as follows. Discrimination is used in the context of the LR-system itself, in the way an LR-system discriminates between $H_{p}$ and $H_{d}$ . One summary of the discrimination power of an LR-system is the equal error rate. A second type of discrimination is the differentiation of a calibration metric between well- and ill-calibrated LR-systems. To avoid confusion, in the context of performance of a calibration metric we will not speak of discrimination but of ‘differentiation’ or to ‘differentiate’ between well- and ill-calibrated LR-systems.

The interest of this study is how to best measure calibration of LR-systems. Although several studies have used several metrics to measure calibration, as far as the authors are aware their performance has not yet been compared. In this study, four metrics to measure calibration are compared, three from existing literature [1], [2], [3] and one newly developed method (coined devPAV). We compare them by simulating well- and ill-calibrated LR-datasets and studying how well the metrics could differentiate between these datasets, and how stable the outcome is for well-calibrated LR-sets under different conditions. Finally, we discuss limitations of the comparison study and conclude by recommending some metrics for the measurement of calibration and a further comparison study.

Section snippets

Methods

In this study calibration metrics will be applied to several datasets of LRs. It is required to know the characteristics of these datasets, so that a desired behavior of the calibration metrics can be defined. We therefore use simulated data, so that the ground truth and the LR-distributions²

Results

The results are presented in three sections. First, the differentiation between condition ‘P’ and the other conditions is studied, by showing results for parameters $μ_{s} = 6$ and n_SS = 300. These results are used to zoom in on the four most promising calibration metrics. For these four calibration metrics, some results concerning parameters $μ_{s} = 17$ and n_SS = 300 are shown.

Second, stability of results with respect to $μ_{s}$ and n is also important, since this allows for a good interpretation of the value

Discussion and conclusion

In this study, the performances of several calibration metrics were compared with the use of simulated LR-data. The studied calibration metrics were known for this purpose in the literature ( $C_{llr}^{cal}$ , mom0, mommin1, mislHp and mislHd) or proposed by us (devPAV). For Gaussian LLR-distributions with varying discrimination ( $μ_{s}$ = 1 to $μ_{s}$ = 17; e.g. EER = 24 to 0.18%), and varying sample sizes (number of datapoints under $H_{p}$ = 50 to 300), the performance of above metrics was studied.

For the varying

CRediT authorship contribution statement

P. Vergeer: conceptualization, methodology, software, validation, formal analysis, writing – original draft, writing – review & editing, visualization, supervision. Y. van Schaik: methodology, software, formal analysis, writing – original draft, writing – review & editing, visualization. M. Sjerps: methodology, validation, writing – original draft, writing – review & editing, supervision.

Conflict of interest

All co-authors have seen and agree with the contents of the manuscript and there are no conflicts of interest to report.

References (41)

D. Ramos et al.
Reliable support: Measuring calibration of likelihood ratios
Forensic Sci. Int.
(2013)
G. Napier et al.
An online application for the classification and evidence evaluation of forensic glass fragments
Chemom. Intell. Lab. Syst.
(2015)
A. van Es et al.
Implementation and assessment of a likelihood ratio approach for the evaluation of LA-ICP-MS evidence in forensic glass analysis
Sci. Justice
(2017)
A. Bolck et al.
Different likelihood ratio approaches to evaluate the strength of evidence of MDMA tablet comparisons
Forensic Sci. Int.
(2009)
P. Vergeer et al.
Likelihood ratio methods for forensic comparison of evaporated gasoline residues
Sci. Justice
(2014)
P. Vergeer et al.
A method for forensic gasoline comparison in fire debris samples: a numerical likelihood ratio system
Sci. Justice
(2020)
M. Lopatka et al.
Class-conditional feature modeling for ignitable liquid classification with substantial substrate contribution in fire debris analysis
Forensic Sci. Int.
(2015)
R. Coulson et al.
Model-effects on likelihood ratios for fire debris analysis
Forensic Chem.
(2018)
M.D. Coble et al.
DNA Commission of the International Society for Forensic Genetics: Recommendations on the validation of software programs performing biostatistical calculations for forensic genetics applications
Forensic Sci. Int. Genet.
(2016)
D. Meuwly et al.
A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation
Forensic Sci. Int.
(2017)

G.S. Morrison

Measuring the validity and reliability of forensic likelihood-ratio systems

Sci. Justice

(2011)

N. Brümmer et al.

Application-independent evaluation of speaker detection

Comput. Speech Lang.

(2006)

D. Taylor et al.

Testing likelihood ratios produced from complex DNA profiles

Forensic Sci. Int. Genet.

(2015)

P. Vergeer et al.

Why calibrating LR-systems is best practice. A reaction to “The evaluation of evidence for microspectrophotometry data using functional data analysis”, in FSI 305

Forensic Sci. Int.

(2020)

A.B. Hepler et al.

Score-based likelihood ratios for handwriting evidence

Forensic Sci. Int.

(2012)

J.S. Buckleton et al.

Are low LRs reliable?

Forensic Sci. Int. Genet.

(2020)

J. Hannig et al.

Are reported likelihood ratios well calibrated?

Forensic Sci. Int. Genet. Suppl. Ser.

(2019)

I.J. Good, Weight of Evidence: A Brief Survey, Bayesian Statistics 2., 1985,...

R. Royall

Statistical Evidence: A Likelihood Paradigm

(1997)

I. Alberink et al.

Fingermark evidence evaluation based on automated fingerprint identification system matching scores: the effect of different types of conditioning on likelihood ratios

J. Forensic Sci.

(2014)

Cited by (9)

From data to a validated score-based LR system: A practitioner's guide
2024, Forensic Science International
Likelihood ratios (LRs) are a useful measure of evidential strength. In forensic casework consisting of a flow of cases with essentially the same question and the same analysis method, it is feasible to construct an ‘LR system’, that is, an automated procedure that has the observations as input and an LR as output. This paper is aimed at practitioners interested in building their own LR systems. It gives an overview of the different steps needed to get to a validated LR system from data. The paper is accompanied by a notebook that illustrates each step with an example using glass data. The notebook introduces open-source software in Python constructed by NFI (Netherlands Forensic Institute) data scientists and statisticians.
Automated interpretation of comparison scores for firearm toolmarks on cartridge case primers
2023, Forensic Science International
An automated approach for evaluating the strength of the evidence of firearm toolmark comparison results is presented for a common source scenario. First, comparison scores are derived describing the similarity of marks typically encountered on the primer of fired cartridge cases: aperture shear striations as well as breechface and firing pin impressions. Subsequently, these scores are interpreted using reference distributions of comparison scores obtained for representative known matching (KM) and known non-matching (KNM) ballistic samples in a common source, score-based likelihood ratio (LR) system. We study various alternatives to set up such an LR system and compare them using qualitative and quantitative criteria known from the literature. As an example, results are applied to establish a system suitable for a firearm-ammunition combination often encountered in casework: Glock firearms with Fiocchi nickel primer ammunition. The system outputs an LR and a measure of LR uncertainty. The range of possible LR-values is limited to a minimum and maximum value in areas of the score domain with little reference data. Finally, the feasibility of combining LRs of different mark types into one LR for the entire primer is assessed. For the distribution models considered in this paper, different modeling approaches are optimal for different types of similarity scores. For the chosen firearm-ammunition combination, non-parametric Kernel Density Estimation (KDE) models perform best for similarity scores based on the correlation coefficient, whereas parametric models perform best for the Congruent Matching Cells (CMC) scores, assuming binomial and beta-binomial models for KM and KNM score distributions respectively. Finally, it is demonstrated that individual LRs of different mark types can be combined into one LR, to interpret a set of different marks on the primer as a whole.
Likelihood ratio method for the interpretation of iPhone health app data in digital forensics
2022, Forensic Science International: Digital Investigation
Citation Excerpt :
The LRs are calibrated using a Pool Adjacent Violators algorithm (Brümmer and Du Preez, 2006; Fawcett and Niculescu-Mizil, 2007) and are visualized in PAV plots. The metric devPAV is used to study the difference between the pre- and post-calibrated LRs (Vergeer et al., 2021). ECE plots (Ramos et al., 2013, 2018) are presented along with the log-likelihood ratio cost (Cllr) (Morrison, 2011, 2021; Ramos et al., 2018) to measure the accuracy.
This study presents a method to derive numerical LRs for distance data from the iPhone Health app. The LR method provides a probability model for hypotheses that specify a walking distance in meters. The method is demonstrated using a hypothetical case example in which a suspected arsonist's walking route is disputed. The performance of the LR method is assessed through studying the discrimination, calibration, accuracy, and through performing a sensitivity analysis. The proposed method is balanced and transparent, yet its performance is highly case-dependent. Nevertheless, the method and validation procedure are straightforward and can therefore easily be repeated for different data obtained from the iPhone Health app but also for other data within and outside the field of digital forensics. By illustrating the LR method through the example of the comparison of walking routes, this study advocates the application of an LR approach to further relevant questions in digital forensics.
Embracing likelihood ratios and highlighting the principles of forensic interpretation
2021, Forensic Science International: Reports
Citation Excerpt :
The LR approach forms part of the Case Assessment and Interpretation (CAI) model. The CAI model is considered the best practice for a logical probabilistic approach to addressing the facts at issue in criminal investigations from examination strategies through evidential analysis to interpretation [7,17–19]. As part of the LR approach, there are 3 Principles of Forensic Interpretation, which should be considered vital for correct interpretation with minimal mistakes.
Due to the police commissioning of expert forensic services worldwide, the police may specify the level of forensic service they need; thus, forensic scientists can be put under pressure to present scientific findings as quickly as possible and in simple and definitive terms. Frequently, this means reporting forensics results out of the context of the case, as forensic practitioners may not be given relevant case information. However, to do so can be misleading, inaccurate and potentially lead to miscarriages of justice. Thus, whilst well-intentioned, efforts to accelerate and simplify forensic science evidence can undermine the justice system. However, providing that all forensic scientists and practitioners follow three basic forensic interpretation principles based on the formulation of the likelihood ratio component of Bayes Theorem approach, and are given case context, then the chances of miscarriages of justice arising from forensic science should be minimised. Principle #1: Always consider at least one alternative hypothesis. Principle #2: Always consider the probability of the evidence given the proposition and not the probability of the proposition given the evidence. Principle #3: Always consider the framework of circumstance. The expression of a likelihood ratio lies at the core of a pre-assessment that, if used correctly, should minimise bias in the forensic investigation. Several disciplines use likelihood ratios to interpret their findings and place them in the context of the case. There are still several forensic science areas where this is not undertaken, and therefore they may be a source of concern.
In the context of forensic casework, are there meaningful metrics of the degree of calibration?
2021, Forensic Science International: Synergy
Citation Excerpt :
Forensic-evaluation systems should output likelihood-ratio values that are well calibrated [1–12].
Forensic-evaluation systems should output likelihood-ratio values that are well calibrated. If they do not, their output will be misleading. Unless a forensic-evaluation system is intrinsically well-calibrated, it should be calibrated using a parsimonious parametric model that is trained using calibration data. The system should then be tested using validation data. Metrics of degree of calibration that are based on the pool-adjacent-violators (PAV) algorithm recalibrate the likelihood-ratio values calculated from the validation data. The PAV algorithm overfits on the validation data because it is both trained and tested on the validation data, and because it is a non-parametric model with weak constraints. For already-calibrated systems, PAV-based ostensive metrics of degree of calibration do not actually measure degree of calibration; they measure sampling variability between the calibration data and the validation data, and overfitting on the validation data. Monte Carlo simulations are used to demonstrate that this is the case. We therefore argue that, in the context of casework, PAV-based metrics are not meaningful metrics of degree of calibration; however, we also argue that, in the context of casework, a metric of degree of calibration is not required.
Utilization of Machine Learning for the Differentiation of Positional NPS Isomers with Direct Analysis in Real Time Mass Spectrometry
2022, Analytical Chemistry

View all citing articles on Scopus

View full text

Measuring calibration of likelihood-ratio systems: A comparison of four metrics, including a new metric devPAV

Highlights

Abstract

Introduction

Section snippets

Methods

Results

Discussion and conclusion

CRediT authorship contribution statement

Conflict of interest

Forensic Sci. Int.

Chemom. Intell. Lab. Syst.

Sci. Justice

Forensic Sci. Int.

Sci. Justice

Sci. Justice

Forensic Sci. Int.

Forensic Chem.

Forensic Sci. Int. Genet.

Forensic Sci. Int.

Sci. Justice

Comput. Speech Lang.

Forensic Sci. Int. Genet.

Forensic Sci. Int.

Forensic Sci. Int.

Forensic Sci. Int. Genet.

Forensic Sci. Int. Genet. Suppl. Ser.

Statistical Evidence: A Likelihood Paradigm

Fingermark evidence evaluation based on automated fingerprint identification system matching scores: the effect of different types of conditioning on likelihood ratios

J. Forensic Sci.