Measuring calibration of likelihood-ratio systems: A comparison of four metrics, including a new metric devPAV

https://doi.org/10.1016/j.forsciint.2021.110722Get rights and content

Highlights

  • Four metrics that measure calibration of sets of likelihood ratios are compared.

  • Three are taken from the existing literature.

  • A new metric devPAV is introduced.

  • Comparison is based on simulating Gaussian LLR-distributions.

  • DevPAV performs equally well or better than existing methods.

Abstract

Numerical likelihood-ratio (LR) systems aim to calculate evidential strength for forensic evidence evaluation. Calibration of such LR-systems is essential: one does not want to over- or understate the strength of the evidence. Metrics that measure calibration differ in sensitivity to errors in calibration of such systems. In this paper we compare four calibration metrics by a simulation study based on Gaussian Log LR-distributions. Three calibration metrics are taken from the literature (Good, 1985; Royall, 1997; Ramos and Gonzalez-Rodriguez, 2013) [1–3], and a fourth metric is proposed by us. We evaluated these metrics by two performance criteria: differentiation (between well- and ill-calibrated LR-systems) and stability (of the value of the metric for a variety of well-calibrated LR-systems). Two metrics from the literature (the expected values of LR and of 1/LR, and the rate of misleading evidence stronger than 2) do not behave as desired in many simulated conditions. The third one (Cllrcal) performs better, but our newly proposed method (which we coin devPAV) is shown to behave equally well to clearly better under almost all simulated conditions. On the basis of this work, we recommend to use both devPAV and Cllrcal to measure calibration of LR-systems, where the current results indicate that devPAV is the preferred metric. In the future external validity of this comparison study can be extended by simulating non-Gaussian LR-distributions.

Introduction

An increasing majority of forensic scientists is convinced that forensic evidence should be interpreted by a likelihood ratio (LR). An LR is a statistic that discriminates between two propositions, and that can be used in Bayes rule to update prior odds for these propositions to posterior odds. In order to advance forensic science, in the past few decades automated systems that calculate likelihood ratios have been proposed for several evidence modalities, e.g. fingerprints [4], [5], glass [6], [7], [8], XTC [9], gasoline [10], [11], ignitable liquid class in fire debris [12], [13], DNA (see for example [14]), etc.

With the arrival of these automated LR-systems questions about the objective testing of their validity have arisen, and several publications have addressed this issue [3], [14], [15], [16], [17], [18], [19]. A key issue in the testing of validity is the measurement of calibration of the LR-system. It is important that an LR-system is well-calibrated [20]. If not, the value of the LR cannot be trusted and updating by Bayes rule results in misleadingly large or small posterior odds.

An intuitive explanation of what well-calibrated means, is given by [3], [21]. Imagine a weather forecaster who keeps track of his predictions on the probability of rain tomorrow for a sequence of days. Preferably, when examining those days for which his predicted probability of rain was 0.9 say, he would like it to have rained on 90% of these days. When his predicted probability is (nearly) equal to this relative frequency, the weather forecaster is said to be well-calibrated. His probability statements make ‘empirical sense’. A deviation between empirical frequency and prediction may be caused by statistical models or parameter estimates that do not reflect reality well. This is – of course – undesirable. A probability statement of 0.9 that does not mean that it will rain on 90% of such days is very misleading!

In analogy to a dataset of probability statements, a dataset of LR-values may also make ‘empirical sense’ (or not). Consider the general likelihood ratio formula:LRF:=PF|HpPF|Hd=V,Where F denotes evidence1 and Hp and Hd the prosecution and defense proposition respectively, and V denotes a particular value of the likelihood ratio. Just as the forecaster is well-calibrated if predicted probabilities po match empirical frequencies: P(rain|po = V) = V, an LR-system LRo is well calibrated if ‘the LR of the LR is the LR’:LRLRo=V:=PLRo=V|HpPLRo=V|Hd=Vfor any scalar V [1], [22]. As the likelihood ratio is a comprehensive representation of the evidential value, re-calculating the evidential value regarding as evidence this scalar should give the same evidential value. This also gives a handle on how to measure calibration of an LR-system. If an empirical deviation of Eq. (2) is found for a dataset of LRo-values, this indicates the LRo-system is ill-calibrated. For instance, say we have an LRo-system that we apply to a dataset of 100 situations in which Hd is true, and also to 100 situations in which Hp is true. And assume the results are as follows: when Hd is true 90 of the 100 LRo-values have a value of 10, erroneously indicating support for Hp; when Hp is true, 50 of the 100 LRo-values have a value of 10. We then have an ill-calibrated LR-system. For such a system, a deviation from Eq. (2). is found: the LR according to Eq. (2) associated with an LR0 = 10 would be 0.5/0.9 = 0.56, which is quite different from 10 and more reflective of the fact that this value is observed in 90% of instances when Hd is true.

An LR-system can be ill-calibrated in several ways. In this research, four general types of ill-calibrated LR-sets were investigated by creating them by simulation. The first set has too large LRs. In the second one, all LRs are too small. In the third situation, the LRs are too extreme (i.e. too large for LRs greater than 1, and too small for LRs less than 1) and in the fourth situation too weak (for details see Section 2.2).

From the literature [7], [9], [10] it is known that often a small number of LRs in the dataset are ill-calibrated: they constitute misleading evidence with excessive values. Therefore, in a second simulation experiment, this situation was imitated by adding a misleading LR with excessive value (either under Hp or under Hd) to data simulated from a perfectly calibrated LR-system. Together these simulation experiments cover the typical ways in which LR-data can be ill-calibrated.

One of the examined calibration metrics in this study (see Section 2.3) is defined in terms of misleading evidence. This refers to LRs pointing into the wrong direction, i.e. an LR greater than 1 when in fact the defense proposition is true, or an LR less than 1 when in fact the prosecution proposition is true. A relatively large fraction of misleading LRs in a set may be indicative of ill-calibration.

Before embarking on the study, it is important to define criteria that a good metric has to oblige. For this study, we defined two such criteria: differentiation and stability.

Obviously, a metric should differentiate between well- and ill-calibrated LR-systems. But, when one wants to be able to interpret the outcome of a metric it also has to be stable. Stability of results is important so that a typical range of values can be defined for well-calibrated LR-systems. Differentiation is measured by comparing metric outputs for well- and ill-calibrated LRs for the same simulation conditions. Stability is measured by comparing metric outputs for well-calibrated systems under different simulation conditions.

There may be ambiguity in the use of the terms discrimination and differentiation. We use the terms as follows. Discrimination is used in the context of the LR-system itself, in the way an LR-system discriminates between Hp and Hd. One summary of the discrimination power of an LR-system is the equal error rate. A second type of discrimination is the differentiation of a calibration metric between well- and ill-calibrated LR-systems. To avoid confusion, in the context of performance of a calibration metric we will not speak of discrimination but of ‘differentiation’ or to ‘differentiate’ between well- and ill-calibrated LR-systems.

The interest of this study is how to best measure calibration of LR-systems. Although several studies have used several metrics to measure calibration, as far as the authors are aware their performance has not yet been compared. In this study, four metrics to measure calibration are compared, three from existing literature [1], [2], [3] and one newly developed method (coined devPAV). We compare them by simulating well- and ill-calibrated LR-datasets and studying how well the metrics could differentiate between these datasets, and how stable the outcome is for well-calibrated LR-sets under different conditions. Finally, we discuss limitations of the comparison study and conclude by recommending some metrics for the measurement of calibration and a further comparison study.

Section snippets

Methods

In this study calibration metrics will be applied to several datasets of LRs. It is required to know the characteristics of these datasets, so that a desired behavior of the calibration metrics can be defined. We therefore use simulated data, so that the ground truth and the LR-distributions2

Results

The results are presented in three sections. First, the differentiation between condition ‘P’ and the other conditions is studied, by showing results for parameters μs=6 and nSS = 300. These results are used to zoom in on the four most promising calibration metrics. For these four calibration metrics, some results concerning parameters μs=17 and nSS = 300 are shown.

Second, stability of results with respect to μs and n is also important, since this allows for a good interpretation of the value

Discussion and conclusion

In this study, the performances of several calibration metrics were compared with the use of simulated LR-data. The studied calibration metrics were known for this purpose in the literature (Cllrcal, mom0, mommin1, mislHp and mislHd) or proposed by us (devPAV). For Gaussian LLR-distributions with varying discrimination (μs = 1 to μs = 17; e.g. EER = 24 to 0.18%), and varying sample sizes (number of datapoints under Hp = 50 to 300), the performance of above metrics was studied.

For the varying

CRediT authorship contribution statement

P. Vergeer: conceptualization, methodology, software, validation, formal analysis, writing – original draft, writing – review & editing, visualization, supervision. Y. van Schaik: methodology, software, formal analysis, writing – original draft, writing – review & editing, visualization. M. Sjerps: methodology, validation, writing – original draft, writing – review & editing, supervision.

Conflict of interest

All co-authors have seen and agree with the contents of the manuscript and there are no conflicts of interest to report.

References (41)

  • G.S. Morrison

    Measuring the validity and reliability of forensic likelihood-ratio systems

    Sci. Justice

    (2011)
  • N. Brümmer et al.

    Application-independent evaluation of speaker detection

    Comput. Speech Lang.

    (2006)
  • D. Taylor et al.

    Testing likelihood ratios produced from complex DNA profiles

    Forensic Sci. Int. Genet.

    (2015)
  • P. Vergeer et al.

    Why calibrating LR-systems is best practice. A reaction to “The evaluation of evidence for microspectrophotometry data using functional data analysis”, in FSI 305

    Forensic Sci. Int.

    (2020)
  • A.B. Hepler et al.

    Score-based likelihood ratios for handwriting evidence

    Forensic Sci. Int.

    (2012)
  • J.S. Buckleton et al.

    Are low LRs reliable?

    Forensic Sci. Int. Genet.

    (2020)
  • J. Hannig et al.

    Are reported likelihood ratios well calibrated?

    Forensic Sci. Int. Genet. Suppl. Ser.

    (2019)
  • I.J. Good, Weight of Evidence: A Brief Survey, Bayesian Statistics 2., 1985,...
  • R. Royall

    Statistical Evidence: A Likelihood Paradigm

    (1997)
  • I. Alberink et al.

    Fingermark evidence evaluation based on automated fingerprint identification system matching scores: the effect of different types of conditioning on likelihood ratios

    J. Forensic Sci.

    (2014)
  • Cited by (9)

    • Likelihood ratio method for the interpretation of iPhone health app data in digital forensics

      2022, Forensic Science International: Digital Investigation
      Citation Excerpt :

      The LRs are calibrated using a Pool Adjacent Violators algorithm (Brümmer and Du Preez, 2006; Fawcett and Niculescu-Mizil, 2007) and are visualized in PAV plots. The metric devPAV is used to study the difference between the pre- and post-calibrated LRs (Vergeer et al., 2021). ECE plots (Ramos et al., 2013, 2018) are presented along with the log-likelihood ratio cost (Cllr) (Morrison, 2011, 2021; Ramos et al., 2018) to measure the accuracy.

    • Embracing likelihood ratios and highlighting the principles of forensic interpretation

      2021, Forensic Science International: Reports
      Citation Excerpt :

      The LR approach forms part of the Case Assessment and Interpretation (CAI) model. The CAI model is considered the best practice for a logical probabilistic approach to addressing the facts at issue in criminal investigations from examination strategies through evidential analysis to interpretation [7,17–19]. As part of the LR approach, there are 3 Principles of Forensic Interpretation, which should be considered vital for correct interpretation with minimal mistakes.

    • In the context of forensic casework, are there meaningful metrics of the degree of calibration?

      2021, Forensic Science International: Synergy
      Citation Excerpt :

      Forensic-evaluation systems should output likelihood-ratio values that are well calibrated [1–12].

    View all citing articles on Scopus
    View full text