An overview of log likelihood ratio cost in forensic science – Where is it used and what values can we expect?

There is increasing support for reporting evidential strength as a likelihood ratio (LR) and increasing interest in (semi-)automated LR systems. The log-likelihood ratio cost (Cllr) is a popular metric for such systems, penalizing misleading LRs further from 1 more. Cllr = 0 indicates perfection while Cllr = 1 indicates an uninformative system. However, beyond this, what constitutes a “good” Cllr is unclear. Aiming to provide handles on when a Cllr is “good”, we studied 136 publications on (semi-)automated LR systems. Results show Cllr use heavily depends on the field, e.g., being absent in DNA analysis. Despite more publications on automated LR systems over time, the proportion reporting Cllr remains stable. Noticeably, Cllr values lack clear patterns and depend on the area, analysis and dataset. As LR systems become more prevalent, comparing them becomes crucial. This is hampered by different studies using different datasets. We advocate using public benchmark datasets to advance the field.


Introduction
In forensic science, support for using likelihood ratios (LRs) to assess the strength of forensic evidence has been growing [1][2][3][4][5][6][7].Consequently, the issue of evaluating the performance of LR systems has been addressed in the recent past, focusing on their validation for use in casework [8].In particular, several performance metrics have been proposed.The scientific community and forensic practitioners are notably utilizing the Log Likelihood Ratio Cost, also known as C llr , which was initially introduced in Ref. [9] within the context of likelihood-ratio-based speaker verifiers and subsequently adapted for forensic speaker recognition [10].However, it is important to note that the use of the C llr extends beyond speech-based systems and can be applied to any method that produces LRs.
The C llr is defined as: Here, N H1 is the number of samples for which H 1 is true, N H2 is the number of samples for which H 2 is true, LR H1 are the LR values predicted by the system for samples where H 1 is true, and LR H2 are the LR values predicted by the system for samples where H 2 is true.Hence, when we have a set of empirical LR values predicted by a certain LR system and the corresponding ground truth labels are available, we can calculate a C llr .Furthermore, the metric can be split into two parts, giving an indication of the error due to imperfect discrimination ('do H 1 -true samples get a higher LR than H 2 -true samples?') and imperfect calibration ('is the value of the assigned LR correct, not under-or overstating the evidence?').These two metrics can be calculated by applying the Pool Adjacent Violators (PAV) algorithm [11,12] on the evaluation set, mimicking 'perfect' calibration [9], and re-calculating C llr .The resulting value is taken as an assessment of discrimination (called C llr− min ), the difference (C llr− cal = C llr − C llr− min ) as an assessment of calibration [13,14].Importantly, it is worth highlighting some advantages and drawbacks of the C llr as a performance metric for LR systems.Among its advantages, the C llr is a so-called 'strictly proper scoring rule', possessing favorable mathematical properties such as probabilistic and information-theoretical interpretation.As already mentioned, the metric provides an indication of both calibration and discriminating power of a method, allowing separate estimation of these two aspects of performance [15].It thus considers not just whether an evaluation was misleading (i.e. the LR supports the wrong hypothesis), but also its associated evidential strength (e.g.misleading LR = 100 is worse than misleading LR = 2).It furthermore fosters incentives for forensic practitioners to offer accurate and truthful LRs, a critical aspect in a field where inaccurate or biased LRs can have significant implications for the criminal justice system, and imposes strong penalties for highly misleading LRs.In addition, the C llr is a scalar that can be easily thresholded for validation, ensuring comparability between different systems, methods, and setups.
However, the C llr also exhibits limitations.Primarily, as any empirical performance measure, it necessitates an empirical set of LRs, introducing challenges in database selection to generate these LRs.Ideally, these databases should resemble actual casework conditions, but such data is often limited, sometimes requiring a two-stage validation procedure using laboratory-collected data.Additionally, the C llr is affected by small sample size effects, i.e., scarcity in empirically generated LRs, potentially leading to unreliable performance measurements.Data handling is a crucial concern in any validation process [8], as emphasized in many publications in the field [16].Although certainly impacting C llr , these problems are common to all empirical evaluation metrics.
Another significant limitation of the C llr is that, as a scalar, it provides a highly condensed statistic of model performance.In certain situations, further analysis may be beneficial to detect and correct model issues, leading to the proposal of alternative or specialized measures, as described below.The C llr value can be split as a sum of C llr− cal and C llr− min , describing the calibration error and discrimination error separately.While a large C llr− cal value indicates an LR system overstates and/ or understates the evidential strength, it does not tell us by how much and for which evidence class this tends to happen [17].In addition, the C llr value weighs the two different types of misleading evidence (misleadingly supporting H 1 or H 2 ) symmetrically, using a logarithmic scoring rule.One could debate whether or not this is appropriate in a forensic setting.Finally, the interpretation of the actual numerical value of the C llr is not intuitive.Although lower is better and the value should be below 1, many researchers struggle with whether a value of say 0.3 can be considered 'good'.
Various alternative performance metrics and representations have been proposed in the literature.Firstly, a more comprehensive picture can be obtained by looking at the full distribution of LRs under H 1 and H 2 , e.g. using Tippett plots [8].Similarly, the Empirical Cross-Entropy plot (ECE-plot) is often inspected as it generalizes the C llr to unequal prior odds.Furthermore, many metrics exist that focus solely on either discriminating power or calibration.In the former category fall commonly used scalars such as accuracy or false positive rate, and representations as the Receiver Operating Characteristic curve (ROC) or its normalized version, the Detection Error Tradeoff (DET).ROC curves enable the computation of the Area Under the Curve (AUC) value, summarizing the discriminating power of a given method [15].More recently, tools for assessing the calibration of LR systems have been proposed in the form of fiducial calibration discrepancies and corresponding plots [17], and the metric devPAV [18].
This paper focuses on the practical use of the C llr .We note an increasing number of scientific studies proposing LRs across various disciplines.Therefore, one may expect a corresponding increase in the use of the C llr .We conduct a systematic review of these studies, aiming to provide the forensic community with some intuition for what numerical values of the metric to expect or aim for.This may provide some guidance to the question as to how good a particular LR system performs beyond the anchors of C llr = 0, indicating a perfect system, and C llr = 1, indicating an uninformative system equivalent to a system that always returns LR = 1.Additionally, we give an overview of usage of this metric over time and across different forensic disciplines.

Materials and methods
We performed a literature search on October 31st, 2022 using a keyword search on the University of Amsterdam library and Scopus databases (Table 1).We took April 2006 as the start of the search range, since one of the first papers proposing the use of the C llr metric to measure LR system performance was published in this month [9].Hence, only English publications published between April 2006 and October 30th, 2022 were included in this review.In addition, we checked the references and citations of every selected paper to find additional relevant literature.We did this manually, but also performed an automated network search via the online tools Inciteful.xyzand connectedpapers.com.All articles were screened against several inclusion and exclusion criteria (Table 2).
If articles reported multiple C llr values, the most forensically relevant C llr was chosen, meaning the C llr value that was the result of evaluation on data and conditions mostly resembling casework in the view of the authors.In case multiple models were evaluated on the same forensically relevant dataset, the C llr of the best-performing model was selected.In cases where C llr values were reported for multiple, all forensically relevant datasets, these C llr values were included separately as they could be viewed as independent evaluations.

Proportion of publications reporting C llr
In total, we found 136 publications on (semi-)automated LR systems [9,10,13,14,, of which 80 (58.8 %) reported a C llr value in some way, either by giving an explicit (list of) value(s) or by plotting an ECE plot from which the C llr value could be read.Out of the 80 publications that reported a C llr , 45 (56.3 %) also reported a C llr− min , making it possible to differentiate between discrimination and calibration error.As is shown in Fig. 1, the number of publications on (semi-)automated LR systems as well as the proportion of reported C llr values differs widely per forensic expertise area.The various areas included were chosen based on a manual clustering of the selected literature taking into account common forensic expertise areas in literature and forensic practice.While the exact clustering is necessarily to some degree arbitrary, it is nevertheless helpful to illustrate patterns between various forensic sciences.
From Fig. 1 can be seen that by far the largest number of publications on (semi-)automated LR systems is in the area of speaker recognition.The proportion of publications reporting a C llr in this area is also quite high (93.3 %).This makes sense given that speaker recognition is one of the most developed areas with respect to automated LR systems and is the forensic area in which the application of the C llr was first proposed [9].A relatively high proportion of publications report C llr values in the area of materials and microtraces (81.0 %), which mainly includes publications on the individualisation of traces such as glass, paint and other substances.The area with the third largest number of publications is forensic DNA analysis.This is mainly due to the large number of probabilistic genotyping models that have been developed for the interpretation of DNA evidence (e.g.STRMix [111], EuroForMix [158], DNAxs [159], etc.).The null proportion of publications reporting a C llr in this area (0 %) is striking.For stylometric analysis (100.0 %) and face comparison (85.7 %) the C llr seems to be frequently used as well, but it should be noted that all publications found in stylometric analysis were written by the same author(s).In fingermark comparison, the C llr seems to be less often used (50.0 %).Publications from the remaining expertise areas where less than 5 publications could be found were pooled together in a single "Other" category.For these categories, a bit less than half (44.0 %) of publications use the C llr metric.

Proportion of publications reporting C llr over time
Besides looking at the differences between forensic areas, it is also interesting to look at the development of the application of the C llr over time.Fig. 2 shows the number of publications on (semi-)automated LR systems and the proportion reporting a C llr per year.There is a clear upward trend in the number of publications per year on (semi-)automated LR systems.Until 2018, no significant change is visible in the proportion of papers using the C llr metric for evaluation.After 2018 however, the proportion of papers using the metric seems to be much smaller compared to the period before.This is due to some extent to an increased number of publications in the DNA analysis field in the last couple of years.When filtering out all DNA publications (Fig. 3), the difference between the periods becomes smaller, but remains present.

Comparison of C llr values between forensic expertise areas
From the 80 publications on (semi-)automated LR systems that reported one or more C llr values, we selected 95 forensically relevant C llr values.Fig. 4 shows the distribution of forensically relevant C llr values per forensic expertise area, only including the best values per unique dataset.Because different models evaluated on the same dataset can be compared and one would naturally pick the best model for application in practice, only values from the best models per dataset are shown.Note that Fig. 4 shows the best values per distinct dataset in a certain area, whereas Fig. 1 shows the number of unique publications per area.This explains why stylometric comparison, but not signature comparison, is shown separately in Fig. 1, even though the former shows fewer data points in Fig. 4. For the former, we found 8 publications with 3 C llr values on distinct datasets, for the latter this was 3 publications with 8 C llr values on distinct datasets.
From Fig. 4 it can be seen that there is quite a lot of variance in C llr values within every area.Even in relatively well-defined areas such as speaker recognition and face comparison, C llr values are distributed between [0.0002, 0.98] and [0.12, 1.662].The values between the areas do follow, to some extent, the expected pattern.For example, we would expect much more discriminating results in materials and microtraces than in speaker recognition, and thus lower C llr values.In contrast, for signature comparisons, we would expect higher C llr values.This is clear from the figure, with C llr values predominantly below 0.4 for the former, ranging across [0, 1] for speaker recognition and mainly around 0.6 for signature comparison.Literature does allow for drawing tentative conclusions on which C llr values one may expect in general for a state-of-theart system per area.We should be very careful in any conclusions though, as the values reported may fluctuate more because of the dataset queried than the method employed.Unfortunately, for many areas such as drug analysis and forensic ballistics, only one or two C llr values are available, making it hard to draw any conclusions.This also makes it more difficult to observe any clear pattern in C llr values in general.

Discussion
There are slight differences visible between C llr values reported in different forensic expertise areas, with fields that generally provide higher discriminating power generally showing lower C llr values.However, the abundance of diverse datasets used for evaluation and the relative scarcity of C llr values reported make it difficult to distinguish any clear patterns.This is a disappointing finding, as the current state of the literature thus not allows for further intuition on what constitutes a 'good' C llr .To address the noise caused by evaluation on these diverse datasets, different systems should be evaluated on standard, forensically   relevant, benchmark datasets which would allow for a much more direct comparison between systems.The set of C llr reference values resulting from such evaluation can serve as a valuable resource for assessing the quality of (semi-)automated systems being developed in the future.
There is no objective system for assessing whether a dataset is 'forensically relevant'.For this study, we subjectively assessed whether evaluations were conducted under circumstances resembling casework conditions.In this way, we aimed to get at least a global view of C llr values that can be expected for a system in practice, while mitigating the confounding impact of unrealistic setups.Nevertheless, what constitutes forensic relevance remains a subjective decision that may differ per jurisdiction or organization.
Even though the number of publications on (semi-)automated LR systems has been increasing since 2006, the use of the C llr metric has remained relatively constant.It is challenging to explain this phenomenon, but it is plausible that the stability of the metric usage can be more attributed to an existing user base that has been publishing more over the years, rather than widespread adoption of the metric across the field.
In addition, it is worth noting the seemingly non-existent use of C llr in forensic DNA analysis.Especially, since its use seems particularly well suited for forensic DNA comparison, given the well-established Bayesian reasoning methodology in the field.One possible reason is the fact that there are cases where the C llr may not be so relevant.For instance, in cases without contamination, mixtures, dropout, etc., and with abundant samples, DNA LR values present high discriminating power in general.As the C llr is an empirical measure, generating a sufficiently large number of LR values is necessary to obtain statistically stable C llr values, and in cases with very high discriminating power, all C llr values may be numerically zero or very close to zero.This reduces the usefulness of the metric.Another reason could be that, because the metric has historically been rarely used in the field, subsequent publications may also overlook it.
Although much attention in forensic DNA analysis is devoted to validation, there is little emphasis on specific metrics [160][161][162][163]. Usually well-known metrics are employed, e.g.ROC, AUC, false positive rates (given a certain LR threshold), or rates of misleading evidence [164][165][166].We have not seen metrics specifically for calibration, even though the concept is certainly recognized in the field [138,167,168].There thus may be a role to play for strictly proper scoring rules, which can comprehensively indicate the performance of an LR system.

Conclusions
The aim of this review was to gain insight into what C llr values can be expected for state-of-the-art systems in forensic sciences, and to investigate the current application of the log-likelihood ratio cost (C llr ) in the evaluation of (semi-)automated likelihood ratio (LR) systems.While no claim is made that all existing relevant literature was found, considerable effort was put into getting an as complete picture as possible.We found that the use of the C llr metric heavily depends on the field, with the metric being conspicuously absent in the forensic DNA analysis field.In addition, we found that the number of publications on (semi-)automated LR systems has increased over the years, but the use of the C llr metric has remained more constant over time.The results do not show a clear pattern for what C llr values can be expected and values vary a lot between forensic expertise areas, types of analyses, and datasets.Hence, we cannot establish a clear range of variation in general to assess the goodness of a system computing LR values.We set out to investigate if we could get a feeling of what a 'good' C llr value is.We are not quite there yet.A path forward may be provided by benchmarks, i.e. datasets and evaluation protocols publicly available.Such benchmarks are standard practice in many fields of machine learning, and also used in several forensic fields [169], and allow for a much more direct comparison between systems.Given the trend that the number of LR systems developed and used increases yearly, having the ability to properly evaluate them will be increasingly important.Thus, setting up and curating such benchmark datasets, and establishing suitable evaluation protocols, is an investment in the future of the field.
(continued on next page) Ref.

Fig. 1 .
Fig. 1.The number of publications on (semi-)automated likelihood ratio (LR) systems per forensic expertise area.The total number of publications per area is indicated by a dashed bar and the number of publications reporting one or more C llr values per area is indicated by a solid bar.The bars are overlapped, not stacked.The areas are ordered based on absolute number of publications on (semi-)automated LR systems.The percentage on top of each bar indicates the proportion of publications reporting a C llr value in that area.All publications from areas with less than 5 publications were pooled together in a single "Other" category.

Fig. 2 .
Fig. 2. The number of publications on (semi-)automated likelihood ratio (LR) systems per year.The total number of publications per year is indicated by a dashed bar and the number of publications reporting C llr values per year is indicated by a solid bar.The bars are overlapped, not stacked.The percentage of papers reporting a C llr is printed on top of each bar.

Fig. 3 .
Fig. 3.The number of publications on (semi-)automated likelihood ratio (LR) systems per year, excluding all publications in the area of DNA analysis.The total number of publications per year is indicated by a dashed bar and the number of publications reporting C llr values per year is indicated by a solid bar.The bars are overlapped, not stacked.The percentage of papers reporting a C llr is printed on top of each bar.

Fig. 4 .
Fig. 4. The best values per dataset (lower C llr values indicate better performance) of selected forensically relevant C llr values per forensic expertise area.The number in between brackets indicates the number of C llr values plotted.
238) Dataset of Dutch online real or simulated (forged) signatures 0.565 (0.351) Dataset of Chinese online real or simulated (forged) signatures (continued on next page) libraries inPython 3.10  [26].All code for plotting and analysis as well as the dataset itself is available in the supplementary information.

Table 2
Inclusion and exclusion criteria used for systematic literature search.