Feeling the future of eyewitness research

For decades, eyewitness memory research has had the worthy goal of minimizing the chances that an innocent suspect is falsely identified. However, this is not the only goal. Partial receiver operating characteristic (ROC) curves provide a way to identify lineup procedures that keep the false alarm rate low while also maximizing the hit rate. Recently, there have been attempts to extend the ROC curve into high false alarm rate regions that fair lineups are intentionally designed to avoid. These new full ROCs could provide a way for the police to circumvent the protections offered by fillers in a fair lineup. Moreover, these attempts to extend the ROC curve are not based on a mathematically coherent model of latent diagnostic signals. In this article, we empirically demonstrate how this lack of a solid foundation can lead to dubious conclusions, such as eyewitnesses possessing precognition and being able to reliably identify the person they will see commit a crime in the future.

After seeing a crime, witnesses are often asked by police officers to attempt to identify the perpetrator from a lineup.A lineup in the United States typically consists of one suspect, who may be innocent or guilty, and five people who share the same general features with the suspect but are known to be innocent, often called fillers or foils.Constructing a lineup in this way aligns with recommendations for collecting eyewitness identification evidence, which state that "there should be only one suspect per lineup, and the lineup should contain at least five appropriate fillers who do not make the suspect stand out in the lineup based on physical appearances or other contextual factors such as clothing or background" (Wells et al., 2020, p. 17).Fillers are known-innocent members in the lineup because they come from various face databases rather than databases of potential suspects.
The goal of the eyewitness is to identify the perpetrator if present in the lineup.However, police often investigate the wrong person and place an innocent suspect in the lineup instead.When this happens, the goal of the eyewitness is to identify no one, but eyewitnesses regularly do not achieve this goal.Since 1989, over 258 wrongful convictions involving eyewitness misidentification have been overturned in the United States (Innocence Project, 2023).The number of wrongful convictions is likely considerably higher, however, because these are only the subset of wrongful convictions based on eyewitness misidentification that currently have been overturned due to DNA evidence.These exonerees spent, on average, 14 years in prison for a crime that they did not commit, and the real perpetrators remained free to commit additional crimes, including murder and sexual assault (Innocence Project, 2023).
To reduce the costs of eyewitness misidentifications, researchers have long focused on the worthy goal of developing lineup procedures that reduce their likelihood.Some prominent examples include presenting lineup images sequentially compared to all at once (Lindsay & Wells, 1985, see also Lindsay et al., 2009), ensuring that the lineup is fair and that the suspect does not stand out (Lindsay & Wells, 1980), using pre-lineup instructions to warn the witness that the real perpetrator may or may not be in the lineup (Malpass & Devine, 1981), and using double blind administration whereby the lineup administer does not know who is the police suspect, so that lineup administer cannot (accidentally or deliberately) steer the witness to select the suspect (e.g., Charman & Quiroz, 2016;Greathouse & Kovera, 2009;Wells et al., 1998;Zimmerman et al., 2017; for a review see Kovera & Evelo, 2017).Most of these recommendations are still widely accepted by the research community and feature prominently in evidence-based best practice guidelines written by psychological scientists for the legal system (e.g., Seale-Carlisle et al., 2024;Wells et al., 2020).
Researchers have also developed novel procedures for specific witness populations that may be particularly prone to making misidentifications of innocent suspects.For example, the wildcard lineup includes an additional photograph of a silhouetted figure with a large question mark superimposed to make it easier for children to state that the real perpetrator may not be in the lineup (Zajac & Karageorge, 2009).Enhanced non-biased lineup instructions were developed to help older adults remember that the perpetrator may not be in the lineup (Wilcock et al., 2005).In short, the primary goal of eyewitness researchers has long been to reduce the harm caused by misidentifications of innocent suspects.Given the profound societal, economic, and psychological impacts of misidentifying innocent suspects, minimizing the false identification rate is an understandable goal for legal policymakers.
Yet minimizing misidentifications of innocent suspects should not be the only goal.A witness also needs to be able to identify the perpetrator if present in the lineup.More recently, experimental psychologists noticed that many of the recommended lineup procedures reduced the misidentification rate and protected innocent suspects, but also came at the cost of reducing the correct identification rate and protecting guilty suspects (e.g., Clark, 2012).In 2012, Mickes, Flowe, and Wixted introduced receiver operating characteristic (ROC) analysis to the field of eyewitness identification to better conceptualize this tradeoff.The benefit of ROC analysis is that it allows researchers to investigate procedures that minimize misidentifications of innocent suspects (the longstanding goal) and simultaneously maximize correct identifications of guilty suspects.From here on, we use the terms false alarms to refer to identifications of innocent suspects and hits to refer to identifications of guilty suspects.
To plot an ROC curve, the cumulative proportion of hits and false alarms are plotted over confidence judgments from the most conservative judgments (100% confidence) to the most liberal judgments (0-100% confidence).Hit rates (plotted on the y-axis) are calculated by dividing the number of guilty suspects identified by the total number of target-present lineups (i.e., lineups containing the guilty suspect).False alarm rates (plotted on the x-axis) are calculated by dividing the number of innocent suspects identified by the total number of target-absent lineups (i.e., lineups containing the innocent suspect).The hit and false alarm rates are influenced by filler IDs (i.e., the eyewitness identifies a filler rather than the suspect) and lineup rejections (i.e., the eyewitness says the perpetrator is not in the lineup) because the total number of target-present or target-absent lineups is included in the denominators.However, similar to the diagnosticity ratio comparison approach used in earlier work (e.g., Lindsay & Wells, 1985), the ROC does not specifically differentiate filler identifications from lineup rejections.Moreover, it is important to reiterate that filler IDs are not considered false alarms because fillers are known to be innocent, and their identification does not directly imperil innocent suspects.
Policymakers should opt for the lineup procedure that yields a higher ROC curve.This is because the lineup procedure that yields a higher ROC curve can achieve a higher hit rate at whichever false alarm rate is preferred.Consider the two hypothetical ROC curves shown in Fig. 1.If policymakers decide that a false alarm rate of ~0.05 is desirable, a hit rate of ~0.27 can be achieved with the d' = 1 ROC curve shown in open circles.However, at a false alarm rate of ~0.05, a hit rate of ~0.64 can be achieved with the d' = 2 ROC curve shown in closed circles.This is the value of choosing the lineup procedure on the higher ROC curve: it can accomplish the goal of keeping the false alarm rate low while also maximizing the hit rate.This is analogous to the logic used to advocate for the use of higher power in experimental research.For whichever, false alarm rate is preferred (e.g., an alpha level of 0.05), increasing sample size leads to higher power, which fundamentally is a higher hit rate (Button et al., 2013;Wilson & Wixted, 2018;Witt, 2019).
One way to statistically compare two ROC curves is to compute the area underneath the ROC curve.Typical ROCs for fair double-blind lineups are truncated (i.e., they do not span to [1,1] in ROC space, where hit rate = 1, and false alarm rate = 1) even when witness responding is the most liberal.That is, even if every witness made an identification, some identifications would fall on filler faces in the target-present lineup, unless memory was perfect.Likewise, in a fair target-absent lineup, the fillers all match the description of the perpetrator and therefore resemble each other.This means that an innocent suspect is just as likely to be misidentified as one of the fillers.Therefore, the false alarm rate is limited to 1/k, where k is the number of lineup members.Offering such protection to an innocent suspect is the whole idea behind using a fair lineup (rather than a showup which contains only the suspect, for example).Because lineup ROCs are truncated, the partial area underneath the curve (pAUC) is computed.Comparing the pAUC for two procedures within the range of false alarm rates that can be achieved in practice on a lineup task gives a theory-free and base-rate independent way of determining which lineup procedure best enables witnesses to tell the difference between innocent and guilty suspects.
Although there has been a proliferation of eyewitness identification research using ROC analysis to compare pAUC values (e.g., Carlson & Carlson, 2014;Dobolyi & Dodson, 2013;Gronlund et al., 2012), relatively recently there has been some momentum to develop new ways of conducting ROC analysis to determine which lineup procedures best support eyewitness performance.For example, Smith et al. (2020;hereafter referred to as Smith et al.) voiced concerns about the pAUC statistic and developed a new ROC method to replace it.This new ROC analysis is called full ROC analysis and allows policymakers to operate on a false alarm rate that is higher than what can be achieved using a fair lineup.One can imagine circumstances in which doing so might be desirable.For example, if the base rate of guilt were believed to be very high in a particular jurisdiction (i.e., if innocent suspects were almost never included in lineups), then a high false alarm rate could be more tolerable than would otherwise be the case.
Nevertheless, we ask eyewitness memory researchers to pause and critically evaluate whether presenting options with high false alarm rates as potentially reasonable operating points truly serves as an asset rather than a liability.Presenting extremely high false alarm rates as reasonable can encourage police and policymakers to consider using a showup (i.e., a procedure where only the suspect, and no fillers, is shown to the witness) or an unfair lineup (i.e., a lineup where the suspect stands out) in preference to a fair lineup so long as they convince themselves that it is rare for a suspect in their jurisdiction to be innocent.Indeed, Smith et al. explicitly tout this as a virtue of their approach: "In fact, using a classic utility analysis, Smith, Lampinen et al. (2019; see also Lampinen et al., 2019) demonstrated that under certain assumptions about base rates and about the relative costs of missed culprit Fig. 1.Example of two partial ROC curves for suspect identifications.For whichever false alarm rate is preferred, the d' = 2 ROC curve results in a higher hit rate than the d' = 1 ROC curve.Notice, however, that the maximum false alarm rate (FAR max ) will somewhat vary for different lineups and eyewitnesses, but in a fair double-blind lineup will always be less than or equal to 1/k, where k is the number of lineup members.
identification and innocent-suspect identifications, the low-similarity [i.e., unfair] lineup is actually superior to the high-similarity lineup" (p.598).Thus, the central problem with this new analysis is that the assumptions in question are largely unconstrained by empirical reality and may therefore simply reflect the wishes of police investigators.
Unfortunately, the police will almost never have any objective information about the base rate of guilt in their jurisdiction.Thus, being human, they will be at risk of basing their estimate on what they want to be true, and what they want to be true is that they have already found the perpetrator (namely, the suspect they have placed in the lineup).For example, as reported by Brewer et al. (2002): "…our target-absent rate was guided by estimates suggested by police who combined considerable experience in detective work with a formal university education (e. g., psychology, law).They argued that the conduct of a formal live or photo-spread lineup (at least in their jurisdiction) was dependent on the investigating officers having strong grounds for believing that their suspect was the offender, and estimated that the proportion of targetabsent lineups (i.e., the suspect in the lineup turned out not to be the offender) was unlikely to exceed .10"(p.47).In other words, these police officers convinced themselves that at least 90% of suspects in their lineups are guilty.If so, then the high suspect choosing rates induced by showups and unfair lineups (and nonblind lineups for that matter) would rarely land on an innocent suspect.Based on utility calculations grounded in subjective police estimates of the base rate of guilt in lineups, these identification methods would be strongly preferred to a fair lineup.And the full ROC analysis (which we describe in more detail, next) would provide the police with the scientific stamp of approval.
Is this the direction the field should take?That is, if police assume a high enough base rate of guilt, then should a showup be preferred over a fair lineup?That seems like an odd preference especially after decades of research working to ensure low false alarm rates of innocent suspects.As noted by Steblay (2006), "The purpose of fillers in a lineup is to reduce the suggestiveness of the procedure and to draw any errors away from the suspect and toward the fillers.The fillers help to make certain that the lineup is a test of the witness's ability to actually recognize the culprit and not merely to identify the suspect present in the lineup" (pp. 349-350).The new full ROC provides a convenient way to circumvent the protections offered by fillers in a fair lineup because any eyewitness decision (suspect ID, filler ID, lineup rejection) potentially can be used by police to decide that the lineup is target-present, which therefore means that the suspect should be prosecuted.With this approach errors are never being drawn away from the suspect because investigators know who the suspect is and use that knowledge.We see no value in this, at least not until such a time as there are empirical constraints on the hopes and dreams of well-intentioned police officers. 1Instead, we strongly recommend that the police follow recent guidelines presented in Wells et al. (2020).According to those guidelines, a pristine lineup is usually preferable to a showup (an exception is when the police are working an active crime scene) and is always preferable to an unfair lineup.This is true no matter what the police believe about how good they are at picking guilty suspects to place in lineups.
In addition to the practical concern, it's crucial to note that the new techniques have major statistical flaws that can yield misleading conclusions, such as favoring a procedure even when eyewitness performance is at chance-level.In what follows, we outline problems with the new full ROC approach, and illustrate that-as set out by signaldetection theorists years ago-an ROC analysis calculating the pAUC provides the information that legal policymakers need to know when deciding which lineup procedure should be implemented.

Investigator discriminability
Smith et al. were concerned that partial ROC curves focus solely on suspect identifications and do not span the full region of ROC space up to a false alarm rate of 1 (see Fig. 1).They therefore developed a new analytical approach-full ROC curves-to measure a new construct called investigator discriminability.The idea is that the investigator's role is to determine if the suspect is innocent or guilty, and since they already know which lineup member is the suspect and which are fillers, all they need to determine is if the lineup is target-present or target-absent.Once they convince themselves that the lineup is target-present, they can then classify the suspect as being guilty.Alternatively, once they convince themselves that the lineup is target-absent, they can then classify the suspect as being innocent.In other words a central premise of this approach is that police decide whether the lineup contains the perpetrator and then get to use the information (that is normally concealed from eyewitnesses in fair lineups) about who is the suspect to label the suspect as being guilty or innocent.Careful readers might become apprehensive once they realize that this approach allows police to decide that the lineup is target-present from potentially any decision (suspect ID, filler ID, or rejection) made by the eyewitness.With this investigator discriminability approach, innocent suspects are never protected by fillers because police always know who the suspect is, and this approach necessitates using that knowledge.This is why strange data transformations can occur, such as a filler ID in a target-present lineup being counted as a hit.Perhaps one saving grace is that police would never feel the need to steer an eyewitness to the suspect because, even if an eyewitness lands on a filler, police can still count this filler ID essentially as a suspect ID.
During a lineup, a witness can either identify someone (the suspect or one of the fillers) or reject the lineup at different levels of confidence.Smith et al. propose that all of these different eyewitness decisions can be used to plot full ROCs for lineups to measure investigator discriminability.To begin, the cumulative proportion of guilty and innocent suspect identifications are plotted over confidence judgments from the most conservative (100% confidence), to the most liberal (0-100% confidence) identifications.This stage usually is identical to the partial ROC curves that have previously been constructed in the eyewitness literature (i.e., partial ROCs focus on suspect IDs).Next, the authors extend the curve by also plotting the cumulative proportion of filler IDs and lineup rejections at different levels of confidence.Note the curious fact that filler IDs count as hits if they appear in target-present lineups or false alarms if they appear in target-absent lineups.This is because the approach is an attempt at discriminating target-present and targetabsent lineups, rather than directly discriminating innocent and guilty suspects.With this full ROC approach, it is perfectly reasonable for police to decide that a filler ID means that this is a target-present lineup and the suspect should therefore be prosecuted to the full extent of the law.Moreover, it is easy to see why police might be biased to conclude that it is a target-present lineup if they believe that 90% of their suspects are guilty before presenting the lineup to an eyewitness.Smith et al. propose two alternative ways of ordering the points on the ROC: using an a priori order or ordering by the diagnosticity ratio (i.e., the hit rate divided by the false alarm rate).We consider these in more detail later, but we pause here to note that there is no principled way of ordering the remaining points on the ROC, and that is the core problem with this approach.The reason why there is no principled way to order the remaining ROC points is that the approach is not tethered to a mathematically coherent model of latent diagnostic signals.
In any case, whichever way one arbitrarily chooses to order the remaining points on the ROC (simply take your pick between various options), this method yields a "full ROC" that spans to [1,1] in ROC space, where the maximum hit rate = 1, and the maximum false alarm rate = 1.Thus, relative to a standard partial ROC of suspect ID rates (which is tethered to a mathematically coherent model of latent diagnostic memory signals), the new approach extends the ROC curve into a 1 The only formal attempt to estimate the base rate of guilt in a particular jurisdiction placed the number at approximately 35%, which is even lower than the 50% base rate used in most lab studies (Wixted et al., 2016).
B.M. Wilson et al. higher false alarm rate range than researchers and practitioners have ever previously been interested in investigating.It seems fair to say that doing so encourages serious consideration of identification procedures that yield a very high false alarm rate.Thus, if we were to accept the underlying premise that the false alarm rate should possibly be higher than is achieved with standard partial ROCs plotted from fair lineups, this would mean that nearly every reform eyewitness memory researchers have proposed over the past 50 years (e.g., instructing the witness that the perpetrator may or may not be in the lineup) was illadvised whenever the police imagine themselves to be very good at placing only guilty suspects in lineups.We do not accept this premise.Instead, we first explain that for applied purposes, focusing on a low false alarm rate range (as is the case in a standard pAUC analysis) is exactly the information that legal policymakers are-and should be-interested in.

Legal policymakers are interested in keeping the false alarm rate low and therefore need to know about pAUC, not AUC
For decades, eyewitness researchers and legal practitioners have been concerned with protecting innocent suspects because identifications of innocent suspects are especially costly.To decide which lineup procedure best supports witness discriminability when the false alarm rate to innocent suspects is low (i.e., when innocent suspects are protected), it makes sense to calculate the pAUC focusing on the leftmost portion of the x-axis of an ROC plot and make conclusions from this portion of the curve (e.g., Colloff et al., 2016;Gronlund et al., 2012;Mickes et al., 2012).A prominent analogy from the medical literature may clarify the value of this approach.It is well accepted that although a full ROC can be easily plotted using a typical diagnostic medical test, it does not make sense to draw conclusions from a full ROC when the concern is with part of that ROC.For example, a false positive from a screening test used to detect prostate cancer leads to a biopsy that can cause permanent disability.Therefore, research has focused on identifying the best medical test (i.e., yielding the highest ROC) in the low end of the false positive range (0 to 0.05; Yan & Zhang, 2018).This is analogous to the eyewitness identification situation, where false identifications of the innocent have long been judged to be especially costly (e.g., Blackstone's famous ratio).Conversely, on tests for detecting breast cancer, false positives leading to a biopsy generally result in minimal harm, whereas a false negative is life threatening.Therefore, research has focused on identifying the best test in the high end of the false positive range (Jiang et al., 1996).In both the lineup literature, and the medical literature, pAUC is not a misleading measure but instead tells legal policymakers what they wish to know by focusing the analysis on the clinically relevant (or forensically relevant) range.2For decades, the forensically relevant range has been the low false alarm rate region, and we submit that this has not changed.
Relying on the full ROC raises further complications when two ROCs intersect, such that one ROC does not consistently remain above or below the other.In such cases, one ROC might excel in regions of low false alarm rates, while the other outperforms in high false alarm rate areas.For example, in their Fig. 7, Smith et al., plotted full ROCs for fair versus unfair lineup data from Colloff et al. (2016).Colloff et al. focused on suspect IDs (in the low false alarm rate range) and found that fair lineups yield a higher partial ROC curve and therefore concluded that witnesses are better able to discriminate guilty from innocent suspects in fair compared to unfair lineups.When full ROCs are plotted, fair lineups are on a higher ROC for the low false alarm rate region, but unfair lineups are on a higher ROC for the high false alarm rate region.When Smith et al. calculated the AUC for the full ROC curves, there was no statistically significant difference between the two conditions, and they concluded that the difference between fair and unfair lineups "appears to be one that is better characterized by a change in conservativism rather than by a change in discriminability (cf.Colloff et al., 2016;cf. Smith et al., 2018)" (p.602).When discussing changes in discriminability, it is important to specify the comparison.Partial ROC curves examine discriminability for guilty and innocent suspects, whereas full ROC curves (if anything) examine discriminability for target-present and target-absent lineups.The Smith et al. full ROC approach is predicated on police using their additional knowledge about which lineup member is the suspect.
Critically, when ROCs intersect, the full ROC approach of calculating the area under the entire length of the curve obscures the fact that one curve is higher for a particular region and another curve is higher for a different region.This is analogous to examining only the main effect in the presence of a crossover interaction.Looking at only the main effect can give the impression that there are no differences between two conditions, when in actuality the differences could be substantial.For example, consider a study on the effectiveness of a teaching method where the factors are the teaching method (traditional vs. innovative) and the students' initial skill level (low vs. high).The main effect might show that the innovative teaching method is no more effective on average than the traditional method.However, there could be a crossover interaction effect where the innovative method is more effective for students with low initial skill levels but less effective for those with high initial skill levels.Looking only at the main effect would miss this nuance, leading to the incorrect conclusion that the new teaching method is uniformly the same as the traditional method.
Likewise, when two ROCs cross over, rather than calculating the area under the entire curve, two partial ROCs would need to be calculated: the usual partial for the low FAR region and a second partial for the high FAR region.Doing this makes the choice between operating in either a low FAR region or a high FAR region more apparent.Basing conclusions on the high FAR region rather than the low FAR region would be explicitly endorsing this region as the one where investigators should consider operating.In short, in these situations, the key is to determine which region is more important.In line with the longstanding tradition in eyewitness memory research, we encourage prioritizing the low falsealarm rate range to safeguard the innocent from wrongful convictions.Again, this is exactly the information gleaned from a pAUC analysis as introduced to the eyewitness field by Mickes et al. (2012).The forced choice (choosing either to focus on a low FAR or a high FAR) when ROCs cross over would be required even if the full ROC approach were functionally useful at providing information to investigators.But as we explain next, the new full ROC approach does not appear to be functionally useful for investigators.

The full ROC ordering of the ROC points is unstable
The ordering of points for the positive suspect identification region of the ROC (i.e., the area of the ROC included in the normal pAUC) is remarkably stable.High-confidence suspect IDs consistently provide relevant range.In medicine, the base rate denotes the prevalence of a disease in a specific population.This base rate is essential because when dealing with rare diseases, there is a priority to maintain a low false positive rate.Medical professionals can rely on "gold standard" biopsy tests to establish the true prevalence of a disease.In the legal realm, the base rate pertains to the frequency of guilty individuals in lineups.In simpler terms, it is about distinguishing lineups with the actual culprit (target-present) from those with only innocent individuals (target-absent).Current methods do not offer concrete information to establish the ground truth in police lineups.Hence, a pragmatic approach is to assume a base rate of 50%, erring on the side of caution.Nonetheless, as noted earlier, some evidence suggests that the base rate might actually be under 50% (Wixted et al., 2016).Furthermore, a recent consensus paper by experts (Wells et al., 2020) advises ensuring some independent evidence against a suspect before including them in a lineup, out of concern that the real base rate of guilt could be extremely low.Given that the aim is to minimize false positives, especially when guilty suspects are few, there appears to be little practical value in scrutinizing lineups at extremely high false alarm rates.
B.M. Wilson et al. more evidence of guilt than medium-confidence suspect IDs, which consistently provide more evidence of guilt than low-confidence suspect IDs.This can be observed in all tables of Smith et al., which display the diagnosticity ratio of guilt for suspect IDs made with different levels of confidence.When suspect IDs are made with increasing confidence, the diagnosticity ratio increases.This pattern can also be observed in a review of 20 eyewitness studies by Wixted and Wells (2017), which shows the confidence accuracy relationship for positive suspect identifications.High-confidence suspect IDs almost invariably provide greater evidence of guilt than lower-confidence suspect IDs.It is this reliable pattern that makes confidence a useful piece of information for investigators.If highconfidence suspect IDs did not reliably provide more evidence of guilt than lower-confidence suspect IDs, investigators would not improve their "innocent"/"guilty" classification accuracy by considering witness confidence.
The ROC points for positive suspect identifications are consistently ordered from the highest to lowest diagnosticity ratio as they move from high to medium to low confidence.However, the problem arises when considering additional points added using filler IDs and lineup rejections.These points are no longer consistently ordered from highest to lowest diagnosticity ratio, leading to major inconsistencies across different experimental conditions.The reason for this inconsistency is that the new method was developed without any connection to a coherent model of latent diagnostic signals.The inconsistent ordering of points across conditions in the new full ROC method can easily be appreciated by examining the tables in Smith et al.Consider the extreme B.M. Wilson et al. variability in Table 2 of Smith et al. (reproduced in Fig. 2 here), which shows the problem with extending the ROC by even a single point beyond the standard pAUC region.The table displays participant identification rates after viewing a mock crime video in either degraded (low ROC) or clear (high ROC) encoding conditions.These values were used to construct the ROC curves displayed in Fig. 2. Keep in mind that with the Smith et al. approach, any decision made in a target-present lineup contributes to the hit rate, and any decision made in a target-absent lineup contributes to the false alarm rate.This is how these lineup ROC curves span the full range up to a hit rate and false alarm rate of 1; all decisions (e.g., suspect IDs, filler IDs, lineup rejections) are counted as either hits or false alarms.
The ROC points are ordered by the diagnosticity ratio (from high to low).While the first three suspect identification points are consistently ranked in terms of diagnosticity ratios (from high to medium to low confidence) for both conditions as expected, all additional points lack this stability.For example, in the degraded condition, the fourth point on the full ROC (and the fourth entry in the table) is a filler ID made with medium confidence.However, in the clear condition, the fourth point is not a filler ID made with medium confidence, but a reject decision with low confidence.The filler ID made with medium confidence is the last point on the clear condition ROC.One type of decision (in this case, a filler ID with medium confidence) can sometimes be the first point used to extend the ROC curve and other times be the last point used to extend the ROC curve.But the problems with the full ROC do not stop there, as we illustrate next.

A statistical problem with the sorting on diagnosticity ratio approach-or was Bem (2011) right about precognition?
Another fundamental problem with sorting on the diagnosticity ratio (i.e., HR / FAR) to determine the ordering of points is that noise artificially raises the ROC.As described previously, confidence-based ROCs are constructed by cumulating the hit and the false alarm rates across decreasing levels of confidence.First, consider Point 1 on the ROC in Fig. 3.This is the point with the lowest HR and FAR, and it includes only IDs made with high confidence.Now consider the second point on the ROC.This is the point with the next lowest HR and FAR and is constructed by including the sum of IDs made with high confidence and medium confidence.Because the hit and false alarm rates that make up these earlier points are included in the later points, the region in which later points can fall gets progressively limited by earlier points.The cumulative process means that hit and false alarm rates can only increase for later points and never decrease.If high-performance points (i.e., with a high HR and low FAR) are artificially moved to earlier on the curve because of the DR sorting process, the ROC is likely to continue staying high when the lower-performance points are added in later.

"Feeling the Future" time-reversed lineup experiment
To illustrate the problem that noise artificially raises the full ROC, we conducted an experiment.The experiment tested precognition (Bem, 2011) by giving participant eyewitnesses the lineup test before witnessing the crime.We recruited participants from Amazon's Mechanical Turk (MTurk).Participants received 50 cents for the ~3 min needed to complete the experiment.The research was approved by the UC San Diego Institutional Review Board for studies involving human subjects, and informed consent was obtained from all participants prior to their participation in the study.A total of 163 participants were randomly assigned to receive either a "target-present" or "target-absent" lineup (84 to a TP lineup and 79 to a TA lineup).We chose this sample size because samples sizes of ~150 are common in the eyewitness memory literature.

Materials and procedure
All participants were told, "In a moment, you're going to see a video.We would like you to try to predict who is going to be in that video.We find that people are likely to be accurate at this task when they trust their intuitions.The person from the video may or may not be shown."All participants were then given the exact same 6-person simultaneous lineup and told, "Please select the person you think is going to be in the video you are about to see, or select the 'Not Present' option if you think none of these people will be in the video."Participants used a confidence scale ranging from 0% to 100%, with options available in 10% increments to indicate their level of confidence.
Participants were then randomly assigned to view a 30-s mock-crime video that showed one of the lineup members or a 30-s mock-crime video that showed a person other than one of the lineup members.Having identical lineups for all participants ensured that differences in performance could not be attributed to some faces potentially being inherently more likely to be selected than other faces.The only difference between target-present and target-absent conditions is the video that was shown after the lineup test was completed.That is, we had a Fig. 3. Diagram showing how confidence-based ROC points are constructed.Imagine this hypothetical study had 200 eyewitnesses.100 eyewitnesses got a TP lineup that each contains one guilty suspect, and 100 eyewitnesses got a TA lineup that each contains one innocent suspect.The data in the table show the guilty suspect being identified 7 times with high confidence, 6 times with medium confidence, and 16 times with low confidence.The data also show the innocent suspect being identified 1 time with high confidence, 2 times with medium confidence, and 6 times with low confidence.All rates are therefore calculated by dividing each of these numbers by 100.The important fact to consider is that later points are constructed by cumulating the hit and false alarm rates from earlier points.While Point 1 includes only suspect identifications made with high confidence, Point 2 includes those same high-confidence suspect identifications in addition to mediumconfidence identifications, and Point 3 includes those same high-confidence and medium-confidence identifications in addition to low-confidence suspect identifications.The points do not operate independently of earlier points.For example, the Point 2 hit rate must be ≥0.07, and the Point 2 false alarm rate must be ≥0.01.Because the hit and false alarm rates can never decrease, the region of ROC space that can be occupied by later points is constrained by earlier points.
B.M. Wilson et al. designated innocent suspect in the target-absent condition, which was the same lineup face that was the guilty suspect in the target-present lineups.

Standard suspect ID partial ROC curves
We first constructed the standard suspect partial ROC curves by plotting each level of confidence.The first point (i.e., the point with the lowest hit and false alarm rates) included only suspect IDs made with 100% confidence.The second point (i.e., the point with the second lowest hit and false alarm rates) included suspect IDs made with 100% confidence and 90% confidence.The third point included suspect IDs made with 100%, 90%, and 80% confidence.Additional points were constructed in this same manner ending with all suspect IDs made with all levels of confidence.Fig. 4 shows these points falling along the line of chance as would be expected in the absence of precognition.That is, as is reasonable to expect, participants cannot predict who is going to be the perpetrator in a future crime.

Diagnosticity ratio sorted full ROCs
We next constructed full ROC curves according to the Smith et al. diagnosticity ratio sorting guidelines.The "hit rate" is calculated by taking the number of times each decision (guilty suspect IDs, filler IDs, rejections-at all confidence levels for each) occurs in target-present lineups and dividing this number by the total number of target-present lineups.The "false alarm rate" is calculated by taking the number of times each decision (innocent suspect IDs, filler IDs, rejections-at all confidence levels for each) occurs in target-absent lineups and dividing this number by the total number of target-absent lineups.A diagnosticity ratio (DR) is calculated by dividing each "hit rate" by each "false alarm rate."The ordering of points is then determined by rank ordering these DRs from highest to lowest.Fig. 5 shows these points falling well above the line of chance.The area under this ROC curve is 0.68, 95% CI = [0.60,0.76].Since the value of 0.5, which represents chance performance, is not within this interval, we can conclude that performance is statistically significantly higher than chance at the 0.05 significance level.Therefore, according to a full ROC analysis using the Smith et al. method, participants can predict who is going to be the perpetrator in a future crime.
We invite readers to make their own judgment about which of the following two hypotheses is better supported by the experimental data presented in Figs. 4 and 5.
Hypothesis 1.People are good at knowing who they will later see committing a crime (i.e., Bem was right all along about precognition being real), but partial ROC curves obscure this fact.
Hypothesis 2. People are not able to see into the future and predict who they will later see committing a crime, and the Smith et al. full-ROC approach provides a biased estimate of discriminability.

The alternative a priori ordering approach
Instead of computing diagnosticity ratios after the fact and using them to order the ROC points, Smith et al. suggest that another approach for ordering the points beyond the normal partial ROC range is to decide in advance of the study how to order these points.As they put it: "As an alternative to ordering the operating points by diagnosticity ratios, for example, we ordered the operating points for these ROC curves as follows: high-confidence suspect picks, medium-confidence suspect picks, low-confidence suspect picks, low-confidence rejections, lowconfidence filler picks, medium-confidence rejections, mediumconfidence filler picks, high-confidence rejections, high-confidence filler picks" (p 604).
There is no principled basis for choosing this ordering over the one based on diagnosticity ratios, and that is the crux of the issue.Moreover, even this ad hoc alternative does not exhaust the list of possibilities.For example, Lucas and Brewer (2022) used the following approach to order the full ROC points: "With the objective of ordering identification evidence from strongest to weakest, we plotted suspect identifications across descending levels of confidence, followed by filler identifications and lineup rejections across ascending levels of confidence.Low confidence filler identifications were followed by low confidence lineup  rejections and so forth.Our decision to order filler identifications before lineup rejections was based on the premise that although filler identifications and lineup rejections have been indicated to provide equivalently strong exonerating evidence, the latter is likely given more weight by investigators" (p.116).This is still a third option for plotting the remaining operating points to achieve a full ROC.It is simply arbitrary, with researchers free to choose whatever full ROC method best supports their intuitions.
One advantage of the approach recommended by Lucas and Brewer (2022) is that it is based on a rudimentary theory of underlying latent variables.Specifically, the points are ordered with lineup rejections appearing last (and the filler ID points appearing in the middle) based on the assumption that latent weights in the minds of investigators are incorrectly higher for rejections than they are for filler IDs.That is not the kind of mathematically coherent theory of latent diagnostic signals that we believe is needed to order the remaining ROC points in a coherent way, but perhaps it is the beginnings of one.Ultimately, using an a priori ordering for ROC points extending to the right of suspect ID rates is not going to be an easy problem to solve because there are a multitude of factors that influence the ordering of the filler IDs, which we explain next.

The assumed decision rule and lineup fairness dramatically influence the DR ordering of filler IDs
In the feeling the future experiment, we demonstrated how noise within an individual run of an experiment can inflate the full ROC curve.This is because each experiment provides a noisy estimate of the true DR for each ROC point.However, even if the DR for each point is measured precisely, the ordering of points based on the DR will continue to vary dramatically from one experiment to another.Moreover, the ordering requires researchers to endorse (although perhaps implicitly) one decision model or another, which means that full ROC curves ordered using the DR do not provide a theory-free way of estimating empirical discriminability.
Two decision models are currently reasonable contenders for how eyewitnesses base their confidence in a lineup decision (Shen et al., 2023;Wixted et al., 2018).These two models are the independent observations model and the ensemble model.The independent observations model assumes that eyewitnesses first determine which face in the lineup generates the strongest memory-match signal with the perpetrator (i.e., the MAX rule).If this memory-match signal exceeds a particular criterion, that face (i.e., the MAX face) is identified.If this memory-match signal does not exceed a particular criterion, the MAX face is not identified, and the lineup is instead rejected.Confidence is similarly determined by the value of the memory-match signal of the MAX face, wherein higher memory-match signals result in higher confidence ratings, and lower memory-match signals result in lower confidence ratings.The ensemble model has the same initial MAX rule: eyewitnesses first determine which face in the lineup generates the strongest memory-match signal with the perpetrator.However, according to the ensemble model, the decision is not based on the raw memory-match signals generated by each face but by a new ensemble variable.This ensemble variable is calculated by subtracting the average memory-match signal generated by all faces in the lineup from the MAX face.If this ensemble variable exceeds a particular criterion, the MAX face is identified.If not, the MAX face is not identified, and the lineup is rejected.Confidence is similarly determined by the value of the ensemble "MAX minus mean" decision variable, wherein higher values (i.e., representing a larger difference in memory strength between the average and the MAX face) result in higher confidence ratings, and lower values result in lower confidence ratings.The confidence rating given to lineup rejections is yet another variable to consider and might be based on the raw memory-match signal generated by the MAX face, the ensemble variable, or some other variable.Empirically, the ordering of reject points based on the diagnosticity ratio may be fairly stable wherein the next point after the standard partial ROC curve would be a low-confidence rejection, followed by a medium-confidence rejection, and finally a high-confidence rejection.The main challenge is to accurately place filler IDs around the rejection points, which could be before, after, or in-between them.
Table 1 shows the ordering of points for simulated lineup data when the decision is based on an independent observations or ensemble variable.The ordering of filler identification points for the independent observations model is the reverse of the ordering of filler identification points for the ensemble model.It is also worth noting that the ordering difference can be as extreme as a filler ID made with high confidence being either the first or last point after the suspect ID points.Simulation details are provided in Supplementary Materials.
Likewise, lineup fairness is another factor that influences the ordering of filler IDs and how these IDs are ordered among reject IDs.Table 2 shows the ordering of points for differing levels of lineup fairness.A fair lineup has the mean of the innocent suspect distribution the same as the mean of the foil distribution (M = 0), since both the innocent suspect and fillers are equally similar, on average, to the witness's memory of the perpetrator.As a lineup becomes progressively more unfair, the mean of the innocent suspect distribution becomes greater than the filler distribution.In a very unfair lineup, the mean of the innocent suspect distribution (M = 0.9) approaches the mean of the target distribution (M = 1).The question here is how to organize the filler IDs among the reject decisions.What makes this especially problematic in the context of research is that a different ordering would be needed for different lineup conditions within a single experiment, and it would necessitate considerable speculation by researchers about the precise degree of lineup unfairness to attempt to determine an order a priori.
These considerations illustrate why it is essential for ROC analysis to be tethered to a coherent model of underlying diagnostic signals.The ordering of ROC points corresponding to suspect IDs made with different levels of confidence is coherently tied to (and is the same for) different signal detection models that have been used to interpret lineup performance.This is why the corresponding discriminability measure (pAUC) is not susceptible to yielding misleading conclusions (such as participants having the ability to feel the future).The ordering of points that make up the full ROC, by contrast, amounts to mostly intuitive guesswork, which makes it vulnerable to yielding false conclusions.

Could researchers fit a model to the data to extend the ROC?
There are possible ways to extend the ROC to the right into a higher FAR range if that were a desirable goal (contrary to what we believe to be true).In our view, the only way to do so is to fit a mathematically coherent model to the data and then use the model to predict the remaining trajectory of the suspect ID ROC.This does not involve including filler IDs or reject decisions on the ROC, though those ID decisions are used when the model is fit to the data.As noted before, the

Table 1
The ordering of points based on the diagnosticity ratio for the independent observations and ensemble models.S = suspect ID; F = filler ID; R = lineup rejection; High, Medium, and Low refer to level of confidence for each decision.(2020) did in their study, where the suspect in the lineup had a red border and participants were asked whether or not the suspect in the red box was the perpetrator of the crime.This type of procedure was used for theoretical purposes for tracking the suspect ID ROC, the authors did not advocate for the use of this procedure in practice.Other approaches, such as revealing the suspect to the witness after a reject decision is made (e.g., Yilmaz et al., 2022) could also allow lineup rejection confidence ratings to be included in the ROC along with suspect identification ratings.Starns et al. (2022) also noted the inconsistent and problematic ordering of points on the new full ROC approach.They suggested one solution is that researchers could fit a theoretical model to experimental lineup data to determine the ordering of the full ROC points for a particular experiment (including suspect IDs, filler IDs, reject decisions).If researchers are interested in extending the ROC for theoretical reasons, we agree with the need for a formal model, as noted above.But our preferred approach would be to use that model to project the suspect ID ROC (not to include fillers and reject decisions).Then, to extend the suspect ID ROC empirically, our preferred approach would be to change the lineup procedure to enable a higher false alarm rate to the suspect, just like in the simultaneous showup in Colloff and Wixted (2020).In our view, this would be preferable to relying solely on intuition to plot ROC points associated with filler IDs and lineup rejections.
It is worth contrasting these potential approaches with the discussion in Smith et al. about how to measure identification performance at higher false alarm rates than observed with partial ROCs.They state that "Option 2 is also problematic because it would require making parametric assumptions and extrapolating the shorter curve well beyond the observed data for that procedure.But we do not really know where that curve will project to, and this extrapolation approach is really a guessing game (Colloff et al., 2016;Wixted & Mickes, 2018)" (p.598).On the contrary, it is the full ROC approach that is grounded in guesswork.The additional full ROC points (for filler IDs, reject decision) could be ordered using the diagnosticity ratio, or they could alternatively be ordered a priori in the way described by Smith et al., or they could instead be ordered in the way described by Lucas and Brewer (2022).Researchers (and police investigators) are free to take their best guess because there is no coherent model of latent diagnostic signals that would favor one approach over the other.To us, that is a guessing game, one that is far inferior to an approach that relies on a coherent mathematical model of memory based on principles that have been worked out in the basic-science literature over many decades.
Yet even if one considers using the model-based approach we describe (or Starns's model-based approach that includes filler ID rates and lineup rejection rates) it would no longer be a theory-free way of analyzing empirical discriminability, which is a strength of the current pAUC method.The pAUC method is compatible with a coherent mathematical model of underling diagnostic memory signals, but it does not depend on any one of the competing models (e.g., independent observations, ensemble).The approaches suggested above would be modeldependent, and researchers should be clear about the question they are attempting to answer when using these approaches.

Conclusion and future directions
We, of course, advocate for researchers not to construct full ROC curves using the method introduced by Smith et al. and instead stick with constructing only the suspect ID partial ROC curves introduced by Mickes et al. (2012).This is because legal policymakers are interested in keeping the false alarm rate low and therefore need to know about pAUC, not AUC, and the full ROC approach is statistically flawed because the ordering of the ROC points is theoretically unprincipled and empirically unstable.Ordering on the DR for each study capitalizes on noise and leads to misleading conclusions (e.g., such as the existence of precognition).Moreover, following a different a priori order without guidance from a coherent model of latent diagnostic signals amounts to guesswork.
It is also important to consider that these full ROC curves are not simply a new way to analyze ROC data, they are a new lineup procedure.The only way the extended ROC region is of any potential value is if investigators are sometimes arresting a suspect on the basis of a filler ID or lineup rejection.As we demonstrated, there is no coherent way of reliably determining the ordering of points in the extended region.However, theoretically, policymakers could decide that the hit rate achievable by a fair lineup is too low and that suspects should be arrested both when they are identified by an eyewitness and when, for example, a filler is identified with medium confidence.This full ROC could then provide a valid measure of investigator discriminability for this lineup procedure.Similarly, whenever police were highly confident they had a guilty suspect, if they regularly arrested the person both when the suspect was identified and when a lineup was rejected with medium confidence, then this full ROC could provide a valid measure of investigator discriminability.However, is there any compelling evidence (or any evidence at all for that matter) that investigators across the thousands of U.S. law enforcement agencies consistently follow this lineup procedure and set their criterion for arresting a suspect in the filler ID or reject region of ROC space?We would be surprised if police do this at all and even more surprised if they do so consistently.If police do this, we would have serious concerns about the frequency of innocent suspects being wrongly sent to jail.However, given that seemingly unlikely scenario, we would concede the value of this full ROC investigator discriminability measure.If police do not do this, then this full ROC investigator discriminability measure is not providing information about investigator discriminability in the world we currently live.
For researchers determined to analyze full ROCs, a more sensible approach would be to plot "detection ROCs" (Shen et al., 2023).In this type of ROC, a hit occurs whenever anyone (guilty suspect or filler) is identified from a target-present lineup, and a false alarm occurs whenever anyone (innocent suspect or filler) is identified from a target-absent lineup.This approach maintains the spirit of the Smith et al. approach, wherein a suspect no longer needs to be identified in order to be counted

Table 2
The ordering of points based on the diagnosticity ratio for the ensemble model and varying degrees of lineup fairness.S = Suspect ID; F = Filler ID; R = Lineup Rejection; High, Medium, and Low refer to level of confidence for each decision.
Fair (M = 0) Unfair (M = 0.25)More Unfair (M = 0.5) Very Unfair (M = 0.9) as a hit or a false alarm, but with the added bonus that the ROC is now theoretically sensible.These types of ROCs can be useful for testing competing theoretical models that make competing predictions about which detection ROC curve should be higher or lower.A major benefit of detection ROC curves is that they do not suffer from the same pointordering issues as the Smith et al.ROC curves.Just like standard old/ new recognition or showup ROC curves, the ordering is always high confidence IDs followed by lower confidence IDs, then low confidence rejections, and finally high confidence rejections.Note that, from our perspective, detection ROCs do not have any direct application for police investigators even though they are useful for testing theories.Just like the Smith et al. full ROC curves, the only way for these curves to measure empirical discriminability for applied purposes would be if police arrested suspects when filler IDs occur, thereby thwarting the protections offered by fillers.
For researchers determined to use the full ROCs advocated by Smith et al. or some variation of them, we propose one potential compromise.If plotting these types of full ROC curves, make it clear when the points switch from the standard partial ROC points (which are ordered coherently) to the extended points (which are not ordered coherently).This could easily be done by using a different type of marker for extended ROC points.For example, if the standard partial ROC curve points are indicated by circles, the extended points could be indicated by diamonds.This would at least allow readers to be able to easily see what differences are likely to be real (in the partial ROC area, where we can be confident of the ROC point orderings) and which differences might be due to methodological artifacts.Thus, all readers can remain ESP skeptics even if certain ROC methods beg to differ.

Fig. 2 .
Fig. 2. (A) Table 2 in Smith et al., which presents noncumulative rates as a function of the witness's identification decision and associated level of confidence.HR = hit rate or decisions made in target-present lineups; FAR = false alarm rate or decisions made in target-absent lineups; DR = diagnosticity ratio of guilt or the likelihood that the suspect is guilty given the witness's response.S = suspect, F = filler, R = rejection; high = 90% to 100% confidence; medium = 70% to 80% confidence; low = 0% to 60% confidence.+ Inf = positive infinity.(B) Diagnosticity ratio sorted full ROC curves for these data.These curves are created by cumulatively summing the HR and FAR values down the respective columns in the table.For example, the first Clear point has HR of 0.383 and FAR of 0. The second Clear point is constructed by adding the next HR and FAR values, resulting in HR of 0.383 + 0.250 and FAR of 0 + 0.013.Additional points are constructed in the same manner by continuing this cumulative process.Reference: Smith et al., Perspectives on Psychological Science, Vol.15(3), pp.589-607, © 2020 by The Author (s).Reprinted by Permission of SAGE Publications.

Fig. 4 .
Fig. 4. Standard partial ROC curves for suspect IDs in the time-reversed lineup test.

Fig. 5 .
Fig. 5. Smith et al.DR sorted full ROC curves for our time-reversed lineup test.
Seale-Carlisle et al., 2019)is empirically constrained to have a maximum false ID rate of 1 / k, where k is lineup size.However, the math modelling approach would allow one to plot the remaining trajectory of a fair lineup ROC that extends past 1 / k.Some researchers have already used such a plotting approach to illustrate the correspondence between model-predicted values and the empirical data, but not because they have been specifically interested in extending the ROC past 1 / k (e.g., seeSeale-Carlisle et al., 2019).However, if one decided that the ROC point with a false alarm rate of 0.50 was attractive after viewing the model-based trajectory, that false alarm rate could be achieved empirically (in an experiment), by updating the lineup procedure.For example, a false alarm rate of 0.50 could be empirically achieved by simply revealing to the witnesses in advance of the lineup decision who the suspect in the lineup is.This is exactly what Colloff and Wixted false