A new way of classifying developmental prosopagnosia: Balanced Integration Score

Despite severe everyday problems recognising faces, some individuals with developmental prosopagnosia (DP) can achieve typical accuracy scores on laboratory face recognition tests. To address this, studies sometimes also examine response times (RTs), which tend to be longer in DPs relative to control participants. In the present study, 24 potential (according to self-report) DPs and 110 age-matched controls completed the Cambridge Face and Bicycle Memory Tests, old new faces task, and a famous faces test. We used accuracy and the Balanced Integration Score (BIS), a measure that adjusts accuracy for RTs, to classify our sample at the group and individual levels. Subjective face recognition ability was assessed using the PI20 questionnaire and semi structured interviews. Fifteen DPs showed a major impairment using BIS compared with only ﬁve using accuracy alone. Logistic regression showed that a model incorporating the BIS measures was the most sensitive for classifying DP and showed highest area under the curve (AUC). Furthermore, larger between-group effect sizes were observed for a derived global (averaged) memory measure calculated using BIS versus accuracy alone. BIS is thus an extremely sensitive novel measure for attenuating speed-accuracy trade-offs that can otherwise mask impairment measured only by accuracy in DP. ©


Introduction
Developmental prosopagnosia (DP) is a neurodevelopmental syndrome that manifests in severe face recognition problems due to the visual mechanisms for face processing having failed to develop (Duchaine & Nakayama, 2006).DP occurs despite normal vision and IQ and lack of obvious brain damage.Prevalence of DP is estimated at around 2 %e2.9 % in adults (Bowles et al., 2009;Kennerknecht et al., 2006Kennerknecht et al., , 2008) ) and between 1.2 % and 4 % in children (Bennetts et al., 2017).DP is thus as common as severe dyslexia (~2 %e4 %, European Dyslexia Association, n.d.) and more common than autism (~.6 %, World Health Organisation, 2018) despite being relatively unknown.However, recent work by DeGutis et al. (2023) demonstrates that prevalence estimates vary considerably depending on the measures and cut-offs used. 1.1.

Different approaches to classifying DP
No clinical definition of DP exists, instead it is usually classified by poor performance relative to neurotypical controls (usually 1.7 or 2 standard deviations e SDs below the control mean) on at least one laboratory (i.e., objective) test of face processing alongside personal (i.e., subjective) report.Some studies argue that DP is an identifiable disorder with qualitative, as well as quantitative differences between DPs and controls (Behrmann et al., 2007;Bobak et al., 2017;Burns et al., 2014;Towler et al., 2018), others that it may simply represent the lower end of the face recognition (FR) ability spectrum (Bowles et al., 2009;Johnen et al., 2014).More generally, many studies have noted that DP is a heterogenous condition, both in presentation and severity (Bobak et al., 2017;Corrow et al., 2016;Dobel et al., 2007;Duchaine & Nakayama, 2005;Wilcockson et al., 2020).This heterogeneity has even been observed among members of the same family (De Haan, 1999;Duchaine et al., 2007;Lee et al., 2010;Schmalzl et al., 2006).For example, Lee et al. (2010) reported in their study three members of the same family with impaired face memory, but only one additionally with impaired object recognition.
Overall, the literature to date contains many mixed findings.These could be due to genuine heterogeneity, perhaps accentuated by the small sample sizes typical in neuropsychology.However, it is possible that the competing findings may, in part, also be explained by the varied approaches taken by different research groups to classifying DP (Bate & Tree, 2017) and by the range of different tests and measures used for classification and assessment of face processing in DP (for an overview see DeGutis et al., 2023;Robotham & Starrfelt, 2018).Broadly, the literature shows that approaches to categorisation differ across studies in three main, yet interrelated, ways.Firstly, in the selection of test(s) and associated cut off levels to be used; secondly, in the selection of the outcome measure(s) used to quantify test performance and, thirdly, in the inclusion and exclusion criteria used for both the DP and typical controls groups.As we show here, the DP heterogeneity might also arise from study participants adopting different strategies in approaching the tasks, i.e., by taking longer to complete it and thus potentially masking impairments when only accuracy is taken into consideration.

Different approaches to the selection of tests for classification
The criteria to classify DP vary between research groups, which is problematic for study and case comparison (Corrow et al., 2016).To address this, Dalrymple and Palermo (2016) recommended that, in addition to subjective reports of face recognition difficulties, individuals should exhibit impairments on at least two objective tests of face processing (cf., Burns et al., 2022 who argue that subjective report alone is sufficient for classification).Notably these guidelines did not specify that these should test face recognition (as opposed to face perception) specifically.This may be problematic because several studies report that as many 50 % of DPs show no face perception impairment (e.g., Dalrymple et al., 2014), meaning that perceptual tests may therefore not always be suitable for classifying DP.Indeed, DeGutis et al. ( 2023) have recently proposed that two tests of face recognition specifically should be used for classification but this has yet to become the norm.
Since the 2016 guidelines were published, other researchers have gone further, arguing that converging evidence of impairment across multiple tests of face processing would offer stronger evidence for classifying potential DPs (Bate & Tree, 2017), and that participants scoring more than 1 SD below the mean on two or more tests should be classified as impaired (Mishra et al., 2021;Stumps et al., 2020) since this approach is common in other areas of neuropsychology such as mild cognitive impairment (Sachdev et al., 2014).At the other end of the face recognition ability spectrum, it has also been argued that super recognizers should be assessed using converging evidence from multiple tests (Bobak et al., 2023).The guidelines for the best combination of tests predicting real-life face recognition are constantly evolving in both superior (Mayer & Ramon, 2023) and typical face recognition (Bobak et al., 2023).

Different approaches to the selection of outcome measures to quantify test performance
Impaired accuracy is the outcome measure traditionally used to classify DP.However, it has been shown that DPs can sometimes perform within typical accuracy limits when the task has extended or unlimited presentation time (Albonico et al., 2017;Dobel et al., 2007;Duchaine & Nakayama, 2004) leading to recommendations for both accuracy and response time (RT) to be considered (Fysh & Ramon, 2022;Stacchi et al., 2020), typically using Inverse Efficiency Scores (IES, Townsend & Ashby, 1983).Unfortunately, the IES is only suitable when mean accuracy is above around 85 %, a level not commonly found in DP research, and/or when there are no accuracy versus RT trade-offs (Bruyer & Brysbaert, 2011).Pertinently, such speed-accuracy trade-offs are observed in common face perception tests (Stacchi et al., 2020) and have been shown to differ between not only DPs and controls but also DPs and participants with acquired prosopagnosia (Behrmann et al., 2005;Fysh & Ramon, 2022).
A better single measure that combines accuracy and RT is the Balanced Integration Score (BIS, Liesefeld & Janczyk, 2019, 2022).Using drift diffusion modelling, the authors demonstrated that BIS is relatively unaffected by speed-accuracy trade-offs (SATs) and is appropriate for between-subjects designs and tasks with varying levels of mean accuracy and decision thresholds thus making it suitable for individual differences face processing research.Face processing tasks often instruct participants to respond as accurately and as fast as possible.This instruction is not neutral as it requires participants to decide whether to prioritise speed or accuracy since one may affect the other.Even when instructions are silent on this matter, or the time allowed for responses is not constrained, participants must still decide how to balance speed and accuracy.Controlling for differential speedaccuracy trade-offs is essential in a lifespan study of DP.Draheim and colleagues (2019) point out in their review of RT and individual differences that multiple studies have shown speed accuracy trade-offs differing across ability level (participants with lower ability are more likely to sacrifice speed for accuracy vs those with higher ability) and age (older adults tend to proceed more slowly and carefully than young adults regardless of instructions).
As Liesefeld and Janczyk (2019) point out, the relative weightings applied to speed or accuracy may not only differ between participants, but also within participants e.g., an individual might prioritise speed on easier trials and accuracy on harder trials.BIS addresses these challenges by equally weighting accuracy and RT and can be thought of as accuracy adjusted for RT, thereby controlling for differential speed-accuracy trade-offs.BIS is calculated by subtracting a participant's standardised RT score on correct trials from their standardised accuracy score [BIS ¼ (Z accuracy ¡ Z RT)], see Fig. 4 below.So, for example, the BIS for a hypothetical DP participant whose accuracy z score is À1.2 (less accurate than average) and RT z score is 1 (slower than average) would be À2.2 (À1.2 minus 1), whereas BIS for a hypothetical control participant with an accuracy Z score of À1.2 (also below average) and a RT z score of À1.2 (faster than average) would be 0 (À1.2 minus À1.2).

Different approaches to defining inclusion and exclusion criteria
How researchers choose to define study inclusion and exclusion for the DP and control groups has important implications for our ability to fully understand DP.A key question is why so many individuals who report severe problems recognising familiar faces in everyday life do not appear to meet the classification for DP (for a discussion see Burns et al., 2022;DeGutis et al., 2023).For example, a recent large-scale study of 165 adults who reported severe everyday face recognition problems (Bate et al., 2019) showed that 61.8 % of the 165 suspected DPs did not meet the commonly used diagnostic threshold of at least 2 SDs below control means on a minimum of two of the three commonly-used diagnostic tests.These were the Cambridge Face Memory Test (CFMT; Duchaine & Nakayama, 2006), the Cambridge Face Perception Test (CFPT; Duchaine et al., 2007) and a Famous Faces Test (FFT).The CFMT is the gold standard test for detecting DP, but even on this test 107/165 (65 %) participants did not meet strict cut off criteria (À2 SD).Burns et al. (2022) also report that a similar proportion of DPs (56 %) did not score more than 2 SD below controls means on the CFMT.
The studies by Bate et al. (2019) and Burns et al. (2022) are relatively unusual in assessing the performance of all the selfreported DPs.More commonly in the literature only those participants who meet strict objective criteria for DP are included in studies (for a review see DeGutis et al., 2023).Unnecessary exclusion of a high proportion of potential participants who report this rare condition is problematic, both for practical and, arguably, ethical reasons (Burns et al., 2022).
A final consideration is exclusion criteria based on face test performance.It is common practice to exclude control participants whose performance falls more than 1.7 or 2 SDs below control mean.However, although this approach avoids the unwelcome possibility of including potentially undiagnosed DPs in the control group, it also could lead to an overestimation of the FR abilities of the remaining control group by inflating the control group mean.This is problematic because excluding individuals who produce low test scores but report no day-to-day face recognition problems e even when prompted through a comprehensive and ecologically valid PI20 questionnaire (Shah et al., 2015) or similar instrument could, in turn, create two artificially-distinct groups.After some initial uncertainty, the PI20 has been shown to be a reliable and valid means of classifying DP (Burns et al., 2022;Gray et al., 2017;Tsantani et al., 2021).
Subjective face recognition difficulty is widely considered a pre-requisite for classification as a DP.Arguably therefore, it logically follows that an absence of subjective difficulty e provided subjective experience has been interrogated using a questionnaire e rules out the possibility of undiagnosed DP.Some low scoring individuals may not experience face recognition difficulties in day-to-day life because they have developed effective compensatory techniques or because their poor test performance simply represents natural variability in face processing or other cognitive processes such as attention e or both.In reality, there are likely to be overlapping face recognition abilities between participant groups who do, and do not, report face recognition difficulties but individuals falling within this overlapping range (exact scores will vary test by test) are currently rarely researched due to prevailing selection and classification methods.
Increasingly therefore, several research groups have begun to question whether the current approaches to classifying DP are appropriate and suggest that broader inclusion criteria might help inform our understanding of DP (Berger et al., 2022;Burns et al., 2022;Dalrymple & Palermo, 2016;DeGutis et al., 2023;Mishra et al., 2021;Stumps et al., 2020).Unless future research includes both the full range of individuals who report FR problems (potential DPs) and the full range of neurotypical individuals who score poorly on face processing tasks but do NOT report FR problems on the PI20 or similar questionnaire, we are unlikely to be able to satisfactorily characterise DP.

The present study
In this study we therefore adopt a new approach that allows a more comprehensive understanding of the nature of the underlying deficits in participants who report severe and noticeable face recognition difficulties in everyday life (for simplicity we call these DPs).Specifically, we compare the classification of a group of 24 self-reported DPs on traditional accuracy measures alongside a new classification method using the Balanced Integration Score (BIS).BIS accounts for speed accuracy trade-offs possibly masking face and object processing impairments in some DPs.The aim of this study was to identify whether accuracy or BIS yields an objective measure that better accounts for self-reported face recognition ability.

Definitions
We use the term face processing generically to refer to the overall process involved in recognising a human face.More specifically, here face recognition and face memory are used to refer to the ability to say whether a newly learned target face has been seen before.Both the Old New faces task and the CFMT test face memory.We also use (object/bicycle) memory to mean the ability to say whether a specific bicycle has previously been seen.Face memory tasks do not require participants to be able to name the facial exemplar or to recall any biographical details about an individual.By contrast, when we refer to face identification, we mean the ability to identify a face, either by name or other biographical detail (e.g., Actor who plays Mr Bean).The Famous Faces task tests face identification.It also tests face familiarity by which we mean the ability to correctly judge (yes/no) whether the face of a personally-known celebrity looks familiar.

Research transparency and openness
The data presented here were used to classify participants in a wider study which investigates face perception in developmental prosopagnosia and findings from the perceptual task battery will be reported separately in a future publication.Although exploratory in nature, we decided to preregister the study after data collection had commenced, but prior to analysis, to avoid any suspicion of fishing for results or hypothesising after results were known.The preregistered hypotheses, analysis plan, study data and R analysis scripts are available via The Open Science Framework https://osf.io/qne8d/.Analysis of one of the initial screening tasks, the Matrix Reasoning Item Bank (Chierchia et al., 2019) showed that the self-reported DP group adopted a different speedaccuracy strategy to controls, preferring to proceed more slowly and carefully on this visual processing task that did not involve faces (Lowes, Hancock, & Bobak, 2024).To account for these observed speed-accuracy trade-offs, we therefore report here one additional measure that was not preregistered e the Balanced Integration Score or BIS (Liesefeld & Janczyk, 2019, 2022) and describe this further in section 3.4.We also conducted additional unregistered logistic regression analysis and regression analyses to formally assess whether BIS or our preregistered variables best classified potential DPs.
We report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.
Legal copyright restrictions prevent public archiving of the Cambridge Face Memory Tests, the Cambridge Bicycle Memory Test, Old New Faces and the Famous Faces Test which can be obtained from the copyright holders in the cited references (details at https://osf.io/f496b/).

Participants
Participants were 24 individuals aged 8e71 years (7 men; 17 women) who self-reported severe everyday face recognition difficulties (DPs) and 110 age matched controls aged 6e74 years (50 men; 60 women).All participants were living in the UK, Ireland and USA and reported normal, or corrected-tonormal, vision as well as a sufficiently good level of English to understand the instructions and participant information.
Participants needed access to a laptop or computer with stable broadband.
Preregistered exclusion criteria were as follows.Individuals with any neurodevelopmental or neurological condition; learning difficulty (other than mild dyslexia) or psychiatric illness; a history of major/moderate brain injury at any time or a mild head injury or concussion during the preceding 12 months or acquired prosopagnosia (i.e., any face recognition difficulty that developed suddenly).
We enquired whether participants had previously participated in other face recognition or face training studies or had attempted any of the test battery online.Control participants who had done so were excluded (number excluded n ¼ 0).DPs who had previously completed a face recognition test, but not face training, were included provided that at least three months had passed (n ¼ 7, AF006, AF013, AF017, AF018, AF019 and CF008).Four of these reported prior participation in other studies but did not know which ones, two had attempted an (unspecified) online test and one child had been tested by a neuropsychologist using tasks not contained in our battery.
Participant data from any single test where responses were indicative of repeated random key press, technology failure, a clear lack of understanding of the task or failure to follow instructions were excluded.Following preliminary analysis, data from two child control participants were removed because their performance on two or more tasks suggested suboptimal effort or failure to follow instructions.Two adult controls were also excluded because of inconsistencies between their PI20 scores which were borderline indicating possible difficulty, and their follow up interview which revealed no real evidence of lifelong subjective difficulty (e.g., PI20 scores were influenced by a recent one-off failure to recognise a neighbour who was wearing a face mask).Because these two cases met neither the DP criteria (clear subjective lifelong impairment) nor the control criteria (no subjective impairment as measured by the PI20), we excluded them.
Participants were recruited though media coverage, social media, personal networks, and prosopagnosia support groups.Data were collected online and participants offered a £10 gift voucher to recompense them for their time.
An overview of individual DP participant accuracy performance on the tests that follow is provided in Table 1.c o r t e x 1 7 2 ( 2 0 2 4 ) 1 5 9 e1 8 4 complete the MaRs-IB but their performance on other tasks was within typical norms so they were retained (n.b. reasoning serves as a check to ensure poor test performance in DP cannot be accounted for by impaired cognitive function or difficulty following test instructions).An overview of the recruitment and classification process is shown in Fig. 1.

Subjective report
To be initially classified as a potential DP, participants had to have subjective report of lifelong difficulty with familiar face recognition.For adults this was operationalised as a score of !61 on the PI20 (Shah et al., 2015) as recommend by Tsantani et al. (2021).This cut-off represented roughly 2.5 standard deviations above the age group control means (and 3.6 SD above the mean in the 14e35 years age group).Inclusion and exclusion criteria were confirmed by follow up semi-structured interview which also captured details of daily face recognition experiences.For children and young people aged 6e17 years we required parental report of face recogni-tion difficulties, classified as a score >1.7 SD above the control group mean on a questionnaire comprising 32 items: 17 items from the PI20 (Shah et al., 2015), rephrased where necessary to make it suitable for reporting a third party's ability, and 15 additional items identified as hallmarks of prosopagnosia for    non-experts (Murray et al., 2018).Similar to the PI20, a higher score indicates more difficulties with face recognition.Like adults, inclusion and exclusion criteria and case histories were confirmed through screening interviews with a parent.
For ease of comparison with other tasks, negative self-report z scores indicate performance worse than the control average.

Face tests
3.2.3.1.OLD NEW FACES.The Old New Face task (see Fig. 2) is a test of delayed face identity recognition matching.Full methods are described in the original paper (Dalrymple et al., 2014), both children and adults completed the version with child faces.Since z scores were computed using the means and SD of the appropriate control age group, any own age bias is therefore controlled for because each participant is compared to the typical score for their relevant age group (Fig. 4).Briefly, participants memorise ten target child faces presented one at a time for 3 sec each.Each target was then immediately shown again in the same order for 3 sec.Both adults and children Participants were instructed to try to memorise the faces.At test, the task is to indicate which of two faces presented is the previously seen face.One target and a similar-looking distractor (age, expression, orientation) were presented simultaneously for 1 sec.Response options appeared under each face ("LEFT" or "RIGHT") and remained on screen until a response was made by mouse click.Targets appeared in randomised order three times each for a total of 30 trials alongside 30 unique distractors that were never repeated.Stimuli were presented in grayscale with hair, ears, and any obvious moles or blemishes removed.DVs were accuracy and BIS (which uses mean RT on correct trials only).We administered an amended version of the test by introducing a distractor task between study and test.This took the form of a 17-item Ishihara colour vision test, lasting approximately two minutes.Data from one child control participant was not included in analysis of this test since their mean RT was 2.94 SD above the age group mean.Their other test performance was within typical norms, so they were retained as a participant.

CAMBRIDGE FACE MEMORY TESTS. Three versions of the
CFMT, suitable for different ages, were administered.The CFMT is considered the gold standard test for detection of prosopagnosia.It tests viewpoint-dependent and viewpointindependent recognition memory for newly-learned faces.Impaired performance on this test is widely used to classify DP since difficulty learning and individuating faces is the core behavioural manifestation of DP.There is a matched object recognition test and comparison of participants' standardised scores on both tests can indicate whether individuation deficits are face selective or also extend to object recognition e at least for the object class being tested.The DVs for all versions of the CFMT were accuracy and BIS.BIS is calculated using mean RT on correct trials only.

CAMBRIDGE FACE MEMORY TEST (ADULTS AND ADOLES-
CENTS AGED !14 YEARS).A full description of the methods can be found in the original paper (Duchaine & Nakayama, 2006).Briefly, participants study six target faces each from three different viewpoints.Cropped and greyscale faces are presented for 3 sec.The test stage uses a three alternate forced choice (3 AFC) format, comprising one target and two distractor faces.In the first introduction test phase each target face is tested with three identical viewpoints.In the second test phase, after review, each target face is shown in a novel viewpoint from that learned at study and, finally, the noise test section introduces novel views of the target faces with added gaussian noise.There are 72 trials in total and chance ¼ 33 %.(Dalrymple & Palermo, 2016).Here, participants study 12 unfamiliar target faces, one at a time, from three different viewpoints.Each identity is tested three times, once from each viewpoint, using a 3AFC paradigm.Like the introduction test phase in the CFMT and CFMT-Kids, testing occurs immediately after the study phase for each identity thus minimising memory demands.There are 36 trials in total and chance ¼ 33 %.

CAMBRIDGE BICYCLE MEMORY TASK (CBMT).
The CBMT (Dalrymple et al., 2014) is matched in format to the CFMT (adult and kid versions) but uses images of bicycles rather than faces to allow a wider object agnosia to be identified.The CBMT has been used with adults as well as children as young as seven years old and does not appear to have floor effects in this age group (Bate et al., 2020), is well matched in controls to the CFMT (Biotti & Cook, 2016), and has been argued to have better diagnostic properties than the car memory test (Barton et al., 2019).So, following Bate et al. (2020), the six-target version was used with all age groups.There are 72 trials in total and chance ¼ 33 %.As for the CFMT, the DVs were accuracy and BIS.The primary outcome measure of interest for classification purposes was the standardised CMT difference score which was computed as standardised face score minus standardised bicycle score.

FAMOUS FACES TEST. Difficulty identifying familiar faces
is the core deficit in DP.In contrast to the CFMT which tests memory for newly learned, unfamiliar faces, the FFT measures long-term familiar face recognition memory.We were not aware of any recent published famous face tests suitable for children and young people as well as adults, so a new test was devised and administered to all age groups.First, we used social media to informally poll parents of children aged 6e16 years.From the longlist of suggested famous identities, we selected 30 identities from multiple sectors covering sport, entertainment, music, politics, and royalty and gathered facial images from the internet.Pilot testing with typically developing participants in the UK aged 6e16 years showed that the median number of facial identities that participants reported knowing was 28.5/30 indicating that the chosen identities were likely to be familiar to children and young people across our target age range.
Participants saw 30 famous faces one at a time (Fig. 3).Stimuli were presented in full colour on a black background with hair cropped but hairline and external contours retained.Identifying blemishes were removed and any jewellery blurred.Each face was presented for 2 sec and participants indicated if the face was familiar by clicking "yes" or "no".If "no", they moved immediately onto the next trial and a new identity was shown.If "yes", participants then had to click on the correct name/identity (e.g., Boris Johnson, UK Prime Minister-at the time of testing) from a choice of four (one target, three foils) before moving on to the next trial.Foils were the descriptions of other famous identities matched for gender and approximate age and, as far as practical, profession.To discourage guessing in the initial familiarity judgement, instructions stressed that it didn't matter how many identities looked familiar.The test began with three practice trials using cartoon images to familiarise participants with the task and feedback was given.No feedback was provided during the test phase.Participants could take a break after every 10 trials.Stimuli and trial files are available from the authors on request.We do not have permission to share them publicly because images were sourced from the internet.

Procedure
All tests were administered using the online platform Testable (www.testable.org)with Google Chrome as the recommended browser, except the matrix reasoning screening task which was administered using Gorilla (www.gorilla.sc).Participants (or parents in the case of under 18s) were emailed a document containing written task instructions and links to each test and instructed to complete the tests in the prescribed order.Full onscreen instructions were also provided.Parents supervised children and could help with explaining tasks, but instructions stressed they must not help children with their responses.The tests reported here were completed in the following order CBMT, FFT, CFMT and then Old New Faces.
Testing took place over a minimum of three self-paced sessions.Each session included only one face recognition test alongside some perceptual tasks with no memory demands.Instructions recommended a minimum break between sessions of at least 12 h for participants under 18 years and one hour for adults.In addition, participants were informed that they could stop at the end of any test to take additional breaks and parents were advised that children should take a break if they began to appear distracted or tired.

Analysis plan
First, we inspected the descriptive statistics and distributions for each test and compared these to published norms for controls where available.Next, test reliability was calculated using Cronbach's alpha for the sample as a whole and for the control and DP groups separately.We originally planned to include gender as a covariate, so we then checked for gender differences in all tests.Because there were no gender differences in all tests (all p's > .05)and there were no significant differences in gender distribution between the DP and control groups (X 2 ¼ 2.14, p ¼ .144,and all ps > .144 in individual age groups) data were collapsed across genders for analysis.Finally, to allow comparison across tasks, we standardised all measures by calculating z scores (Fig. 4).We centred z scores on control age group means because accuracy and RT varied on some tasks as a function of age.Age groups were predefined as 6e9, 10e13, 14e35, 36e59 and 60e74 years.The choice of age group broadly followed previous literature (e.g., Bowles et al., 2009) and was also partially driven by practical consideration as we aimed to have 20 controls in each age group.The child age groups were chosen primarily to be suitable for the CFMT_Kids and CFMT_Young Kids (7e9).When deciding which tests to administer to adolescents 14e17 we followed Bate et al. (2015) and Bennetts et al. (2017) who administered the CFMT adult version to a prosopagnosic and a super recogniser (respectively) and typical controls aged 14 & 15 years.We inspected mean control performance for 14e17 year olds and 18e35 years old and found no difference with mean scores of 58.8/72 and 58.0/ 72 respectively.This age group standardisation allowed us to classify DP participants' performance across the full age range since here the z scores quantify participants' performance relative to their own age group on any given task.We checked within age groups for any significant age-accuracy correlations and found none except among 6e9 year olds for Old New Faces accuracy [r(10) ¼ .664,p ¼ .018]and 10e13 year olds on the FFT [r(20) ¼ .44,p ¼ .042].We also report unstandardised mean accuracy, z score and BIS by age group for each test.
The primary preregistered DV for the CFTM, CBMT and Old New Faces was accuracy (correct trials/total trials).This was because at the time of pre-registration we were not aware of BIS.As discussed in section 2 above, to account for the different speed accuracy trade off strategies observed between controls and DPs on the matrix reasoning screening c o r t e x 1 7 2 ( 2 0 2 4 ) 1 5 9 e1 8 4 task (Lowes, Hancock, & Bobak, 2024)), we additionally calculated BIS (Z accuracy À Z RT) which can be thought of as accuracy adjusted for RT thereby controlling for differential speed-accuracy trade-offs.
For the FFT, three DVs were preregistered.In addition to the primary DV of accuracy (the proportion of identities known to the participant that were correctly identified), we also calculated the raw number of identities known to each participant (i.e., a familiarity check), and familiarity (the proportion1 of identities known to each participant that were reported as familiar).We did not compute RTs, and therefore BIS, for the FFT since we considered reading speed to be an important potential confound.
The fifth and final preregistered measure of face memory was a global memory score computed for each participant from the mean of their standardised scores on the CFMT, Old New Faces, FFT and standardised CMT difference using pairwise deletion.We report two global memory scores, the first calculated using accuracy and the second using BIS.

Statistical analysis
To calculate group differences, we conducted Bayesian independent samples t-tests on the standardised scores using the default Cauchy prior with a scale of .707.We pre-registered Bayesian analysis because it enables the strength of evidence for both the alternative and null hypotheses to be compared and removes the need to correct for multiple comparisons (Gelman et al., 2012;Kruschke, 2010).For completeness, we also report Welch's t-tests which are recommended in independent subject designs with different experimental group sizes and/or unequal group variance (Ruxton, 2006) and are more conservative than Student's t tests.Data were analysed in R (R Core Team, 2021) using R Studio 2021.09.1 and the Tidyverse (Wickham et al., 2019) and jmv, version 2.3.4 (Selker et al., 2022) packages.Cronbach's alpha was calculated in SPSS 28.0.0.0 (IBM).
At the individual participant level, single case analyses were conducted using the SingleBayes_ES.execomputer programme for Bayesian tests of deficit (BTD) and the DiffBayes_ES.exeprogramme for Bayesian standardised difference test (BSDT) (Crawford et al., 2010;Crawford & Garthwaite, 2007).The alpha level was set at .05.These analyses were preregistered.Our sample size of DPs was determined following previous literature and our control sample size (target of 20 participants per age group) was informed by McIntosh and Rittmo's (2021) study.This recommends a minimum neuropsychological control sample of at least eight participants and that twice that number is more desirable, but that increasing control sample size above 16 does little to meaningfully increase power.
Binomial logistic regression modelling and regression analysis were used to assess which outcome measures, or combination of measures, best predicted self-reported group membership as quantified by PI20 or parental report scores.These were not a pre-registered analyses but were added in order to formally assess whether the original variables or BIS best predicted group membership.

Results
The results section is structured as follows.We first report test reliability then group results for each of the four tests of face memory followed by the computed global (average) memory score.To illustrate the effect of taking RT as well as accuracy into consideration, we compare the group results calculated first using accuracy and second using BIS.Finally, we present logistic regression data showing which models best predicted group membership.

Reliability
Table 2 shows reliability (Cronbach's alpha) for the overall sample and separately for the DP and control groups.Reliabilities are rarely reported in such detail but are needed in order to calculate maximum possible correlations between tests and their suitability for individual difference studies (see Bobak et al., 2023).Reliability for most tests was excellent or good; the Old New Faces test was acceptable for controls and good for DPs.

Old New Faces
The potential DP group (n ¼ 23, one self-reported DP participant did not complete this test) was, on average, less accurate than controls (n ¼ 87, 23 participants did not complete this test) at judging whether a face had previously been seen or not.The results of Bayesian independent samples t test and Welch's t tests on standardised accuracy scores and BIS scores are reported in Table 8 and illustrated in Fig. 5.There was strong evidence for a true difference between groups both for accuracy (BF 10 ¼ 28.8) and BIS (BF 10 ¼ 5,090,000).In other words, the alternative hypothesis (a true BIS difference exists between groups) is more than 5 million times as likely as the null hypothesis given the data.Descriptive statistics for each age group showing unstandardised accuracy scores and BIS are shown in Table 3.

Cambridge Face Memory Tests
We analysed standardised accuracy scores to allow comparison across the different versions of the CFMT.Results are shown in Table 8 and illustrated in Fig. 6.As expected, selfreported DPs were less accurate than controls and overall   c o r t e x 1 7 2 ( 2 0 2 4 ) 1 5 9 e1 8 4 there was strong evidence of a true difference between groups for both accuracy (BF 10 ¼ 3,290) and BIS (BF 10 ¼ 2,453).Unstandardised accuracy scores and BIS are shown in Table 4.

Difference between face (CFMT) and object (CBMT) memory accuracy
To investigate whether any memory difficulties were facespecific, we calculated a CMT difference score by subtracting each participant's CBMT z score from their CFMT z score (Fig. 7).A difference score above zero indicates that a participant performed relatively better at faces than bicycles and a score below zero indicates that a participant performed relatively better at bicycles than faces.As a reminder, all z scores were centred on age group control means so this difference score is therefore a relative measure of face/bicycle memory accuracy compared to participants' own age control group.Unstandardised accuracy and BIS scores for the CBMT are shown in Table 5.

Famous Faces Test
Data from one potential DP were not analysed because they notified us of participant error during this task.Data from 23 controls is not available due to participant or technical error or participant drop out (i.e., some participants did not complete   all three sessions).No floor effects were observed in any age group, 2 SD below the control mean accuracy was always above chance (25 %).As discussed in section 3.4, the test design meant it was not appropriate to calculate BIS for the Famous Faces Test, instead we use only accuracy.As discussed below, FFT accuracy was a strong predictor of group and so remains a useful measure.
The raw number of famous faces known to participants was checked at the end of the test.At an age group level, the mean number of identities known ranged from 13.4 in young children (skewed by a higher proportion of non-UK controls, chosen to match the DP in this age group) to 24.2 in the 14e35 years age group.Considering UK control participants only, the mean number of identities known were: 16.4 (6e9 years); 21.2 (10e13 years); 25.7 (14e35 years); 23.1 (36e59 years); 20.4 (>60 years).Whilst number of known identities was not the main variable of interest and was measured in order to calculate personally familiar accuracy and familiarity scores for each participant, we noted that the DP group reported knowing significantly fewer famous faces than controls, t(32.1)¼ 3.80), p < .001,with extremely strong support (BF 10 ¼ 133) for a true group difference.This could be due to previously reported differences in media consumption (Dalrymple & Palermo, 2016).

Famous Face Test: Identification
As shown in Fig. 8, potential DPs (n ¼ 23) on average, correctly identified (named) a lower proportion of faces than controls (n ¼ 100).A Bayesian independent samples t test on standardised scores provided very strong evidence for a true difference between groups BF 10 ¼ 204,436.In other words, the alternative hypothesis (a true difference exists between groups) is more than two hundred thousand times as likely as the null hypothesis given the data (see Table 8).Unstandardised grouplevel accuracy descriptive statistics are shown in Table 6.

Famous Face Test: Familiarity
Similar to the primary outcome measure of famous face identification discussed above ("choose the name that matches the face"), the self-reported DP group was significantly less accurate at judging a known face as looking familiar versus controls  ("does this face look familiar?"),t(24.9)¼ 4.43, p < .001,d ¼ 1.22.A Bayesian independent samples t-test also provided extremely strong evidence for a true difference between groups (BF 10 ¼ 6,130,000).Notably, an almost perfect correlation was observed between participants' familiarity scores and identification scores in both the DP [r(21) ¼ .997,p < .001]and control [r(98) ¼ .983,p < .001]groups.This result suggests that, at least in this cohort, a sense of familiarity was not distinct from the ability to identify a face, however the 4AFC paradigm used in the identification phase would be expected to result in a higher familiarity/identification correlation than a paradigm which required participants to generate the name themselves.We chose a 4AFC paradigm for this task because we reasoned it would be easier for children than a standard FFT paradigm.A typical FFT requires participants to provide a name, or some other unique identifying detail, from memory, unprompted, usually by writing or typing a response.We were therefore also concerned that parental interference might become an issue with such a design since children would be likely to need parental help to type and/or spell the celebrity's name and this could result in parents answering on behalf of children.Due to our design, it was possible therefore that the expected group differences would not be observed on this somewhat easier test (see Rivolta et al., 2013).We did not find this to be the case and observed strong group differences on the FFT (see Table 8) with, as expected, the DP group being less accurate at identifying famous faces known to them than the control group (Table 6).These results suggest that our test design did not unduly assist potential DPs to identify familiar faces versus controls.

Global face memory measures
Because it is possible that a participant may achieve a score on one test that may be higher or lower than their true ability due to chance or to random factors such as tiredness, we computed a global memory score by averaging participants' z scores across the four face memory measures of interest (Old New Faces, CFMT, CMT Difference and FFT).Descriptive statistics for the global memory scores are provided in Table 7.As shown in Fig. 9, the self-reported DP group's global memory accuracy and global memory BIS means were both significantly lower than the respective control group means.This finding of lower DP performance versus controls, even when measured across multiple tests, suggests that the group differences observed for each individual face memory outcome measure are not due solely to noise, or chance, and is confirmed by much higher Bayes factors for global face memory versus individual tests as shown in Table 8.Table 8 also shows that statistical analysis of the group differences in accuracy provided strong support for the alternative hypothesis and that the group difference was significant.
When considering group differences averaged across multiple standardised measures of face memory, the global memory BIS (i.e., accuracy adjusted for RT) showed a larger effect size than the global memory score calculated using accuracy alone (see Table 8).We also separately calculated group differences for adults (aged 18 years and over) and found an identical pattern of results (lower portion of Table 8), and very similar effect sizes.Although face processing continues to develop throughout childhood (e.g., Pascalis et al., 2011) and declines in later life (Bowles et al., 2009), our approach of using age-matched z scores to analyse group differences in this lifespan study ensures that any individual's face processing impairment is best classified with comparison to typical controls at a similar stage of development.

4.7.
Which measures best predict group membership?
We used binomial logistic regression to formally assess which objective face memory measure, or combination of measures, best predicted self-reported group membership as classified by cut offs on the PI20 or the parental report questionnaire.We developed five models (see Table 9).All models significantly predicted group membership, and all correctly classified over 94 % of controls.Crucially however, the models' ability to correctly classify self-reported DPs varied greatly, ranging from only 16.7 % (Model 5, with CFMT accuracy as the predictor) to 68.2 % in the best performing model (Model 2, comprising four separate predictor outcome variables (CFMT BIS, CMT difference BIS, Old New BIS, Famous Face identification accuracy)).Crucially, Model 2 strongly predicted group membership and outperformed all other models.Model 2 correctly predicted 68.2 % of self-reported DPs and 94.9 % of controls, explaining around 44 %e57 % of the variance in subjective ratings of participants' face recognition ability.

How does using BIS change classification of DP?
As seen in Table 9 above, the effect sizes for mean group difference were larger using BIS than using accuracy alone.We  Note.There was only one potential DP in each of the two youngest age groups so SD could not be calculated separately for DPs.Mean scores were computed from CFMT BIS, CMT difference BIS, Old New Faces BIS and FFT identification accuracy.
c o r t e x 1 7 2 ( 2 0 2 4 ) 1 5 9 e1 8 4 therefore next investigated how using BIS might change, or confirm, classification as a DP on our objective tests.Results are shown in Tables 10 and 11.As a reminder, to meet the initial classification as a potential DP, all participants firstly had to score atypically on the parental questionnaire or PI20.
We then sub-classified these self-reported "DPs" into three groups using the average of four objective measures: CFMT, FFT, Old New Faces and the difference between CFMT and CBMT scores (CMT difference)."Major DPs" scored more than 1.7 SD below their age group control mean; "Mild DPs" scored between 1 and 1.7 SD below the control means and "Subjective DPs" showed atypical self-report scores but scored within 1 SD of their age group control means, showing no objective impairment despite subjective report of face recognition difficulties.One participant (CF059) did not complete the Old New Faces test but was classified as Major DP because their CFMT z score (À3.45) indicated severe impairment and their FFT score was also below average.As shown in Table 10, the key finding of our study was that 83.3 % of the self-reported DP group showed objective face recognition deficits (20/24; 15 major, five mild) using the global memory BIS measure versus 58.3 % who showed deficits when using global memory accuracy alone (14/24: five major, nine mild).This suggests that some self-reported DPs in our sample were able to achieve close to normal, or only mildly impaired, performance by trading speed for accuracy; however, once RT was considered (using BIS), their impairment became apparent.This important finding demonstrates the value of BIS as a measure for classifying DP since these participants would otherwise have been missed if classification considered accuracy alone.
Three self-reported DP participants showed no objective impairment regardless of the measure used, we refer to these as having "subjective" DP since no objective impairment was observed.Additionally, one potential child DP showed a mild impairment (Z ¼ À1.32) when considering global accuracy alone but was not considered objectively impaired once RT was accounted for (BIS) although their global memory BIS was still below average (Z ¼ À.86).In summary, many more of the participants with atypically high PI20/parental questionnaire scores were classified as DP using BIS than using classical measures.Here we classify a major impairment as z À1.7 but, for comparison, using a stricter À2 SD cut off, our data show that 12 participants would be classified as DP using the BIS measures versus only five when using accuracy alone.
Finally, we compared how classification using individual test scores, rather than the global face memory score shown in Table 10, would differ using accuracy versus BIS and these are reported in Table 11.

Discussion
Twenty-four participants with self-reported face recognition problems (DPs) and 110 age matched controls completed four tests measuring different aspects of objective face recognition ability alongside a self-report questionnaire.To examine the best way of best identifying DP, we used both traditional accuracy measures (proportion correct) and a novel integrated measure, the Balanced Integration Score (BIS), which adjusts accuracy to take account of RT while controlling for speedaccuracy trade-offs.At an individual participant level, 15 individuals who self-reported face recognition difficulties were classified as having a major face recognition impairment using BIS, but only four were classified as having a major impairment when using accuracy alone.Overall, observed between-group effect sizes for computed global (averaged) memory scores were also larger using BIS than using accuracy alone (see Table 8).

5.1.
Is self-report a useful initial classification measure for DP?
Before comparing how the objective face memory measures were able to detect and classify DP, we first checked that selfreport e which we used to make a preliminary classification of participants into potential DPs or likely controls e was in fact a valid basis on which to make this preliminary classification.The extent to which self-report is a valid indication of face recognition ability has been much debated.Furthermore, direct comparisons are difficult to make across studies using different self-report questionnaires.Nonetheless, previous studies have reported that self-reported and objective face recognition measures are, at best, modestly correlated in naı ¨ve typical perceivers (Bobak et al., 2019;Matsuyoshi & Watanabe, 2021;Palermo et al., 2017) and in individuals who were aware that they met the diagnostic criteria for DP before completing the self-report questionnaire (Murray & Bate, Table 10 e Global memory accuracy and BIS classification for all self-reported DP participants.Table 11 e Showing how individual DP case performance compares using accuracy and BIS. Note.FFT ¼ Famous Faces Test, CFMT ¼ Cambridge Face Memory Tests, CBMT ¼ Cambridge Bicycle memory Test, CMT difference ¼ CFMT À CBMT, Global face memory ¼ standardised mean average.BIS was not calculated for the FFT so the results are the same as for accuracy alone but are shown again in the second table for ease of comparison.Major impairment z À1.7 shown in red, mild impairment À1 !z !À1.69 shown in yellow.All scores are standardised and centred on age-matched control means.The two farthest right columns show firstly the number of independent face processing measures on which a participant was impaired (mild or major) and secondly whether the participant showed an impairment of at least 2 SD below control means on at least two of the four face memory measures, * indicates that participant completed three rather than four tasks.c o r t e x 1 7 2 ( 2 0 2 4 ) 1 5 9 e1 8 4 2019).By contrast, studies of individuals with DP, mainly using the PI20 (Shah et al., 2015), found that DPs do have insight into their own face processing difficulties (Burns et al., 2022;Gray et al., 2017;Livingston & Shah, 2018;Tsantani et al., 2021;Ventura et al., 2018), or at least the fact that their face recognition ability was poor relative to others (Palermo et al., 2017).Our data also support the use of self-report measures for classifying DP.All the potential DP participants except one made contact with our lab because they (or a parent) believed they struggled with face recognition or had a family history of DP, but, unlike the DP participants in the Murray and Bate (2019) study, participants in the present study were not told prior to completion of the self-report questionnaire whether they met the 'diagnostic' threshold for DP.They were arguably therefore less likely to be influenced by a DP 'diagnosis' when subjectively rating their face recognition ability.However, four individuals reporting poor face recognition ability had previously participated in other face recognition studies which may have provided insight into their ability.
It has been reported that women rated their prosopagnosia symptoms as more severe than men (Murray & Bate, 2019) but we found no support for this.Among adult DPs (n ¼ 20; 15 female, 5 male) who showed both objective and subjective impairment, we found no evidence of gender differences in PI20 scores (d ¼ .32,p ¼ .604),and anecdotal evidence for the null hypothesis (BF 10 ¼ .52).It is possible that there is an interaction between gender and status (naive; informed) that should be explored further by researchers who disclose status to participants prior to administering a self-report questionnaire but this was not a relevant issue in our study.
Regression analysis showed that, overall, the standardised subjective rating score was a significant predictor of global face memory BIS, F(1,124) ¼ 57.7, p < .001explaining around 31 % of the variance in objective scores (adjusted R 2 ¼ .31).When considering adults and children separately, unsurprisingly adults' subjective rating of their own face recognition ability was a better predictor of group than parental report of their child's face recognition ability, explaining around 40 % and 6 % respectively of the variance in global face memory BIS.Binomial logistic regression analysis showed that parental report was nevertheless a significant predictor of group suggesting that parents did, on average, have insight into their children's face memory ability in binary terms, i.e., whether it was much worse than average or not.However, parents' ability to predict more precisely their child's ability relative to their age group, as measured by global face memory BIS z score, was only just above chance (p ¼ .044).
One factor that may have limited parental ability to accurately judge their child's face recognition ability was the parent's own face recognition ability.Face recognition is a highly heritable ability (Wilmer et al., 2010) and many reported cases of DP have a known family history (De Haan, 1999;Duchaine et al., 2007;Gruter et al., 2008;Lee et al., 2010;Schmalzl et al., 2006).In our study, the face recognition ability rating for three of the four child DP candidates was provided by a parent who themselves reported having difficulty recognising faces.Thus, it is likely that these parents lacked an accurate reference point for judging typical, and consequently atypical, face recognition ability.

DP performance on Old New Faces
In line with previous work (Dalrymple et al., 2014), we found that Old New Faces was a useful test for classifying DP.
Although the test could be criticised as being more of an image memory than a face recognition test since the images used at study and test are identical, Dalrymple et al. (2014) reported that all adult DPs (n ¼ 16) showed impaired accuracy on the Old New Faces test.By contrast, 0/16 were impaired at a matched old/new houses test and only 1/16 was impaired in a matched old/new horses test.Among children, 4/6 DPs were significantly impaired versus controls on the Old New Faces test but e similar to adults e 0/6 were impaired on the matched object task, in this case an old/new flowers task.By contrast, the authors observed that 10/16 adult DPs were unimpaired on the Cambridge Face Perception Test accuracy (CFPT, Duchaine et al., 2007).Together, these results suggest firstly that the Old New Faces test is useful for identifying DP and that the impairment detected by the task is face specific.Secondly, the results suggest that the presence of a memory demand in the Old New Faces task e even when using the same image at study and test e produces different patterns of impairment compared with the CFPT which is also a test of face matching but without memory demands.Overall, Dalrymple et al. (2014) show that the Old New Faces task is a useful source of converging evidence of face recognition difficulties when the CFPT is not suitable.Our data support these findings, namely that the DP group, on average, produced significantly lower accuracy and BIS scores than the control group.Additionally, on the Old New Faces task we found stronger Bayesian evidence and larger effect sizes for a group difference for BIS versus accuracy alone.

DP performance on Famous Faces Test
Despite not being able to calculate BIS for FFT, since reading speed would have confounded RT, we nevertheless found this novel FFT to be a useful measure for classifying DP using identification accuracy.This measure produced the strongest accuracy effect size (d ¼ 1.15, p < .001) of the four individual face memory tests we administered.Additionally, Bayesian analysis showed extremely high evidence for a group difference (BF 10 ¼ 204,436), again the highest of any single test (Table 8).To check that group differences were not driven by a small number of individuals with extreme scores, we conducted individual case analysis on all accuracy measures (Tables 1 and 11).On the FFT, 19/23 potential DPs produced identification accuracy scores below the age group mean, and of these, 9/23 scored significantly below mean control accuracy.A previous large-scale study (Bate et al., 2019) found evidence of a dissociation between memory for familiar faces (FFT) and memory for newly-learned faces in 63 of 165 individuals and suggested that long term memory for familiar faces (as indexed by a FFT) may be selectively impaired in DP.
Our data support the use of an FFT as one of several measures to classify DP, even though BIS could not be calculated.

Do accuracy or BIS measures best predict group membership?
The logistic regression model that included three BIS face memory measures plus FFT accuracy (Model 2, see Table 9) was the most sensitive and classified the most self-reported DP participants (68.2 %) as being objectively impaired.Importantly, Model 2's highest sensitivity did not come at the cost of reduced specificity as the ability of this model to classify controls was very similar (94.9 % vs 96.7 % for Models 2 and 5 respectively).In comparison, CFMT accuracy (Model 5), which is the measure traditionally used to diagnose DP, classified only 16.7 % of our self-reported DP group.Although AUC was only slightly lower than this for Model 3 (AUC ¼ .87,global face memory accuracy) and Model 4 (AUC ¼ .88,global face memory BIS), these models were much less sensitive than Model 2, correctly classifying only 29.2 % and 54.2 % of self-reported DPs respectively.
One DP participant produced very low scores on three of the four tests.We therefore checked to ensure that this potential outlier was not unduly influencing results by repeating the logistic regression modelling with this participant removed.Although exact values changed, the pattern of results was very similar and Model 2 remained the best model for classifying DP.
We also reran these analyses on a sub-sample of only the self-reported DPs who showed mild or major impairment on the global face memory accuracy score, that is without the ten participants classed as "subjective DPs" and found the same pattern of results (see supplementary materials sections 2 and 3).Again, Model 2 with BIS as the outcome measure remained the best model (AUC ¼ .98,p < .001).As would be expected from this approach (i.e., excluding the potential DPs who achieved typical accuracy due to atypical RT) the group effect size differences when comparing accuracy and BIS were smaller than when we analysed the full sample of potential DPs.This is because in this alternative approach, only the selfreported DPs with impaired accuracy are included in the analysis and so it logically follows that the utility of accuracy as a classification measure would improve.Nevertheless, the fact that BIS measures e even among this sub-sample e were more sensitive than accuracy measures for classifying DP further strengthens the value of BIS.
Our results confirm the importance of considering RT as well as accuracy when classifying DP and these findings are in line with recent studies (Fysh & Ramon, 2022;Stacchi et al., 2020) and a large literature review by Geskin and Behrmann (2018).
However, it is possible that RT (and therefore BIS) is more useful in some tasks and paradigms than others.Our data show that for the CFMT, a test specifically designed to detect DP, BIS added little or no additional information compared to accuracy alone.Effect sizes for both measures were similar, and indeed slighter larger for accuracy.Others have also questioned whether RT always adds value over accuracy.A recent study (DeGutis et al., 2022) investigating both accuracy and response times on a face matching task, the Benton Face Recognition Test (BFRT-c, Rossion & Michel, 2018) reported that RT alone did not reliably predict group membership.The BFRT-c is an updated version of the original face matching test (Benton et al., 1983) that was specifically designed to emphasise speed as well as accuracy.However, there are two important differences between the BFRT-c and the tasks used here.Firstly, despite its name, the BFRT is a perceptual task since it involves no memory demands.Secondly, the BFRT-c requires participants to click up to three faces from a choice of six meaning that motor control is likely to influence RT more than it would in our tasks where participants had to make a single response.Finally, as the authors explain, the BFRT-c design means that it is not possible to analyse RT on correct trials only as is common practice in other face cognition tasks.Notably RT from correct trials only is the measure used to calculate BIS (Liesefeld & Janczyk, 2019).Despite this, some theoretical papers have incorrectly used RT on all trials when comparing BIS with other integrated measures of speed accuracy which could lead to confusion about how to calculate BIS (for a fuller explanation see Liesefeld & Janczyk, 2022).BIS may therefore not be informative on tasks where RT measures include both incorrect and correct trials.
Our findings support the use of BIS as an integrated measure that adjusts accuracy to account for RT.However, we make no claims about the use of RT as a sole measure which was the question DeGutis et al. ( 2022) investigated.As the authors correctly caution, if researchers wish to use RT instead of accuracy on a given task, RT should first be validated as a measure.In the present study we were instead interested in whether accuracy and RT together might be more informative than accuracy alone.Our data show that accuracy and RT together (BIS) explained more of the variance in self-reported face recognition ability than accuracy alone and, additionally, showed greater sensitivity for classification of DP than accuracy alone.A third benefit of BIS was that for the global face memory score and Old New Faces task, the observed effect sizes were larger for BIS than accuracy (on the CFMT there was little difference).This is an important consideration in neuropsychology research where effect sizes are typically modest and sample sizes often small, thereby limiting power (McIntosh & Rittmo, 2021).Using BIS thus provides a practical solution to ensuring that no impaired participants are omitted from analyses.This increased power to detect differences between different populations will allow for better understanding of deficits underpinning DP and provide pathways to effective training.Better classification will also ensure that neuropsychological research is more ethical.The British Psychological Society Code of Human Research Ethics calls for maximising the benefit to participants (point 2.4) from inception to dissemination (Oates et al., 2021).Thus by improving the methods of studying DP, our work contributes to that principle of ethical research.
In addition to offering a practical solution, BIS, as an integrated measure of both accuracy and RT, is a more ecologically valid approach to identifying DP than accuracy alone.This is because in typical social interactions, the amount of time an individual with DP has to make a correct identification is effectively the time it takes the person they are interacting with to recognise the DP.It is time limited.If the DP has not recognised the face before the person greets or speaks to them, then this will appear to be a failure of recognition (even if e given much longer to study the face e the DP might have been able to identify the person).Response latency and accuracy are therefore both important elements with regard to ecological validity.

Use of global scores and choice of cut off
In their editorial of a special issue on DP, Bate and Tree (2017) argued that seeking converging evidence of impairment across multiple tests provides more compelling evidence of a true deficit as well as mitigating the risk that unimpaired controls may be misclassified as DP.The same argument is made for super recogniser research (Bate et al., 2018;Ramon, 2021).We therefore administered four separate tests and used these scores to compute a global face memory score (see Table 7).Global scores showed large differences between DPs and controls and thus appear useful.However, logistic regression showed the global face memory scores produced slightly lower AUCs than Models 1 and 2 which used nonaveraged scores from the individual tests.Our results therefore suggest that although global effect sizes are larger than those from individual tasks and therefore will increase the power to detect group level differences, the four individual BIS measures (CFMT BIS, CMT Difference BIS, Old New Faces BIS and FFT accuracy) best predicted group membership.This could be because averaging results can attenuate the insights provided by multiple individual test scores.Researchers may need to decide whether they wish to prioritise sensitivity (classification), or ability to detect group difference (effect sizes).Individual BIS measures were slightly better for classification purposes and global measures showed larger effect sizes which may be an important consideration when sample sizes are small.Finally, we also compared the patterns of impairment across the four independent face memory measures (Table 11) at the individual case level.Using this alternative, and more commonly used approach, results once again supported the overall conclusion that BIS is a more sensitive measure for classifying DP than accuracy alone.Data showed that more than twice as many (71 %) of the self-reported DP participants showed severe objective impairment (<À2 SD) on at least two individual face memory BIS measures compared with only 33 % who were severely impaired on at least two accuracy measures.BIS therefore appears equally valuable whether using a global (averaged) score or multiple independent measures of face memory.Notably, although we used more liberal cut offs of À1 SD to classify mild and À1.7 SD for major impairment to classify potential DPs, all participants who were classified as impaired (mild or major) on global face memory BIS also showed severe impairment (<À2 SD) on two individual BIS measures suggesting that a more liberal cut off is justified provided multiple objective tests are administered.

Online versus lab-based data
We compared test results with previous literature and found that our control group mean accuracy data was broadly very similar, almost always within 1 SD, to published lab-based studies testing similar age groups, thus giving us confidence in our results.However, one important difference we wish to highlight was that we observed a greater score variability, and thus larger standard deviations, compared with previous literature.For example, CFMT accuracy in both the 14e25 and 36e59 years control groups was 81 %, almost identical to previously reported scores of 80.4 % (Duchaine & Nakayama, 2006).By contrast, the standard deviations in these control groups were 14 % and 13 % respectively versus 11 % in the original Duchaine and Nakayama (2006) study.Among control participants aged 60e74 years, mean accuracy data (72 % ± 15 %) was again very similar to published age group norms for 60e69 year olds of 70.14 % ± 12 % (Bowles et al., 2009) but standard deviations were once again higher.We also observed similar patterns in the child data.This finding is highly relevant for classification because is DP is 'diagnosed' or classified using the mean and standard deviations of the control group.For online research, we therefore caution against using "standard cut-offs" on popular tests such as the CFMT as these may not be valid for online data collection.Instead, online specific norms should be used.Further, our data support the need to use age group norms (Bowles et al., 2009) since our data also showed that accuracy and standard deviations both varied by age group resulting in different cut offs for mild and major impairment in each age group.This finding is also in line with results from the large sample of 165 DPs (Bate et al., 2019).

Conclusion
In conclusion, our key finding is that, whether using multiple individual scores or an averaged global score, and regardless of whether the cut off applied was À1.7 SD or À2 SD, BIS was a more sensitive measure of difference between self-reported DPs and controls than accuracy alone.Furthermore, BIS better predicted group membership compared with accuracy alone.Our data show that using four measures (Old New Faces BIS, CFMT BIS, CMT difference BIS and FFT identification accuracy) alongside the PI20 to classify DP considerably improved sensitivity (captured more DPs) with no reduction in specificity (did not decrease the proportion of controls correctly classified) compared with traditional accuracy measures.These results have important applied value for researchers who must identify and classify DP.Using measures that show strong effect sizes can increase power, an important consideration in research into rare conditions such as DP where large sample sizes are difficult to achieve.Improved classification will also allow a better understanding of the underpinnings of DP and avoid unnecessary exclusion of participants whose impairments may be masked by speed-accuracy trade-offs.

Note.
Self report ¼ standardised score on PI20 or parental questionnaire.CMT difference ¼ (Cambridge Face Memory accuracy z score e Cambridge Bicycle Memory accuracy z score).Mean RT is response time on correct trials.Effect sizes in bold indicate performance significantly worse than controls calculated using the SingleBayes_ES.execomputer programme for Bayesian tests of deficit (Z-CC) and the DiffBayes_ES.exeprogramme for Bayesian standardised difference test (Z-DCC) from Crawford et al., 2010; Crawford & Garthwaite, 2007.Alpha ¼ .05.

Fig. 1 e
Fig. 1 e Overview of recruitment and classification process.

Fig. 2 e
Fig. 2 e Old New Faces test design.

Fig. 3 e
Fig. 3 e Famous Faces test design.

Fig. 4 e
Fig. 4 e Overview of the method used to calculate z scores and BIS.

Fig. 5 e
Fig. 5 e Group difference (self-reported DPs vs controls) in Old New Faces performance using (A) accuracy and (B) BIS (Z accuracy minus Z RT ).Each dot represents a single data point, the box shows the interquartile range (IQR) and the midline indicates the group median score.The end of each whisker line represents 1.5 £ the IQR.

Fig. 6 e
Fig. 6 e Group difference (self-reported DPs vs controls) in performance using (A) accuracy (correct trials/total trials) and (B) BIS (Z accuracy minus Z RT ).Each dot represents a single data point, the box shows the interquartile range (IQR) and the midline indicates the group median score.The end of each whisker line represents 1.5 £ the IQR.Note.Participants aged 6e9 years completed the CFMT Young Kids, ages 10e13 years completed the CFMT-Kids and participants age ≥14 years completed the CFMT.

Fig. 7 e
Fig. 7 e Group difference (self-reported DPs vs controls) in participants' relative performance on the Cambridge Face Memory Test and Cambridge Bicycle Memory Test (Z_CFMT minus Z_CBMT) using (A) accuracy and (B) BIS (Z accuracy minus Z RT ).Each dot represents a single data point, the box shows the interquartile range (IQR) and the midline indicates the group median score.The end of each whisker line represents 1.5 £ the IQR.

Fig. 8 e
Fig.8e Group difference (self-reported DPs vs controls) in identification accuracy.Scores were calculated using only those facial identities that participants reported knowing.Each dot represents a single data point, the box shows the interquartile range (IQR) and the mid line indicates the group median score.The end of each whisker line represents 1.5 £ the IQR.

Fig. 9 e
Fig. 9 e Global face memory scores showing group differences (self-reported DPs vs controls) in mean face memory scores using (A) accuracy and (B) BIS (Z accuracy minus Z RT ).Mean scores were computed from CFMT accuracy, CMT difference, Old New Faces accuracy and FFT identification accuracy.Each dot represents a single data point, the box shows the interquartile range (IQR) and the midline indicates the group median score.The end of each whisker line represents 1.5 £ the IQR.
Note.Major DP: z À1.7 shown in red, Mild DP: À1 !z !À1.69 shown in yellow and Subjective DP: z > À1.Change column on far right indicates whether the DP classification severity rating changed when classifying using BIS versus accuracy; [ indicates rating increased, ¼ indicates unchanged, Y indicates rating decreased.

Table 1 e
Individual case scores of self-reported DP participants.
(continued on next page)

Table 2 e
Test reliabilities.
Note.CFMT ¼ Cambridge Face Memory Test, CBMT ¼ Cambridge Bicycle Memory Test.Only one DP completed the CFMT-Kids and CFMT-Young Kids meaning alpha could not be calculated separately for DPs on these tests.

Table 3 e
Old New Faces accuracy and BIS by age group.
Note.Chance ¼ .5.There was only one potential DP in each of the two youngest age groups so SD could not be calculated separately for selfreported DPs in these age groups.Data from 23 controls is not available due to participant or technical error or participant drop out.Because BIS is a standardised measure, BIS scores for controls are 0 by definition.

Table 4 e
CFMT accuracy and BIS by age group.Note.Chance accuracy ¼ .33.There was only one self-reported DP in each of the two youngest age groups so SD could not be calculated separately for DPs in these age groups.20 controls did not complete this test.Because BIS is a standardised measure, BIS scores for controls are 0 by definition.

Table 5 e
Cambridge Bicycle Memory Test accuracy and BIS by age group.Chance accuracy ¼ .33.There was only one DP in each of the two youngest age groups so SD could not be calculated separately for DPs in these age groups.Because BIS is a standardised measure, BIS scores for controls are 0 by definition.

Table 6 e
FFT identification accuracy by age group.

Table 7 e
Global face memory BIS by age group.

Table 8 e
Comparing effect sizes for the difference between potential DP and control group scores before (Accuracy) and after (BIS) accounting for RT.Note.*p < .05,**p < .01,***p < .001.p values in bold remain significant after Bonferroni correction.CMT difference ¼ CFMT À CBMT.RT (and consequently BIS) was not considered a relevant measure for the FFT since participants had to read names meaning RT would reflect reading speed and comprehension as well as face processing ability.Instead, we use only famous face identification accuracy (proportion of known famous faces that were correctly identified) in both versions of the global scores.

Table 9 e
Logistic regression models predicting self-reported group membership (DP or control).