People have modest, not good, insight into their face recognition ability: a comparison between self-report questionnaires

Matsuyoshi, Daisuke; Watanabe, Katsumi

doi:10.1007/s00426-020-01355-8

People have modest, not good, insight into their face recognition ability: a comparison between self-report questionnaires

Original Article
Open access
Published: 20 May 2020

Volume 85, pages 1713–1723, (2021)
Cite this article

Download PDF

You have full access to this open access article

Psychological Research Aims and scope Submit manuscript

People have modest, not good, insight into their face recognition ability: a comparison between self-report questionnaires

Download PDF

3430 Accesses
15 Citations
4 Altmetric
Explore all metrics

Abstract

Whether people have insight into their face recognition ability has been intensely debated in recent studies using self-report measures. Although some studies showed people’s good insight, other studies found the opposite. The discrepancy might be caused by the difference in the questionnaire used and/or the bias induced using an extreme group such as suspected prosopagnosics. To resolve this issue, we examined the relationship between the two representative self-report face recognition questionnaires (Survey, N = 855) and then the extent to which the questionnaires differ in their relationship with face recognition performance (Experiment, N = 180) in normal populations, which do not include predetermined extreme groups. We found a very strong correlation (r = 0.82), a dominant principal component (explains > 90% of the variance), and comparable reliability between the questionnaires. Although these results suggest a strong common factor underlying them, the residual variance is not negligible (33%). Indeed, the follow-up experiment showed that both questionnaires have significant but moderate correlations with actual face recognition performance, and that the correlation was stronger for the Kennerknecht’s questionnaire (r = − 0.38) than for the PI20 (r = − 0.23). These findings not only suggest people’s modest insight into their face recognition ability, but also urge researchers and clinicians to carefully assess whether a questionnaire is suitable for estimating an individual’s face recognition ability.

A Dress Is Not a Yes: Towards an Indirect Mouse-Tracking Measure of Men’s Overreliance on Global Cues in the Context of Sexual Flirting

Article Open access 07 February 2024

Ingo Landwehr, Katrin Mundloch & Alexander F. Schmidt

Development and Validation of the Camouflaging Autistic Traits Questionnaire (CAT-Q)

Article Open access 25 October 2018

Laura Hull, William Mandy, … K. V. Petrides

The ethical application of biometric facial recognition technology

Article Open access 13 April 2021

Marcus Smith & Seumas Miller

Introduction

Self-report measure is one of the important methodologies in many disciplines of psychology, such as educational, developmental, clinical, social, and personality psychology. Using questionnaires, researchers have quantified people’s insight into their skills, intelligence, cognitive ability, personality, or mood, and have created psychological models or theories. Despite the prevalence of self-report in psychological measurements, the correspondence between self-evaluations of ability and objective performance has been debated. Zell and Krizan (2014) synthesized meta-analyses across diverse disciplines and ability domains and reported that the mean correlation between ability self-evaluations and behavioral performance was moderate (M = 0.29). This finding suggests that people have only modest insight into their ability, perhaps reflecting not only the inaccuracy or imprecision of self-evaluations but also the biases (e.g., social desirability or self-esteem) inherent to self-report questionnaires (Choi & Pak, 2005).

Although the meta-synthesis indicated a moderate relationship between people’s insight into their ability and actual performance, recent studies have reported that people have good insight into their face recognition ability using the 20-item prosopagnosia index (PI20) (Livingston & Shah, 2017; Shah, Gaule, Sowden, Bird, & Cook, 2015; Shah, Sowden, Gaule, Catmur, & Bird, 2015). Shah et al. developed the PI20 to serve as a new self-report measure for estimating face recognition ability and developmental prosopagnosia (DP) risk, while criticizing a pre-existing questionnaire (Kennerknecht, Ho, & Wong, 2008), a 15-item questionnaire developed in a Hong Kong population (hereafter, HK questionnaire), on the grounds that it correlates poorly with objective face recognition performance (Palermo et al., 2017) (but see Johnen et al., 2014; Stollhoff, Jost, Elze, & Kennerknecht, 2011). However, although the PI20 was aimed to overcome the weakness of the HK questionnaire (i.e., it contains items irrelevant to face recognition and it has a ‘weak relationship’ to actual behavioral performance), its performance was not validated formally against the HK questionnaire. No direct comparison between the questionnaires was performed not only in terms of their relation to behavioral performance, but also their own relationship. Thus, whether the PI20 outperforms the HK questionnaire remains unclear.

Moreover, whether people have insight into their face recognition ability also remains to be investigated. Recent studies have reached different conclusions regarding the association between self-report and actual face recognition performance (Livingston & Shah, 2017; Palermo et al., 2017; Shah, Gaule, et al., 2015). Not only do they differ in the questionnaire used, but also in their participant demographics. Shah, Gaule, et al. (2015) reported that people have good insight into their face recognition ability (r = − 0.68); they used PI20 and recruited individuals ‘identified themselves as suspected prosopagnosics’ in addition to a normal population. On the other hand, Palermo et al. (2017) reported that people have moderate insight into their face recognition ability (r = − 0.14); they used the HK questionnaire and recruited a normal population, without ‘suspected prosopagnosics’. (The distinction between ‘good’ and ‘moderate’ insight has been arbitrary and seems to be based solely on researchers’ intuition or convention without clarifying the criteria, but here we regard a significant correlation coefficient of r = 0.5 or larger as ‘good’ insight and a significant correlation coefficient less than r = 0.5 as ‘moderate’ or ‘modest’ insight.) These inconsistent results are likely to result from the two methodological differences. First, although the PI20 and the HK questionnaire are so similar and simply asking how good (or bad) people are at recognizing faces, their subtle differences in texts might lead to a difference in correlation between self-report and behavioral performance. Second, because Shah and colleagues used an extreme group approach (i.e., recruited ‘suspected prosopagnosics’), which almost always leads to upwardly biased estimates of standardized effect size (Preacher, Rucker, MacCallum, & Nicewander, 2005), they might observe an inflated correlation between self-report and behavioral performance. Thus, it is crucial to use the two questionnaires in the same population and assess the relationship between the questionnaires and their relation to behavioral face recognition performance. We examined this issue by administering the two questionnaires to a large population and performing a set of analyses including correlation analysis, hierarchical clustering, a brute-force calculation/comparison of reliability coefficients, and a behavioral validation using Taiwanese Face Memory Test (TFMT) (Cheng, Shyi, & Cheng, 2016), an East Asian version of Cambridge Face Memory Test (CFMT) (Duchaine & Nakayama, 2006). If the PI20 is a better self-report instrument in estimating face recognition ability than the pre-existing HK questionnaire, the PI20 is expected to have distinct or more desirable features (i.e., low or moderate correlation between the questionnaires, PI20-specific cluster, or higher reliability) and a greater prediction accuracy of behavioral face recognition performance compared to the HK questionnaire.

Survey

Materials and methods

Participants

All participants were recruited from job and volunteer web sites for students in Tokyo area. The recruitment advertisement did not ask whether they have difficulty recognizing faces. Neither inclusion nor exclusion criteria are related to self-reported face recognition ability. Eight hundred and fifty-five young Japanese adults [427 female, 428 male; mean age: 20.9 ± 2.2 (± 1 SD) years; range 18–36 years] participated in the survey along with another psychological experiments (not including the follow-up Experiment) and received monetary compensation for their 3-h participation [3000 yen (approx. US $30)]. All had normal or corrected-to-normal vision and none reported a history of neurological or developmental disorders.

Procedure

We asked participants to complete the questionnaires using an 8-in. touchscreen tablet PC in the laboratory. They were required to indicate the extent to which 36 items (15 from the pre-existing Hong Kong (HK) prosopagnosia questionnaire (Kennerknecht et al., 2008), and 20 from the PI20 (Shah, Gaule, et al., 2015), and an additional item pertaining to self-confidence in face recognition ability: “I am confident that I can recognize faces well compared to others”) described their face recognition experiences. Responses were provided using a five-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). The participants were instructed to complete the questionnaires at their own pace. The questionnaire took about 5 min to complete.

Data analysis

Because the HK questionnaire developed by Kennerknecht et al. (2008) contains four dummy questions (HK#10, #11, #12, and #13) that are irrelevant with respect to face identity recognition, we excluded these items and calculated the total scores ranging 11–55, using the remaining 11 items (hereafter, ‘HK11’) (score range 11–55). The four dummy items consisted of three items related to face processing [ability to judge facial gender (HK#10), facial attractiveness (HK#12), and facial emotion (HK#13)], but not pertaining to their own face identity recognition abilities, and one item not at all related to face recognition [spatial navigation deficits (HK#11)].

PI20 scores were calculated using all 20 items and ranged from 20 to 100. As females have been shown to exhibit superior performance in behavioral face recognition studies (Shapiro & Penrod, 1986), we examined sex differences between the questionnaire scores. In addition, we used polychoric correlation coefficients to infer latent Pearson correlations between individual items from the ordinal data. The polychoric correlation matrix was estimated using two-step approximation (Olsson, 1979).

Cronbach’s α and Revelle & Zinbarg’s omega total coefficients were calculated to assess the scale reliability of both HK11 and PI20. Omega total coefficients were estimated using a maximum likelihood procedure (Revelle & Zinbarg, 2009). Confidence intervals (CI) for the coefficients were estimated using a bootstrap procedure (10,000 replications) with a bias-corrected and accelerated approach (DiCiccio & Efron, 1996; Kelley & Pornprasertmanit, 2016).

As it was possible that higher reliability coefficients merely reflected the higher number of items in the PI20, relative to that in the HK11 (Cortina, 1993), we performed a brute-force calculation of reliability coefficients for all 167,960 (₂₀C₁₁) possible combinations of PI20 items taken 11 items at a time (i.e., subsets of the PI20 generated by choosing 11 of the 20 items), which allowed us to compare reliability coefficients between the questionnaires with a virtual match of the numbers of items.

Results

Total scores and score distribution

Table 1 shows descriptive statistics for the total HK11 and PI20 scores. Independent two-sample t tests showed no significant differences in HK11 [t₈₅₃ = 0.0511, p = 0.9592, Cohen's d = 0.0035 (95% CI − 0.1306, 0.1376)] or PI20 [t₈₁₀ = 0.9578, p = 0.3384, Cohen's d = 0.0655 (95% CI − 0.0686, 0.1996)] scores between males and females. In addition, a Bayesian analysis using a JZS prior (r scaling = 1) (Rouder, Speckman, Sun, Morey, & Iverson, 2009) showed strong evidence for the null hypothesis (i.e., no sex difference) for both HK11 (Bayes factor BF₁₀ = 0.0544) and PI20 (BF₁₀ = 0.0856) scores. In addition, two-sample Kolmogorov–Smirnov tests showed no significant sex differences between the distributions (Fig. 1) of HK11 (D = 0.0265, p = 0.9982) and PI20 (D = 0.0460, p = 0.7554) scores. These results indicate that females and males showed almost identical mean HK11 and PI20 scores and score distributions, suggesting that sex was not a significant factor.

Table 1 Descriptive statistics of total scores for the questionnaires (N = 855, survey)

Full size table

Correlations between total scores

The results showed a very strong significant correlation between the total scores for the two questionnaires [Fig. 1, r = 0.8228 (95% CI 0.7999, 0.8433), p = 1.6510 × 10⁻²¹¹], suggesting a significant overlap of face recognition abilities assessed via each measure. It should be noted that Fisher’s z test with Zou’s CI (Zou, 2007) showed no significant sex difference in the correlation between total scores [r_diff = − 0.0065 (95% CI − 0.0502, 0.0371), z = 0.2917, p = 0.7705; r_females = 0.8200 (95% CI 0.7863, 0.8489), p = 4.7087 × 10⁻¹⁰⁵; r_males = 0.8265 (95% CI 0.7939, 0.8543), p = 2.3806 × 10⁻¹⁰⁸]. Principal component analysis (PCA) with singular value decomposition of the correlation matrix between total scores showed that the first principal component (PC1) accounted for 91.1% (using standardized scores) and 94.2% (using raw scores) of the total variance in scores.

Correlations between individual item scores

The correlation matrix (Fig. 2) generally showed correlations between individual items across the two scales; however, some items were not correlated with other items to the extent that they would reduce the reliability or internal consistency of a single measure pertaining to a single construct. In fact, hierarchical clustering using the unweighted pair group method with arithmetic mean showed that 8 out of 36 items were distant from a cluster to which most items belonged (shaded areas in Fig. 2, dendrogram). These eight items consisted of (Table 2): the four items already known to be irrelevant with respect to face identity recognition (HK#10, HK#11, HK#12, and HK#13), and two items from the HK questionnaire (HK#2 and HK#7), and two items from the PI20 (PI#3 and PI#13). Previous studies reported that five of the eight item-score differences (suspected prosopagnosics − control) were marginal (score difference < 1) between individuals with suspected prosopagnosics and typically developed control individuals (0.45 for HK#10, − 0.39 for HK#11, 0.11 for HK#12, − 0.45 for HK#13, and 0.62 for PI#3) (Kennerknecht et al., 2008; Shah, Gaule, et al., 2015). However, it should be noted that the score difference exceeded 1 for the remaining three items (1.46 for HK#2, 1.12 for HK#7, and 1.16 for PI#13), suggesting that these three items could measure traits that differ from those measured via the other 28 items.

Table 2 Test items shown with the mean scores (N = 855, survey)

Full size table

Scale reliability

We found that the reliability coefficients for the PI20 were higher relative to those for the HK11 [HK11: α = 0.8449 (95% CI 0.8273, 0.8633), ω_t = 0.8767 (95% CI 0.8571, 0.8880); PI20: α = 0.9174 (95% CI 0.9102, 0.9249), ω_t = 0.9368 (95% CI 0.9300, 0.9424)]. Follow-up Feldt paired tests (Feldt, 1980) confirmed significant differences in reliability coefficient between the HK11 and PI20 (difference in α: t₈₅₃ = 16.4437, p = 5.5132 × 10⁻⁵³; difference in ω_t: t₈₅₃ = 17.4868, p = 9.4696 × 10⁻⁵⁹).

However, the difference in reliability coefficients may merely reflect the difference in the number of items in the questionnaires (Cortina, 1993). To examine this possibility, we compared reliability coefficients of the two questionnaires with a virtual match of the numbers of items (see Data analysis). The brute-force calculation of reliability coefficients showed that the coefficients for the 11-item PI20 subsets were almost comparable (within 1 SD) to those for the HK11 [α: mean = 0.8530 ± 0.0392 (± 1 SD), median 0.8474, range 0.7495–0.9324; ω_t: mean = 0.8914 ± 0.0225 (± 1 SD), median 0.8933, range 0.8122–0.9438], indicating that the HK11 and PI20 demonstrated almost equivalent reliability at the individual-item level.

Discussion

These results showed that the two representative face recognition questionnaires are closely related to each other in terms of correlation analyses, PCA, hierarchical clustering, and item reliability. It is worth noting that a recent meta-analysis showed that test–retest reliabilities for instantaneously administered tests are about r = 0.8 (Calamia, Markon, & Tranel, 2013), which is comparable to our findings (r = 0.8228). This may indicate that the correlation coefficient between the two questionnaires is sufficiently high to consider that the two questionnaires might measure essentially the same trait to the extent of reliability that solid neuropsychological tests can achieve. However, residual variance is not yet trivial in the present case, as 32% of the variance (1 − 0.8228²) remains unexplained. It is possible that one questionnaire has a stronger relationship with actual behavioral performance than the other. In experiment, we examined this issue by comparing correlations of HK11 and PI20 with actual face recognition performance.