Neuropsychological functioning in adult anorexia nervosa: A meta-analysis

Several studies have conceptualized neuropsychological dysfunction as part of the core pathology and defining behaviors seen in the eating disorder anorexia nervosa (AN). The aim of the current review was to synthesize the differences in neuropsychological test performance between individuals with AN and healthy controls, quantify and explain their heterogeneity. The search and screening procedures resulted in fifty studies that comprised 186 neuropsychological test results. Utilizing random-effects meta-analyses, the results revealed evidence for significant, moderate underperformance in people with AN in overall neuropsychological functioning (g¯ = -0.43, 95 % CI [-0.50, -0.36]). Weighted mean effect sizes ranged from g¯ = -0.53 for visuospatial abilities to g¯ = -0.10 for planning. Study and participant characteristics, including BMI and age, had significant moderator effects, especially on executive function, memory, and visuospatial abilities. The findings from the current study provide an extensive and comprehensive overview of the possible impairments in neuropsychological functioning in adult patients diagnosed with AN.


Introduction
Anorexia nervosa (AN) is a severe mental illness characterized by dietary restriction leading to weight loss or a failure to gain weight, as well as body-shape and weight over-evaluation (American Psychiatric Association, 2013). It is associated with a high risk of premature death (Kask et al., 2016) and is recognized by significant concerns regarding body image and persistent efforts to lose weight despite being severely underweight. The classification, diagnosis and treatment of AN have traditionally focused on the behaviors and cognitions of patients. However, in recent years, researchers have increasingly targeted their studies towards a broader phenotypic and biological appreciation of phenomenology. One reason for emphasizing cognitive functioning as an area of interest is the association between inferior treatment outcome and poor neuropsychological functioning (Hamsher et al., 1981;Harper et al., 2017). In addition, an increased comprehension of the neuropsychological function of patients with AN has the potential to provide a better understanding of the cognitions and behaviors characterizing the illnesswhich could aid in diagnosis and treatment. It has also been suggested that neuropsychological deficits could be a trait marker, or endophenotype, for the disorder (Kanakam et al., 2013). Subsequently, some of the core pathology and defining behaviors seen in AN have been conceptualized as a reflection of neuropsychological dysfunction. For example, body size estimation errors have been described as an expression of poor visuospatial abilities (Lang et al., 2016;Lang and Tchanturia, 2014), and cognitive and behavioral inflexibility have been considered a consequence of set shifting impairments (Shott et al., 2012;Steinglass et al., 2006). Consequently, more research has focused on neuropsychological functions, such as visuospatial processing and set shifting, compared to attention and inhibition (Smith et al., 2018).
Despite extensive research, however, findings have been inconsistent. Some studies have reported considerable cognitive deficits in patients with AN (e.g. Lopez et al., 2008;Tchanturia et al., 2004a;Weider et al., 2015), whereas other studies have failed to find a difference in cognitive function between patients and controls (e.g. Jones et al., 1991;Thompson, 1993;Øverås et al., 2017). These inconsistencies have been attributed to variable design, small sample sizes, heterogeneous samples and a failure to control for alternative explanations for test performance (Stedal, 2012;Tchanturia et al., 2005). This lack of coherence in the field has made it immensely challenging for clinicians and researchers to interpret findings from studies and to select tests for neuropsychological assessments. This is further highlighted in the recently published review of reviews by Smith et al. (Smith et al., 2018). Out of 28 systematic reviews and meta-analyses, thirteen were based on patients with AN, and only one study (Zakzanis et al., 2010) explored a broad spectrum of cognitive functions in this patient group. The remaining studies investigated specific cognitive domains, such as set shifting (Westwood et al., 2016;Wu et al., 2014), executive functions (Hirst et al., 2017;Miles et al., 2020), decision-making (Guillaume et al., 2015;Wu et al., 2016), central coherence (Lang and Tchanturia, 2014), or attention bias (Aspen et al., 2013;Brooks et al., 2011). In addition, some previous meta-analyses have reported findings from self-reports (Miles et al., 2020), despite research showing a lack of association between performance-based neuropsychological tests and self-report measures (Herbrich et al., 2019;Stedal and Dahlgren, 2015). Consequently, there is a lack of meta-analyses providing an overall framework of cognitive function, based on traditional domain classifications (Lezak et al., 2004), using standardized neuropsychological tests.
One reason for focusing on specific domains when performing metaanalyses could be to avoid issues with dependency of effect sizes. Most primary studies investigating neuropsychological functioning in AN have assessed more than one cognitive domain and often report more than one relevant effect size for each domain. However, combining multiple effect sizes in one meta-analysis can be problematic. For example, a study investigating executive function might report scores from multiple tests which all assess executive functions or multiple scores from the same test. For traditional meta-analytic procedures this warrants concern, since a premise for the analyses is independence of effect sizes (Cheung, 2019). Until recently, the most common way of handling dependent effect sizes was to either average the effect sizes or to select only one effect size from each study (Cheung, 2019;Smith et al., 2018;Zakzanis et al., 2010) or, in some cases, to just disregard the dependency of the data. This is concerning because "when effect sizes are not independent, conclusions based on these conventional procedures can be misleading or even wrong" (Cheung, 2019, p. 387). In addition, by selecting only one effect size from each study, there is a notable risk of selection bias in terms of which tests and/or domains are chosen, and it also limits the utilization of available data (Cheung, 2019). However, recent statistical advancements have led to the development of meta-analytical procedures which can address non-independent effect sizes. These procedures can provide more detailed information concerning both the direction and magnitude of difference between patients with AN and healthy controls on neuropsychological tests. To the authors' knowledge, there are no studies which have applied these novel meta-analytical procedures to investigate neuropsychological functioning in patients with AN. In addition, most previous meta-analyses of neuropsychological function in AN have not taken into account possible confounding factors, including depression and anxiety, weight status, duration of illness and/or age (Smith et al., 2018).

The current meta-analysis
The literature on cognitive function in AN is inconsistent, and most previous research syntheses have focused on single rather than multiple cognitive domains and have included patients at different stages of the illness, including recovered participants. The latter challenges the interpretation of meta-analytic results, as some studies have shown improved cognitive function with weight gain during recovery (Hemmingsen et al., 2020). Further, some previous systematic reviews and meta-analyses have also combined results from different assessment methods, including self-report questionnaires, which further obfuscate the understanding of findings.
In the current meta-analysis, we focused on six major neuropsychological domains (i.e., attention, executive functions, memory, processing speed, visuospatial abilities, and working memory) and ten subdomains (see Table 1). These domains were aligned with the classification by Lezak et al. (2004) and previous studies of comparable patient populations (Abramovitch et al., 2013;Geller et al., 2018). The included tests were classified "according to the major functional activities they elicit" (Lezak et al., 2004, p. 335).
The primary aim of the current review was to synthesize previously published data to examine the magnitude of difference on neuropsychological tests between individuals with AN and healthy controls. In addition, since the majority of previous studies have not accounted for factors which can potentially influence test performance (Smith et al., 2018), our second aim was to assess the moderation of study and sample characteristics on neuropsychological test performance. These characteristics included, but were not limited to, AN diagnostic subtype, participants' average age, body mass index (BMI), years of education, performance on intelligence tests, and eating disorder severity.

Literature search
We conducted the literature search in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009). The literature search was performed in July 2019, subsequently updated in May 2020, and checked again in September 2020. We restricted the search to the databases MEDLINE, PsycINFO, ISI Web of Science, and Epistemonikos to provide an exhaustive record and documentation from these key databases which also indicate some degree of documented quality. The databases may include both published and grey literature. No further databases were searched to avoid additional duplicates and data was not drawn from other sources. A librarian at the Medical Library of the Oslo University Hospital conducted the search using the following search terms: 'anorexia nervosa', cross-referenced with the terms 'neuropsych*', 'neurocog*', 'executive function ', 'memory', 'processing speed', 'visuospatial', 'inhibition', 'planning', 'attention', 'set shifting', 'central coherence', 'flexibility', 'rigidity'. Terms were searched for as Medical Subject Headings (MeSH, MEDLINE) or Thesaurus of Psychological Index Terms (PIT, PsychINFO), as well as in titles and abstracts. We limited the search to publications in English. The full search strategy for all databases, including the corresponding limits, can be found in Appendix A in Supplementary materials.

Screening procedures
The literature search resulted in 5023 titles, which were reduced to 3313 after removing duplicates. These publications were then submitted to the screening of titles and abstracts according to the inclusion and exclusion criteria specified below. A detailed overview of the screening process is presented in Fig. 1.

Inclusion criteria
Published studies investigating neuropsychological functioning in adult patients currently diagnosed with AN, based on criteria from the Diagnostic and Statistical Manual of Mental Disorders (DSM), 3rd edition or newer (Association, 1987), were considered for inclusion based on the following a-priori set criteria: (1) At least one comparison on one or more neuropsychological tests between current DSM-diagnosed (i.e. via structured interview) adult (≥ 18years) patients with AN and a healthy (i.e. screened for absence of psychiatric or neurologic diagnosis) adult (≥ 18years) control group was conducted.
(2) Studies evaluated one or more of the following neuropsychological domains: Attention, executive functions, memory, processing speed, visuospatial abilities (including central coherence), and working memory. (3) Studies were published in English or had an available English translation.

Exclusion criteria
Studies were excluded if they lacked a healthy control group, if they only reported comparisons between patient groups (e.g., patients with AN compared to patients with depression), if the assessment was done within-subject (e.g., pre/post treatment), or if the study was a single-group investigation. Treatment studies were included if it encompassed a neuropsychological pre-treatment comparison of patients with AN and healthy controls. To ensure the validity of findings, studies were excluded if they did not use validated, traditional, and standardized neuropsychological testsas determined by experts in the field (Lezak et al., 2004). This included tests of decision making and outcome measures not considered a part of the original tests. Studies using modified versions of the original tests (e.g., Emotional Color Word Interference Test), tests administered during brain scans, or tests rarely used (< 1 % of the studies) were also excluded. Organizational scores on the Rey Complex Figure Test were included, due to the large amount of research assessing organizational strategy in AN (Lang et al., 2016;Lang and Tchanturia, 2014). Books, book chapters, editorials, commentaries, reviews, theses, conference abstracts, errata, and studies presenting data where diagnostic screening was unclear were omitted. When the same dataset was reported in multiple studies, only the original article or the one with the most complete report of the relevant information was included.

Title screening
In this first screening process, we removed titles corresponding to studies which were obviously not eligible (e.g., "Meningioma and psychiatric symptoms: An individual patient data analysis."). Equivocal (e. g., "Cortisol levels and vigilance in eating disorder patients") and plausible titles (e.g., "Exploring the neurocognitive signature of poor set shifting in anorexia and bulimia nervosa") were retained. To assess interrater agreement, two authors (KS and CB) performed an initial screening on a subset of publication titles (n = 101). Titles were labelled "include", "exclude" or "inconclusive". The resultant consistency for title extraction indicated a substantial agreement between the two raters (κ = 0.80, p < .001), according to the well-established guidelines (Landis and Koch, 1977;Viera and Garrett, 2005). The second author (CB) reviewed the remaining titles. After title screening, 2914 out of the 3313 publications were omitted, and 400 titles were retained.

Abstract screening
Two authors (KS and CB) screened the abstracts of the retained studies for eligibility and classified them as "include" or "exclude". The consistency for abstract extraction on 49 % of the reviewed studies (m = 196) indicated almost perfect agreement between the two raters, κ = 0.87, p < .001. After reviewing the 400 abstracts, 118 studies were retained for full-text review and were screened by the first author. Sixtyeight of these studies were excluded for the following reasons: Not providing a DSM diagnosis (m = 6), investigating a population < 18 years old (m = 3), the results from the sample had been presented in a previous study (m = 7), combining patient groups (e.g., only results from AN and bulimia nervosa combined were presented), investigating patients with a lifetime diagnosis (n = 17), lacking necessary data to compute effect sizes (m = 10), or a healthy control group (m = 6). Finally, studies were excluded for utilizing modified/experimental tests or tests were performed in scanner (m = 13), or the test was not included in the current meta-analysis (m = 6). A total of 50 studies were included. The authors of the texts and their affiliations were disclosed in the screening and extraction processes.

Coding of primary studies and effect size measures
The status of the included and eligible studies is that of July 6, 2020. Extraction from the included studies followed recommended coding procedures (Valentine, 2009) and the coding scheme presented by Abramovitch et al. (2013). Variables were coded as "participant characteristics", "study characteristics" and "validity and reliability assessments". The first and second author (KS, CB) extracted the following information from the included studies: (a) Publication status, (b) publication year, and (c) the country in which the study was conducted. The subsequent participant characteristics were recorded: (a) Sample sizes for both groups (AN and healthy controls), (b) mean age (in years), (c) mean BMI (d) mean age of AN onset (in years), (e) mean duration of illness (in months), (f) years of education, (g) percentage of males in the AN group, (h) mean score on measures of AN severity (e.g., Eating Disorder Examination Questionnaire), depression severity (e.g., Beck Depression Inventory), and anxiety severity (e.g., State-Trait Anxiety Inventory), (i) percentage of AN participants with Axis I comorbid illnesses, and (j) the percentage of AN participants receiving serotonin reuptake inhibitors, neurotropic, or neuroleptic medication. Furthermore, we recorded study characteristics, including the specific neuropsychological test used and the associated domains and subdomains of functioning. Reported outcomes for neuropsychological test performance were extracted as means and standard deviations. Table 1 presents the domains, subdomains, and outcomes coded from the studies. For cases where the outcome variables were uncommon (e.g., "Time to first move" on the Tower Test), only the conventional outcome variables were recorded. Finally, validity and reliability assessments were recorded as (a) the number of tests, (b) number of testing sessions, and (c) the average length of testing sessions. When studies included more than one measure, either within the same domain or for multiple domains, we extracted all relevant outcomes instead of selecting only one. All outcome variables from the neuropsychological assessments were coded so that positive scores indicated better performance. Variations of the same or similar tests were grouped together. For instance, the Rey Auditory Verbal Learning Test, the California Verbal Learning Test, and the Auditory Verbal Learning Tests were all considered to be "verbal learning tests". Similarly, the Tower of London and Tower of Hanoi tasks were considered to be "tower tests".
Excluding author and publication information, a total of 20 variables were coded for each study. To assess the coding reliability, a random sample of studies was coded by two authors (KS and CB) on all variables (10 %, m = 8). A total of 160 variables were compared. The results revealed discrepancies between coders on only six variables, indicating high interrater agreement (96 %) -these discrepancies were resolved through discussion.
Given the means and standard deviations extracted from the primary studies for both the AN and the control group, we calculated Hedges' g from the standardized mean difference ES, following Borenstein et al.'s (2009) procedure. Specifically, with X AN and X HC denoting the group-specific mean scores of some neuropsychological test, SD AN and SD HC the corresponding standard deviations, and N AN and N HC the sample sizes, Hedges' g and its elements were calculated as follows: The corresponding sampling variance v g and the standard error SE g were then calculated as: Given this specification of Hedges' g, negative effect sizes indicated a disadvantage of participants in the AN group in their performance on some neuropsychological test over the healthy control group. The resultant effect sizes for each primary study are displayed in Appendix B in Supplementary materials.

Quality assessment
On the basis of factors which may influence performance on neuropsychological tests (Lezak et al., 2004;Yang et al., 2018), we assessed the included studies for methodological quality and assigned quality scores to them. These scores ranged from 0 to 7, with higher scores indicating better precision of the neuropsychological test results (Yang et al., 2018). In line with the quality rating developed by Yang et al. (2018), study quality was calculated as follows: (age difference excluded . For instance, studies which accounted for (i.e. matched for) differences in age, gender, education, IQ, and medication, as well as controlling for depression and anxiety, received a score of 7.

Meta-analytic baseline models
As a first step, we synthesized the effect sizes for the overall sample of neuropsychological test and, subsequently, for each of the domains, subdomains, and tests. Given that the structure of the meta-analytic data was inherently hierarchical with multiple effect sizes per study, the independence assumption clearly did not hold (Borenstein et al., 2009). The extant literature has proposed several procedures to account for these dependencies, such as averaging multiple effect sizes per study, robust variance estimation, or multilevel random-effects modeling with or without correlated effects (Cheung, 2019;Fernández-Castilla et al., 2020;Pustejovsky and Tipton, 2021). In the current review, we performed multilevel random-effects modeling to quantify the different variance components explicitly. For instance, the three-level random-effects model quantifies the sampling variance (level 1), the variance between effect sizes within studies (level 2, variance τ 2 (2) ), and the variance of effect sizes between studies (level 3, variance τ 2 (3) ). Such a model accounts efficiently for the dependence of effect sizes and allows researchers to test different assumptions on the variance components (Cheung, 2013). Specifically, for a given meta-analytic data set with a nested structure, the variance components can be tested against zero via model comparisons (e.g., based on information criteria and likelihood-ratio tests). However, these significant tests are performed against the boundary estimate of zero-hence; the confidence intervals of the variances should also be considered. Synthesizing the effect sizes, we tested and compared several models with different variance components (i.e., a three-level random-effects model, random-effects models with variances either between studies or effect sizes, and a fixed-effects model) to establish baseline models. These models provided the weighted mean effect sizes for neuropsychological functioning in general and the (sub-)domains specifically, next to the heterogeneity indices (I 2 (2) and I 2 (3) ) and variance components (Cheung, 2013). Moreover, we extended the baseline models to mixed-effects meta-regression models to test for moderator effects.

Sensitivity analyses
To establish the robustness of our findings, we examined the sensitivity of the meta-analytic results across several conditions: (a) Type of variance estimation: Restricted maximum-likelihood (REML) estimation vs. Bayesian estimation, (b) Treatment of effect size dependencies: Robust variance estimation vs. multilevel meta-analysis with or without constant sampling correlation, (c) Handling influential effect sizes: Exclusion vs. inclusion, and (d) Treating missing data in the continuous moderators: Pairwise deletion vs. multiple multilevel imputation. Both the analytic code and the results of these analyses are documented in the Supplementary Material S1-S3.

Moderator analyses
As a second step, we tested the possible moderating effects of the study and sample characteristics, specifying and estimating mixedeffects models with the continuous and categorical moderators (Cheung, 2013). The moderator variables in the current meta-analysis were either related to the study (e.g., country, publication year, control for depression/anxiety, study quality score), to all the participants (e.g. BMI, IQ, age, years of education), or were specific for the patient sample (e.g., duration of illness, clinical severity). Participant moderator variables are presented in Table 2. Moderator effects for categorical variables were only considered if at least six to seven effects sizes were available per category (Rubio-Aparicio et al., 2017;Tipton et al., 2019). For moderators with many levels (e.g., countries, outcomes), we implemented the moderator variable as an additional clustering variable that indicated an explicit level of analysis. The respective models were specified as either four-level random-effects models (Fernández-Castilla et al., 2020) or cross-classified random-effects models (Fernández-Castilla et al., 2019), depending on the type of hierarchical data structure. For instance, we tested the possible differences in effects between the specific outcomes of the neuropsychological tests using a cross-classified model with variance components: Sampling variation (level 1), variation between effect sizes within studies (level 2), variation of effect sizes between studies (level 3), and variation between outcomes (level 4). While level 2 is hierarchically nested in level 3 in this example, level 4 represents a level of analysis that is independent of levels 2 and 3 (Fernández-Castilla et al., 2019). In contrast, we tested the moderator effects of countries using a four-level model with full hierarchical nesting, assuming that studies were directly nested in countries. All analyses, including the sensitivity and baseline model analyses, were performed in the R packages 'metafor' (Viechtbauer, 2010), 'robumeta' (Fisher et al., 2017), and 'brms' (Bürkner, 2017).

Publication bias, file-drawer issues, and influential effect sizes
To examine possible publication bias and file-drawer issues, we conducted several analyses: First, we conducted trim-and-fill analyses, evaluated the symmetry of the funnel plots (Duval and Tweedie, 2000), and performed Begg's rank correlation test (Begg and Mazumdar, 1994). The trim-and-fill analyses have recently been extended to multilevel meta-analyses, resulting in the two estimates L + 0 and R + 0 as indicators of the number of missed effect sizes (Fernández-Castilla et al., 2021). We further tested the asymmetry of these plots via Egger's linear regression test (Egger et al., 1997). Second, using Rosenberg's procedure, we estimated the fail-safe N's (Borenstein et al., 2009). Third, we performed the funnel plot test and evaluated the precision-effect estimate with standard error (PEESE; Fernández-Castilla et al., 2021). Fourth, we plotted the p-curves underlying the effects and examined their skewness (Simonsohn et al., 2014). Specifically, if a p-curve was right-skewed, the primary studies had evidential value, providing evidence against p-hacking. We used the R package 'dmetar' to obtain the p-curves (Harrer et al., 2019). Finally, we identified influential effect sizes using Viechtbauer and Cheung's (2010) diagnostics in the R package 'metafor'. All of these analyses and their outcomes are documented in the Supplementary Material S1.

Description of the primary studies
A total of m = 50 primary studies, k = 186 effect sizes, and N = 4057 participants were included. The sample was comprised of n = 1778 participants diagnosed with AN and n = 2279 healthy controls. Notably, only two studies (Weider et al., 2014;Talbot et al., 2015) included 5 % and 4.2 % men. As noted earlier, all other studies, except for Tchanturia et al. (2002) who only reported the matching for gender, were based on female samples. The included studies were conducted in the following countries: Italy (m = 11), United Kingdom (m = 9), United States of America (m = 4), Spain (m = 7), The Netherlands (m = 2), Republic of South Korea (m = 1), Australia (m = 4), Germany (m = 3), Norway (m = 2), Canada (m = 1), Mexico (m = 1), Japan (m = 2), Argentina (m = 1), Belgium (m = 1), and France (m = 1). The six core domains comprised varying numbers of effect sizes: Attention (k = 6), executive functions (k = 74), memory (k = 38), processing speed (k = 30), visuospatial abilities (k = 29), and working memory (k = 7). Notably, the studies including some test of executive functions dominated the meta-analytic sample (m = 39 out of 50). An overview of the included studies is presented in the Appendix B in Supplementary materials (references included in meta-analysis can be found in Appendix C in Supplementary materials), and Table 2 shows sample characteristics.

Overall effect size and moderator analyses
Combining all effect sizes across the neuropsychological functioning domains, we established a three-level random-effects model as the baseline model to report an overall effect size (see Supplementary Material S1). This model resulted in a moderate, negative, and statistically significant effect size (g = -0.431, 95 % CI [-0.503, -0.359]) and indicated significant heterogeneity (Q E [37] = 98.4, p < .001). The corresponding heterogeneity indices suggested moderate heterogeneity within studies and small heterogeneity between studies, and so did the variance components (0.095 and 0.014, respectively; see Table 3).
The overall effect size provides a reference point for the more detailed analyses of effects for each domain and subdomain. Moreover, as the respective moderator analyses suggested that the effects varied between domains and outcomes, we further performed domain-and outcome-specific analyses.

Effect sizes and moderator analyses per cognitive domain and subdomain
In the following, we present the weighted mean effect sizes for each of the categories of cognitive domains, subdomains, and outcomes, along with the moderator analyses. Given that this differentiation limits the sample sizes available to meta-analyses, we evaluated the baseline models for each of these categories. Tables 3 and 4 exhibit the effect sizes, and Fig. 2 displays the corresponding forest plot.

Attention
Our meta-analytic sample provided six effect sizes in the domain of attention. On the basis of a fixed-effects models, we obtained a weighted mean effect size of g = -0.571 (95 % CI [-0.791, -0.351]). Given the limited number of effects, we refrained from conducting moderator analyses. However, the effect sizes within this domain ranged substantially from g = -0.962 (Go/No-go omission errors) to g = -0.439 (Go/Nogo commission errors).

Planning.
Only three effect sizes were available to synthesize the effects for the domain of planning on the basis of a fixed-effects model ( Table 3). The weighted mean effect size was g = -0.104 (95 % CI [-0.441, 0.234]) and did not significantly differ from zero (Q E [2] = 2.5, p = .55). No further moderator analyses were conducted, and all effect sizes were obtained from one type of outcome (i.e., "Tower tests"; see Table 4).

Response inhibition.
The weighted mean effect size for response inhibition was small (g = -0.194, 95 % CI [-0.382, -0.006]), with patients performing significantly worse than the control participants (Q E [12] = 30.8, p < .01), and a moderate to high heterogeneity (60.6 %; Table 3). Subsequent moderator analyses revealed a marginal difference between two AN subgroups (more negative effect size for the AN-restrictive subgroup than the AN subgroup;   Five different tasks were used to assess response inhibition (see Table 4), of which the Color Word Interference Task was most commonly administered (k = 6). No evidence for significant differences between outcomes existed, χ 2 (1) = 0.3, p = .61.  Table 4). The effect sizes varied greatly between tasks from g = 0.180 for the phonemic condition of the Verbal Fluency Task to g = -1.168 for the Berg Card Sorting Test. This variation was statistically significant (τ 2 (3) = 0.135, 95 % CI [0.043, 0.173]), as the comparison between the baseline model and a three-level random-effects model with an additional outcome level indicated, χ 2 (1) = 6.2, p = .01.

Memory
We also observed a moderate, significant, and negative effect size for the domain of memory (g = -0.485, 95 % CI [-0.698, -0.273]; Q E [37] = 98.4, p < .01), again favoring healthy controls with substantial heterogeneity (64.7 %; see Table 3). However, we could not find any evidence supporting the significant differences between verbal and non-verbal No further subgroup or country differences existed. Seven different tasks were used for assessing memory-the most commonly used test were the RCFT for non-verbal memory (k = 7) and the delayed recall condition of list learning tests (e.g., the California Verbal Learning Test) for verbal memory (k = 7). The outcome-specific effect sizes ranged between g = -1.101 and g = -0.207 (Table 4), yet did not differ significantly, χ 2 (1) = 0.8, p = .38.

Processing speed
Patients with AN had a significantly worse processing speed performance compared to the healthy adults, g = -0.390 (95 % CI [-0.530, -0.250]; Q E [29] = 56.1, p < .01). The degree of heterogeneity was high (48.9 %; see Table 3), and the moderator analyses revealed a negative moderation effect of study quality (the better the quality of the study the more negative the effect size; B = -0.157, SE = 0.061, Q M [1] = 6.6, p = .01). No other moderator effects were detected.

Working memory
Our meta-analytic sample contained seven effect sizes which were based on measures of working memory. A fixed-effects model resulted in a moderate, negative, and significant effect size, g = -0.455 (95 % CI [-0.818, -0.091]). The underlying, outcome-specific effect sizes were g = -0.626 (Digit span) and g = -0.309 (WMS letter number sequencing), respectively. We did not conduct any further moderator analyses.

Sensitivity analyses and publication bias
Supplementary Material S1 and S3 show the detailed results of both the sensitivity analyses and the analyses of publication bias. Overall, the specification of the meta-analytic models via Bayesian analysis supported the choice of the baseline models-specifically, the preference of three-level random-effects models over models with fewer variance components was backed by the respective Bayesian credibility intervals (see Supplementary Material S1). Moreover, the sizes of the effects and their variance components were almost identical to those obtained from the REML estimation. The key results obtained from the series of metaanalyses were not sensitive to the exclusion of influential effect sizes (see Supplementary Material S1). Multilevel meta-analysis did not exhibit different results as compared to the standard random-effects models with robust variance estimation in situations where one of the variance components (next to the sampling variation) was small (see Supplementary Material S1). Notably, an alternative three-level random-effects model that takes into account both hierarchical and correlated effects and assumes a constant within-study correlation between sampling errors (ρ) did not fit the data significantly better and resulted in a weighted average effects size almost identical to that of the model without this correlation (for ρ = 0.40: g = -0.43, 95 % CI [-0.51, -0.36]). Besides, the differences in variance components were negligible (see Supplementary Material S1). Finally, one moderator effect turned statistically significant after imputing the missing data points in the continuous moderators (see Supplementary Material S3). Overall, the results presented earlier show a substantial degree of robustness with respect to the selected conditions.
The trim-and-fill analyses for the entire set of effect sizes indicated that no additional effect size was missing, L + 0 = 0 and R + 0 = 0 (see Supplementary Material S1). Egger's regression test was significant (B = -1.27, SE = 0.51, p = .01), and so was the PEESE (B = -2.41, SE = 0.91, p < .01), suggesting that some selection bias may be present in the data. The funnel plot test, however, did not indicate such bias (B = 0.00, SE = 0.01, p = .77), and neither did Begg's correlation test (r = .08, p = .11). The estimated fail-safe N was high, N = 14767. These findings suggested that some degree of publication bias was present. For the analysis of publication bias per domain, please refer to the Supplementary Material S1.
Besides these analyses of publication bias, we further inspected the pcurve (see Supplementary Material S1). The right-skewed binomial test of the k = 184 effect sizes, of which k = 78 were statistically significant (p < .05), was significant (z = -13.2, p < . 01), and, conversely, the flatness test was insignificant (z = 6.5, p = .99). These findings suggested that the extracted effect sizes had evidential value.

Discussion
The aim of the current study was to extract all available data on neuropsychological functioning in AN in order to examine the magnitude of difference in neuropsychological test performance between individuals with AN and healthy controls. By utilizing novel statistical methods, we accounted for dependent effect sizes and examined moderating variables which could be hypothesized to influence test performance. This is the first meta-analysis since 2010 (Zakzanis et al., 2010) which examines neuropsychological performance in patients with AN across a range of cognitive domains. Several primary studies have been published since then, and subsequently been encompassed in the present meta-analysis. To facilitate interpretation and generalizability of results, we only included patients in the acute stage of illness. In addition, to keep the meta-analysis rigorous and focused, the neuropsychological domains and subdomains of interest were chosen a-priori and were based on traditional neuropsychological domain classifications (Lezak et al., 2004) as well as previous studies of comparable psychiatric samples (Abramovitch et al., 2013;Geller et al., 2018). A total of 50 studies and 1778 patients were included. An overall analysis (k = 184), i.e. combining all test result in one meta-analysis, can be a valuable tool to provide an indication of general cognitive functioning in the patient group. The overall analysis in the current study indicated some broad and non-specific difference in cognitive function between individuals with AN and healthy controlswith the former performing significantly lower compared to the latter. Moderator analyses of the overall cognitive function revealed an effect of both age and BMI, demonstrating the influence of participant characteristics on task performance.
Our finding that overall cognitive function was moderated by age has some implications for the debate on whether or not cognitive inefficiencies are a trait marker of AN. Despite a large amount of research, there still appears to be no clear understanding as to why adult patients with AN perform worse on tests of cognitive function. Is it the result of malnutrition or is it pre-existing, and does it resolve over time? Longitudinal studies on adults with AN have demonstrated cognitive inefficiencies also after recovery (Tchanturia et al., 2004b(Tchanturia et al., , 2002, thereby providing some support for cognitive deficits as a trait marker for the illness. However, studies on children and adolescents have challenged this view, showing that cognitive functioning improves with weight recovery (Bühren et al., 2012;Lang and Tchanturia, 2014;Lozano-Serra et al., 2014). In the current study, we revealed that age was a significant moderator of overall cognitive performance, with older participants performing worse than younger. This finding lends some support to the suggestion that longer duration of illnessas commonly observed in older patientsposes greater risk for cognitive deficits (Grau et al., 2019). Thus, it could be that the inferior cognitive performance is a consequence of prolonged malnutrition, rather than a trait. However, our moderator analyses revealed no moderating effects for neither age of onset nor duration of illness. We encourage future studies to include age of onset and/or duration of illness to further investigate the potential effects of illness duration on cognitive performance. To date there are no comparable meta-analyses of children and adolescents to allow for a direct comparison with the current results.
Regarding BMI, our findings are in line with the study by Zakzanis et al. (2010) which also noted a relationship between weight and test performance. Specifically, lower weight was associated with more severely impaired test performance. However, it should be noted that the aforementioned participant characteristics (age and BMI) were not associated with all the cognitive domains investigated in the current study. For example, the impact of BMI was only associated with the effect sizes for memory, inhibition and visuospatial abilities. Whereas age had a positive effect on inhibitioni.e. older participants were better at inhibiting a pre-potent response.
Continuing with our investigation of domains and sub domains, the patients performed worse on all domains, with effect sizes ranging from -0.34 (executive function) to -0.57 (attention). For subdomains the highest effect size was found in central coherence (-0.622), whereas the smallestand only non-significant effect sizewas revealed in planning (-0.104). The moderator analyses of (sub-) domains revealed some interesting findings. Firstly, the positive moderation effect of eating disorder severity and anxiety on memory and set shifting/cognitive flexibility, respectively, was somewhat surprising. Eating disorder severity has previously been associated with poorer visuospatial memory (Zuchova et al., 2013), and anxiety is commonly reported to negatively influence cognitive performance (Clarke and Mackleod, 2013). There are some studies showing that anxiety does not impair performance when compensatory strategies are usedlike enhanced effort or increased cognitive processing resources (Eysenck et al., 2007). This can be investigated by employing pure measures of inhibition and shifting (Eysenck et al., 2007) which could give an indication of recourse utilization, including performance effectiveness and processing efficiency. Nonetheless, the counter-intuitiveness of our findings accentuates that replication is essential before making recommendations for future studies.
Secondly, our moderator analyses revealed a negative effect of education for both memory and set shifting/cognitive flexibility. This is surprising, since education is usually associated with a positive influence on neuropsychological performance (Lam et al., 2013). All our moderator-analyses were carefully checked, but we are still cautious about making firm conclusions before these findings have been replicated.
As mentioned in the introduction, executive function and visuospatial processing have received the vast majority of attention in the field. This is also reflected in the current study, where executive functionand particularly set shifting/cognitive flexibilitywas the most commonly researched domain (k = 58). A recently published review by Miles et al. (2020) report mixed results from studies investigating cognitive flexibility in AN. Their review found that adult patients with AN performed significantly worse on some perceptual cognitive flexibility tasks like the Brixton, but that findings were mixed for other tests of cognitive flexibilityincluding the Wisconsin Card Sorting Test, the Verbal Fluency Test and the Trail Making Test (Miles et al., 2020). When synthesizing findings from multiple studies, our analyses revealed a medium, significant, negative effect size for set shifting/cognitive flexibility. However, there was significant variation between tests. Thus, as noted in the review by Miles et al. (2020), measures of cognitive flexibility appear to vary in usefulness in terms of differentiating between patients with AN and healthy controls. Since "by definition, executive functions operate on other cognitive processes" (Strauss et al., 2006, p. 405), non-executive cognitive processes will be involved in solving these tasks. Thus, we encourage future studies to utilize tests which isolate specific areas of executive functioning, in order to explore which components of executive function are compromisedrather than complex task where it can be difficult to elucidate the cognitive functions involved. The Wisconsin Card Sorting Test and the shifting version of the Trail Making Test were the most commonly used measures of set shifting/cognitive flexibility (k = 19 and k = 13, respectively) in the current study. Both of these measures revealed a significantly worse performance in the patient group, with medium effect sizes for both tasks. Our findings thus confirm previous studies which demonstrate discrepancy between verbal and perceptual set shifting/cognitive flexibility tasks (Zakzanis et al., 2010). For patients with AN, the perceptual form of set shifting appeared to be associated with cognitive underperformance, whereas verbal set shifting appears comparableor even superiorto the control group Zakzanis et al., 2010). This divergence could also account for the significant variation between tests for set shifting/cognitive flexibility domain in the current study. Performance on verbal fluency has a strong correlation with speed of information processing (Boone et al., 1998), verbal IQ, as well as working and semantic memoryin addition to executive functions like monitoring and suppression (Strauss et al., 2006). Whereas the Brixton task does not require any verbal abilities and has been shown to load on the same factorsimple alternationas the Trail Making Test (shifting) in patients with eating disorders (Tchanturia et al., 2004a) Thus, most neuropsychological tests assess combinations of different cognitive functionsand particularly tests assessing executive functions are known to operate on multiple cognitive processes (Miyake et al., 2000). For example, the Wisconsin Card Sorting Test is categorized within the set shifting/cognitive flexibility subdomain of executive functions. However, performance on this task involves several cognitive functions, including memory, visual perception, auditory perception, as well as cognitive flexibility (Keefe, 1995). Therefore we cannot be conclusive about the specificity of impairment found, and caution against overly simple interpretations that do not take account of this complexity.
Interestingly, the moderator effect of psychotropic medication was positive, indicating that patients on medication performed better on the set shifting/cognitive flexibility tasks. Previous studies have shown that some medications can indeed influence cognitive performance (Barker et al., 2004;Goldberg and Burdick, 2001;Pachet and Wisniewski, 2003). With this in mind, it is concerning that half of the included studies fail to reportor control formedication use in their samples. It is also noteworthy that few of the included studies account for the potential impact of co-morbid disorders on neuropsychological performance. Especially since AN is commonly associated with other psychiatric disorders, such as anxiety and/or depression (Kaye et al., 2004;Kennedy et al., 1994). In fact, most studies (m = 35) neither assess nor report the number of participants with a co-morbid Axis I diagnosis. This is particularly striking since the impact of depression and anxiety on neuropsychological performance in AN has been reported in previous studies (Billingsley-Marshall et al., 2013;Ely et al., 2016;Wilsdon and Wade, 2006). Both these illnesses are themselves associated with impaired neuropsychological test performance (Moran, 2016;Rock et al., 2014) and meta-analyses of affective disorders have repeatedly demonstrated that patients express deficits in areas of memory, attention and executive function when compared to controls (Burt et al., 1995;Zakzanis et al., 1998).
Since the early 2000s there has been interest in developing a neuropsychological test battery specifically aimed at assessing patients with AN (Rose et al., 2011). However, as apparent from the vast range of tests employed in the source studies included in the current meta-analysis, there is still a lack of consensus regarding which neuropsychological test(s) would be most suitable for assessing individuals suffering from this illness. The results from our meta-analysis revealed four neuropsychological tests which stood out. Firstly, because they demonstrated an effect size which would be considered medium or large (i.e. Hedges g ≥ 0.5). Secondly, because they encompassed more than five effect sizes, thereby indicating a fairly strong evidence base. The four tests were as follows: The switching condition of the Trail Making Test (k = 13, g = -0.595 (Delis et al., 2001;Reitan and Wolfson, 1985), the perseverative responses condition of the Wisconsin Card Sorting Test (k = 9, g = -0.535) (Heaton et al., 1993), the Block Ddesign Test from The Wechsler Adult Intelligence Scale (k = 7, g = -0.660) (Wechsler, 2008) and the Rey Complex Figure Test (Osterrieth, 1944) where both delayed recall (k = 7, g = -0.561) and the central coherence index (k = 9, g = -0.669) revealed a medium magnitude of difference between individuals with AN and healthy controls. Thus, based on these findings, clinicians or researchers with limited resources and/or time might wish to consider including one or more of the above-mentioned neuropsychological tests when performing cognitive assessments of individuals with AN.
It is worth noting that an inherent concern with employing neuropsychological tests to assess psychiatric populations is that the majority of these tasks have been developed to differentiate between healthy controls and patients with brain lesions or traumanot psychiatric populations, meaning they are potentially not sensitive enough to detect subtle cognitive inefficiencies (Keefe, 1995). It has been argued that there is a need for instruments specifically tailored for assessing and detecting patients with psychiatric disorders (Kuelz et al., 2004). One first step could be to place greater emphasis on process scores which are generated by monitoring behaviors used to solve the task (Stedal et al., 2019). The vast majority of studies included in the current synthesis report global, or overall, achievement scores. The Central coherence index domain is an exception. In fact, our results reveal that central coherence was associated with the largest effect size, indicating that in comparison to healthy controls, patients with AN apply a more inefficient strategy when copying a complex figure. This is in line with a previous synthesis of studies employing the Rey Complex Figure Test to investigate central coherence in eating disorders (Lang et al., 2016). Taken together, these findings highlight the benefit of recording process scores when examining psychiatric populations with neuropsychological instruments.
Another important point of consideration when interpreting findings from neuropsychological tasks, is whether or not these effect sizes should be considered clinically significantand if they reflect clinically significant impairments. Traditionally, a score of 2 standard deviations (the equivalent of a Cohen's d effect size of +/-2.0) below the norm is considered clinically significant (Lezak et al., 2004). On the other hand, Abramovitch and Cooperman (2015) propose that a standard deviation of 1.0 could be considered a clinically meaningful cutoff for neuropsychological test performance. In the current study, none of the effect sizes were 1 or above. It has been argued that small to moderate effect sizeslike the findings from the current studyshould be labelled cognitive underperformance, rather than clinically significant impairments (Abramovitch et al., 2013). Another principal consideration is how these findings relate to everyday functioning. Studies examining the ecological validity of neuropsychological test performance in AN have demonstrated only small correlations between measures considered more ecologically valid and traditional neuropsychological tests (Herbrich et al., 2019;Spitoni et al., 2018;Stedal and Dahlgren, 2015). However, the majority of these studies are based on findings from children and adolescents. Future studies should aim to include multiple formats of cognitive assessments since "self-reports of functioning, as well as observations of behavior while performing testing, are critically important pieces of information" (Harvey, 2012, p. 91).
Our quality of study index revealed that the precision of the neuropsychological test results could be questionable for some of the included studies. Despite moderator analyses revealing only one association between test results (processing speed) and the quality of study index, it is concerning that several of the included studies failed to report, or control for, variables which have been demonstrated to influence neuropsychological test resultslike anxiety (Ely et al., 2016), depression (Burt et al., 1995;Wilsdon and Wade, 2006), medication (Barker et al., 2004;Goldberg and Burdick, 2001) and intelligence quotient (Diaz-Asper et al., 2004). Consequently, for future studies, we emphasize the importance of reporting possible confounding variablesboth to control for in the primary study and to facilitate future meta-analyses.

Limitations and future directions
The current study has some limitations which are worth noting: First, we used an a-priori classification of neuropsychological domains and subdomains as a way to present our findings-more fine-grained and alternative classifications, however, could lead to locally different effect sizes and moderator effects. We therefore encourage researchers in the field of neuropsychological functioning to examine the convergence and diversity of such classifications in subsequent meta-analyses.
Second, for some (sub-)domains, such as working memory and central coherence, the interpretation of the effect sizes was limited by only two available outcome measures. Besides, some (sub-)domains contained only few effect sizes. As a consequence, quantifying and explaining variation between outcomes within these domains and performing meaningful moderator analyses was not possible. Moreover, the degree and consequences of publication bias is to be examined. We therefore suggest updating the sample of primary studies in future metaanalyses to increase the number of studies and effect sizes.
Third, the precision of the effect size interval estimate depends on several factors, including sample size and elements of the study design. Although the study quality index revealed that the precision of the neuropsychological test results could be questionable for some of the included studies, our moderator analyses showed that it only moderated the effect sizes for the domain of processing speed. Similar to the classification of (sub-)domains, these results may depend on the setup of the study quality index and may thus be examined further.
Fourth, with the exception of two, all studies examined all-female samples. This observation is by no means surprisingin fact, recent studies have emphasized that men have been underrepresented in the extant eating disorder literature (Limbers et al., 2018). Given that the generalizability of our finding to male populations is thus limited, it is key to future primary studies to also include male participants.

Conclusion
The findings from the current meta-analysis revealed consistent evidence of cognitive underperformance in individuals with AN. We have provided a framework for comparing results across studies and offer specific suggestions for neuropsychological tests which might be helpful in the assessment of this patient group. To move the field forward, there is a need for greater coherence in assessment procedures. Increased knowledge concerning a potential neuropsychological profile specific for AN could facilitate the development of more sensitive instruments. Until then, utilizing a process oriented neuropsychological assessment approach might hold some promise.

Declaration of Competing Interest
The authors report no declarations of interest.