Female advantage in verbal learning revisited: a HUNT study

ABSTRACT The argument for a female advantage in word list learning is often based on partial observations that focus on a single component of the task. Using a large sample (N = 4403) of individuals 13–97 years of age from the general population, we investigated whether this advantage is consistently reflected in learning, recall, and recognition and how other cognitive abilities differentially support word list learning. A robust female advantage was found in all subcomponents of the task. Semantic clustering mediated the effects of short-term and working memory on long-delayed recall and recognition, and serial clustering on short-delayed recall. These indirect effects were moderated by sex, with men benefiting more from reliance on each clustering strategy than women. Auditory attention span mediated the effect of pattern separation on true positives in word recognition, and this effect was stronger in men than in women. Men had better short-term and working memory scores, but lower auditory attention span and were more vulnerable to interference both in delayed recall and recognition. Thus, our data suggest that auditory attention span and interference control (inhibition), rather than short-term or working memory scores, semantic and/or serial clustering on their own, underlie better performance on word list learning in women.

Sex differences in cognition have been a matter of considerable controversy since the 1960s (Andreano & Cahill, 2009). Some evidence suggests that women outperform men in verbal abilities, whereas men outperform women in math and spatial reasoning (Herlitz et al., 1997;Maguire et al., 1999;Meinz & Salthouse, 1998). Other evidence suggests that men and women perform at comparable levels in most cognitive tasks, laying ground for the gender similarity hypothesis (Hyde, 2005(Hyde, , 2016, or at least that some cognitive sex differences are less stable over time than others. For instance, sex differences in math skills, with boys outperforming girls, decreased in the 1970s and 1980s (Hyde et al., 1990) and remained small to negligible afterwards (Lakin, 2013;Lindberg et al., 2013). An early meta-analysis of 165 studies involving 1,418,899 subjects noticed a major decline in the magnitude of sex differences in verbal abilities since 1973 (Hyde & Linn, 1988). Focusing on vocabulary knowledge, reading comprehension, speech production and essay writing, the authors found differences between women and men to be so small that they argued that sex differences in these verbal abilities no longer existed. A more recent meta-analysis of 617 studies involving 1,233,921 participants, which investigated verbal memory as an aspect of episodic memory, suggests a female advantage in verbal tasks across different levels of language (word-, sentence-and discourse-level), in the retrieval of names for images and locations, and overall in episodic memory (Asperholm et al., 2019). This meta-analysis derived a variable "verbal" from 341 studies that investigated memory for "words, sentences, facts, conversations, or narrative content" (p. 791), which precludes clear inferences on evidence regarding specific contributing variables, such as those involved in word list learning. In contrast, large cross-sectional studies afford evidence that does not suffer from typical drawbacks of evidence from meta-analyses and reviews due to heterogeneity of methods and measures, while longitudinal studies provide evidence on changes across the lifespan. Two such recent large-scale studies suggest a female advantage in verbal paired-associates learning (cross-sectional design, Talboom et al., 2019) as well as in immediate recall in word list learning and better semantic fluency in women with high school or higher education relative to men (longitudinal design, Bloomberg et al., 2021).
Looking specifically at word list learning, a female advantage was found in girls as young as 5 through 16 years of age when tested on the California Verbal Learning Test (CVLT): girls outperformed boys on all five learning trials, delayed recall and delayed recognition (Kramer et al., 1997). However, a female advantage in word list learning has not been consistently found across studies (Sobal & Juhasz, 1977). Some studies show that a female advantage is present in learning trials, but not in recognition (Bleecker et al., 1988), in immediate and delayed recall, but not in recognition (Kramer et al., 1988), or in immediate recall but not in long-delayed recall (Ragland et al., 2000). Some studies tested for a sex difference in word list learning by investigating only learning trials, and not delayed recall and recognition (Herlitz et al., 2013). Other studies reported only selective findings, despite administering in full the standard tests of word list learning, focusing on a few variables from CVLT, because of power concerns due to a small sample size (e.g., Ragland et al., 2000) or for other reasons (Bücker et al., 2014;Chipman & Kimura, 1998;Hazlett et al., 2010;Mellet et al., 2014;Otero Dadin et al., 2009). These studies often involve patient populations, which have likely affected their choice of variables and outcomes. For instance, a recent study that argued for a female advantage in verbal memory focused on short-and longdelayed recall of the Rey Auditory Verbal Learning Test (RAVLT), because of these subtests' potential to differentiate between Alzheimer's disease (AD) and healthy cognitive status (Sundermann et al., 2016).
Another methodological challenge in comparing findings on word list learning across studies pertains to differences in testing paradigms. The most common differences across studies are found in the number of learning trials, the number of items per learning trial, the presence of cues to recall and recognition, the mode of stimuli presentation (auditory vs. visual), the required mode of response (verbal vs. written) and the timing of response (untimed vs. timed). That these are not trivial differences suggests the finding that a female advantage in word list learning that was evident in learning trials 1-4 disappeared in trial 5 (Ragland et al., 2000) or that a female advantage in immediate recall was found only in a longer list (16 words from CVLT-II), but not in a shorter list (nine words from the Philadelphia Verbal Learning Test) (Sunderaraman et al., 2013), while another study showed a female advantage when the lists were 10 and 20 words long (Bloomberg et al., 2021). The discrepant and sometimes partial observations on which the argument for female advantage in word list learning is based and the concern that heterogeneity of methods and measures may affect results of meta-analyses indicate a need for a large-scale study that would investigate all components of word list learning in a single sample.
Since word list learning is a complex task, it is possible that sex differences in multiple cognitive processes contribute to sex differences in this type of learning and memory. For instance, female advantage in word list learning has been associated with a difference in use of learning and recall strategies. Encoding strategies are in general considered "cognitive mediational processes that support memory" (Goldstein et al., 1994). Standard tests of word list learning, such as CVLT-II, routinely produce scores that reflect a specific strategy used in encoding and recall, such as semantic, serial and subjective clustering (Delis et al., 2000). These three types of strategies differ in the specific criterion used to mentally group wordsspecifically, semantic features of words in semantic clustering, the order of their appearance in the list in serial clustering, or some other, subjectively relevant way of words' grouping in subjective clustering (Delis et al., 2010). Evidence so far indicates that women rely more on semantic clustering than on serial clustering when learning lists of words, and that they rely more on semantic clustering than men, which led to the notion that female advantage in word list learning might be due to the use of semantic clustering (e.g., Kramer et al., 1988Kramer et al., , 1997Ragland et al., 2000;Sunderaraman et al., 2013). The use of strategies other than semantic clustering in word list learning, such as serial clustering, position of a word on the list, order of presented words, or phonemic properties of words (Delis et al., 2010) is typically less efficient in CVLT II, RAVLT and similar tests.
Importantly, semantic clustering depends on cognitive resources such as working memory and attention, which means that sex differences in these resources may also contribute to the sex difference in word list learning. Although short-term memory and working memory have not been consistently distinguished across cognitive neuroscience theories (D'Esposito & Postle, 2015), in the present study we use both concepts in the sense in which they are typically associated with tests of digit span forward and digit span backward, i.e., roughly assuming that they both allow brief storage of memory traces but that working memory also allows their manipulation. While learning a list of words is considered an episodic memory task, it requires short-term memory for a simple retention of information, working memory for organisation of briefly stored memory traces, as well as semantic memory, because this type of long-term memory involves stored knowledge about the world and language. Access to this store is relevant for learning, i.e., in relating stimuli when encoding items or in deciding which items are semantically related, which is critical for semantic clustering . Neurofunctional sex differences have been associated with performance on working memory tasks even when sex differences in behavioural scores were absent, suggesting that men and women rely on different brain structures while performing such tasks (Goldstein et al., 2005). Resting state functional neuroimaging evidence shows that, in addition to common brain areas, sex specific functional networks are also associated with working memory tasks (Hill et al., 2014). Neuroimaging and brain lesion studies provide evidence on sex differences in semantic processing (Gainotti, 2010;Pasterski et al., 2011). Taken together, these findings suggest that the sex differences in reliance on brain structures and functional networks underlying these other types of memory may also underlie the female advantage in word list learning.
Given the role of short-term and working memory in learning, on the one hand, and the role of semantic clustering in word list learning, on the other, we tested the hypothesis that semantic clustering mediates the effects of short-term and working memory on learning, delayed recall and recognition and that this effect is moderated by sex. We also tested whether serial clustering would mediate the effects of short-term and working memory on the outcome variables and whether these effects would be moderated by sex, because some authors argue that participants switch from semantic to serial clustering by default as soon as the working memory capacity has reached its limit (Sunderaraman et al., 2013). Finally, since word recognition requires that representations of words from the target and interference lists be available in memory for items' comparisons, short-term and working memory are highly relevant for keeping representations of words active for a time sufficient to complete such comparisons. Word recognition further requires the ability to discern a target word from other words with overlapping features (pattern separation), such as semantic features of words that belong to the same semantic category, because words representing similar concepts trigger activation of similar semantic features . Thus, we hypothesised that short-term and working memory would affect word recognition via semantic and serial clustering, and that pattern separation would affect true positives in recognition via auditory attention span, testing whether sex would moderate these effects. To further discern mental processes supporting better performance on word list learning, we investigated the effects of auditory attention span on delayed recall and recognition and susceptibility to interference from irrelevant information in recall and recognition.

Methods
The present study was approved by the Regional committee for medical and health research ethics (reference: 155024155024, dated 08.09.2020) and conducted under agreements with the Trøndelag Health Study (Helseundersøkelsen i Nord-Trøndelag, henceforth HUNT) on use of data from the neuropsychological tests (reference: 2021/14777) and information on the study and the data collection (reference: 2021/11308).

Data
Data used in the present study were collected between November 2017 and January 2021 as part of a large longitudinal population health study in Norway, the HUNT study (https://www.ntnu.edu/hunt/data). The HUNT study is a collaboration the between HUNT Research Centre (Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, NTNU), Trøndelag County Council, Central Norway Regional Health Authority and the Norwegian Institute of Public Health. It contains datasets for over 230,000 participants from the Trøndelag county (www.ntnu.edu/hunt), which were collected in four waves, HUNT1-HUNT4, starting in 1984(e.g., Åsvold et al., 2022Krokstad et al., 2013;Stordal et al., 2001). The present study is based on the data obtained in the fourth wave, HUNT4 survey.
The main novelty in HUNT4 relative to previous HUNT surveys is a web-based and self-administered battery of neuropsychological tests platform, Memoro, created by Håberg (Hansen et al., 2015, 2016). Briefly, the battery deployed in HUNT4 consisted of a test of verbal memory, visual memory, simple reaction time, complex reaction time, digit span forward, digit span backward, behavioural pattern separation and a digit-symbol coding test. The primary focus of the present study is the test of verbal memory, i.e., word list learning (section Materials). In addition to word list learning scores, we downloaded HUNT4 participants' data on digit span forward (as a measure of short-term memory), digit span backward (as a measure of working memory), behavioural pattern separation (henceforth pattern separation, as a measure of cognitive ability to separate stimuli with overlapping features), and simple reaction time (as a measure of speed of psychomotor processing) together with their demographic data.

Participants
The sample (N = 4403) consisted of 2571 female and 1832 male participants, who were between 13 and 97 years of age. Although there were more women (58.4%) than men (41.6%) overall, this pattern was reversed in the two oldest age groups (Table 1). The self-reported highest level of education attained by the time of testing indicates that 55.1% of all participants had some form of higher education, having completed at least 16 years of formal schooling, 32.2% had a high-school diploma, while 12.7% had completed primary/middle school education. As Table 1 shows, 91% of participants were right-handed, 7.5% were left-handed and 1.5% were ambidextrous.

Structure of the word list learning test
The word list learning test consisted of three learning trials (T1-T3) of the target list (18 words), immediate recall of an interference list of words (18 words), short-and longdelayed recall of the target list, and a recognition test (36 words). The smaller number of learning trials and longer lists of words in Memoro were deliberately chosen to preclude the ceiling effect.

Structure of word lists
All lists were composed of words from seven semantic categories: animals, clothes, fruits and vegetables, furniture, musical instruments, vehicles/transportation and tools.
The target list consisted of 18 words belonging to four of the seven semantic categories (e.g., animals, clothes, fruits and vegetables and furniture). These semantic categories were first shuffled and the words were then selected from the shuffled list by an algorithm using the following structure: five words from the first (e.g., animals), four words from the second (e.g., clothes), five words from the third (e.g., fruits and vegetables) and four words from the fourth category (e.g., furniture) in the shuffled list.
The interference list was likewise composed of 18 words, half of which were selected from the categories overlapping with the target list: five words from the first category (animals) and four words from the third category (fruits and vegetables). The other half was chosen from two novel categories: five words from the fifth category (e.g., musical instruments) and four words from the sixth category (e.g., vehicles/transportation).
Finally, the recognition list consisted of 36 words, half of which were the words presented on the target list, while the other half was composed of nine randomly chosen words from the interference list and nine words from a novel category (e.g., tools). Thus, only nine words on this list were in fact novel words. Importantly, the proportion of interference words in the recognition list in our study (18/36) was the same as the proportion of distractors in similar, widely used tests of verbal memory, such as CVLT-II (16/32).

Characteristics of stimuli
All words used in the word list learning tests were concrete nouns. Relying on the Norwegian Words (NW) database (Lindt et al., 2015) and associated resources (https:// www.hf.uio.no/iln/research/groups/clinical-ling/project/ norwegian-words), we assessed the stimuli for phonological neighbourhood density, frequency, imageability and length (number of letters in a word). Based on the NW criteria, there were 26 (19.6%) words with many phonological neighbours (≥ 19) in the stimuli, 68 (51.1%) words with 3-18 phonological neighbours, and 39 (29.3%) words with only few phonological neighbours (0-2). Almost all words in the present study (129/133) were high-frequency words, while four stimuli were medium-frequency words. The words' imageability was either high, as found in 86 (64.7%) words, or medium, as found in 47 (35.3%) words. The word length varied from 2 to 11 letters, with 40 (30%) words containing between 6 and 11 letters and 93 (70%) words containing five or fewer letters.
Summing up, most stimuli used in the present study were relatively short words, with high frequency and imageability, and with a medium number of phonological neighbours.

Procedure
All participants gave informed consent before the testing commenced. The participants self-administered the test battery online using Memoro (Hansen et al., , 2016. The word lists, recorded by a female native speaker of Norwegian (Bokmal), were presented orally. The spoken words were presented to participants at a rate of ∼ 2 sec per word. Each participant received a different randomisation of words on the target list as well as on the interference list, but the same randomised target list was presented to a single participant in all three learning trials (T1-T3).
In the learning and delayed recall tests, participants were required to type in as many words as they could remember from the target list (T1-T3, and the delayed recall tests) or from the interference list (in the immediate recall of that list). The typed words stayed on the screen for the duration of a specific test. Consistent with CVLT-II (Delis et al., 2000), the long-delayed recall test was administered about 20 min after the short-delayed recall test. During the delay, participants were assessed in other, non-verbal, cognitive tests from the battery. In the recognition test, the task was to indicate whether a heard word had appeared on the target list or not by pressing a specific key for "yes" or "no", as appropriate.
After completing the recognition test, participants were required to answer a question on whether they used a strategy to learn words (a "yes"/"no" response). Those who answered positively were offered the following seven strategies, of which they could indicate using one or more: (1) I tried to encode and repeat in my mind the words in the order in which they were presented (henceforth "covert rehearsal"), (2) I tried to visualise the words in my mind ("visualisation"), (3) I tried to create a verbal history of the words ("story line"), (4) I tried to visualise and position the words in a known environment ("method of loci"), (5) I focused on the last part of the word list ("word list-last part"), (6) I focused on different parts of the word list at each word list presentation ("word list-different parts") and (7) other.
Time to complete the word list learning test was not limited.

Scoring
Participants' responses were automatically scored for accuracy by the programme. Each correct response counted as one point. Typos were disregarded, as long as they did not change the word's form or meaning (e.g., "taksi" instead of "taxi", "pinao" instead of "piano"). All other types of incorrect responses were considered errors, earning zero points.

Variables of interest from the word list learning test: learning and delayed recall
The word list learning variables included learning trials T1-T3. A total learning score was calculated by summing up the scores obtained on trials T1 through T3. Following previous research, T1 scores were considered as an indicator of auditory attention span (Donders, 2008). The interference list scores were assessed as an additional word-list learning variable, but these scores were primarily used to derive other variables, as described below. The delayed recall included the scores obtained in short-delayed recall and long-delayed recall. A total recall score was calculated by summing up scores on T1, T2, T3, interference, and short-and long-delayed recall tests' scores.
A few additional variables were derived from the recall tests' scores. Proactive interference, i.e., the impact that learning words from the target list had on immediate recall of words from the interference list, was calculated from T1 and interference list scores (T1interference words). Retroactive interference, i.e., the impact that learning words from the interference list had on recalling previously learned words from the target list, was calculated from T3 and short-delayed recall scores (T3shortdelayed recall). Number of intrusions from the interference list that each participant generated during delayed recall tests was recorded separately for short-and longdelayed recall.
Furthermore, for each learning trial, as well as for both delayed recall tests, we estimated semantic and serial clustering indices for each participant. In other words, semantic and serial clustering indices in total learning were estimated from participants' T1-T3 scores, and in shortand long-delayed recall from short-and long-delayed recall scores respectively. An average number of semantic and serial clusters for each participant was calculated by summing up the number of clusters at these tests and dividing it by the number of tests from which the indices were estimated.
The estimation of indices of semantic and serial clustering was guided by previous work on the CVLT test using list-based expectancy measures (Stricker et al., 2002) and it was carried out in Python 3.7.4 (http://www.python. org). Specifically, for each trial we randomly extracted the same number of words as the participants had recalled from the correct word list 5000 times. The semantic clustering index was then estimated by subtracting the number of times two adjacent words recalled by the participant belonged to the same semantic category from the average number of times that two adjacent words belonged to the same semantic category across the 5000 randomly generated lists. The serial clustering index was estimated by subtracting the number of times that two adjacent words the participant recalled were in the same order in which they were presented on the target list from the average number of times that two adjacent words appeared in the correct order across the 5000 randomly generated lists.
Finally, from the participants' subjective reports on strategies used in word list learning we assessed the number of self-reported strategies and the frequency of use of each of the seven choices.

Variables of interest from the word list learning test: recognition
The following recognition variables were extracted from the data: (1) true positives ("hits"), i.e., words from the target list that participants correctly recalled as being on the target list; (2) false positives, i.e., words from the interference list or novel list that participants incorrectly recalled as being on the target list; (3) true negatives, i.e., words from the interference list or a novel list that participants correctly recalled as not being on the target list; (4) false negatives ("misses"), i.e., words from the target list that participants incorrectly recalled as not being on the target list. Furthermore, for the purposes of error analysis the false positives errors were split into (5) the errors of identifying words from the inference list as being on the target list ("false positives old") and (6) the errors of identifying words from a novel list as being on the target list ("false positives new"). Finally, a recognition discriminability index, d', which is considered the most sensitive measure of recognition, was calculated using the standard formula, d' = z true positivesz false positives, adopted from signal detection theory (e.g., Graves et al., 2017;Russo et al., 2017;Sivakumaran et al., 2018;Snodgrass & Corwin, 1988).

Variables of interest from other cognitive tests
Considering the possibility that short-term memory, working memory, speed of psychomotor processing, and the ability to discern stimuli with overlapping features might affect word list learning, delayed recall and recognition, we included the participants' scores on digit span forward, digit span backward, simple reaction times and pattern separation scores as proxies of these cognitive abilities in the analyses. These measures were also used in the analysis of self-reported learning strategies, since we assumed that such strategies would be affected by auditory attention span, indicated by T1 (Donders, 2008), short-term memory as a passive repository of information, indicated by digit span forward, and working memory as a resource for simultaneous storage and manipulation of short-term memory traces, indicated by digit span backward.

Other variables of interest
Self-reported memory problems, based on participants' subjective ratings of their memory in everyday life, were assessed on a scale 0-2, where zero indicates no memory problems, 1 some, and 2 considerable memory problems. Since the questionnaire for the HUNT4-young, 13-19 years old participants does not include these questions, we report data for participants 20 years of age and older.
For completeness, we report how much time (measured in minutes) participants took to perform the learning trials, delayed recall, recognition and the total time taken to complete the task.

Analyses
Prior to analysis, data were screened for accuracy and compliance with the requirements of analyses. In cases where an assumption has not been met, appropriate procedure was applied to remedy the issue. Since the negatively skewed residuals for the outcome variable true positives in multiple regression analysis (see below) indicated that normality requirement has not been met, we reflected and logarithmically (to the base of 10) transformed the scores, as recommended (Curran-Everett, 2018; Keene, 1995;Tabachnick & Fidell, 2014; but see Feng et al., 2014, for a different view).
An independent samples t-test was used to assess whether women and men differed in age, time to complete the tests, and to compare outcome variables' scores of men and women at each age band. A GLM repeated measures analysis with learning trial as withinsubject variable, sex as between-subject variable, and age as a covariate of no interest was used to assess sex differences in learning words from the target list. A significant trial by sex interaction was further inspected in the simple effects analysis to determine (1) whether women and men differed significantly in their scores at each learning trial, and (2) whether there were significant differences in the scores among the trials within each group. The results were corrected for multiple comparisons using the Bonferroni correction. The same GLM analyses were performed for semantic and serial clustering at T1-T3, short-and long-delayed recall.
Multiple linear regression analyses were performed separately for total learning, short-and long-delayed recall as outcome variables, with sex, age and education as predictors in Model 1, and with digit span forward, digit span backward, simple reaction times, pattern separation, semantic and serial clustering for total learning, plus intrusions and auditory attention span (T1) for short-and longdelayed recall in Model 2. The same analysis was performed with the recognition data, with true positive scores as the outcome variable, except that Model 2 did not include intrusions (Table 3).
The accuracy of the regression model was tested by first breaking the dataset into a sample of randomly selected 80% of cases for which we derived a regression equation and then applying the regression coefficients obtained in that sample to obtain predicted values of each outcome variable on a cross-validation sample (the remaining 20% of cases). We report cross-validated correlations (r cv ) for each outcome variable and its cross-validated counterpart as well as R 2 and squared R cv (R 2 cv ) obtained from the two samples. For regression equations of reasonable validity, cross-validation correlations should be high, and even though R 2 cv are expected to be less than R 2 , they are not expected to differ radically (Howell, 2013).
We performed additional analyses, separately for each outcome variable, to test whether sex moderated the relationship between semantic and serial clustering and each outcome variable and whether it moderated the relationship between T1 and short-and delayed recall and true positives in recognition, controlling for age. We chose a moderation analysis, because we were not interested in showing that sex together with one of these other predictors have a combined effect on the outcome variable, but rather in finding out whether sex would alter the relationship between the outcome variable and these predictors, implying a causal relation (Field, 2018;Hayes, 2018). Continuous variables in the interaction, but not the categorical one, were mean-centred. These analyses were carried out in PROCESS 3.5.3 (www.afhayes. com), implemented in IBM SPSS Statistics for Windows, version 27.0 (Armonk, NY: IBM Corp). Parameters were set to determine bias-corrected bootstrap confidence intervals based on 5000 bootstrap samples and the confidence interval was set at 95% value. Standard errors were corrected for heteroscedasticity using the HC3 approach (Hayes & Cai, 2007).
Furthermore, we tested whether sex moderated the respective predicted indirect effects of digit span forward and digit span backward on each outcome variable mediated through semantic and serial clustering, and the predicted indirect effect of pattern separation via auditory attention span on true positives. In this model of moderated mediation, the path connecting the mediator and the outcome variable was hypothesised to be moderated by sex (for an illustration of the conceptual and statistical forms of this specific model see Hayes, 2018, p. 409). The respective indirect effects of digit span forward and digit span backward on total learning, short-delayed recall, long-delayed recall and true positives through semantic or serial clustering conditioned on sex quantify the amount by which two cases with the same value of moderator (e.g., two women) that differ by 1 score on digit span forward or digit span backward are estimated to differ in the outcome variable indirectly, through the effect of digit span forward or digit span backward via semantic or serial clustering on that variable. The direct effect in this model quantifies how much two cases that differ by 1 score in digit span forward or digit span backward are estimated to differ in total learning, short-delayed recall, long-delayed recall, or true positives when semantic or serial clustering and sex are kept constant. The moderated mediation was considered significant if the index of moderated mediation, i.e., the difference between the conditional effects at two levels of the moderator, was statistically significantly different from zero, which in this case indicates a sex difference (Hayes, 2013(Hayes, , 2018. A series of ANCOVA analyses with age as a covariate of no interest was performed to determine possible sex differences in participants' (1) errors in recall (proactive interference, retroactive interference, intrusions in shortdelayed recall, intrusions in long-delayed recall), (2) errors in recognition (false positives, false negatives, false positives from the interference list, false positives from novel words) and (3) scores on additional cognitive measures (digit span forward, digit span backward, simple reaction times and pattern separation).
Pearson's correlation coefficient (r) was calculated to assess associations between continuous variables, and point-biserial coefficient (r pb ) for associations between dichotomous and continuous variables. The chi-square test was used to examine the relationship between two discrete variables.
All tests were two-tailed, with a significance threshold set at 0.05, unless differently stated as after Bonferroni correction for multiple comparisons, and performed in SPSS 27.

Data at glance
The raw scores in total learning, short-and long-delayed recall, and true positive scores for women and men grouped in eight 10-year age strata are shown in Figure 1 and sex differences in these scores across the eight age bands in Table 2. Participants' errors are shown in Figure 2.

Total learning, delayed recall and recognition
The results of multiple linear regression analyses showed that age and sex were negatively related with the outcome variables (Table 3). A higher age and being male were related to lower scores in total learning, short-and long-delayed recall when other predictors were held constant. Model 1 further revealed that relative to those who completed high-school, participants with primary/middle school performed worse and participants with higher education performed better in total learning, short-delayed recall and long-delayed recall. This pattern was found in Model 2 only in long-delayed recall, whereas in total learning and short-delayed recall the only significant difference was found between participants with high-school education and those with higher education, indicating better results in those with more education. Model 2 showed that digit span forward was positively associated with total learning and short-delayed recall, whereas digit span backward, pattern separation, semantic and serial clustering were positively associated with total learning, short-delayed recall and long-delayed recall. Auditory attention span was positively associated with short-and long-delayed recall. There was a negative association between simple reaction times and total learning, short-and longdelayed recall, suggesting that faster times in this test were associated with better scores in total learning and delayed recall. Number of intrusions was negatively associated with participants' scores on both delayed recall tests.
Since the regression analysis with true positive scores from the recognition test was carried out on transformed data, we focus on the antilogarithm values of regression coefficients of significant predictors, keeping in mind that the originally higher scores are now lower and vice versa. Significant effects of sex and age were in the same direction as in the learning and recall data. There were significant positive associations between pattern separation, serial clustering, semantic clustering, auditory attention span and true positives, but no significant associations between the remaining predictors (digit span forward, simple reaction times) and true positives.
Squared semi-partial correlations showed that semantic and serial clustering were the predictors with the highest importance in total learning, whereas in short-and longdelayed recall and recognition the strongest predictors were semantic clustering and auditory attention span.

Results of moderation and moderated mediation analyses
Moderation of direct effects by sex. To test whether sex moderated the direct effects of semantic and serial clustering on all outcome variables and the direct effects of auditory attention span on delayed recall and true positives, we     Note: Mmodel; SDRshort delayed recall; LDRlong delayed recall; Constconstant; Edu 1primary/middle school; Edu 2higher education; DSFdigit span forward; DSBdigit span backward; PSpattern separation; SRTsimple reaction time; SemClsemantic clustering; SerClserial clustering; Intrusintrusions; T1 -Trial 1 (auditory attention span); r 2 spsquared semipartial correlation. *p < 0.05, ** p < 0.001 and *** p < 0.0001. Reference groups: educationhigh school; sexfemale (coded 0). The antilog of coefficients' values of significant predictors for the log-transformed reflected true positives is given in parentheses and marked by an exclamation point in superscript ( ! ). performed a series of moderation analyses (see Methods). Significant effects of sex that survived Bonferroni correction (p < 0.006) emerged for serial clustering in total learning and short-delayed recall, for semantic clustering in true positives, and for both types of clustering in long-delayed recall (Table 4). These effects were systematically positive and, as revealed by simple slopes analyses, stronger in men than in women. As an illustration of the direction of these effects, supplementary Figure S1 shows a stronger effect of semantic clustering on true positives in men than in women.
Sex did not moderate the significant direct effects of auditory attention span on short-or long-delayed recall. However, it moderated the direct effect of auditory attention span on true positives (Table 4). A simple slope analysis showed that this effect was stronger in men than in women. The direct effect of pattern separation on true positives was not moderated by sex.
Mediations moderated by sex. To determine whether a clustering strategy mediated the effect of short-term memory and working-memory on the outcome variables and if so, whether these indirect effects were moderated by sex, we performed a series of moderated mediation analyses. The effects that survived Bonferroni correction (p < 0.004) are reported in Table 5. Inference on statistical significance is based on bootstrap confidence intervals.
Digit span forward and digit span backward were significantly related to short-delayed recall via serial clustering, and to long-delayed recall and true positives via semantic clustering. These indirect effects were consistently positive and significantly higher in men than in women, as shown by the index of moderated mediation in Table 5, suggesting that both groups benefitted from reliance on clustering in recall and recognition, but men benefitted more than women. As expected, direct effects of digit span forward and digit span backward on short-and long-delayed recall and true positives were also statistically significant and consistently positive, suggesting that higher scores in recall and recognition were associated with higher short-term and working memory scores, when sex and a clustering score were kept constant. The effect of pattern separation on true positives was mediated by auditory attention span and it was larger in men than in women.

Error analysis
There were no statistically significant differences between men and women in proactive interference, but the groups differed in retroactive interference (F (1, 4400) = 4.480, p = 0.034, η 2 p = 0.001), with men being more susceptible to interference in short-delayed recall than women. There were no statistically significant group differences in the number of intrusions found in either short-or longdelayed recall.

Semantic and serial clustering as learning strategies
A repeated measures ANOVA comparing groups on semantic and serial clustering respectively at five points of word list learning test revealed a statistically significant difference between men and women in semantic (F (1, 3695) = 146.189, p < 0.0001, η 2 p = 0.038), but not in serial clustering (F (1, 3695) = 3.714, p = 0.054, η 2 p = 0.001). Within-subject comparisons of the estimated marginal means within each group revealed the same pattern for semantic clustering in men and women: statistically significant differences that survived Bonferroni correction at p < 0.005 were found in all pairwise comparisons (all ps < 0.0001), except for the comparison between T3 and long-delayed recall in both groups (supplementary Figure S2). This means that semantic clustering increased in effect across the trials, and that a similar number of semantic clusters available at T3 were also available in long-delayed recall for both men and women.
In contrast, serial clustering peaked at T3 and then fell in both groups (supplementary Figure S2). Comparisons between T1/T2 and short-delayed recall and between T1/ T2 and long-delayed recall were not statistically significant in either group, but the scores at T3 were significantly higher than the scores at T1, T2, short-or long-delayed scores after correction for multiple comparisons (all ps < 0.0001) both in men and women. This means that significantly fewer serial clusters were available at any other learning trial, short-or long-delayed recall than at T3 in each group.
Taken together, these findings indicate that both sexes retained semantic clusters in memory longer (through long-delayed recall) than serial clusters (through T3).

Self-reported learning strategies
Among women, 46.6% did not report using any strategy, 27.7% reported using one, 12.2% two strategies, 5.5% three, 1.3% four, while 2% reported using five and 0.1% six strategies. Among men, 57.3% did not report using any strategy, 22.7% reported using one, 9.9% two strategies, 4.1% three, while four strategies were reported by 0.5%, and more than four strategies by 0.1% of male participants.
Both women and men reported visualisation of words as the strategy they used most frequently; it was slightly ahead of method of loci, which in turn was slightly ahead of covert rehearsal of words (Figure 4).
Associations between each strategy and the auditory attention span, digit spans forward and backward, and semantic and serial clustering are listed in Table 6. However, although statistically significant, most of the observed effects were very small, with the strongest associations emerging between semantic clustering and method of loci, on the one hand, and between serial clustering and placing words in a story, on the other. Still, even the latter two effects were in the range of small effects (r < 0.3).
A summary of results for word list learning is visualised in Figure 5.

Other cognitive tests
Pairwise comparisons of estimated marginal means adjusted for the effect of age showed that men had Table 5. Results of moderated mediation that survived Bonferroni correction. Indirect effects of digit span forward, digit span backward and pattern separation on outcome variables were higher in men than in women. .2529] Note: DSFdigit span forward; DSBdigit span backward; SDRshort-delayed recall; LDRlong-delayed recall; PSpattern separation; T1trial 1 (auditory attention span); W 0women; W 1men; CI -95% bootstrap confidence intervals. Results corrected using Bonferroni correction (p < 0.0045).
Self-reported memory problems and time to complete the task Frequency of self-reported memory problems were similar in men and women (χ 2 (2) = 0.668, p = 0.717, n.s.). Most participants reported having no memory problems (28.1% of women and 21.8% of men), slightly less reported having some memory problems (27.8% of women and 20.4% of men), while a small percent reported having considerable memory problems (1.1% of women and 0.8% of men). Comparing men and women within these "memory" groups on all outcome variables, while adjusting for the effect of age, showed a female advantage in total learning, short-and long-delayed recall, and true positives in each "memory" group, after correcting the results for multiple comparisons (p < 0.004; data not shown).
Negative associations between the outcome variables and participants' scores on self-reported memory problems that remained statistically significant after correcting for multiple comparisons (p < 0.0012) were found for total learning (r = −0.123, p < 0.0001), short-delayed recall (r = −0.128, p < 0.0001) and long-delayed recall (r = −0.108, p < 0.0001). They indicate that lower test scores were associated with reporting more memory problems.  Table 6. Associations between self-reported strategies and DSF, DSB, auditory attention span, and semantic and serial clustering.
Correlating errors with the scores on self-reported memory problems revealed significant positive associations between memory problems scores and retroactive interference (r = 0.079, p < 0.001), false positives (r = 0.1, p < 0.001) and false positives from distractors (r = 0.102, p < 0.001), although these effects were small. The association between self-reported memory problems and proactive interference was not statistically significant, and associations with false negatives and false positive errors from novel words did not survive Bonferroni correction at 0.008 (0.05/6).
The two sexes did not differ in time to complete T1-T3, delayed recall, or the entire word list learning test, but women spent more time than men on the recognition test (2.39 ± 0.551 vs. 2.29 ± 0.524) (t (4155) = 6.239, p < 0.001). Since women had significantly lower scores on the simple reaction times test, and given that the recognition test required pressing a button to indicate response, it is likely that the difference in time to complete the recognition test reflects a motor rather than a cognitive difference in the task performance.

Discussion
Using a large sample of individuals 13-97 years of age from the general population and a validated, novel, webbased test of word list learning specifically designed for Norwegian (Hansen et al., , 2016, this cross-sectional study investigated a female advantage in this ability and how other cognitive abilities contribute to it. The main findings of the study are: (1) an overall female advantage in learning, delayed recall, and recognition; (2) a significant role of auditory attention span in delayed recall and recognition; (3) a significant role of interference in delayed recall and true positives in men; (4) a differential mediating role of semantic and serial clustering in the effects of short-and working memory on word learning, recall and recognition, moderated by sex and (5) a mediating role of auditory attention span in the effect of pattern separation on true positives in recognition, moderated by sex. For a graphical presentation of the main results see Figure 5.
Some previous studies proposed that a female advantage in word list learning can be explained in terms of semantic clustering, because women use this effective strategy more than men (Kramer et al., 1988(Kramer et al., , 1997Sunderaraman et al., 2013). Like in these previous studies, women in our sample had higher indices in semantic clustering than men. However, men in our sample had higher short-term and working memory scores and benefited more from clustering. Thus, the female advantage in our study cannot be explained in terms of women's higher semantic clustering scores.
The finding that women in our study performed better on all components of word list learning indicates men's poorer ability to encode words from a list read aloud. Our data show that men had considerably lower auditory attention span, as indicated by their lower T1 scores, which may have contributed to their less efficient encoding. The auditory attention span was a significant predictor of short-and long-delayed recall as well as of true positives. The explanation that less efficient encoding leads to less efficient recall and recognition is plausible, as long as men do not compensate for the initial difference in scores on the subsequent learning trials (Kljajevic, 2022). Although men in our study learned across trials, women learned more, and thus men were unable to compensate for the initial difference in scores during the subsequent trials. Evidence from other studies that have demonstrated a female advantage in word list learning also shows that men's initial learning scores (T1) were lower relative to women's scores (e.g., Bleecker et al., 1988). A large longitudinal study including 15,924 participants found that in all birth cohorts women were better than men on immediate recall of words from lists (Bloomberg et al., 2021). Since men in our study had significantly higher short-term and working memory scores, our data further suggest that having better short-term and working memory cannot mitigate the effect that the difference in auditory attention span has on subsequent learning and recall of a word list. Considering that men had better short-term and working memory but lower auditory attention span, our data appear to be aligned with the notion that mental representations are brought into working memory by attentional processes, and that attentional prioritisation may explain effects of interference in memory (D'Esposito & Postle, 2015).
Unlike our study, other population studies have not always found a sex difference in working memory (e.g., Collaer & Hines, 1995). Nevertheless, neuroimaging evidence suggests significant sex differences in regions associated with working memory during task performance (Goldstein et al., 2005) as well as in resting state (Hill et al., 2014). For example, women activate dorsolateral prefrontal cortex to a greater extent than men while performing a working memory task, despite lack of significant sex differences in scores on the behavioural task (Goldstein et al., 2005). Since this specific brain region has been associated with the executive-attentional component of working memory (Kane & Engle, 2002), this finding has been interpreted to mean that women work harder to counteract the effect of interference and maintain mental representations, thereby achieving behavioural scores similar to those of men (Goldstein et al., 2005). Similarly, a recent resting state functional neuroimaging study showed that in addition to common networks associated with working memory, there also exist sex specific networks, with higher activations of limbic (hippocampus and amygdala) and prefrontal (right inferior frontal gyrus) regions in women relative to men, and more activated parietal brain areas in men relative to women (Hill et al., 2014). The existence of neurofunctional differences in the absence of behavioural sex differences in working memory is consistent with the notion that men and women rely on different cognitive strategies while performing working memory tasks. Reliance on strategies that differentially engage relevant brain areas may lead to differences in performance on more complex cognitive tasks that require working memory. For instance, Wagner et al. (1998) showed that the strength of a memory trace during word learning depends on the prefrontal and medial temporal lobe structures (e.g., hippocampus), where words that were better remembered later were associated with more activation in these areas during encoding. Other evidence suggests that different neural structures underpin semantic processes, such as categorisation, in men and women (Gainotti, 2010;Pasterski et al., 2011).
Nevertheless, our large sample consisting of 4403 participants from the general population shows a clear sex difference in short-term and working memory, with men having significantly higher scores on standard tests of these abilities, i.e., digit span forward and digit span backward. These scores' effects on learning were consistently positive, indicating that better short-term and working memory scores were associated with better scores in outcome variables. Furthermore, the effects of short-term and working memory on short-delayed recall were mediated by serial clustering, and on long-delayed recall and true positives by semantic clustering. This is not surprising, given that both men and women held semantic clusters in memory longer than serial clusters. Thus, our data are consistent with the notion that in word list learning semantic clustering, as a way of deep encoding, is more effective than serial clustering, which is a passive learning strategy highly dependent on auditory attention span (Delis et al., 2000). Employing such a passive encoding strategy by men, while having lower auditory attention span, led to an initial female advantage in word list learning, which persisted throughout delayed recall and recognition. The persistence of the difference was enabled by executive processes, as suggested by men's greater susceptibility to intrusions in delayed recall and recognition and less shifting among different strategies, as explained below.
Effective use of a retrieval strategy requires cognitive control processes (Baddeley et al., 2020). The striking effect of interference words on men's recall and recognition suggests that they had less ability to inhibit the information that was supposed to be ignored. Similarly, Kramer et al. (1997) found that boys 5-16 years of age were more vulnerable to interference than girls. The idea that the female advantage in episodic memory could critically depend on the ability to inhibit information is not new (e.g., Hasher et al., 1999) and our findings corroborate this notion.
In addition, women in our study reported using more strategies than men, and more shifting among different strategies while performing a task also indicates better executive functioning. The finding that visualisation and covert rehearsal were the most frequently used selfreported strategies in word list learning is consistent with the notion that words may be encoded and recalled via their visual and verbal features (Paivio, 1971). It is also consistent with the multimodal concept of working memory, which postulates distinct components for visuo-spatial and verbal information ensuing binding of information (Baddeley, 2010). However, although the associations between self-reported strategies on the one hand and semantic and serial clustering as implicit strategies on the other intuitively makes sense, it is difficult to see their potentially more meaningful psychological implications beyond the simple notion that different types of strategies may interrelate more strongly relative to their associations with auditory attention span and digit span scores. Since women in our study reported using visualisation, the method of loci, and covert rehearsal more than men, our results do not support the view that women use local strategies, whereas men use holistic strategies in cognitive tasks (Pletzer, 2014).
Taken together, these findings open the question of stability of the observed effects, i.e., the rate of change over time in the observed sex differences in word list learning. Recent longitudinal studies have reported discrepant findings regarding sex differences in the rates of change with age in word list learning. For instance, a study with 15,924 participants found that women had slower rates of verbal memory decline based on learning words from two lists (Bloomberg et al., 2021). Another recent longitudinal study that included 1623 participants showed no sex differences in the rates of change over time in learning across trials, short-and long-delayed recall on CVLT (McCarrey et al., 2016). Our study is cross-sectional, and therefore it is not possible to provide evidence on possible sex differences in the rate of change of the outcome variables or draw conclusions on age-related changes in memory across the lifespan in men and women. However, our large study sample spans 84 years (ages from 13 till 97) and we briefly discuss the pattern of presence/absence of female advantage across different age bands in our data, while keeping in mind that cross-sectional studies may confound age effects with birth cohort effects (Arking, 2006).
Looking across the eight age groups, we found that women between 13 and 19 years of age and those between 40 and 79 years of age had better total learning, short-and long-delayed recall scores than men (Table 2). In addition, women aged 30-39 were better in long-delayed recall than men in the same age group. Finally, the youngest group of women and women between 60 and 79 years of age were additionally better in true positives than men in the same age groups. Overall, these data do not support the hypothesis that female advantage in word list learning is driven by female sex hormones (e.g., Otero Dadin et al., 2009), because this advantage was not present in any outcome variable in participants in their 20s and since among those in the 30s it was found only in longdelayed recall.
Our data are also not compatible with the hypothesis that female superiority in verbal memory improves with age (e.g., Bleecker et al., 1988;Ragland et al., 2000), because the advantage was found in the youngest group in total learning and delayed recall, and there was no advantage among the oldest participants. The female advantage appears to be most consistent in participants between 60 and 79 years of age in our sample, where it emerged in total learning, delayed recall and recognition, but less consistent in participants between 40 and 59 years of age, where it was found in total learning and delayed recall but not in recognition. Thus, even though female advantage is more noticeable in midlife and before the age of 80 in our sample, it is also present in the youngest group in our sample.
The loss of this advantage in women aged 80 and older might be associated with accelerated cognitive decline due to neurodegenerative changes in the brain. For instance, Jack et al. (2015) demonstrated a significantly higher load of amyloid-β (Aβ) in women 70 years of age and older than in men, despite women's better performance on word list learning. Other studies that investigated associations between word list learning and β-amyloid as well as other validated biomarkers of Alzheimer's disease (AD), such as tau, reduced hippocampal volume and temporal lobe glucose metabolism showed that women in preclinical AD retain "verbal memory reserve" despite significant neuropathology (Jack et al., 2015;Sundermann et al., 2017). Given that tests of word list learning are routinely used in diagnosing AD, female advantage in this task may delay diagnosis of AD in women (Banks et al., 2021). Our participants' self-reported memory problems were negatively associated with total learning and delayed recall, and the female advantage was consistent in all three "memory" groups. Since we do not have evidence of presence/absence of Alzheimer's pathology in our oldest-old participants (80 + years of age), and given the lack of female advantage in this specific age group, we recognise the need to further investigate this phenomenon as a form of cognitive reserve and determine how it deteriorates by sex in old age.
The present study has some limitations. First, women in our study were an average of 5 years younger than men and age negatively affects inhibitory processes (Hasher et al., 1999). Even though we consistently included age as a covariate of no interest in analyses (see Analyses), the question remains whether the marked inhibitory issues in men in our study would have emerged if the groups were matched in age.
Another possible limitation of the study is that we relied on the digit span forward and digit span backward scores as indicators of participants' short-term and working memory, disregarding the difference in modalities between digit span tests (visual) and the word list learning test (auditory). A potential issue here is that visually presented digit span tests tap into attention and control processes that may be less relevant to the processing of auditory stimuli (e.g., selective visual vs. auditory attention). The fact that men had higher digit span scores, and yet lower word list learning scores is roughly compatible with this notion. However, the digit span scores significantly, and consistently positively, affected participants' recall scores through clustering, with stronger effects in men than in women. There are several indicators of relevance of this test for our data. As described in the Methods, participants were required to type responses and the written words stayed on the screen for the duration of a test. This could have reinforced learning through more reliance on visual resources, promoting visualisation as the primary choice of learning strategy. It is also possible that imageability of words promoted visual imagery, further contributing to the visualisation strategy, since 65% of our stimuli were highly imageable words. An influential theory from the 1960s posits two routes to retrieval of imageable wordsvisual and verbal, meaning that such words can be encoded and recalled both in terms of their visual and verbal features (Paivio, 1969). Furthermore, one prominent model of working memory postulates a component that holds and binds into episodes the multidimensional information (visual, verbal) coming from different sources, including other components of working memory, perception and long-term memory (Baddeley, 2010). This suggests that the modality of digit span tests used in the present study might be less relevant than it appears at first sight.
In conclusion, the present study replicates the female advantage in word list learning in a large sample of participants drawn from the general population, using a webbased self-administered test and thus allowing for assessment in ecologically more valid, natural settings. Our findings highlight the role of auditory attention span, control of interference, and the mediating role of semantic and serial clustering in the effects of short-term and working memory on the outcome variables. It also shows the relevance of pattern separation in word list learning, which has not been sufficiently explored so far. The sex difference in semantic clustering alone cannot explain our data, because men benefited more from clustering than women. Instead, our findings suggest a critical role of auditory attention span in learning, delayed recall, and recognition, and the role of inhibition of interference in delayed recall and recognition. However, only experimental studies can elucidate which processes contribute most to female advantagefor instance, whether it is a more inefficient active inhibition of interference in men, a better ability to disengage from information that is supposed to be ignored in women, or some other cognitive process related to access to semantic memory during encoding and clustering. To fully answer the question of what is driving the female advantage in word list learning, one must recognise the full complexity of the task at hand and the multitude of possible causes. Future studies could assess sex differences in word list learning with respect to sex differences in regional brain structure and functional networks implicated in this aspect of verbal memory, providing a stronger basis for somewhat neglected research on sex differences in verbal memory in neurological diseases (e.g., Banks et al., 2021) and the role of female advantage in cognitive reserve.