Working memory training involves learning new skills

We present a new framework characterizing training-induced changes in WM as the acquisition of novel cognitive routines akin to learning a new skill. Predictions were tested in three studies analyzing the transfer between WM tasks following WM training. Study 1 reports a meta-analysis establishing substantial transfer when trained and untrained tasks shared either a serial recall, complex span or backward span paradigm. Transfer was weaker for serial recall of verbal than visuo-spatial material, suggesting that this paradigm is served by an existing verbal STM system and does not require a new routine. Re-analysis of published WM training data in Study 2 showed that transfer was restricted to tasks sharing properties proposed to require new routines. In a re-analysis of data from four studies, Study 3 demonstrated that transfer was greatest for children with higher fluid cognitive abilities. These findings suggest that development of new routines depends on general cognitive resources and that they can only be applied to other similarly-structured tasks.

Performance on many working memory (WM) tasks can be improved by training. However, the benefits of that training rarely transfer to other activities that also depend on WM. Why is this, and what conditions give rise to transfer? Here we present a new framework designed to explain both how and when the benefits of WM training will transfer from one task to another. Our claim is that training-induced transfer occurs only when we have learned a new complex cognitive skill in the course of training and when that skill can be applied to a novel task.
The potential of intensive training to expand on our intellectual capacities has long fascinated philosophers and psychologists. In recent years, many commercial training products have been developed for individuals keen to boost their cognitive skills (for reviews see: Bavelier, Green, Pouget, & Schrater, 2012;Simons et al., 2016;Strobach & Schubert, 2016). With extensive practice, performance on most trained tasks will improve, and gains are also reflected in changes in underlying brain systems. This kind of learning is often described as neuroplasticity. While there have been some important recent advances in understanding the impact of training on both the structure and functioning of neural networks (Astle et al., 2015;Barnes, Woolrich, Baker, Colclough, & Astle, 2016;Caeyenberghs, Metzler-Baddeley, Foley, & Jones, 2016;Salmi, Nyberg, & Laine, 2018), the field lacks detailed accounts of the cognitive changes that take place. A new framework presented here describes what these changes might be and how these both enable and constrain transfer to novel situations. We do this in one of the most extensively investigated areas of cognitive training, WM.
Training has been investigated in many different areas of cognition ranging from rote learning, problem solving and WM through to expertise in highly specialized domains such as chess and academic learning (Simons et al., 2016 for recent review). Two broad conclusions have emerged. First, transfer is much more likely under conditions where trained and untrained activities share many features (near transfer) than few (far transfer, Barnett & Ceci, 2002;Noack, Lövdén, Schmiedek, & Lindenberger, 2009). Second, beyond this broad distinction, there is little understanding of the cognitive constraints on transfer (Shipstead, Redick, & Engle, 2010;Simons et al., 2016;Taatgen, 2013).
Our primary goal is to characterize the task features that engender transfer within WM. We propose that transfer occurs primarily when training leads to the acquisition of a new complex cognitive skill that can be applied to an untrained activity. This learning is conceptualized as the development of cognitive routines that coordinate the execution of the processes necessary to perform an unfamiliar task. For training activities supported by cognitive routines or mechanisms that are already firmly established, a new routine is not required. There is consequently less scope for transfer within these activities, even if the tasks are very similar.
Predictions derived from the framework about the conditions under which transfer to other WM tasks is expected to be strongest are tested in three studies. Study 1 provides a meta-analysis of published randomized controlled trials (RCTs) of WM training. The aim was to discover which features common to both trained and untrained tasks are associated with transfer, and to establish the magnitude of any transfer that does occur. Studies 2 and 3 re-analyze data from several published studies of Cogmed training in children to test whether transfer is indeed mediated by the development and application of new routines. Study 2 investigates transfer across WM tasks following training on a single WM program. Study 3 examines the sources of individual differences in transfer following WM training in a large sample of children with the aim of establishing whether transfer originates in the WM system itself or from more general cognitive resources.

Neuroplasticity
An appealing explanation for transfer is that the WM gains observed following adaptive training reflect cortical plasticity in the neural system underpinning WM. Klingberg (2010) speculated that WM training "might lead to durable neuronal changes in WM-related areas in the same way as perceptual training does for neurons of the visual cortex" (p. 318). Westerberg and Klingberg (2007) suggested that this could be mediated by changes in the response characteristics of single neurons, possibly reflecting plasticity in cellular components including synapses and dendrites.  proposed that training enhances the structure of the white matter tracts in the neural system underpinning WM. These plasticity-based accounts resonate with evidence for neural changes following intensive training in motor activities such as repetitive neural stimulation of fingers in primates (Xerri, Merzenich, Peterson, & Jenkins, 1998), perceptual discrimination learning in monkeys (Law & Gold, 2008), and in golf and juggling in humans (Bezzola, Mérillat, Gaser, & Jäncke, 2011;Draganski et al., 2004;May 2011).
The problem with this concept of neuroplasticity is that it fails to explain why training has so little benefit for aspects of everyday cognitive functions that are widely considered to depend on WM. If the neural efficiency of WM improves with training, its benefits should extend to these activities too. In fact, even transfer within WM is limited. Consider n-back and complex span, two common WM paradigms. In n-back, participants judge whether for each item in a lengthy sequence is the same as the item that appeared n positions back (1 item, 2 items, etc.). In complex span, an unrelated processing activity is interpolated between the presentation of successive memory items (Daneman & Carpenter, 1980;Turner and Engle, 1989). A recent metaanalysis of n-back training established that the magnitude of transfer to WM paradigms such as complex span is very small (Soveri et al., 2017).

Process-specific transfer
An alternative explanation is that rather than expanding the fundamental capacity of the system in an undifferentiated manner, WM training enhances the specific processes within WM that are engaged by particular tasks (Dahlin et al., 2008;Dunning & Holmes, 2014;Holmes et al., 2009;Minear et al., 2016;Shipstead et al., 2012;Soveri et al., 2017;Sprenger et al., 2013;von Bastian & Oberauer, 2013a). This approach accounts for the absence of transfer across WM paradigms by assuming that training originates in processes in WM such as updating, inhibitory function and short-term memory (STM) storage that are engaged by some but not all WM tasks (STM, Dahlin et al., 2008;Minear et al., 2016). Transfer should only be observed when training and transfer tasks both place demands on the same processes.
Participants in training studies often report using mnemonic strategies (Holmes et al., 2009;Minear et al., 2016), and these too are also potential sources of training-induced change. Strategy transfer will necessarily be limited by the ways in which the stimuli in the untrained stimuli can be represented. Training a mental imagery strategy to assist recall of lists of concrete nouns would not, for example, be expected to benefit the recall of either abstract nouns or movements. Evidence for the content-specificity of mnemonic strategies is provided by Chase and Ericsson's (1981) study of an individual completing a lengthy period of digit span training. SF began with a typical digit span of seven items which had expanded to 79 items after two years of training. He reported that this was achieved by recoding digit sequences into long-distance running times that he was familiar with as a runner. Tellingly, his memory span for letter sequences over the same period did not change. The capacity of verbal STM per se was therefore unchanged. Similar conclusions were reached in a study of two adults who trained on digit span for a period of four months reached (Martin & Fernberger, 1929).
Strategies may be of limited value even when the stimuli are the same if the WM tasks change. Minear et al. (2016) asked participants completing either spatial n-back or verbal complex span training to describe the mnemonic strategies they had used. Although letters were the memoranda in both cases, participants reported different strategies. N-back trainees employed many different strategies, although a substantial minority reported using no strategy. Participants who underwent verbal complex span training described strategies involving chunking the letter sequences in some way, for example by forming associations between the letters and words and then forming sentences. A transfer measure of serial recall of letters was included in this study. The authors reasoned that if the gains on the trained complex span task reflected the development of such material-specific strategies, they should extend to this task too. No such benefits were found. It therefore appears that any letter-chunking strategy must have been tied in with the broader information processing demands of the paradigm that was originally trained, limiting its transfer (see also, von Bastian & Oberauer, 2013b).
A limitation of process-specific accounts of training to date is that they do not distinguish shared task features that will be sufficient for transfer from those that will not. Often, these accounts are advanced speculatively to explain unexpected transfer rather than generating specific hypotheses that are directly tested in new studies (Sprenger et al., 2013;von Bastian, Langer, Jancke, & Oberauer, 2013). One proposal is that the magnitude of transfer is related to the extent of task overlap, with highest levels of transfer for tasks with the greatest numbers of shared task features (Soveri et al., 2017). As we shall see later, the presence of shared task elements alone is not a sufficient explanation for either the presence or absence of its transfer, or its magnitude. One of the greatest challenges for existing theories of transfer is therefore not only to explain why transfer arises, but also why it does not.

Cognitive training as skill acquisition
Here we present a new perspective on transfer following WM training. This broadly conceptualizes transfer as a consequence of acquiring complex cognitive skills that can then be applied to untrained tasks with similar demands. It has its origins in production system models that represent skilled behavior as sets of production rules incorporating specific knowledge (Anderson, 1982;Newell, 1991). Complex new activities are accomplished by combining these rules. The execution of the rules becomes increasingly automatic with practice, a developmental process characterized in Anderson's ACT-R model as progression from a declarative to a procedural stage. As learning progresses, the demand on limited resources diminishes and this leads to performance gains. Transfer arises when the production rules can be applied to new tasks (Singley & Anderson, 1989). Taatgen (2013) incorporated new principles into a production system framework that provide more specific predictions about the conditions for transfer. His primitive elements theory of cognitive skills distinguishes individual low-level elements of production rules that are entirely specific to a particular task, from task-general skills that control the flow of information across the task independently of content. Transfer occurs when the task-general skills are consistent, even if tasks differ in low-level task features. This approach was applied to model data showing transfer from complex WM span training to performance on a Stroop interference task (Chein & Morrison, 2010). Transfer was modelled as an increase in a process of proactive (executive) control that corresponds to a high-level executive state of planning. This enhanced both rehearsal in the WM task and the selection of the ink color in Stroop.
Primitive elements theory places no limits on the transferability of task-general skills across tasks: transfer may in principle occur for any tasks requiring common higher-order processes such as proactive control. We will see later that this is not necessarily the case for WM training, in which some paradigms are more trainable than others. Other models of skill acquisition do impose constraints on transfer, and these offer some insights as to why transfer within WM may not always occur even within paradigms. Fitts and Posner (1967) suggested that learning progresses through three stages: cognitive, associative, and autonomous. As an example, the cognitive stage of acquiring arithmetic skills might involve performing multiplication by explicit calculation. In the associative stage, the results of calculations would already have been stored in long-term memory, requiring only search followed by retrieval. In the autonomous stage, this process would operate automatically. A reasonable expectation is that for typical participants in WM training studies many of the basic processes of storage and retrieval within WM will already be fully established and will have reached the autonomous stage. This will leave relatively little scope for further refinement even with the extensive practice provided in a WM training program. Neither training nor transfer would therefore be expected for tasks supported by systems such as verbal STM that are already fully established.
Some of these principles are incorporated into the present framework. We propose that in many complex WM tasks, training cannot be accomplished by established configurations of processes within WM. Participants must therefore learn how to perform these unfamiliar tasks. This form of learning follows the conventional path to acquiring a new skill. It starts with a period in which execution and coordination of its components are highly demanding of cognitive resources. With experience the skill becomes more autonomous, improving performance. Once established, the new skill will transfer to other tasks with similar structures. For WM tasks that are already served by existing mechanisms there will be much less scope for training or transfer, because the configurations of processes needed to support them are already in place. The framework builds on these principles to explain the limits on WM training and its transfer to new tasks.

The cognitive routine framework
We make two assumptions about WM training. First, training on unfamiliar WM tasks will lead to the development of novel cognitive routines that control the sequence of cognitive processes required to perform the task. Second, these routines can be applied only to other tasks with common structures and only then will transfer occur. A cognitive routine is a structured specification of the coordinated sequence of processes that must be implemented to accomplish a mental activity. In the initial stages of performing a complex WM task, general cognitive resources are required to determine the optimal sequence of the processes, and to execute the routine. With practice, the execution of the routine will become more autonomous, mirroring changes seen in the acquisition of other cognitive skills (Tenison & Anderson, 2016).
A new routine is needed to martial and execute existing processes in a novel sequence when a task has complex and unfamiliar cognitive requirements. For complex WM tasks, it is envisaged that the routines will have a hierarchical structure composed in part of sub-routines repeatedly executed across the course of a trial. New routines and subroutines may also support mnemonic strategies such as mental imagery and grouping items into larger meaningful chunks. These strategies require sequences of processes integrating the to-be-remembered material with more permanent knowledge outside of WM. For example in using mental imagery, knowledge must be retrieved from semantic memory to generate visuo-spatial representations for temporary storage. To use a chunking strategy, items in WM must be supplemented by or linked with representations of stimuli from long-term memory bound into multi-item chunks (Cowan, Rouder, Blume, & Saults, 2012;Miller, 1956). In these ways, strategies involve the coordination of processes in WM with systems in long-term memory. To do this, we propose that a new cognitive skill (or routine) has to be learned.
Current process-specific theories of transfer focus on the features common to both the training and transfer tasks. The framework takes two further steps, specifying both the conditions under which training itself will occur, and why. A core assumption is that transfer following WM training is restricted to cases where the training tasks require the establishment or refinement of a cognitive routine that is not already fully developed. If a routine is already well-established there will be little scope for either training or transfer even if the trained and untrained tasks both call upon the same routines and processes. One domain where the basic cognitive routines will already be well established is verbal short-term memory (STM). The encoding of item and order information and the engagement of a maintenance rehearsal process are core elements of this system which can readily account for many key verbal serial recall phenomena (Burgess & Hitch, 1992;Page & Norris, 1998). It is frequently engaged in everyday occasions such as remembering new words and names, following instructions, and remembering unfamiliar phone numbers, PIN codes and passwords. As the skills required to perform these tasks will have already been acquired there will be no need to develop new routines. If no new routines are developed, there will necessarily be no routine-mediated transfer.
A computer analogy is useful here to highlight the differences between this approach and concepts of plasticity that emphasize the malleability of the neural processes underpinning WM capacity. The framework represents a shift away from thinking of training as a way of modifying the hardware of WM towards viewing it as the generation of new software. This software controls both the operation of the hardware of WM and the interface between WM and other cognitive systems. The capacity of buffers in WM could be considered to be hardware whereas the control of rehearsal or of a chunking strategy would be controlled by software. As discussed above, there is little indication that the storage capacity of verbal STM (hardware) can be increased by training (Chase & Ericsson, 1981;Martin & Fernberger, 1929). However, the ability to rehearse (software) can be trained in non-rehearsing individuals (Broadley & MacDonald, 1993;Johnston, Johnson, & Gray, 1987).
Continuing with the computing analogy, we can see that there may be choices about the particular forms of software (routines) that will influence transfer. For instance, a function to reverse the order of a sequence might be written in a way that accepts only a list of spoken digits as an argument. Such a function would improve performance on the backward recall of spoken digits, but would be of no value for the backward recall of written letters. Alternatively, the function might be written in a more general fashion so that it could accept written and spoken words, digits, letters, and even visual objects as arguments. Either function will result in improved performance on a training task using digits, but only the more general function will be transferable to different materials. The basic capacity of the system would be unchanged but because its software differs, so too will the extent of transfer. It will also be influenced by the particular software solutions selected by the programmer.
To develop firm predictions about transfer we need to be able to say something about the detailed routines employed. Although we can speculate on grounds of principle alone, the best guide to the nature of these routines and the limits of their transferability is provided by hypothesis-driven experimental analysis. A good example of this is Chase and Ericsson's (1981) digit span training study of SF. The tenfold increase in SF's digit span across two years of practice does not tell us whether he had developed a general-purpose serial recall routine or one specific to digits. However, his description of chunking digits in terms of running times led to the prediction that this strategy (or routine) would not transfer to letters. This was indeed found to be the case, confirming that the routines developed during training were indeed tied to a specific set of stimuli.
Up to this point, we have argued that training in complex tasks involving the novel coordination of existing processes or the development of new mnemonic strategies will lead to the construction of cognitive routines that are the source of transfer to other routine-compatible tasks. An important caveat is that training-induced changes also originate outside of routines in established processes. A wealth of evidence indicates that performance on almost any cognitive activity, including basic low-level visual discrimination of perceptual features, shows gradual improvement with training (for review see Bavelier et al., 2012). Performance on almost all speeded tasks also continues to improve to some degree, even after extensive practice, a ubiquitous phenomenon termed the "law of practice" (Newell & Rosenbloom, 1981). If components of a WM task can be performed faster, then this too should enhance performance (Barrouillet, Bernardin, & Camos, 2004).
It is therefore likely that extensive practice on all WM tasks will produce some fine-tuning in the efficiency of established processes. These subtle changes may generate relatively small degrees of transfer that cannot be reliably detected in the low-to moderately-powered studies that dominate WM training research. With larger sample sizes or more data-intensive psychophysical testing, however, they should be evident. The primary goal here is to understand the origins of the more substantial effect sizes that can be detected the studies that typify the field of WM training research. We suggest that these moderate to large transfer effects are the hallmark of routine-mediated learning during training.

A cognitive taxonomy of WM tasks
To generate predictions from the cognitive routine framework about transfer across WM tasks it is necessary to distinguish the tasks that need new routines from those that can be supported by existing processes. To do so requires the development of a cognitive taxonomy of WM tasks. Deriving such is not straightforward because there are many conceptually distinct theories and models of WM that also differ in the scope of the paradigms they address. Baddeley, Hitch and colleagues developed a highly influential modal model of WM that has framed much of the research in the field (Allen, Baddeley, & Hitch, 2006;Baddeley, 2000;Baddeley & Hitch, 1974). At its heart is a limited-capacity central executive sub-system supplemented by an episodic buffer that binds temporary representations both within and beyond WM. Further buffers provide limited and specialized storage for verbal and visuo-spatial material (Baddeley, 1986(Baddeley, , 2000Baddeley & Della Sala, 1996;Baddeley, Lewis, & Vallar, 1984). Cowan, Engle and others have conceptualized WM not as a separate storage medium but as long-term memory (LTM) representations temporarily boosted via a limited attentional resource (Cowan, 1998;Cowan & Morey, 2007;Engle, Tuholski, Laughlin, & Conway, 1999). Others have located WM within a broader framework of executive functions (von Bastian & Oberauer, 2013a) or as the combined product of two parallel memory systems (primary and secondary memory) with distinct temporal and organizational features (Shelton, Elliott, Matthews, Hill, & Gouvier, 2010;Unsworth & Engle, 2007). Even when models focus on the same paradigms there is little consensus about the nature of the component processes (e.g. Barrouillet, Bernardin, Portrat, Vergauwe, & Camos, 2007;Oberauer, Lewandowsky, Farrell, Jarrold, & Greaves, 2012;Towse, Hitch, & Horton, 2007).
In the absence of theoretical convergence, the current taxonomy was generated from an evidence-based task analysis. The analysis was confined to the paradigms required the serial recall of list items that dominate the current generation of WM training. In those paradigms in which evidence points to domain-specific differences, verbal and visuospatial tasks are analyzed separately.

Verbal serial recall
Serial recall of verbal material is supported by a multi-component system of verbal STM. It consists of processes responsible for encoding item and order information, and for linking together the two sets of representations. Key phenomena including serial position and transposition functions have been successfully modelled as associations between temporary representations of each item and either temporallyevolving context or order signals that can be decoded to retrieve item position or order (e.g. Burgess & Hitch, 1992, 1999Page & Norris, 1998).
Older children and adults spontaneously use a verbal rehearsal strategy to enhance serial recall (Gathercole & Hitch, 1993;Hitch et al., 1983). This has been suggested to involve the reactivation of phonological representations in STM as a means to offset time-based decay, possibly through a process of covert articulation (Baddeley et al., 1984;Baddeley, Thomson, & Buchanan, 1975). Rehearsal has been modelled as the re-presentation of the stored sequence back into verbal STM (Burgess & Hitch, 1992;Page & Norris, 1998). Once established at around seven years, rehearsal is a highly effective strategy for retaining information in verbal STM (Flavell et al., 1966). 1 What happens prior to this developmental milestone is not fully understood. Recent work indicates that rehearsal may be underestimated in children performing poorly on verbal STM as a consequence of low measurement sensitivity to experimental indicators of rehearsal (Jarrold & Citroën, 2013;Jarrold, Tam, Baddeley, & Harvey, 2010). But most importantly for the present purposes, it is reasonable to assume that the contribution of rehearsal to verbal serial recall develops over the early school years and becomes fully functional by seven years or so.
Does training on verbal STM training tasks transfer to similar untrained tasks? A key assumption of the framework is that established processes such as those involved in encoding verbal item and order information should not be amenable to further training. Verbal STM performance is therefore predicted to be relatively impervious to training once rehearsal has been established. Performance may, however, be enhanced by the adoption of new material-specific strategies under conditions of extensive and prolonged practice (Chase & Ericsson, 1981;Martin & Fernberger, 1929).
With sufficient practice and instruction, rehearsal can be induced in non-rehearsing children, leading to increases in memory span (Broadley & MacDonald, 1993;Johnston et al., 1987). Training gains can extend to serial recall for untrained verbal content: Comblain (1994) showed that training individuals with Down syndrome to rehearse word lists led to benefits in digit span. Thus, even without explicit strategy instruction, WM training programs that provide extensive practice in verbal serial recall for children who are not yet rehearsing may provide catalytic conditions for a new rehearsal routine (Holmes, Butterfield, Cormack, Loenhoud, Ruggero, Kashikar, & Gathercole, 2015). In older children and adults, the routine will already be well-established and hence not be amenable to further training.
When the number of items to be recalled is close to span, verbal memory depends primarily on phonological coding, although for supraspan sequences there may be a shift to non-phonological strategies (Gathercole & Baddeley, 1990;Salamé & Baddeley, 1982). Other nonphonological strategies such as semantic linkage and visual imagery can also be beneficial in immediate memory tasks (McNamara & Scott, 2001;St Clair-Thompson, Stevens, Hunt, & Bolder, 2010;Turley-Ames & Whitfield, 2003). These recoding strategies may increase the depth of processing (Craik & Lockhart, 1972), permit the generation of multiple representations for each item (Paivio, 1990) and allow multiple items to be formed into single chunks (Cowan, Chen, & Rouder, 2004). We assume that for most individuals these strategies require the development of new routines that have the potential to transfer to other WM tasks with similar stimulus content.
In summary, two aspects of verbal STM may require the development of cognitive routines and hence to yield transfer to other verbal STM tasks. The first is subvocal rehearsal in pre-rehearsing children, and the second is the development of new mnemonic strategies. The basic mechanisms of encoding item and order information in verbal STM are already in place early in childhood and do not warrant new routines. Transfer of verbal STM training is therefore expected to be minimal for older children and adults unless novel mnemonic strategies are developed.

Visuo-spatial serial recall
The representations and processes involved in visuo-spatial serial recall are much less well understood. Standard paradigms involve a variety of stimulus forms including spatial locations, continuous movements, static patterns, unfamiliar objects, and scenes. Hallmark experimental phenomena indicate that mechanisms encoding serial order may be similar in the verbal and spatial domains (for review see Hurlstone, Hitch, & Baddeley, 2014). However, verbal and visuo-spatial memory span are largely independent in both children and adults (Alloway, Gathercole, & Pickering, 2006;Baddeley, Papagno, & Vallar, 1988;Della Sala, Gray, Baddeley, Allamano, & Wilson, 1999). Further evidence indicates that within this system, visual characteristics and spatial locations may be stored separately (Darling, Della Sala, Logie, & Cantagallo, 2006;Logie & Pearson, 1997;Pearson, Ball, & Smith, 2014;Pickering, Gathercole, Hall, & Lloyd, 2001), with serial spatial rehearsal providing a means of maintaining either visual or spatial representations (Logie, 1995). This appears to be accomplished through the covert control of eye movements (Awh, Vogel, & Oh, 2006;Logie & Pearson, 1997;. While the wealth of evidence distinguishing between the STM processes involved in verbal and visuo-spatial STM is undisputed, the extent to which it reflects separate but analogous temporary storage for the two domains is a matter of current debate. For many years, the most prominent position has been that information is represented in terms of its visual or spatial characteristics in the relevant domain-specific store (STM) within WM (Baddeley, 2012;Logie, 1995). This conclusion has been strongly challenged in comprehensive analysis of experimental and neuropsychological evidence by Morey (2018). The issue here is not whether verbal and visuo-spatial STM can be distinguished, but whether STM for visuo-spatial material is domain-specific in nature. There is substantial evidence that it is not. Across many studies employing a wide range of paradigms, it has been shown that visuo-spatial tasks show a far greater reliance on general attentional resources than their verbal equivalents (Alloway et al., 2006;Kane et al., 2004;Morey & Miron, 2016;Pearson et al., 2014;Thompson et al., 2006). On this basis Morey concludes that "neither the neuropsychological evidence nor the dual-task literature provides strong support for a dedicated visual-spatial STM system" (p. 876).
One reason why a specialized STM system may not have evolved for this domain is that, unlike verbal material, recall of the order of visuospatial events is rarely required in everyday life. In our terms, this could mean that there is no established STM system to support this material. In order to make improvements during training, participants must therefore develop new cognitive routines that draw initially on domaingeneral cognitive resources. These routines may involve learning to exploit the unique configurations of visuo-spatial stimuli in particular 1 On the basis of simulations Lewandowsky and Oberauer (2015) have argued that there is no evidence for the effectiveness of rehearsal. However, their simulations are based on the assumption that people make errors in rehearsal which they then further rehearse leading to an accumulation of errors.
tasks. For sequences involving temporally dynamic information such as Corsi block recall (De Renzi & Nichelli, 1975), this may involve encoding properties that can include multiple transitional features including the lengths of spatial paths, their crossing points and their angles (Parmentier, Elford, & Maybery, 2005). Neuroimaging data suggests that exploiting some of these spatial properties may impose significant attentional burdens: prefrontal activity associated with the multiple demand system increases when spatial sequences have properties that encourage their recoding into higher-order chunks (Bor, Duncan, Wiseman, & Owen, 2003).
In summary, verbal and visuo-spatial serial recall are both supported by domain-specific processes that encode and maintain item and order information. A difference that may turn out to be important for transfer is that visual-spatial STM shows a greater dependence on general cognitive resources than verbal STM. This may provide scope during training for the development of new cognitive routines (or the refinement of existing ones) tailored to meet the relatively unfamiliar task demands. If so, training on visuo-spatial STM should generate greater transfer to untrained tasks with similar recall demands than the corresponding verbal STM tasks.

WM tasks
Most training programs include complex WM tasks that combine the temporary storage demands of simple serial recall tasks with additional processing requirements such as changing the order of items at recall, updating the contents of WM, or handling irrelevant distraction. They share the common feature of requiring participants to store material in highly unfamiliar and challenging cognitive conditions. To cope with these unfamiliar conditions it will be necessary to develop novel cognitive routines.
The cognitive routine framework predicts that these paradigms will not generate transfer unless the untrained task shares the same unfamiliar demands. Transfer will be determined by the fit of the coordinating structure of the routine (for example, executing a particular form of memory updating or reversing the sequence of items at recall) to an untrained task, and not simply by the overlap with individual processes embedded within subroutines. Significant mismatches in task structure will prevent transfer. We suggest that routines readily adapt to changes in lower-level features such as the modality of an input (e.g., auditory or visual) or a response (e.g., spoken or mouse click) that preserve the higher-level structure of the routine. Taatgen (2013) drew a similar distinction between low-level elements of production rules specific to a particular task and the task-general skills that alter the flow of information across the production system. In his conceptualization, transfer can only occur across individual elements in the production system if the same task-general skills can be applied across two tasks.

Complex span
Complex span tasks differ from simple serial recall tasks through the interpolation of episodes of distractor processing between successive memory items. For example, in operation span (Turner and Engle, 1989), sequences of words are presented for serial recall and, after each word is presented, participants must read and verify an arithmetic problem such as "Is (4/2)-1 = 1?". After the final calculation the participant attempts serial recall of the word list. An example of a visuospatial complex span task is symmetry span. In this task, the distractor activity involves judging the symmetry of a pattern and the items to be remembered are the locations of squares presented successively in a matrix (Redick & Engle, 2011).
In these tasks, participants must work out how to protect the memory representations from the interference or decay that might be caused by the punctuating periods of distraction. Several ways in which this could be achieved have been proposed. One is through rapid switching between processing of the distractor events and rehearsal of the memory items (Towse, Hitch, & Hutton, 1998). An alternative proposal is that participants might use attentional refreshing to revive decaying representations by switching between distractor processing and rapid serial reactivation of the encoded memory sequence (Barrouillet, Gavens, Vergauwe, Gaillard, & Camos, 2009). Oberauer et al. (2012) implemented a very different account of complex span in their SOB-CS (serial-order-in-a-box -complex span) neural network model. In this, interference is generated by the unwanted encoding of distractor items resulting from a novelty-gating mechanism. This is minimized by the active removal of distractor representations with the aim of restoring the quality of earlier memories. Note that in each one of these accounts, a novel set of cognitive processes (time-switching to permit rehearsal or attentional refreshing, or active removal of distractors) is required to meet the unusual needs of the particular complex span task. From our perspective these represent novel cognitive routines developed across the course of an extended training program that can then be applied to other similarly-structured tasks.
Beyond this point, it is not possible to make firm predictions about the limits on the transferability of a complex span routine to other untrained complex span tasks. We would certainly expect routines to adapt readily to changes in superficial task features such as the sensory modality of inputs or outputs that call on relatively peripheral and specialized processing systems, as these have few consequences for the higher-level structuring of task processes. Whether more profound mismatches such as changes in the interpolated distractor activities will be sufficient to prevent transfer to an untrained task is less clear. The overlap in the cognitive processes involved in performing distractor tasks such as verifying equations and judging whether letters are upright or mirror-reversed when mentally rotated (Harrison et al., 2013) is minimal. Indeed, the distractor activities employed in the trained and untrained complex span tasks differ in all relevant training studies familiar to the authors.
What the different distractor activities in complex span tasks do share is their functional position in the higher-level of the task structure: in each case, they are interpolated between memory items and disrupt stimulus maintenance and encoding processes that might otherwise take place. In most complex span tasks, the distractor activities are unrelated to the to-be-recalled stimulus items. 2 The different distractor activities may therefore be supported by substitutable subroutines within a broader routine common to multiple complex span tasks designed to minimize distraction and maintain stimulus representations. Could a match to this higher-order structure be sufficient to allow the routine to be applied to complex span tasks involving different distraction, or is the tolerance to deviance in task structures limited to more superficial tasks elements? The transfer data from the meta-analysis in Study 1 directly address this question.

Re-sequencing
Another way in which complexity can be introduced into WM tasks is by changing the sequence in which memory items should be recalled. The most common re-sequencing task is backward span in which participants are instructed to recall lists in reverse sequence. Compared with forward recall, backward recall is generally slow (Anders & Lillyquist, 1971) and errorful (Isaacs and Vargha-Khadem, 1989). Some participants report doing this by engaging in successive forward retrievals in order to peel off the numbers backwards (Anders & Lillyquist, 1971;Conrad, 1965;Thomas, Milner, & Haberlandt, 2003). First, the whole list is run through and the final item reported. The process is then repeated, each time reporting what has now become the final unrecalled item (1,2,3,4 …1,2,3…, 1,2… etc.).
Like any ad hoc strategy designed to solve the unusual problem of reversing an input sequence, this strategy requires a new routine for the 2 An exception is listening span, in which the recall items are the final words in interpolated sentences that participants read aloud (Daneman & Carpenter, 1980) recall phase. Existing cognitive processes must be coordinated in a novel way to make repeated forward covert retrieval attempts. For verbal stimuli, this is likely to involve control of the rehearsal process. As storage in verbal STM is phonological rather than semantic in nature (Baddeley et al., 1975;Gathercole, Frankish, Pickering, & Peaker, 1999), the same routine should be readily extended to backward span tasks employing different categories of verbal stimuli such as word, letters and digits. Transfer is therefore expected to any backward span tasks employing verbal material.
It is less clear whether training in backward span should transfer across verbal and visuo-spatial domains. Whereas backward recall leads to much lower memory span for verbal stimuli, its impact is minimal when the task involves recalling spatial sequences (Isaacs and Vargha-Khadem, 1989). This raises the possibility that the backward recall of spatial sequences may not involve the effortful and time-consuming peeling-off strategy that can be applied in backward digit span (Norris, Hall, & Gathercole, under review). If this is the case, the routines verbal and visuo-spatial will differ substantially and this will limit transfer.
Other re-sequencing tasks include recalling numbers in mixed lists in numerical order and recalling letters according to either alphabetical order (Wechsler, 2008) or semantic category (Sheslow & Adams, 1990). These tasks are also unfamiliar and highly challenging. They require use of stored knowledge such as numerical sequences, the alphabet, and object category to guide retrieval and output. Distinctive routines will be required to meet each specific task requirement. No transfer is therefore predicted across these different re-sequencing variants.

Updating
Many WM training programs use tasks that involve continuous updating of the to-be-remembered items in lists of unknown length. The two updating paradigms used most frequently in training studies are nback and running span.

N-back.
In n-back tasks, participants encounter a lengthy sequence of items and must judge, for each item, whether it matches the item presented n positions back. There are a number of ways in which n-back tasks might be performed. Most accounts of n-back have focused on how the potential set of memory items might be updated with successive presentations. Alternatively, participants might repeatedly break and then reconstruct bindings between item and order information as each new item is presented. Chatham et al. (2011) developed a neurally inspired model of nback. In line with the idea that n-back requires the development of a novel routine, it has to be trained explicitly to perform the task with a specific value of n. Juvina and Taatgen (2007) presented behavioral evidence indicating that n-back may be supported by at least two different spontaneous strategies. One group of participants showed decreases in recognition accuracy at later positions. This was interpreted and simulated in an ACT-R production system model as reflecting a strategy of maintaining an active rehearsal set of n items, in combination with inhibition of items that drop out of the window. The remaining participants showed a flatter serial position function simulated in an alternative model without rehearsal.
There are two key points here. The first is that, irrespective of the specific cognitive processes required to support n-back, the task demands are so unfamiliar and challenging that they cannot be met by ready-to-go mechanisms. We argue that this necessitates the development of a new routine specifying the set of coordinated processes that need to be performed. Second, there is more than one way to perform nback and it looks as though individuals choose between alternative strategies, and hence routines. We speculate that this variability in how tasks are performed even without training may be a hallmark of the unfamiliar complex tasks that benefit from the acquisition of new routines during training. Similar findings have been reported both for running span, in which participants may adopt either a passive strategy or an active updating strategy (Hockey, 1973) and in backward span (Norris et al., in preparation).
As with all complex WM tasks requiring novel routines, on-task performance is expected to improve with training, and benefits should extend to other similarly-structured tasks. Changes within stimulus domain (e.g., letters, digits), the input modality of the memory items, or the modality of response should not constrain transfer. In the case of these task deviations, the routine requires a minor modification that is unlikely to have repercussions on the higher-order structure of the routine. Whether transfer will extend across different representational domains (verbal, visuo-spatial) is harder to anticipate because it is unknown whether n-back is supported entirely by domain-general processes (in which case there should be cross-domain transfer) or in part by domain-specific processes such as rehearsal (in which case transfer may be limited to same-domain tasks). Transfer is not expected between n-back tasks and complex span tasks because of the mismatch in overall task structures and hence the routines they require. A similar point was made by Shipstead et al. (2012): "learning (e.g., practice, strategies) that occurs during n-back … training may simply not apply to complex span tasks" (p. 646). Outcomes of a recent meta-analysis of n-back training studies indicate that this is indeed the case (Soveri et al., 2017).
Running span. In this paradigm, sequences of memory items of unpredictable length are presented and participants attempt to recall the last n items when the end of the list occurs (Jaeggi et al., 2008). Continuous updating of the memory items to be potentially recalled is a highly unfamiliar activity that can only be performed by developing a new routine. Although there is no agreed model of running span, there is consensus regarding the high cognitive demands of the task. Postle, Berger, Goldstein, Curtis, and D'Esposito (2001) proposed that it requires not only encoding, storage and rehearsal but also the discarding of previously encoded items and repositioning.
The cognitive routine framework and other process-specific accounts make different predictions regarding the transferability of training between n-back and running span. Despite the common updating requirement, no transfer is expected because the broad set of processes required to perform each paradigm are so different. Whereas n-back requires comparisons of each successive stimulus with the most recently presented item, running span requires full serial recall of the updated stimulus set following the unpredictable end of the sequence. Any updating that takes place will therefore occur in the context of completely different preceding and succeeding processes, yielding routines with distinct higher-order structures. In contrast, according to the process-specific account of Dahlin et al. (2008), both tasks employ a common updating process that can be enhanced by training and will therefore show transfer. The data they report are consistent with this position, with transfer to an n-back task following training on running span. However, this study lacked an active control condition for comparison with running span training, potentially over-estimating the specificity of any training effects to updating in particular. To our knowledge there have been no further tests of transfer of training across these paradigms.

Study 1
The aim of this study was to investigate whether transfer following WM training arises when specific task features are shared both by training activities and transfer tasks and to discover the magnitude of their transfer. We performed a meta-analysis of randomized controlled trials (RCTs) of WM training that reported data on transfer to other WM tasks. The analysis was restricted to studies including untrained tasks that required the ordered recall of memory items in the context of simple serial recall or more complex WM paradigms.
The extent to which transfer is mediated by common elements of trained and untrained tasks was examined for the following task features: stimulus input modality (auditory, visual), recall modality (spoken, manual), stimulus category (words, letters, digits for verbal stimuli, objects and spatial locations for visuo-spatial stimuli), stimulus domain (verbal, visuo-spatial), and recall paradigm (serial recall, complex span, backward span). In each, transfer was measured under two conditions: i) when the feature was present in both the untrained task and at least one of the training activities (matched), and ii) when it was present in the untrained task but not during training (unmatched). If training induces changes in any of the cognitive processes associated with specific features of the trained activity, transfer should be greater when that feature is also present in the untrained WM task.
Analyzing the impact of feature overlap on transfer for multiple task features provides a systematic means of identifying the cognitive processes which undergo change during WM training. It also provides a way of testing predictions from the cognitive routine framework and, where predictions could not be made, of providing new information to inform its theoretical refinement. The framework predicts that trainingrelated changes will be most substantial when participants have to perform unfamiliar WM activities that require the development of novel cognitive routines because these new routines can then be applied to untrained tasks sharing the same requirements. The more unusual the cognitive demands of a WM task and the less that they can be met by existing mechanisms, the greater the potential for transfer.
Transfer was not expected to be influenced by whether the input (memory items) or output (response) modalities were matched. Consider the case of linguistic material. Words can be presented and recalled through several modalities: we can recall verbal information by either speaking it aloud or writing it down, and this information may have initially been experienced in the form of spoken language, print, objects or images. In comprehension, production and reading, we readily translate between representations in these different modalities, so it should be unnecessary to develop new routines to enable transfer across modalities.
Predictions about whether transfer should depend on whether training and transfer tasks involve stimuli belonging to the same category are harder to derive. In the case of verbal stimuli there is substantial evidence that, irrespective of the input modality, verbal material is encoded in phonological form only in verbal STM (Salamé & Baddeley, 1982). We might therefore expect that transfer across verbal tasks will be independent of whether the semantic category of stimuli in trained and untrained tasks are matching, regardless of whether they are words, nonwords, digits, or letters. On the other hand, individuals engaging in extensive practice in verbal STM tasks clearly can develop highly material-specific recoding strategies that boost memory span if conditions permit (Chase & Ericsson, 1981). The gain in digit span was achieved by recoding the list items into familiar chunks in long-term memory (running times). We consider this strategy to be an example of a cognitive routine developed to exploit the recodable properties of just one category of stimuli. Category-specific transfer of this kind will necessarily be restricted to the trained stimulus category. However, it is not entirely clear whether the current generation of WM training programs provide the same opportunities for developing such strategies as these earlier studies of digit span training. In these, training typically took place over many months, the participants were small in number and highly motivated, and training was restricted to a single task or highly similar variants. In contrast, contemporary training programs usually involve multiple training activities with differing materials and involved fewer than 20 h of training in total. The strategies developed by individuals in complex WM tasks also appear to be relatively idiosyncratic (Minear et al., 2016;Norris et al., in preparation). For these reasons, the strength of transfer when the categories of verbal memory items are matched is hard to anticipate.
For the studies included in this meta-analysis, the majority of the visuo-spatial tasks involved recall of spatial locations. It was therefore not possible to test within-domain transfer across alternative forms of visuo-spatial stimuli. We can, however, ask whether transfer is mediated by stimulus domain. In serial recall, predictions differ for simple verbal and visuo-spatial material. Serial recall for verbal items is not expected to require a new routine because it is fully served by the established system for encoding items and order information of verbal STM. A common verbal serial recall paradigm is therefore predicted to generate minimal transfer. For spatial recall, robust transfer is expected in light of the likely absence of an established visuo-spatial STM system (Morey, 2018;Morey & Miron, 2016). It is therefore proposed that in the course of training in the serial recall of spatial locations, participants develop new cognitive routines that allow them to improve performance and diminish reliance on more general cognitive resources. The detailed nature of these routines is not known but could, for example, involve refining a spatial rehearsal strategy or developing taskspecific recoding strategies.
The final task feature is the WM paradigm itself. Transfer was analyzed for verbal and visuo-spatial variants of three WM paradigms: simple serial recall and two complex WM paradigms, complex span and backward span. N-back and running span were also coded, although for these paradigms it turned out that there were insufficient data for analysis. We expect that novel routines will be developed during training for visuo-spatial but not verbal serial recall, for complex span, and for backward span. Substantial routine-mediated transfer is therefore predicted across task sharing these paradigms.

Literature search and inclusion criteria
A flow diagram summarizing the process of selecting studies for inclusion in this study is provided in Fig. 1. In March 2018 separate comprehensive literature searches of the electronic databases Psych Info and Google Scholar were carried out by two authors (DD, JH). Studies were identified from searches of keywords and titles that contained both working memory and training. The reference lists of studies and reviews were also checked for additional potentially relevant studies. The searches were then collated and, after duplicates were removed, the abstracts of the remaining studies were independently reviewed (SG, DD). If the abstract suggested that the study may be appropriate for inclusion in the meta-analysis then the full-text article of the study was evaluated against our inclusion criteria. These were: (i) publication in a peer-reviewed journal; (ii) randomized controlled trial of an adaptive WM training program; (iii) restriction of training activities to only one of the following complex WM paradigms with or without additional simple span tasks: complex span, backward span, updating; (iv) data for a minimum of 10 participants in the adaptive training condition; (v) inclusion of an active control training condition that involved either non-adaptive WM training or a form of adaptive training with a low WM load; (vi) outcome measures provided quantitative data from which effect sizes could be calculated for individual tasks. If not provided in the publications the data were requested from the authors and included if supplied; (vii) assessments of untrained WM tasks both before training and within 3 months of the completion of training. Table 1 summarizes the characteristics of the selected studies. Training groups ranged in size from 14 to 62 participants, with a mean group size of 27 (median = 24) for adaptive WM training and 26 (median = 26) for the active control group. Participants were children or adolescents in 11 studies and adults aged 18 to 60 years in the remaining nine studies. Two of these included a group of older participants (60+). Most studies involved a single group of participants completing a single adaptive WM program. In four studies different groups completed different WM training programs: updating and Cogmed RM (Ang, Lee, Cheam, Poon, & Koh (2015); complex span and running span (Foster et al., 2017) complex span and simple span (Harrison et al. (2013); n-back and complex span (Minear et al., 2016). The nature of the active control conditions varied across studies. In 13 studies the control group received a non-adaptive version of the WM training program that was fixed at a level of either low memory load or no memory load. The remaining studies either employed adaptive programs that either did not tax WM at all (e.g., visual search training) or other activities assumed to place only low demands on WM (e.g.,  mathematics training, video games).

Feature coding
Each untrained WM task was paired with a single WM task in the training program and both tasks were then coded according to five categories of feature 3 : stimulus type (digits, letters, words, objects, spatial locations), stimulus domain (verbal, visuo-spatial), stimulus modality (auditory, visual), response modality (spoken, manual), and paradigm (serial recall, complex span, and backward span). The use of the serial recall category was restricted to simple serial recall tasks. The procedure for matching the trained task with each untrained task within each study was as follows.
1. Match on both paradigm and stimulus domain (e.g., verbal & complex span). 2. If 1 is not possible, match on paradigm alone (e.g., complex memory, or serial recall). 3. If 2 is not possible or there are multiple trained tasks for 2, match on the trained task with the greatest total number of other matched features. 4. If two or more training activities are equivalently matched according to the above criteria, select a single representative trained task for matching.
For some tasks, it was necessary to code multiple features within a single category. For example, in dual n-back tasks each stimulus item consists of both a verbal and visuo-spatial stimulus (e.g. Kundu et al., 2013). In total, 113 pairs of trained (T) and untrained (UT) WM tasks met the task selection criteria. Full feature coding for each of these pairs with details of the trained and untrained tasks and sample sizes is reported in Gathercole, Dunning, Holmes, and Norris (in press). For each task pair, each feature was coded as either not present (empty cell), present in the trained task only (T), present in the untrained task only (UT), or present in both tasks (T&UT). In the four studies in which different groups performed different WM training programs, each untrained task was matched with the closest task from each of the different training programs. For the Ang et al. (2015) study, for example, backward letter span performance was analyzed separately for each of the following combinations of groups: adaptive/ non-adaptive Cogmed WM training, and adaptive/ non-adaptive running span training.

Meta-analytic procedure
The following data were recorded for each transfer task in each study: the number of participants in the adaptive WM training and the active control groups, the means and SDs for the two groups pre-and post-training. All analyses were conducted using version 3.3 of the Comprehensive Meta-Analysis program (Borenstein, Hedges, Higgins, & Rothstein, 2005). Confidence intervals were calculated for Cohen's d effect sizes (Cohen, 1988).
Due to variation in studies (e.g. type of training program used, age of sample, outcome measures used) a random effects model was chosen for all analyses. For each analysis, outcomes with a standard residual value greater than 2 were classified as outliers and excluded (Hedges & Olkin, 1985). The following effect sizes were excluded: Brehmer et al.  (verbal unmatched, all). For serial recall, complex span, and backward span untrained tasks, data were analyzed separately for the conditions in which the stimuli were both verbal, both visuo-spatial, or crossed the two domains (verbal trained to visuo-spatial untrained and vice versa) if there were at least two effect sizes in each case. A further set of analyses was performed for each paradigm summed across the different domain conditions.

Analysis plan
For each feature separate analyses were conducted for the matched conditions, the unmatched feature conditions, and the summed comparisons across both categories ('all'). Cohen's d effect sizes were calculated for the pre-to post-training gain for the adaptive relative to the control group (difference between the gain scores for the two groups/ summed SD), with confidence intervals, z-scores and p values for the effect sizes. The criterion was significance was set at .05. By convention, an effect size d of .2 is considered small, .5 moderate and .8 large.
Measures of heterogeneity (Q, p, I 2 ) and publication bias (Eggers) were also calculated. For I 2 estimates, a value of 0% equates to no heterogeneity, 25% to low heterogeneity, 50% to moderate heterogeneity and 75% to high heterogeneity (Higgins & Green, 2008).
Moderator analyses tested whether feature match had a significant impact on the magnitude of effect size. Feature match (matched, unmatched) was coded as a categorical moderator variable in each regression model and its influence on effect size was assessed for the summed matched and unmatched data for each feature. The critical outcomes of the moderator analyses are the p value and R 2 . The criterion for significance was set at p = .05. In order to test further whether the effect size (transfer) for matched conditions differs for verbal and visuo-spatial material, the matched feature data were summed across the two domains for each of the serial recall and complex span paradigms. The significance of domain as a moderator variable for matched effect sizes was tested for each paradigm.

Results
The results of the analyses are summarized in Tables 2-4 (Supplementary materials). Across all 113 task pairs in the analysis, the mean effect size (d) was .42, SD = 0.54. The analyses assessed the statistical significance of the effect sizes according to each matched and unmatched feature, and of feature match condition as a moderator of transfer. The patterns of significance for each feature are summarized in Table 5. It should be noted that for three features the number of cases is less than 10, the recommended minimum number of cases for moderator analysis (Higgins & Green, 2008). These features are indicated by parentheses in Table 5. It should be noted that in each case the number of effect sizes included in the moderator analysis was greater than 18.

Digits
The effect size was large for matched task pairs (d = 0.994, p < .001) and smaller but significant for unmatched pairs (d = 0.357, p < .01). Match was a significant moderator of transfer (p = .005).

Letters
Effect sizes were small but significant both when the stimuli were matched across the trained and untrained tasks (d = 0.301, p < .05) and when they were not (d = 0.337, p = .046). Match was not a significant moderator of transfer (p > .05).

Words/nonwords
The effect size was moderate and significant for matched pairs (ES = 0.568, p < .05) and nonsignificant for unmatched pairs (d = 0.294, p > .05). Match was not a significant moderator of transfer (p > .05).

Objects
The effect sizes were moderate and comparable in magnitude both when the stimuli were matched (d = 0.575, p < .05) and when they were unmatched (d = 0.403, p < .05). Match was not a significant moderator of transfer (p > .05).

Spatial location
The effect size was moderate for matched task pairs (d = 0.551, p < .001) and nonsignificant when the trained task for unmatched pairs (d = 0.360, p > .05). Match was not a significant moderator of transfer (p > .05).

Stimulus domain Verbal
The effect size was moderate in magnitude and highly significant when both tasks were verbal (d = 0.455, p < .001) and non-significant when the domain did not match (d = 0.018, p > .05). Match was a significant moderator of transfer (p < .004).

Visuo-spatial
With matched spatial material the effect size was significant and moderate (d = 0.470, p < .001). It remained moderate in magnitude but was nonsignificant when the stimuli were not matched across tasks (d = 0.643, p > .05). Match was not a significant moderator of transfer (p > .05).

Stimulus modality Auditory
The effect size was large when both tasks employed auditory presentation (d = 0.470, p < .001) and was nonsignificant when they did not (d = 0.128, p > .05). Match was not a significant moderator of transfer (p > .05).

Visual
The effect size was large and significant when both tasks employed visual presentation (d = 0.459, p < .001). There were no data for cross-modal task pairs.

Response modality Spoken
The effect size was moderate in magnitude both when both tasks employed spoken recall and when they did not (d = 0.615 and .401, respectively, p < .001). Match was not a significant moderator of transfer (p > .05).

Manual
The effect size was significant and moderate in magnitude when spoken recall at transfer was combined with manual recall at training (d = 0.463, p < .001). No data are available for cross-modal task pairs.

Serial recall Verbal
The effect size was moderate and significant for matched pairs (d = 0.508, p < .001) and nonsignificant for unmatched pairs (d = 0.173, p > .05). The match significantly moderated transfer (p < .05).

Visuo-spatial
The effect size was very large and significant for matched pairs (d = 1.054, p < .001) and nonsignificant for unmatched pairs (d = 0.138, p > .05 The match was a significant moderator of transfer (p < .001). As there was only one cross-domain matched effect size for serial recall, this condition could not be analyzed. In order to test whether transfer differed significantly across verbal and spatial tasks, a further moderator analysis was performed on the matched feature data across all serial recall tasks with domain entered as a categorical moderator variable. Domain was a significant moderator of effect size, Q = 41.473, I 2 = 61.688, p < .001, demonstrating greater transfer for matched pairs for spatial than verbal serial recall.

All
The effect size was moderate and significant for matched pairs (d = 0.709, p < .001) and nonsignificant for unmatched pairs (d = 0.158, p > .05). Match was a significant moderator of transfer (p < .001).

Complex span Verbal complex span
The effect size was moderate and significant for matched pairs (d = 0.544, p = .002) and nonsignificant for unmatched pairs (d = 0.046, p > .05). Match was a significant moderator of transfer (p < .005).

Visuo-spatial complex span
The effect size was moderate and significant for matched pairs (d = 1.010, p < .001) and nonsignificant for unmatched pairs (d = 0.183, p > .05). Match was a significant moderator of transfer (p = .003).
In order to test whether transfer differed significantly across verbal and visuo-spatial spatial tasks, a further moderator analysis was performed on the matched feature data with domain entered as a categorical moderator variable. Domain was not a significant moderator of transfer, Q = 17.706, I 2 = 58.487, p > .05.

All
The effect size was moderate for matched pairs (d = 0.540p < .001) and very small for unmatched pairs (d = 0.112, p > .05). The match was a significant moderator of transfer (p < .001).

Backward span Verbal
The effect size was moderate and significant for matched pairs (d = 0.778, p < .001) and nonsignificant for unmatched pairs (d =0 .144, p > .05). Feature match was a significant moderator of transfer (p < .05).

Spatial
There were insufficient data for analysis in this condition.

Cross-domain
The effect size was large and significant for matched pairs (d = 1.294, p < .001).

All
The effect size was moderate and significant for matched pairs (d = 0.901, p < .001) and nonsignificant for unmatched pairs (see verbal backward span above). Feature match was a significant moderator of transfer (p < .05). One limitation of the coding method is that although individual features are coded and analyzed independently of one another, matched features are often highly or indeed perfectly correlated with one another. For example, a pair of tasks involving spatial stimuli will necessarily also be coded as visuo-spatial. As a consequence, it is not always possible to identify which matched task features are critical to transfer. We therefore examined whether any of the matched features found to be significantly associated with substantial transfer in the moderator analysis could be explained in terms of another correlated feature that could potentially be the origin of transfer.
There were two such features. One is digits . Matched digits yielded a large transfer effect (d = 0.990). On closer examination it was found that all seven of the task pairs with matched digits employed a backward span paradigm which was also associated with a high level of transfer effect (d = 0.901). A second confounded feature is verbal material. For this feature the transfer effect was smaller, but still highly significant (d = 0.455). This feature too was associated with recall paradigm: of the 50 matched verbal pairs, 27 also had shared the same paradigm, whereas none of the 14 unmatched verbal untrained tasks did. Thus both for digits and for verbal material more generally, it is possible that the the significant levels of transfer observed were consequences of matched paradigms rather than the stimulus items.

Discussion
The meta-analysis evaluated the features associated with transfer within WM in RCTs of adaptive WM training with active control training conditions. Across the 24 studies furnishing data for 113 pairs of trained and WM untrained tasks, the strength of transfer was small to moderate (d = 0.42). The magnitude of transfer was associated with some matched features but not with others. Transfer was high when the tasks employed the same paradigm -either serial recall, complex span, or backward span paradigms. Paradigms differed in the impact of domain on transfer. For complex span, transfer was substantial when both the untrained and trained tasks employed material in the same domain (either verbal or visuo-spatial), but was absent when the domains differed across the task pairs. For serial recall, transfer was very large for spatial material, and reduced, although still significant, for verbal material. In the case of backward span, transfer was large both for verbal tasks and for task pairs that crossed domains. In contrast, the following more specific features did not generate transfer: stimulus category (letters, words/ nonwords, objects, locations), the stimulus domain (verbal, visuo-spatial), input modality (auditory, visual), and output modality (manual, spoken). Transfer was observed when the memory items were either digits or verbal material more generally, although this could reflect the confounding influences of shared paradigms.
The findings provide broad support for the predictions of the cognitive routine framework. By this account, substantial transfer following WM training occurs only when both the trained and untrained activities impose the same unfamiliar task demands that are not supported by existing WM sub-systems. In complex span, distractor activities are interpolated between the presentation of items required for subsequent serial recall. This presents a major challenge for the maintenance and retrieval of stimulus presentations in the face of nearcontinuous distraction that must be addressed by engaging additional cognitive processes either to prevent decay (Barrouillet et al., 2009) or to minimize interference caused by encoding distractor items (Oberauer et al., 2012). It is this novel schedule that we suggest constitutes a cognitive routine, and its construction and refinement represents the process of acquiring a new cognitive skill.
The present findings tell us two important things about the structure and generalizability of a complex span routine. First, high levels of Table 4 Outcomes of the meta-analysis collapsed across feature match condition on transfer and the feature moderator analysis. S.E. Gathercole et al. Journal of Memory and Language 105 (2019) 19-42 transfer between complex span tasks indicate that the routine can be readily adapted to accommodate novel interpolated activities. In every case included in the meta-analysis, these activities differed between the trained and untrained tasks. Combinations included processing the meaning of sentences and mental arithmetic (Harrison et al., 2013), sentence verification and counting (Henry, Messer, & Nash, 2014), symmetry and orientation judgments (Harrison et al., 2013), lexical decision and arithmetic calculations (Minear et al., 2016), and vowel/ consonant and odd/even judgments (von Bastian & Oberauer, 2013a). The processes required to accomplish these activities are highly specific and have relatively little in common. Transfer across these distractor activities indicates that if the high-level structure of alternating stimulus presentation and distractor activity is preserved, distractor subroutines can be substituted one for another with relative ease. Second, the representational domain of the memory items limits the generalizability of the routine. There was no transfer for complex span task pairs that crossed stimulus domain (d = 0.12) although it was substantial when the task pairs shared the stimulus domain (d = 0.66). This suggests that domain-specific encoding and/ or maintenance processes cannot readily be adapted to fit a complex span task drawn from another domain. This contrasts with the subroutines for handling distractor activities, which do show a high degree of content generality. The implication is that the domain influences more than just the local elements (subroutines) of the routine, preventing modular substitutions of one for another. This may not be too surprising given the different nature of stimulus maintenance processes for verbal and visuo-spatial material. Rehearsal depends on the covert control of the relevant domain-specific action production systems -of articulation in the case of verbal rehearsal Waters et al., 1992) and of eye movements for spatial rehearsal (Pearson et al., 2014;Pearson & Sahraie, 2003). A consequence of the distinctiveness of these underpinning systems may be that the two forms of rehearsal cannot be readily substituted for one another. Rehearsal may also be interleaved with the periods of distractor activity throughout the trial (Barrouillet et al., 2009), influencing multiple points during the execution of the routine. In this way, informational domain may directly shape the broad structure pf the routine, limiting its transfer to complex span tasks embedded in the same domain.
Backward span is another complex WM task that is not readily served by existing mechanisms in STM, as a consequence the forwardgoing representation of serial order in STM (Hurlstone et al., 2014). One of the ways in which the task can be accomplished is by peeling off the last item across successive forward retrievals of diminishing length (Conrad, 1965;Anders & Lillyquist, 1971;Norris et al., in preparation). Such a strategy would require the establishment of a new routine to coordinate the multiple processes required to achieve the end goal. The present findings demonstrate high levels of transfer across backward span tasks, both for verbal stimuli and for cross-domain task pairs. It is tempting to conclude that this demonstrates the application of a backward recall routine to a novel task on this basis. However, in the present dataset the degree of generalization captured by the task pairs in the verbal condition is very limited. In the six of the seven task pairs, the trained and untrained tasks were identical (with digits as memory items), differing only in the modality of response (manual responding during training, spoken recall at transfer). The large degree of crossmodal transfer in backward span suggests that higher-order features of the routine can be adapted to fit new stimuli, although here the data are restricted to just two task pairs. The extent to which the high-order structure of the backward span routine can be generalized across materials has therefore yet to be fully established.
Although tests of visuo-spatial serial recall such as Corsi blocks and spatial span are considered to tap STM, the conclusion of the task analysis was that there is not a highly developed STM system for storing visuo-spatial material (Alloway et al., 2006;Kane et al., 2004;Pearson et al., 2014;Thompson et al., 2006). We therefore speculate that routines may be developed and refined for tasks involving the recall of spatial locations. The present findings of substantial transfer across visuo-spatial serial recall tasks are consistent with this position. As the great majority of matched task pairs employed spatial locations as the memory items, the potential generalizability of this routine to other visuo-spatial characteristics is unknown.
The cognitive routine framework contrasts with other process-specific accounts of WM training in generating predictions not only about the features that generate transfer but also those that will not. One prediction is that verbal serial recall tasks will not require a new routine because they are already supported directly by specialized system of verbal STM. Transfer was indeed significantly lower for verbal than visuo-spatial recall (d = 0.51 vs .84). However, it extended across different stimuli (letters, digits, words), suggesting that it does not originate from material-specific mnemonic strategies. So, where does it come from? Not, we suggest, from new routines. The training benefits may instead reflect subtle adjustments and re-calibration of the orderbased encoding system already in place in verbal STM. These could have two possible sources. One is the optimization of the routines to task-specific characteristics likely to generalize to untrained tasks in the same study. An example is the temporal properties of the task such as the stimulus presentation rate and time allowed for recall. Alternatively, the extended practice may lead to subtle fine-tuning of the core mechanisms themselves.
Minimal levels of transfer were also predicted for local features of tasks such as the stimulus presentation modality and the response format as these are likely to be handled by specialized and highly  modularized systems for the processing of inputs and outputs. The present findings provide strong support for this position: neither a common input nor output modality had any impact on transfer. It was predicted that within the verbal domain at least, the specific category of stimuli (letters, digits, etc.) will not modulate transfer because information in verbal STM is represented in a phonological rather than semantic form (Salamé & Baddeley, 1982). There was partial support for this prediction: for letters and words/nonwords a category match across tasks did not generate transfer. For digits, transfer was high. This could be because participants developed digit-specific recoding strategies during training, as reported in previous studies of lengthy intensive training in digit span (Ericsson & Simon, 1981). However, in each of the task pairs included in this analysis, both tasks also employed a backward span paradigm and this is an alternative and possibly more plausible source of routine-mediated transfer. Further data are needed to resolve this issue.

Study 2
The broad set of training paradigms included in the meta-analysis provide a more comprehensive analysis of the boundary conditions to transfer within WM than any single study could provide. However, some of the strengths of the approach are also limitations. One of these is the heterogeneity of the studies included in the meta-analysis: there was variation in the training activities, training regimens, the transfer tests, the active control conditions, the participants, and sample sizes. While this forms a sound basis for the generalization needed for broad theoretical analysis, it cannot provide the more detailed level of analysis needed to test the cognitive routine framework as an explanation for the patterns of transfer observed in individual studies.
A second limitation is the relatively coarse level of analysis yielded by the independent coding and analysis of individual features of the pairs of trained and untrained tasks selected to be most closely matched. We have already seen that this can generate potentially misleading results: in all cases in which both tasks in the meta-analysis employed digits as the memory items, they also happened to be backward span tasks, but the two features were analyzed independently. Whether the robust levels of transfer observed in these data is mediated by the common stimulus category, the complex WM paradigm, or the combined impact of both is therefore indeterminate. Other kinds of interactions between matched and unmatched features that influence transfer could also go undetected in this method. For example, we cannot using this method determine whether cross-paradigm transfer occurs only under conditions of matched stimulus features such as domain.
One way of addressing these limitations is to examine whether the cognitive routine framework can successfully account for the transfer found with a single WM training program. This allows a more finegrained analysis of the impact of combinations of features on transfer. We adopted this complementary approach in the present training study. In this, we re-analyzed a dataset from two studies of children (n = 106) who had completed either a single adaptive or non-adaptive WM training program and four WM transfer tasks (Dunning et al., 2013;Holmes et al., 2009). The children were selected on the basis of low scores on complex WM tasks. These studies were not included in the meta-analysis and therefore provide an independent evaluation of the cognitive routine framework. Participants in the Holmes et al. study were excluded from Study 1 because they were not randomly allocated to the adaptive and non-adaptive training conditions; instead, they were recruited sequentially (adaptive first, then non-adaptive) using identical recruitment criteria. Dunning at al. reported analyses only for composite scores that combined pairs of transfer tests (e.g., digit and word span for verbal STM). We chose not to incorporate previously unreported descriptive statistics for these measures in Study 1, retaining the data instead for the present independent test of transfer patterns.
Participants completed either the standard adaptive or non-adaptive form of Cogmed RM Working Memory Training (http://www.cogmed. com). It provides training on eight WM activities drawn from a larger set of 12 activities on each daily session for at least 20 daily sessions. Further information on Cogmed RM training activities is provided in the Appendix. The following transfer tests were completed before and after training. The WM measures were taken from the Automated Working Memory Assessment (Alloway, 2007) and consisted of tests of verbal and visuo-spatial serial recall (digit span and dot matrix), backward digit span and Mr. X, a test of visuo-spatial complex span. The Wechsler Abbreviated Scales of Intelligence (WASI: Wechsler, 1999) was also administered. This consists of four subtests: Similarities and Vocabulary (from which a verbal IQ composite is formed), and Matrix Reasoning and Block Design (performance IQ). Analyses were based on standard scores for each measure.
The cognitive routine framework was used to generate predictions based on analysis of the profile of matched and unmatched features in this study, guided by the coding protocol employed in the meta-analysis. Consider first dot matrix, a test of serial recall of successively highlighted spatial locations in cells in a 4 × 4 grid. This tasks differs only in superficial aspects of the spatial layout multiple from spatial STM activities trained daily in Cogmed RM including Data Link, in which spatial elements such as colored panels illuminate in sequence. The coded features for dot matrix and Data Link were identical. In the meta-analysis, combinations of spatial recall tasks such as these generated very high levels of transfer (d = 1.12). We propose that this reflects the development and refinement of a routine for the serial recall of spatial material. A high degree of transfer was therefore expected for the dot matrix task. The digit span transfer task involved the spoken serial recall tasks of spoken items. Its closest match, the Decoder task, was completed on each of the first five days of training. Sequences of letters were presented auditorily with recall by mouse selection of each item from a choice of three items at each position. The two tasks therefore differed on two features: stimulus category (digits vs letters) and response modality (spoken vs mouse click). These differences would not be expected to have any impact on ease of the phonological encoding of items and their order in verbal STM, representing input and output conditions that are widely used in studies of verbal serial recall. Findings from the meta-analysis reported in Study 1 indicate transfer across verbal serial recall tasks that although moderate in magnitude and highly significant (d = 0.51), is markedly smaller than that for visuo-spatial serial recall. We have argued that this transfer is not mediated by the development of a new routine as verbal STM is already well-established but instead by fine-tuning of STM processes within this system, possibly to optimize task fit. In the present study, a small to moderate transfer effect is therefore predicted for digit span.
The untrained backward digit span task involved auditory presentation and spoken recall. Two daily Cogmed RM tasks also require backward recall of digits. In Input Module with Lid the response pad is not displayed until recall, and in Input Module without Lid it was present during presentation. The trained and untrained tasks differed only in response modality (spoken vs mouse selection). Transfer levels were found to be high for verbal backward span tasks (d = 0.78) in the meta-analysis. The framework predicts high levels of transfer across backward span tasks because the unusual nature of the task demands will require the development of a new routine. We note that because in this study the trained and untrained tasks are distinguished only by the mode of recall, transfer across tasks would represent very near transfer only. In this respect, as in the majority of previous studies, this does not provide a particularly strong test of routine-mediated transfer across backward span tasks.
Mr. X is a complex span task in which participants judge whether pairs of figures are holding a ball in the same hand as one another for each of a series of displays, and then remember the location of the ball held by the figure on the right which can appear at one of six compass positions. At the end of a trial, participants are required to recall the successive locations of the ball. There was no visuo-spatial complex span task in the Cogmed RM training program, and its closest match was a visuo-spatial serial recall activity, Visual Data Link. This shares the requirements for the serial recall of spatial locations in Mr. X, but not the interpolated distractor activity characteristic of complex span tasks. The cognitive routine framework does not generate any strong predictions for this transfer task because the trained and untrained paradigms do not squarely match. However, it is at least plausible that any improvements in encoding, maintaining and recalling spatial locations resulting from extensive training will enhance the same stimulus encoding and maintenance activities in subroutines embedded within the more complex structure of a complex span routine. We would though expect transfer to be diminished to Mr. X relative to the dot matrix, for which there is a precise match in the entire paradigm.
Finally, measures of verbal and performance IQ provided tests of the specificity of transfer. Individual differences studies have established close links between WM and problem-solving skills (Engle et al., 1999;Kyllonen & Christal, 1990;Wiley, Jarosz, Cushen, & Colflesh, 2011). However, the cognitive routines developed to accomplish complex WM tasks would be expected to have very little overlap with the specific structures problem-solving tasks using unfamiliar spatial designs used to index performance IQ. These typically involve simultaneous presentation of multiple response alternatives as well as the application of rules to transform the spatial form of one stimulus for another (Cattell, 1963;Raven, 2003). The same applies to the vocabulary-based assessment of verbal IQ, which relies on access to stored knowledge based on prior learning with little or no reliance on either the basic processes of WM or the more specific routines that develop them further to address more specific needs. Transfer of WM training benefits to performance on tests of either verbal or nonverbal IQ is therefore not expected.

Method
The participants were children aged 7 to 12 years with low WM scores identified through routine classroom-wide screening on two WM tests. In the adaptive training group there were 32 boys and 24 girls ranging in age from 7 years 1 month to 12 years 0 months (M = 9 years 1 month, SD = 14.42 months), and in the non-adaptive group there were 29 boys and 21 girls with ages ranging from 7 years 6 months to 11 years 6 months (M = 8 years 10 month2, SD = 9.82 months). The children participated in either standard Cogmed RM training (n = 56) or the non-adaptive version of the same program in which the difficulty level was fixed at a span length of two (n = 50). Sixty-four participants (34 adaptive, 30 non-adaptive) who participated in the Dunning at al. (2013) study were selected on the basis of standard scores below 86 on the backward digit span and Mr. X tasks from the AWMA (Alloway, 2007). A further 42 children from the Holmes et al. (2009) study (22 adaptive, 20 non-adaptive) had standard scores below 86 for the listening recall and backward digit span tasks from the AWMA.
Written parental consent for participation was provided for all children. Further details of participant recruitment and methods are supplied in the original publications. A post hoc power analysis with a total N of 106 yielded power of .99 to detect a large effect size, f 2 = 0.35, with linear regression at p = .05. The power to detect a medium effect size of f 2 = 0.15 was .95, and .23 for a small effect size, f 2 = 0.02.

Analysis plan
The effect size Cohen's d was computed as the difference between the gain scores (pre-to post-training) for the two groups divided by their pooled SD (Weisz & Hawley, 2001). Univariate analyses were performed on baseline scores to identify any group differences prior to training. For baseline measures without a significant group effect, a general linear model (GLM) was run with post-training scores as the dependent variable and age and baseline scores as independent variables. For baseline measures with significant group differences, both centered baseline scores and centered baseline score x group product terms were also included in the regression models. Product terms were derived by the product of the centered scores (individual score minus the group mean) and the grouping variable. Where product terms were not significantly associated with post-training scores, GLMs were re-run omitting the product term or centered scores in the final analyses (i.e. with post-training scores as the dependent variable and group and pretraining scores as independent variables). In all cases, the critical term of interest was the group effect in the GLM and the criterion for significance was set at p = .05. Separate regression models were tested with each of the six transfer tests. Corresponding analyses were also performed on verbal and performance IQ scores in order to test the specificity of any transfer effects to WM.
In addition to traditional null hypothesis significance testing (NHST), Bayesian analysis was used to test support for the null hypothesis relative to the alternative hypothesis that training has a genuine effect. Bayesian linear regressions were conducted using JASP (Love et al., 2015). Regression models with baseline scores and intervention condition as independent variables were computed individually for each post-test dependent variable. Bayes factors are reported. By convention (Jeffreys, 1961), values are interpreted as follows: 1-3 (anecdotal evidence for the alternative hypothesis, in this case that there is an effect of training), 3-10 (substantial evidence), 10-30 (strong evidence), 30-100 (very strong evidence), 100+ (decisive evidence). The corresponding values in support of the null hypothesis are the inverse values: 0.33-1.0 (anecdotal evidence for the null hypothesis), 0.10-.33 (substantial evidence), and so on.

Results
Descriptive statistics and statistical outcomes are shown in Table 6. Bayesian analysis provided decisive evidence in favor of the alternative hypothesis for dot matrix and backward digits. The effect sizes for transfer were 0.92 and 0.77, respectively. For Mr. X there was substantial support for the alternative hypothesis and a transfer effect size of.58. For digit span, the transfer effect size was 0.29 and the evidence did not favor either the null or alternative hypothesis (effect size of 0.29). For both verbal and performance IQ, the evidence provided substantial evidence in support of the null hypothesis of no transfer.
The same pattern of outcomes was reflected in the GLM analyses. Significant group differences were found at baseline on the dot matrix and Mr. X tests. For these variables the group product term (centered baseline score x group) was entered along with centered baseline scores and group. In the GLMs for dot matrix and Mr. X, the product terms were nonsignificant, indicating that baseline differences did not affect the outcomes. The GLMs were subsequently run without the product terms and centered baseline scores. Highly significant group differences arising from increased performance following adaptive training were found for backward digit recall, dot matrix, and Mr. X (p < .001 in each case). For digit recall, the training group effect was significant (p = .035). It should be noted that this term reflects in part a reduction in post-training scores for the non-adaptive comparison group (.09), with a mean increase in standard scores in the adaptive group of only 2.6 points. The effect of training condition on verbal IQ was nonsignificant (p > .05) for verbal IQ but significant for performance IQ (p < .05). This arose from a greater post-training improvement for the non-adaptive than the adaptive group and so does not reflect a positive transfer effect.

Discussion
The transfer patterns observed in this re-analysis of transfer data from two published studies are consistent with predictions derived by applying the cognitive routine framework to the profile of overlapping and distinct features of the training program and the individual transfer tasks. They also largely uphold the conclusions regarding the features governing transfer from the meta-analysis.
The strength of transfer varied considerably across the untrained WM tasks. Highest levels were found for the two untrained tasks with WM paradigms that were also employed during training and which are hypothesized to require the development of new routines that can be transferred to similarly-structured tasks. These tasks involve the recall of spatial locations (dot matrix) and the recall of digits in reverse sequence. The same paradigms also generated large transfer effects in the meta-analysis.
An intermediate degree of transfer was found for a visuo-spatial complex span task (Mr. X), even though there was no corresponding paradigm in the Cogmed training program. We propose that this is driven by routine-mediated improvements in the recall sequences of spatial locations of multiple daily training activities. This finding is important as it extends the range of the critical conditions beyond the scope of the meta-analysis, which was not designed to quantify crossparadigm transfer. Transfer was very weak for the untrained verbal serial recall task, even though participants had trained on another serial recall task employing different verbal stimuli (letters rather than digits) during training. This outcome is entirely consistent with the proposal that cognitive routines are only necessary and only generate transfer when they cannot be readily accomplished with existing WM processes and mechanisms. The verbal STM system amply meets the needs of verbal serial recall tasks and will therefore not lead to transfer. In the meta-analysis, tasks that shared this paradigm showed levels of transfer that were moderate but significantly smaller than for visuo-spatial STM tasks. Finally, we found no evidence that WM training influenced performance on either vocabulary or nonverbal reasoning tests. This too we expected: although WM may play a role in nonverbal reasoning, the paradigm-specific routines developed in the course of WM training will have minimal overlap with the cognitive demands of such assessments.
One limitation of the study is that the participants were all children who performed poorly at screening on tests on complex WM. The reason for this is that the motivation of the original studies was to investigate whether their relatively common WM problems (they represented the 16% of children with the lowest WM scores in the school population) could be ameliorated through training. The extent to which these results generalize to other populations of individuals with either typical or atypical WM abilities or of different ages therefore cannot be determined. What we can conclude is that in accordance with the predictions of the cognitive routine framework, training-induced transfer is greatest when the trained and untrained tasks tap aspects of WM other than verbal STM in children with weak WM skills. Moreover, the patterns of transfer observed in this particular population provide a close fit to the broader analysis of the factors driving transfer from the meta-analysis in Study 1 of training studies employing many different populations.

Study 3
One issue we have not yet addressed is how new routines might be created. These are only needed when existing processes and mechanisms are not sufficient to satisfy the demands of the task. The process of constructing a routine must therefore draw in part at least on cognitive resources that fall outside of WM. This position stands in opposition to more general plasticity-based accounts of training, according to which training-induced changes reflect increases in the fundamental capacity of the underlying WM system itself.
One way of conceptualizing the establishment of new routines is as a form of problem-solving behavior involving the decomposition of a complex task into its constituent cognitive parts. When assembled, these parts enable the individual to meet task goals. This might, for example, be to protect memory representations from interference or decay caused by distraction in complex span tasks, or to reverse the input sequence at output in backward span. This kind of problem-solving capacity has been most widely investigated In the context of nonverbal reasoning where it has been widely considered to be supported by limited general cognitive resources labelled either g or fluid intelligence (Cattell, 1963;Duncan, Emslie, Williams, Johnson, & Freer, 1996). It has been linked to the fronto-parietal brain networks that respond flexibility to multiple kinds of cognitively challenging activities (Duncan, Burgess, & Emslie, 1995;Duncan & Owen, 2000). It has been proposed that the same flexible resources may be responsible for the cognitive segmentation of complex reasoning tasks into their component parts (Duncan, 2013;Duncan, Chylinski, Mitchell, & Bhandari, 2017). Although the concept of a cognitive routine that we use is broadly influenced by the production systems approach to the acquisition of complex skills (Anderson, 1982;Taatgen, 2013), it also has much in common with Duncan's notion of cognitive segmentation. Perhaps, then, cognitive routines depend on the same general attentional resources believed to be critical for the restructuring process in the context of nonverbal reasoning tasks.
To date, investigations of individual differences in transfer following WM training have focused primarily on links with WM performance prior to training. There is some evidence from n-back training that trainees who begin training with relatively high WM scores fare better on the training tasks and, to some extent, on measures of transfer (Au et al., 2015;Jaeggi et al., 2008). In a study of complex span training, participants with higher WM showed greater performance benefits on the trained tasks as the number of sessions increased, but there was little evidence that the magnitude of transfer to other WM tasks varied as a function of pre-training WM ability (Foster et al., S.E. Gathercole et al. Journal of Memory and Language 105 (2019) 19-42 2017). The authors speculated that these findings may reflect greater ability to generate and implement new strategies during training rather than a fundamental increase in WM capacity. Borella, Carbone, Pastore, De Beni, and Carretti (2017) conducted a detailed analysis of individual differences in baseline measures of vocabulary, age and WM the magnitude of transfer across STM and WM tasks following training on a verbal complex span task. Participants monitored lists of words in a sequence of lists for the presence of animal words and recalled the sentence-final words. In tasks judged to require active processing (which included dot matrix and backward digit span), higher-performing participants benefitted more from training. The converse pattern of greater gains for individuals with lower initials levels of performance was found for digit span. These findings show the same differentiation observed in the previous WM training study between digit span on the one hand (yielding little or no transfer) and dot matrix and backward span on the other (with very substantial transfer). In this study, the baseline predictors of post-training scores did not include fluid intelligence, the issue of particular interest of the present study.
In Study 3 we examined whether the magnitude of transfer following adaptive WM training is modulated by the fluid cognitive abilities indexed by nonverbal reasoning. If the development of new routines during training and hence their transferability depends on this ability, the strength of routine-mediated transfer should be more strongly predicted by pre-training measures of fluid intelligence than by the WM measures themselves. This association should be present in the three tasks showing substantial transfer following Cogmed WM training in the previous WM training study: dot matrix, backward span and Mr. X. No link is predicted for post-training digit span, for which transfer was insubstantial.
To achieve the statistical power and heterogeneity required to examine individual differences in transfer following training, we re-analyzed data from four WM training studies in which measures of IQ as well as WM were obtained before and after training. The sample of 108 children were a mixed group of individuals who were either typically developing, had low WM but no other recognized developmental impairments, or had a diagnosis of ADHD. They completed the four tests of WM from the AWMA (digit span, backward digit span, the dot matrix test of visuo-spatial STM, and Mr. X, a visuo-spatial test of interpolated span) and the verbal and performance IQ subtests of the WASI (WASI, Wechsler, 1999).

Method
The sample was composed of 34 children with low WM from Dunning et al. (2013), 22 children with low WM from Holmes et al. (2009), 25 children with a diagnosis of ADHD from Holmes et al. (2014), and 27 children (12 with low language abilities and 15 children matched for nonverbal abilities) from Holmes et al. (2015). The participants from the Dunning et al. and Holmes et al. (2009) studies comprised the adaptive training group in WM Training Study 1. All children completed standard adaptive Cogmed RM training. Further information regarding recruitment and methods are provided in the individual publications. Written parental consent for participation was provided for all children. The mean age of the children was 9 years 3 months ranging from 8 to 11 years, and there were 68 boys and 40 girls.

Analysis plan
The effect size d calculated for this study is the mean training gain (average of difference between pre-and post-training scores) divided by the pooled SD (average of pre and post SDs). It should be noted that because all of the participants completed adaptive training, the effect sizes are not directly comparable with those reported in Study 1 and 2 which compared the gain scores for the adaptive and control training conditions. Univariate ANOVAs were conducted on scores on each of the four WM transfer tests as a function of time (pre-training and posttraining). Corresponding Bayesian ANOVAs were also performed. In order to examine individual differences in transfer, separate GLMs and Bayesian regression analyses were performed for each set of posttraining WM scores. The criterion for significance was set at p = .05. In these, the post-training measure was the dependent variable and baseline WM scores, verbal and performance IQ scores were the independent variables. Outcomes of these analyses establish the strength of unique associations between each independent and depend variable.

Results
Descriptive statistics for the six tests administered before and after training and the outcome of analyses of these data are shown in Table 7. Transfer increased significantly after training on all four WM tests (BF 10 > 100, p < .001, in each case). The effect sizes were large for both dot matrix (1.27) and backward digit span (0.96), moderate for Mr. X (0.75) and small for digit span (0.37). For verbal and performance IQ, they were very small (0.14 in both cases). Table 8 summarizes the outcomes of the individual differences analyses examining associations between baseline (pre-training) measures and post-training scores on the WM tests. Pre-training scores significantly predicted post-training scores for all four WM tests. In the Bayesian analyses the evidence in favor of a positive effect was decisive for digit span, strong for Mr. X and substantial for both backward span and dot matrix. NHST outcomes were consistent with this pattern, with p < .001 for digit span and p < .05 for all other tasks. The corresponding beta weights were between 2.2 and 2.9 times greater for digit span than each of the other measures.
Pre-training performance IQ significantly predicted dot matrix and Mr. X scores (p < .005 in both cases) and, less strongly, backward digit span (p < .05). It was not significantly associated with digit span (p > .05). Bayesian analysis provided substantial evidence for the null hypothesis for digit span, anecdotal evidence for the alternative hypothesis for backward span and strong support for the alternative hypothesis for both dot matrix and Mr. X. Baseline verbal IQ scores were not significantly associated with any of the four WM measures (p > = .05 in each case). There was substantial evidence only for digit span, where the null hypothesis of difference was favored.

Discussion
The baseline predictors of post-training performance differed markedly across the four WM transfer tests. Digit span was very strongly related to digit span prior to training but was not linked with either verbal or performance IQ. Dot matrix, backward digit span and Mr. X were also related to their own baseline scores, although here the strength of association was much weaker than for digit span. Each of these WM tasks was significantly associated with baseline performance IQ but not verbal IQ. This association was strongest for the two visuospatial tasks -dot matrix and Mr. X.
Performance IQ was therefore linked with post-training scores only on the same tasks that showed substantial transfer with adaptive training in the previous study: dot matrix, backward span and Mr. X. This is consistent with the suggestion that the flexible cognitive capacities indexed by this measure are involved in the construction and refinement of new routines. It is not obvious why these general resources would be critical if training simply results in increases in the fundamental efficiency of established processes. If so, post-training performance should be best predicted by pre-training scores on the same task, as an index of the baseline capacity on which training experience could be built. This was not the case.
We note that the composition of the sample included in this study is unusual. It combines data from studies of typically-developing children and two samples with compromised WM skills: children selected on the basis of low WM, and children with ADHD. The cognitive heterogeneity of this large sample is ideal for exploring the cognitive origins of transfer through individual differences analyses. However, generalization across other populations of trainees will require further data.

General discussion
We have proposed that substantial transfer from WM training is a consequence of the development of new routines that are applied to new tasks. In learning a new task, individuals must develop a new cognitive routine that specifies the precise sequence of cognitive processes necessary to perform the task. It is a form of learning that follows well-established principles of the acquisition of complex skills (Anderson, 1982;Fitts & Posner, 1967;Taatgen, 2013). Once established, new routines can be applied to other similar new tasks. For activities that are already highly practiced before training commences, no new routine will be needed. There will therefore be little scope for either improvement on the trained activity or for transfer.
The approach differs from previous conceptualizations of WM training and transfer in several respects. In contrast to accounts of training with a primary focus on the neuroplasticity (Klingberg, 2010;Takeuchi, Sekiguchi et al., 2010), it does not characterize training-induced change as an undifferentiated process that simply and automatically propagates itself across any activity relying on WM. Instead, transfer is considered to be the direct consequence of the cognitive routines developed during training and, critically, how they can be adapted to fit new tasks at the point of potential transfer. Unlike process-and task-specific accounts of WM transfer (Dahlin et al., 2008;Dunning & Holmes, 2014;Holmes et al., 2009;Minear et al., 2016;Shipstead et al., 2012;Soveri et al., 2017;Sprenger et al., 2013;von Bastian & Oberauer, 2013a), the framework generates predictions about which overlapping task properties both will and will not lead to transfer, and specifies the cognitive conditions necessary for transfer. It does so by building both on a detailed task analysis of common WM tasks and on a skill acquisition framework. Finally, the approach is clearly differentiated from theories of cognitive training which assume that benefits might arise from "learning to learn" (Bavelier et al., 2012;Harlow, 1949). The learning that we believe drives WM training extends only to tasks that can benefit directly from common routines. Learning to learn implies that there are higher-order routines that can be developed in the course of experience with different learning situations which can then be applied to new learning situations. In WM training we believe there is little opportunity to develop sufficiently higher-order routines that can be applied to other paradigms.
The framework was tested in two steps. First, a task analysis of the cognitive processes involved in common WM tasks was performed. This was used to guide the classification of whether or not each task meets the criteria for requiring a new cognitive routine. Framework predictions were then tested in three studies investigating the task features influencing transfer. The first two studies examined the shared characteristics of trained and untrained WM activities associated with transfer. Study 1 reported a meta-analysis of the features associated with transfer in RCTs of WM training. The studies varied widely in the nature of the training activities, transfer tasks, control conditions, participants, statistical power and the methods of analysis. Study 2 investigated the task features associated with transfer for a single WM training program with a small set of WM transfer tests in children with low WM. Study 3 adopted an individual differences approach to investigate the hypothesis that routine-mediated transfer depends primarily not on expansion of existing WM processes, but instead on the recruitment of more general cognitive resources that fall outside of the WM system.
The findings were broadly consistent with the predictions. The task analysis led to the conclusion that the following paradigms require new routines: visuo-spatial serial recall, complex span, backward span, and the updating tasks of n-back and running span. When these paradigms are shared by trained and untrained tasks, there should be transfer. This was indeed found to be the case: transfer was strongly linked with the serial recall of spatial locations (Studies 1 and 2), complex span (Study 1), and backward span (Studies 1 and 2). Data were insufficient for corresponding analysis of the two updating paradigms.
A further prediction was that there should be little transfer across verbal serial recall tasks because they do not need a new routine. This is because mechanisms for encoding, maintaining and retrieving item and order information are served by an existing and highly-practised system of verbal STM. This prediction was partially supported. Transfer was found across verbal serial recall tasks in the meta-analysis (Study 1). However, its magnitude was significantly lower than for visuo-spatial S.E. Gathercole et al. Journal of Memory and Language 105 (2019) 19-42 recall in Study 1 and in the training study (Study 2), it was very weak. We therefore conclude that there may be subtle but genuine trainingrelated improvements in the efficiency of established processes within verbal STM. These gains are small in magnitude and may be reliably detected only under conditions of higher statistical power than the standard WM training study. We speculate that they reflect either the fine-tuning of basic mechanisms or recalibration to optimize task parameters is evident even in low-level perceptual discrimination paradigms (Bavelier et al., 2012). What is clear is that transfer within verbal STM does not have the same robust character associated with more complex WM paradigms that require new routines. The meta-analysis established that more basic features of tasks such as the input modality (auditory or visual) and output modality (spoken or manual) are not critical for transfer. For example, there were high levels of transfer across backward span tasks that differed only in whether the responses involved spoken recall or mouse-based selection of response alternatives. We had predicted that this would indeed be the case: changes in the sensory modality of either stimulus items or responses should require adjustment only to specialized input and output processing systems employed as low-level elements of routines, and shifting from one to another (from auditory to visual presentation, or spoken to manual responses) should not modify the higher-order structures of the task. It should therefore be possible for routines to be rapidly modified to adapt to these peripheral changes.
A common stimulus category in itself does not appear to be a sufficient basis for transfer (see Minear et al., 2016;von Bastian & Oberauer, 2013a). The meta-analysis established that for letters, words/nonwords and spatial locations, transfer was not influenced by whether the stimulus category was matched except when digits were the memory items in both tasks. Although this may reflect a recoding strategy (or routine) developed during training (Chase & Ericsson, 1981), it could also be due to a confound in the shared features of the relevant task pairs. In all of the cases in this dataset, the trained and untrained tasks shared not only digits but also a backward span paradigm. This was itself predicted to generate high levels of routinemediated transfer.
The goal of the new framework was to define the boundary conditions on transfer within WM. Our starting position was that routines are hierarchically organized, consisting of subroutines that control the execution of processes necessary to accomplish different task components (e.g., encoding, maintenance, distractor processing, retrieval). At the most peripheral level, these processes include task-specific input and output settings such as the modality of the stimulus inputs or the responses. Switching between these settings should not have an impact on other elements and subroutines: borrowing Taatgen's (2013) terminology, they should not alter the flow of information processing across the system. There should therefore be transfer across low-level settings, as was indeed found in Study 1. In contrast, mismatches in the highlevel structure of the subroutines corresponding to the paradigm were expected to limit the adaptability of a routine to a new task. The metaanalysis established that this was the case.
What was much less clear to us in formulating the framework were the limits on transfer when tasks differed at intermediate levels of a routine. The meta-analysis generated new information on this issue. In every pair of complex span tasks included in the analysis, the distractor activity was different. The strength and consistency of transfer found for both verbal and visuo-spatial matched complex span task pairs establishes that transfer does indeed occur when the higher-order structure [(stimulus, distractor processing) rep , retrieval] of the routine can be preserved but individual subroutines differ. However, this was true only when the stimulus domain was also matched across the two tasks (verbal or verbal, or visuo-spatial to visuo-spatial): transfer did not extend across complex span tasks in which the domains differed. This suggests that although one distractor subroutine can be substituted for another, the intermediate level of organization is so strongly framed by domain-specific encoding and maintenance processes that the routine cannot be adapted to fit complex span tasks with stimuli drawn from a different domain. This interpretation clearly requires further systematic experimental analysis.
Study 3 investigated the cognitive skills that might support the development of new routines during training. We speculated that the flexible cognitive resources believed to play a critical role in the process of decomposing the unfamiliar tasks into their constituent cognitive parts (Duncan et al., 2017) may also play a critical role in constructing the bespoke specification of cognitive processes to be executed in a routine. This hypothesis was explored in the re-analysis of training data from a large sample of children completing a single WM training program in Study 3. Higher performance IQ scores were uniquely associated with the magnitude of transfer to the three WM tests (visuospatial serial recall, visuo-spatial complex span and backward span) hypothesized to require new routines. In contrast, post-training performance on the verbal serial task (digit span) was very strongly associated only with baseline levels of performance on the same task but not with nonverbal reasoning. This suggests that when WM tasks do not need new routines, training-induced changes reflect modifications in the capacity of existing systems. In contrast when routines must be constructed, they depend on general attentional resources.
Other theoretical accounts of WM training are less successful in accounting for both the presence and absence of transfer with different shared task features in the present studies. Accounts that attribute training and transfer to plasticity in the undifferentiated neural substrate of WM (Klingberg, 2010;Westerberg & Klingberg, 2007) lack the specificity to accommodate limited transfer across WM tasks. On the other hand, process-specific explanations of transfer as reflecting increases in the efficiency of individual processes within WM (Dahlin et al., 2008;Minear et al., 2016;Sprenger et al., 2013) struggle to explain why some shared task features and processes are sufficient for transfer whereas others are not.
These challenges disappear if we move away from the idea that training improves existing processes within WM and consider instead that WM training simply involves learning how to perform an unusual task. This learning is conceived as the construction of new cognitive routines and transfer is only expected for tasks that might be expected to share routines. From this we conclude that there is little prospect that WM training with a small number of tasks will ever have a substantial impact on real-life skills such as those required to enhance educational achievement. Such real-life skills are likely to rely on an extensive array of cognitive routines; too many, probably, to be trained with anything other than real-life experience.

Declaration of interests
None.

Data statement
The data reported in this article are available in the CBU Open Data Repository http://www.mrc-cbu.cam.ac.uk/publications/opendata/.