Verbal working memory encodes phonological and semantic information differently

Working memory (WM) is often tested through immediate serial recall of word lists. Performance in such tasks is negatively influenced by phonological similarity: People more often get the order of words wrong when they are phonologically similar to each other (e.g., cat, fat, mat). This phonological-similarity effect shows that phonology plays an important role for the representation of serial order in these tasks. By contrast, semantic similarity usually does not impact performance negatively. To resolve and understand this discrepancy, we tested the ef- fects of phonological and semantic similarity for the retention of positional information in WM. Across six experiments (all Ns = 60 young adults), we manipulated between-item semantic and phonological similarity in tasks requiring participants to form and maintain new item-context bindings in WM. Participants were asked to retrieve items from their context, or the contexts from their item. For both retrieval directions, phonological similarity impaired WM for item-context bindings across all experiments. Semantic similarity did not. These results demonstrate that WM encodes phonological and semantic information differently. We propose a WM model accounting for semantic-similarity effects in WM, in which semantic knowledge supports WM through activated long-term memory.


Introduction
Working memory (WM) is a core function of the cognitive system responsible for holding information briefly available for further processing. It has long been shown that the phonological similarity between items in a to-be-remembered list induces confusion errors (Baddeley, 1966). When participants study lists such as "rat, fat, mat" and are asked to recall them in serial order, they confuse the order of these words more often compared to lists such as "wall, dig, bend". Semantically similar lists such as "leopard, cheetah, lion", by contrast, do not reliably lead to such confusion errors compared to semantically dissimilar lists, such as "sky, pen, pillow" (Saint-Aubin & Poirier, 1999b). In this work, we comprehensively tested the boundary conditions in which semantic similarity could induce confusion errors. Based on our results, we arrived at the conclusion that semantic and phonological information play different roles in the short-term maintenance of serial/positional information.
This study is motivated by models postulating an item's position in WM is maintained through item-context binding, as implemented in many computational models of serial recall (Burgess & Hitch, 1999, 2006Henson, 1998;Lewandowsky & Farrell, 2008;Oberauer & Lewandowsky, 2011;Oberauer, Lewandowsky, Farrell, Jarrold, & Greaves, 2012). In these models, serial position is temporarily maintained by binding itemssuch as wordsto contextssuch as a word's serial position in a list (e.g., binding the word "wall" to "Position 1"). We illustrate this assumption in Fig. 1. Suppose the to-be-remembered sequence is "wall, dig, bend". If asked to recall the item that was presented in the third position, one can re-activate the context of third position and use it as cue to retrieve the word "bend" that is bound to it (Fig. 1, top). Likewise, if asked to recall where "dig" was presented, one can retrieve "Position 2" (Fig. 1, bottom). The generic associative model in Fig. 1 allows this flexibility: Retrieving an item when cued with a context/position, but also retrieving a context/position when cued with an item. It is this item-context binding that we assume is responsible for maintaining the item's serial order in a list. This assumption is supported by modelling work that has identified item-context bindings as an essential component of working memory for lists (Farrell & Lewandowsky, 2004) as well as for visual-spatial arrays (Oberauer & Lin, 2017;Schneegans & Bays, 2017).
It has long been established that the item-context bindings are subject to confusion errors (Henson, 1998). In the serial-recall literature, these errors are typically referred to as order errors, which is a specific type of confusion error appearing in tasks where people need to recall the items in their serial position of a list. When recalling lists of words in their serial order, people often recall the correct words from the list but in wrong list positions. For instance, when trying to recall the sequence "wall, dig, bend" people sometimes retrieve "wall, bend, dig" instead. These order errors are more likely to occur between items sharing adjacent vs. distant serial positions in the list (i.e., the locality constrain, see Henson, 1998), suggesting some degree of overlap between adjacent positional representations (i.e., the overlapping ellipses in Fig. 2). These confusion errors must be distinguished from item errors, which is the failure to recall an item at all. One might not be able to recall "bend" and either respond with another word that did not exist in the list or leave their response empty. These errors and the way to compute them are illustrated in Fig. 2. Studies have shown a dissociation between confusion and item errors. Confusion errors are more affected by dual-task interference than item errors are (Gorin, Kowialiewski, & Majerus, 2016;Henson, Hartley, Burgess, Hitch, & Flude, 2003), and they are associated with different neural regions (Kalm & Norris, 2014;Majerus et al., 2010). Confusion errors are particularly diagnostic to understand what kind of representations is bound to context, because they reflect failures of distinctly binding each item to its context. We will therefore focus on these errors when examining the role of phonological and semantic representations for the item-context binding process. The role of Fig. 1. Illustration of the binding process and its interaction with similarity. Note. Through temporary bindings in WM, items (here depicted as circles) can be retrieved from their context (upper panels), and contexts (here depicted as ellipses) can be retrieved from their item (lower panels). When items are dissimilar (left panels), they are sufficiently distinct to allow the original item or context to be retrieved in most cases. When inter-item similarity increases (right panel), the competition between alternative WM representations increases, increasing the probability that a confusion error occurs.

Fig. 2.
Scoring procedure typically used to assess item and order memory. Note. When measuring participants' ability to recall items, the total number of items recalled is computed, divided by the number of memoranda. In this example, four items (A, C, D and F) out of six (A, B, C, D, E, and F) have been recalled, leading to an item score of 4/6 = 0.667. When measuring participants' ability to recall the order of a sequence, the number of items recalled in their correct position is computed, divided by the total number of items recalled regardless of their position. In this example, only two items have been recalled in their correct position (A and F), out of four items in total (A, C, D, and F), leading to an order score of 2/4 = 0.5. When computing the order score, items not recalled at all are scored as missing values. As these items are not recalled at all, they are not informative regarding participants' ability to recall the items in their order. In this example, items B and E have not been recalled at all, they are therefore scored as missing values. In this way, the order score is independent of the item score: A person can have any order score between 0 and 1 regardless of how many items they recalled (when no item was recalled, the order score is not defined). See also the Methods section for a detailed description of how these scores were obtained in our experiments. item memory will be considered in the Discussion.
The similarity between to-be-remembered information impacts confusion errors, with similar information being more confusable than dissimilar one. The best studied example of this phenomenon is the phonological similarity effect (Baddeley, 1966), in which phonologically similar list items are confused more often than phonologically dissimilar items. This similarity effect is of critical importance for our understanding of WM. It shows that the phonological representation of items is bound to positional contexts. This impact of between-item similarity has been observed across multiple domains, such as the auditory (Visscher, Kaplan, Kahana, & Sekuler, 2007;Williamson, Baddeley, & Hitch, 2010), and visual (Guitard & Cowan, 2020;Jalbert, Saint-Aubin, & Tremblay, 2008;Logie, Saito, Morita, Varma, & Norris, 2016;Saito, Logie, Morita, & Law, 2008) ones. Therefore, the increased confusability induced by similarity appears to reflect a general property of WM. Confusion errors in WM can be more generally attributed to a discriminability problem. Whatever representation is used during itemcontext binding, this representational format is subject to confusion errors, especially when the to-be-remembered information becomes difficult to discriminate (i.e., as similarity increases).

Similarity-based confusions and the direction of retrieval
The similarity between items can cause confusion errors in two different ways. The first one occurs when items need to be retrieved from their context, such as retrieving "wall" from "Position 1". This is the best studied case of similarity-based confusion in the WM literature, in which the so-called phonological similarity effect occurs. The second case is rarely studied in the WM literature and involves retrieving a position/ context from the item it was bound to, such as retrieving "Position 2" when presented with "dig". In this section, we explain more thoroughly each type of retrieval direction and the way it is affected by similarity.
When items need to be retrieved from their context, similarity increases confusions because the retrieved WM trace is ambiguous compared to other items (see Fig. 1, upper panels). For instance, in serial recall, participants must reproduce the items in order. In serial-recall models, this is accomplished by re-activating the positions one by one in forward order and using each position as a cue to retrieve the item bound to it (e.g., "Position 1" is used as cue to retrieve "rat"). This initially leads to the retrieval of a partially degraded WM trace of the item. To produce a legitimate response (e.g., a word), the degraded WM traces must be disambiguated by comparing them to a set of response candidates (Schweickert, 1993). Between-item similarity increases confusion errors during this disambiguation stage. For instance, given the item "rat" and its degraded trace "_at", it is more likely to select "fat" than "dig".
The opposite direction of retrieval is to provide an item and ask to retrieve the position associated to that item. This direction of retrievalrarely tested in the WM literatureprovides a new way for testing a prediction from the idea of item-context bindings, as shown in Fig. 1. For this direction of retrieval, higher between-item similarity is predicted to increase the probability of confusion errors because the cue itself (i.e., the item) is similar to other cues (i.e., other items in the list) (Mensink & Raaijmakers, 1988;Osgood, 1949;Watkins & Watkins, 1976), and therefore more ambiguous. We will refer to this phenomenon as the cuesimilarity principle. During the binding process, all features of an item are bound to the item's context. Similar items have overlapping features. When the item features are activated by the item cue, because of the overlapping features, this activates other items' contexts as well as the target item's context. The activation of multiple contexts by the same item cue increases retrieval competition, and hence, the probability of choosing the non-target context. For instance, when presented with the item "rat", and the next list word was "fat", not only the position of "rat" but also the position of "fat" will be strongly re-activated, leading to increased confusion errors. To the best of our knowledge, this cuesimilarity principle has never been tested for lists of phonologically similar words or lists of semantically similar words.

The present study
The purpose of the present study is to test whether the general similarity principles introduced above also apply to semantic information. Previous studies manipulating semantic similarity have shown that people recall more semantically similar than dissimilar items (i.e., better item memory) (Poirier & Saint-Aubin, 1995). This beneficial effect is generally attributed to people using the semantic category shared by similar items to restrict the set of plausible response candidates during recall (Neale & Tehan, 2007;Saint-Aubin & Poirier, 1999b), or due to increased activation in the shared semantic network in long-term memory (Kowialiewski, Lemaire, & Portrat, 2021;Kowialiewski & Majerus, 2020;Tse, Li, & Altarriba, 2011). We will return to the impact of similarity on item memory in the General Discussion. Whereas the evidence for improved item memory is robust, whether semantic similarity increases confusion errors is more ambiguous. Previous studies testing the impact of semantic similarity on confusion errors provided mixed results; some providing evidence for it (Baddeley, 1966;Saint-Aubin & Ouellette, 2005;Tse et al., 2011) and some providing evidence against it (Monnier & Bonthoux, 2011;Nairne & Kelley, 2004;Neale & Tehan, 2007;Neath, Saint-Aubin, & Surprenant, 2022;Saint-Aubin & Poirier, 1999b). A recent meta-regression study suggested that semantic similarity increases order errors (Ishiguro & Saito, 2020). This metaregression is, however, not completely conclusive, as themarginally significantresults pertain to a specific measure of semantic similarity. When a different measure of semantic similarity was used, no impact of semantic similarity was observed in the Ishiguro & Saito metaregression study. These contradictions raise the question of whether semantic information is bound to contexts in the same way as phonology.
In addition to resolving this empirical uncertainty, we also provide a first test of a new prediction from the WM architecture presented in Fig. 1: People should be able to retrieve a context when presented with an item and confusion errors should come from the item similarities. According to the cue-similarity principle, confusions errors should increase when similar items are used as cues to retrieve the positions. The cue-similarity principle has never been tested with this direction of retrieval in verbal WM tasks, despite being a core prediction from positional models of WM.
We tested whether semantic information is encoded in the same way as phonological information, namely by binding that information to appropriate context cues such as positions. Across six experiments, we manipulated semantic (Experiments 1a, 2a & 3a) and phonological (Experiments 1b, 2b & 3b) similarity between items. We used category membership to manipulate semantic similarity (e.g., musical instruments, animals, fruits), based on the assumption that similarity will be very high between members of the same category (e.g., "leopard-lioncheetah"), compared to items drawn from different categories (e.g., "jacket-tree-letter"). 1 The phonological manipulation served as a control to assess the validity of our experimental procedures. As a rough equivalent of semantic similarity, we manipulated phonological similarity by using lists of items drawn from the same rhyming category (e. g., "rat, fat, mat") and compared these lists to lists of non-rhyming items. Both similarity manipulations involve categories that the similar items share (e.g., a semantic or rhyming category), and have been shown to increase the number of items people can recall (Gupta,Lipinski,& 1 Category membership is a robust and safe way to study semantic similarity. The categories can be directly used to create the similar lists. The dissimilar lists are then created by sampling one word from different categories. This way, all individual characteristics of the stimuli affecting WM performance are controlled for, such as word frequency, imageability/concreteness, or neighborhood density. Aktunc, 2005;Poirier & Saint-Aubin, 1995). Therefore, the two similarity manipulations are comparable. Participants were asked to bind the study list items either in relation to a temporal context (Experiment 1 & 2) or a spatial context (Experiment 3). If semantic information is bound to context the same way as phonological information, we should observe more confusion errors in semantically similar than semantically dissimilar lists.
The novelty of our study was to test the similarity principle across both retrieval directions. We tested the impact of the similarity manipulations on confusion errors by cueing with the context to access the items (Experiments 1, 2 and 3, see Fig. 3), as classically done in the majority of studies. Critically, we also tested item-context binding by cueing with the items to access the context (Experiments 2 and 3, see Fig. 3). Taking both directions of retrieval together provides an exhaustive and unambiguous test of whether semantic information is bound to contexts the same way as phonological information.

Experiments 1a & 1b
Experiments 1a & 1b assessed the impact of semantic and phonological similarity with similar vs. dissimilar lists in a serial recall and an order reconstruction task (see Fig. 3, upper panel). In the serial recall task, participants had to retrieve the items, given positional cues, by typing the words in a prompt box. The serial recall task provides a way to assess the impact of both similarity manipulations on item and confusion errors. In the order reconstruction task, the items were given at retrieval and participants had to put them in their original order, thus providing a pure measure of order memory. If semantic information is used during item-context binding, we predict that people should confuse semantically similar items more often than dissimilar items.

Participants
Young adults aged between 18 and 35 years participated in Experiments 1a & 1b (N = 60 for each experiment). Sample sizes were first estimated based on previous studies investigating the impact of semantic and phonological similarity, leading to a base sample size of 30. In case the Bayes Factor (see statistical procedure) did not reach a sufficient level of evidence (BF > 10 for either the null or the alternative hypothesis) concerning the critical effects of interest, thirty more participants were recruited. Sixty participants per experiment was set as the maximum N due to financial constraints.
Participants were recruited on the online platform Prolific. All participants were English native speakers, reported no history of neurological disorder or learning difficulty, and gave their written informed consent before starting the experiment. The experiment has been carried out in accordance with the ethical guidelines of the Faculty of Arts and Social Sciences at the University of Zurich.
We decided to draw the items from clearly defined semantic and rhyme categories to maximize between-item similarity across both the semantic and phonological dimensions. Furthermore, both similarity manipulations have in common that they increase item memory (Gupta et al., 2005;Poirier & Saint-Aubin, 1995). As the phonological manipulation served as a control to draw conclusions about the semantic manipulation, it is important to show that they have comparable effect on one dependent variablein our case, item memory. The full list of stimuli is available on OSF. To form a similar list, six items were drawn from the same category. For each list, the six items were randomly drawn from the category, and their order was shuffled. The dissimilar lists were built by randomly sampling one item from each of the six different categories. Constraints were imposed when creating the dissimilar lists across both similarity dimensions to ensure that idiosyncratic aspects of the lists would not lead to spurious effects (see Appendix A).
Several metrics of semantic similarity have been proposed in the literature. Among these, Latent Semantic Analysis (LSA) is the most used. It measures the extent to which two words co-occur within similar contexts in large corpora (Landauer & Dumais, 1997). A recent study found that another variable, WordNet path length, predicts WM performance more accurately than LSA (Ensor, MacMillan, Neath, & Surprenant, 2021). This variable measures the shortest path length that separate concepts in a hypothetical semantic network. Finally, another semantic similarity metric has been proposed, which contrary to classical measures, is thought to be partially independent from lexical connectivity measures such as LSA or WordNet path length (Ishiguro & Saito, 2020). This metric relies on three main dimensions: valence, arousal, and dominance (i.e., VAD, see Moors et al., 2013). With this metric, similarity at the list-level is obtained by first computing the centroid of list items in the semantic space. The mean Euclidean distance of all items from their centroid is then computed. The closer the items from their centroid, the more similar they are. We used these metrics (i. e., LSA, WordNet path length and mean distance from the centroid) to evaluate the extent to which the similar and dissimilar lists we used differed in terms of semantic similarity. Overall, the semantically similar lists differed from the semantically dissimilar lists across all semantic similarity measures explained above. In contrast, the phonologically similar and dissimilar lists did not credibly differ along any dimension. The results from this analysis are reported in Table 1.

Procedure
The goal of these two first experiments was to provide a comprehensive direct comparison of semantic and phonological similarity effects on confusion errors, here measured through the ability to report the items in their serial order (i.e., order memory), as classically done in the serial recall literature. The items were words, and their context was the word's ordinal position in the list. The task is illustrated in Fig. 3, upper panel. Each trial began with a central fixation point presented for 500 ms, followed by the presentation of the study list. Study lists consisted of six words presented sequentially at the center of the screen in Courier font. Each word was presented on screen for 1000 ms, followed by the next word with no inter-stimulus interval. Directly after the presentation of the last item, the retrieval phase began. On half the trials, participants were asked to perform serial recall. When this occurred, a prompt box appeared at the center of the screen, and participants were asked to type each word in the order in which they appeared. To validate each response, they pressed "Enter". To help participants keep track of the within-list position, each prompt box was associated with a number at the bottom of it, starting from "1". If participants did not know a given item, they were invited to leave the prompt box empty and move on to the next item, resulting in an omission error. On the other half of the trials, participants were asked to perform order reconstruction. When this occurred, the six words appeared again on the screen on a single line in a pseudorandom order. Using their computer mouse, participants sequentially clicked on each item to reconstruct the order in which the words had appeared at encoding. After each click, the selected word was replaced by a string of "#" characters. This was done to ensure that each Note. Exp. 1a & 1b: six items appeared sequentially in the middle of the screen for 1000 ms each. At retrieval, participants were either asked to perform a serial recall or order reconstruction task. Exp. 2a & 2b: six items appeared sequentially in the middle of the screen for 1000 ms each. At retrieval, participants were sequentially cued with the positions and had to recall the items (cued recall of words) or were sequentially cued with the items and had to recall the positions (cued recall of positions). Exp. 3a & 3b: five items appeared sequentially on the screen on an invisible circle for 750 ms. Each word was preceded by a dot presented during 250 ms, indicating the exact center of the to-be-remembered word on the screen. On half the trials, participants were cued with a spatial location and were required to type the word associated to it (cued recall of words). On the other half, they were cued with a word and were required to report its spatial location (cued recall of spatial locations). The retrieval direction associated with each task is indicated on the right side. word was discarded from the competition after being selected. Participants performed four training trials (i.e., two in each recall condition) before beginning the main experiment.
The purpose of this experimental procedure was to test the impact of similarity on memory for order in a more controlled way to what has previously been done. Both the serial recall and order reconstruction procedures require the disambiguation of WM traces by comparing them to a set of candidates. In serial recall, these candidates are the items stored in long-term memory. In order reconstruction, the candidates are the list items provided at retrieval. The type of recall test (i.e., serial recall, order reconstruction) was not revealed before the retrieval phase and was pseudo randomly assigned to each trial. This procedure ensured that the lists were encoded in the same way for each recall type, an aspect which has rarely been controlled in previous similarity manipulations. The order-reconstruction task has the advantage of providing a pure test of order memory, as item errors are impossible. The serial recall task prevented participants from memorizing only the first letter of each word, a strategy that would be successful for the order reconstruction task and would have neutralized the similarity manipulations. Instead, each item needed to be encoded as a whole to achieve reasonable recall performance in the serial recall task.
In sum, there were four different experimental conditions: two recall procedures (serial recall, order reconstruction) and two similarity conditions (similar, dissimilar). There were 20 and 21 trials for each experimental condition in Experiments 1a and 1b, respectively.

Scoring procedure
Different scoring procedure reflecting different aspects of WM were computed. First, we computed participant's ability to recall the identity of the items in the memory list. Second, we computed participant's ability to recall the items in their correct position. As only the latter is theoretically relevant for item-context binding, it was particularly important to measure it in a way that is not confounded by item memory. 2 In the following paragraphs, we explain in more details how they were computed.
In the serial recall task, we first computed an item recall score, for which an item was considered correct if recalled, regardless of the position at which it was output at retrieval. For instance, given the target sequence "Item1 -Item2 -Item3 -Item4 -Item5 -Item6" and the recalled sequence "Item1 -Item3 -blank -Item5 -blank -Item6", Item1, Item3, Item5 and Item6 would be considered as correct. This criterion, also illustrated in Fig. 1, measures the ability to recall item identity. Second, we computed order memory, as the proportion of items recalled at their correct position out of the number of items recalled regardless of their position. This proportion, also illustrated in Fig. 1, was computed by first coding all items not recalled at all as missing values, and then averaging for each participant the number of items correctly recalled in correct order at each serial position. These scores are equivalent to the order recall score usually used to assess the impact of experimental manipulations on memory for order information (Saint-Aubin & Poirier, 1999a). One problematic aspect with this measure is that it depends on items being recalled at all; items not recalled can't provide any information regarding order memory. The order reconstruction task solves this potential issue.
In the order reconstruction task, participants are asked to reconstruct the order of the to-be-remembered items. Accuracy is measured as the proportion of items chosen in their correct ordinal position. As maintenance of item information is not required in this task, reconstruction accuracy provides an unambiguous measure of the extent to which participants remember their order.

Data analysis
We conducted Bayesian analyses using the BayesFactor package (Morey & Rouder, 2014) implemented in R. Evidence in favor of a model over a comparison model is given by the Bayes Factor (BF). It reflects the likelihood ratio of a given model relative to a competing model, for instance the null model. The BF 10 is used to denote the likelihood ratio for the alternative model relative to the null model, and the BF 01 to denote the likelihood ratio for the null model relative to the alternative model. We use the classification of strength of evidence proposed in previous studies (Jeffreys, 1998): a BF of 1 provides no evidence, 1 < BF < 3 provides anecdotal evidence, 3 < BF < 10 provides moderate evidence, 10 < BF < 30 provides strong evidence, 30 < BF < 100 provides very strong evidence, and 100 < BF provides extreme/decisive evidence. In the main analyses of Experiments 1 through 3, each effect of interest was tested using a Bayesian paired-samples t-test using the aggregated data (i.e., data averaged for each participant) as dependent variable. We also report the 95% Bayesian Credible Intervals using the highest density intervals of the sampled posterior distribution of the model under investigation (number of iterations = 10 5 ). We used the default medium Cauchy prior distribution with scale = ̅̅ 2 √ 2 . On each graph, we report the 95% within-subject Confidence Intervals for each mean.

Results
Detailed statistical values across all experiments are reported in Table 2.

Serial recall
As can be seen in Fig. 4 left panels, similar items were recalled more often than dissimilar items as shown by better item memory accuracy, and this difference was supported by decisive evidence for both the semantic (BF 10 = 5.47e+19) and the phonological dimensions (BF 10 = 1.464e+13). In contrast to item memory, confusion errors did not behave the same way across the semantic and phonological dimensions (see Fig. 4, middle panels). As expected, phonologically dissimilar items were recalled more often in their correct order than phonologically similar items, and this difference was supported by decisive evidence (BF 10 = 8.675e+14). Hence, phonological similarity increased confusion errors. However, semantic similarity did not influence participants' ability to recall the words in their correct order, and hence had no influence on confusion errors. This absence of an effect was supported by moderate evidence (BF 01 = 7.035).

Order reconstruction
In the order reconstruction task, there was no obvious increase of confusion errors for semantically similar over dissimilar lists (see Fig. 4, upper right panel), and moderate evidence supported this absence of a difference (BF 01 = 6.321). In contrast, confusion errors increased for 2 Researchers traditionally report the proportion of items recalled in correct position for serial recall tasks. This score has the disadvantage to provide a blend of both item and item-context binding and is therefore ambiguous regarding which aspect of WM is affected by a given manipulation. It was therefore not included.
phonologically similar vs. dissimilar lists of items (see Fig. 4, bottom right panel), and this difference was supported by decisive evidence (BF 10 = 4.07e+8).

Discussion
Whereas both semantic and phonological similarity increased the number of items people were able to recall to about the same degree (see Fig. 4, left panels), only phonological similarity credibly and consistently impacted confusion errors (see Fig. 4, middle and right panels). These results replicate previous results showing a null impact of semantic similarity on memory for order (e.g., Saint-Aubin & Poirier, 1999b). In the next experiments, we tested the impact of similarity on item-context binding in a more exhaustive manner, by testing both retrieval directions.

Experiments 2a & 2b
Experiments 2a & 2b assessed binding memory between items and ordinal-position contexts, as Experiments 1a & 1b (see Fig. 3, middle panel). Here we also varied the direction of retrieval: Participants were presented with a position and had to retrieve the items associated to it (i. e., word recall task, context-to-item retrieval direction), or presented with an item and had to retrieve the position associated to it (i.e., position recall task, item-to-context retrieval direction). As for Experiments 1a & 1b, we predicted that semantic similarity increases confusion errors if semantic information was bound to context the same way as phonological information.

Participants
Young adults aged between 18 and 35 years participated in Experiments 2a & 2b (N = 60 for each experiment). Participants were recruited on the online platform Prolific. All participants were English native speakers, reported no history of neurological disorder or learning difficulty, and gave their written informed consent before starting the experiment. The experiment has been carried out in accordance with the ethical guidelines of the Faculty of Arts and Social Sciences at the University of Zurich.

Material
All materials were identical to those used in Experiments 1a & 1b.

Procedure
Experiments 2a and 2b used the same design as Experiments 1a & 1b, but with two new test procedures: Cued recall of words, given positions, and cued recall of positions, given words. Whereas the cued recall of words requires the retrieval of the words from the positions, the cued recall of positions requires the retrieval of the positions from the items. Each position/item were probed in a random order at retrieval. For instance, the to-be-remembered sequence "freeze, love, puma, artwork, tree, venus" could be probed such that "artwork" had to be retrieved first, followed by "venus", then "freeze", etc. Trials with item cues and trials with position cues were intermixed randomly so that the kind of test was not predictable during list encoding. The task is illustrated in Fig. 3, middle panel. In the cued recall of positions task, participants were presented with a word below a prompt box and were asked to report the serial position at which the word was presented. The recall procedure continued until all positions were probed. The cued recall of words was identical to the cued recall of positions task, except that a number served as cue to retrieve the associated word. The number was presented below the prompt box, indicating the position of the to-berecalled item.
The novel aspect of Experiments 2a & 2b is the cued recall of positions task, which induces the retrieval direction from item to context. As each position was probed in a random order independent of the order of presentation, this task discouraged participants from mentally recalling the list serially to retrieve the position. This contrasts with the typical serial recall and order reconstruction tasks, in which the retrieval direction from context to item is the most plausible strategy to perform the task. Experiments 2a and 2b manipulated semantic and phonological similarity, respectively. There were again four different experimental conditions: two recall procedures (word recall, position recall) crossed with two similarity conditions (similar, dissimilar). There were 21 trials for each experimental condition in each experiment.

Scoring procedure
For recall of the words from position (i.e., cued recall of words), similar item and order scores were used as those in Experiments 1a & 1b. For recall of the positions from words (i.e., position recall task), performance was analyzed by computing the proportion of positions correctly reported for each cued word. Note that in this task, participants produced a small number of omissions. When this occurred, the observation was treated as missing data to match more closely the order reconstruction and spatial location tasks (cf. Experiments 3a & 3b) in which omission errors are not allowed.

Word recall
As can be seen in Fig. 5, left panels, the results replicate those of Experiments 1a & 1b. Participants recalled more items in the similar than the dissimilar condition, and this difference was associated with decisive evidence both in the semantic (BF 10 = 3.953e+14) and phonological (BF 10 = 1.567e+8) dimensions. Along the phonological dimension, participants recalled the dissimilar items more often than similar items in their correct order (see Fig. 5, lower middle panel), with decisive evidence supporting this difference (BF 10 = 3.79e+7). Hence, people confused more often the similar versus dissimilar items. In contrast, there was no credible difference (BF 01 = 2.177) in confusion errors between semantically similar and dissimilar lists (see Fig. 5, top middle panel).

Cued recall of positions
Performance in the cued recall of positions task was different for semantic and phonological similarity. As can be seen in Fig. 5, upper right panel, semantic similarity did not credibly (BF 01 = 2.924) impair participants' ability to recall the positions associated with each item. This contrasts with phonological similarity, for which participants confused the positions more often when presented with phonologically similar versus dissimilar items (BF 01 = 1.757e+5), as can be seen in Fig. 5, bottom right panel.

Discussion
Semantic and phonological similarity again enhanced the number of items participants were able to recall (see Fig. 5, left panels). However, only phonological similarity credibly increased confusion errors (see Fig. 5, middle and right panels). The novel result of this experiment is that when the items served as cues to recall positions, phonological similarity impaired recall but semantic similarity did not. According to the cue-similarity principle, similar cues should lead to increased confusion errors compared to dissimilar cues. The absence of an effect of semantic similarity when words were used as retrieval cues forces us to conclude that the meaning of the words played no role in their use as retrieval cues.
In the following experiments, we extended these tests by changing the nature of the context to which items were to be bound, from ordinal position to spatial location. If the findings of Experiments 2a and 2b reflect how meaning is encoded into WM in general, then we should observe them for any item-context binding and not just for itemtemporal context bindings.

Experiments 3a & 3b
Experiments 1 and 2 manipulated similarity between items in tasks involving the binding between items and ordinal positions as contexts. Experiments 3a & 3b tested similarity in tasks involving the binding between items and spatial locations as context (Guérard, Tremblay, & Saint-Aubin, 2009). Participants were presented with items at different spatial locations, arranged on a circle (see Fig. 3) and had to memorize each item and its location. At retrieval, they were presented either with a location or an item. When presented with a location, they had to recall the word associated to it (i.e., word recall task, context-to-item retrieval direction). When presented with a word, they were asked to report the location associated to that word on a continuous scale (i.e., spatial location task, item-to-context retrieval direction). The spatial location task enforced the retrieval direction from item to context even more strongly than Experiments 2a & 2b. As the temporal dimension was irrelevant in Experiment 3a & 3b, this further discouraged participants to rehearse the word list in its presentation order before each response. 3 We expected to find more confusion errors in the semantically similar vs. dissimilar lists if semantic information is bound to context.

Participants
Young adults aged between 18 and 35 years participated in Experiments 3a & 3b (N = 60 for each experiment). Participants were recruited on the online platform Prolific. All participants were English native speakers, reported no history of neurological disorder or learning difficulty, and gave their written informed consent before starting the experiment. The experiment has been carried out in accordance with the ethical guidelines of the Faculty of Arts and Social Sciences at the University of Zurich.

Material
This experiment used the same words as in Experiments 2a & 2b. The number of words to be remembered was reduced from 6 to 5, as the task was slightly more difficult than the previous ones, as informed by a pilot study.

Procedure
Experiments 3a and 3b differed from Experiments 2a & 2b only in the kind of context to which the items were to be bound. The items were words, and the context was the spatial location of each word on the screen. The task is illustrated in Fig. 3, lower panel. Participants encoded 5-item study lists, with each item being sequentially presented in lower case at a pace of 1 item/s (250 ms OFF, 750 ms ON). Each word appeared at a different location on an invisible circle centered around the middle of the screen. The locations were pseudo randomly sampled with the constraint that the angular distance (in degree) between any two locations should not be smaller than a pre-defined value (see Appendix B for the methodological details). To ensure that participants could correctly identify the center of each item in an unambiguous manner, the words were preceded by a dot presented during 250 ms, indicating the exact center of each item. Directly after the encoding phase, there was an interval of 1000 ms, followed by the retrieval phase. During the retrieval phase, the circle around which the items were initially presented was always displayed on the screen. As in Experiments 2a & 2b, the items were not tested in their order of presentation, an aspect of the procedure which made the temporal dimension irrelevant. On half the trials, the participants were cued with a previously presented location on the wheel and had to recall the words associated to it by typing it in a prompt box. Participants were asked to leave the box empty if they were not able to retrieve a word. After pressing the "Enter" key, another location was cued, and this process repeated until all memoranda were tested. On the other half of the trials, a word from the to-be-remembered list appeared at the center of the screen written in uppercase. Participants were asked to report on the wheel the spatial location to which the item was associated. To help participants locate their response as accurately as possible, a dot was continuously presented on the wheel, based on the direction in which the current mouse position deviated from the screen center. To confirm their response, participants clicked on the desired location. The response automatically initiated the next retrieval attempt, until all words were tested. Participants performed four training trials (i.e., two in each recall condition) before beginning the main experiment.
There were again four different experimental conditions: two recall procedures (cued recall, spatial location reproduction) across two similarity conditions (similar, dissimilar). Twenty-one trials were included in each experimental condition in both experiments.

Scoring procedure
When participants had to recall the items from their spatial location, the same scoring procedure was used as Experiments 1a, 1b, 2a and 2b for item memory. Order memory was computed as the proportion of words recalled at their correct spatial location out of the number of words recalled regardless of their location. For the spatial location task, which involved participants reporting the word locations on a continuous circular scale, we measured the absolute angular distance (in degrees) of participant's response to the target location. We calculated the average absolute angular distance for each condition and each participant.

Word recall
As can be seen in Fig. 6, both semantic (upper left panel) and phonological (bottom left panel) similarity credibly (BF 10 = 1.881e+18 and BF 10 = 1.926e+9, respectively) increased the number of items recalled, with decisive evidence supporting a recall advantage for similar vs. dissimilar lists. Semantically similar items were not confused more often between each other than dissimilar lists (see Fig. 6, upper middle panel), and an absence of difference was supported by anecdotal evidence (BF 01 = 2.948). This result contrasts with what is observed in the phonological dimension, with phonologically similar lists being more often confused as compared to dissimilar lists (see Fig. 6, lower middle panel). This difference was supported by decisive evidence (BF 10 = 7.145e+4).

Cued recall of spatial locations
Results on the spatial location task in Fig. 6, upper right panel, suggest that the semantically similar and dissimilar lists did not substantially differ in angular error, and only anecdotal evidence supported a difference between both semantic conditions (BF 10 = 1.902). If anything, the direction of this difference went in the opposite direction of what would be expected if similarity led to more confusion errors. In contrast, it can be seen in Fig. 6, bottom right panel, that phonologically similar lists were associated with higher angular error in reproducing the word's location than phonologically dissimilar lists, and this difference was associated with decisive evidence (BF 10 = 104.162).

Discussion
The present results converge with those from the previous experiments. Whereas both semantic and phonological similarity credibly increased the number of items participants recalled, only phonological similarity increased confusion errors. The phonological similarity effect was still observed even when location was used as contexts instead of positions. To the best of our knowledge, this result has been reported in the verbal WM literature only once (Guérard et al., 2009) and constitutes an important test of the generality of models in which the core process of encoding into WM is the formation of item-context bindings. In the next section, we re-analyzed our data with a continuous metric of semantic similarity recently proposed in the literature.

Relationship between WM performance and the dimensional view of semantic similarity
A recent meta-regression study suggested that the absence of detrimental effect of semantic similarity on order memory might be due to an inappropriate measure of semantic similarity (Ishiguro & Saito, 2020). The authors argued that semantic similarity by category membership is confounded with relationships between concepts in a semantic network. Instead, the "true" semantic similarity between items would be better characterized by their shared features. They proposed a threedimensional feature space encompassing valence, arousal, and dominance (Moors et al., 2013) to measure the similarity between words. The average semantic dissimilarity for a list is computed by taking the Euclidean distance for all list items from their centroid in this space. We explored whether this metric was a credible predictor of confusion errors across all our experiments manipulating semantic similarity (i.e., Experiments 1a, 2a, and 3a). We ran a Bayesian generalized mixed model with serial position and the mean distance from centroid as predictors for the recall success of each list item. Details of this new analysis are reported in Appendix C. We report in Fig. 7, upper panel, the posterior distribution for all models. The results are clear-cut. There was no credible effect of the mean distance from the centroid on confusion errors. No consistent trend was observed throughout the experiments.
We ran similar analyses on the item-memory scores, assuming a Bernoulli distribution. The results are reported in Fig. 7, lower panel. As can be seen, the mean distance from the centroid credibly impacted item memory consistently across Experiments 1a, 2a and 3a. As the distance from the centroid decreased (and therefore semantic similarity increased), memory for item increased. In the next section, we discuss more thoroughly the theoretical implications of these results.

General discussion
The present experiments yielded two main outcomes. First, both semantic and phonological similarity enhanced the ability to recall item information. Second, whereas phonological similarity credibly decreased performance in all tasks testing item-context bindings (i.e., order memory, location memory, order reconstruction, cued recall of positions, and cued recall of spatial locations), semantic similarity did not. These results provide strong converging evidence for a dissociation between phonological and semantic similarity effects in WM. Given these results, together with other empirical evidence showing an absence of semantic similarity effect on confusion errors (Neale & Tehan, 2007;Neath et al., 2022;Poirier & Saint-Aubin, 1995;Saint-Aubin & Poirier, 1999b), we conclude that semantic similarity does not negatively affect order and positional memory in tests of WM. If semantic information was bound to a positional or spatial context the same way as phonology, semantic similarity should have led to confusion errors, as observed for phonological similarity (Baddeley, 1966), and other dimensions of similarity (Jalbert et al., 2008;Saito et al., 2008;Visscher et al., 2007).
In the present work, we focused on the item-context binding process of WM. Based on this definition of encoding features into WM, we conclude that WM does not bind meaning to context in the same way as phonology. Other theoretical and modelling approaches would logically reach the same conclusion. For instance, in the Feature Model (Nairne, 1990) as well as its revised version (Poirier et al., 2019;Saint-Aubin, Yearsley, Poirier, Cyr, & Guitard, 2021), items are represented by vectors of perceptual and/or internally generated features. At retrieval, items stored in primary (short-term or working) memory need to be compared to items in secondary (long-term) memory. Similarity in this model leads to increased confusions because the traces in primary memory will be less discriminable when comparing them to items stored in secondary memory. Likewise, in the temporal distinctiveness account (Brown, Neath, & Chater, 2007), similarity is computed as the Euclidean distance between items represented in a multidimensional space (e.g., temporal, phonological). The closer the items in this Euclidean space, the more confusable they are. For all these models, adding the assumption that semantics is represented in WM in the same way as phonology would necessarily result in increased confusion errors for semantically similar vs. dissimilar items, in contrast with our results.

Implications for models of working memory
Based on our results, we propose that semantics does not contribute to WM through the binding of semantic features to context the same way as phonology. One possibility to explain these results is to assume that semantic information is not bound to context at all. How can we explain the recall advantage for semantically similar vs. dissimilar words at the item level, if semantic information is not bound to contexts? There is robust evidence showing that semantic knowledge strongly contributes to WM performance (see Kowialiewski & Majerus, 2020 for a short metaanalysis in serial recall), with lists of semantically similar items being better recalled than lists of dissimilar ones. Results from the present study converge with these observations. The recall advantage for semantically similar vs. dissimilar items can be explained by assuming that WM partly relies on activated long-term memory, as assumed in an embedded processes account of WM (Cowan, 1999;Dell, Schwartz, Martin, Saffran, & Gagnon, 1997;Majerus, 2013;Nee & Jonides, 2013;Oberauer, 2002Oberauer, , 2009. Accordingly, the encoding of an item activates its long-term memory representation, including its meaning. We illustrate in Fig. 8 the mechanistic principles behind this idea. Semantically related items reactivate each other, either via their shared semantic features (Dell et al., 1997) or via lateral excitatory connections (Hofmann & Jacobs, 2014). For instance, when encoding the word "piano", the word "guitar" would in turn be activated (Collins & Loftus, 1975). Thereby, semantically similar list items have increased activation in the semantic network. In many computational models of WM, the success in recalling an item at all depends on its ability to overcome a retrieval threshold. If an item's activation is below the threshold, the model produces an omission. Accordingly, the higher activation of semantically similar items would help them to overcome this retrieval threshold more often than dissimilar items, leading to a recall advantage for semantically similar vs. dissimilar items which is restricted to item memory. The model presented in Fig. 8 furthermore assumes that semantic features are not directly bound to contexts. This simplifying assumption leads to an absence of a semantic similarity effect on confusion errors.
Such a model, inspired by embedded processes models of WM, helps to explain the presence of false memories in WM tasks (Abadie & Camos, 2019;Atkins & Reuter-Lorenz, 2008). When presented with a list of semantically similar items such as "leopard, tiger, lion, cheetah", people are more likely to respond "old" in a recognition test when presented with a semantically similar lure, such as "puma", than for a dissimilar lure, such as "desktop". This result can be explained by assuming that activation spreads to similar list items and non-list items. Hence, non-list items that are similar to several or all list items will be activated to some extent (see Fig. 8, similar condition). When presented with a semantically similar lure (i.e., "puma"), people are therefore more likely to say that this item was presented in the list (i.e., responding "old"), because it is now more strongly activated than other dissimilar lures (i.e., "desktop"). In contrast, when people are presented with lists of semantically dissimilar items (e.g., "arm, tree, sofa, mouse"), no such false memories are observed (Cowan, Guitard, Greene, & Fiset, 2022). From the model presented in Fig. 8, this latter result is predicted, because when given a dissimilar list, the activation spreading from list words no longer converges on the same non-list words (see Fig. 8, dissimilar condition).

Alternative explanations
An alternative explanation of the lack of semantic similarity effect on confusion errors is that semantic information is bound to contexts, but for some reasons, is immune to confusion errors. The only piece of evidence supporting the idea that semantic is bound to contexts comes from Kowialiewski, Gorin, and Majerus (2021). They observed than semantic knowledge can constrain the processing of serial order information. They presented lists composed of two semantically similar triplets (e.g., "leopard, lion, cheetah, arm, elbow, leg"). When items are recalled in a wrong position, they tend to stay within their group of similar items, rather than move to positions that have been occupied by dissimilar items, compared to the same positions in a completely dissimilar list. These results are difficult to explain without assuming that at least some form of meaning is bound to contexts. Meaning could use a different representational format, such as sparse distributed representations (Kanerva, 1988). Using a sparse code for items' meaning would prevent semantically similar items from being confused with each other, while still allowing the cognitive system to have some information about which semantic category was in which list position.
However, 's results can also be explained by assuming that people augment the positions of semantically similar items with a shared positional context. A similar assumption is already made in positional models to explain temporal grouping effects in serial recall (Burgess & Hitch, 1999;Henson, 1998). If semantic groups are represented like temporal groups, semantically similar items would be associated with similar positional contexts. This leads to the prediction that transposition errors should occur more often between items from the same (semantic or temporal) group than with items from another group. This explanation doesn't require semantic information to be bound to contexts.
A previous meta-regression study suggested that the absence of detrimental effect of semantic similarity on order memory boils down to a wrong measurement of semantic similarity (Ishiguro & Saito, 2020). The authors proposed the VAD metric of semantic similarity comprising a three-dimensional space encompassing valence, arousal, and dominance. We did not find credible evidence that VAD negatively impacts binding memory. In contrast, this metric credibly predicted item memory performance. Given these results, we have no reason to believe that the VAD metric should be considered differently than other standard metrics of semantic similarity, such as LSA-cosine (Landauer & Dumais, 1997).
Finally, it could be argued that the absence of a semantic similarity effect on memory for order is due to semantic knowledge not being activated in our WM task, perhaps because it needs more time to be activated. This explanation is unlikely for the following reasons. First, access to meaning is an automatic and extremely fast process, especially in language (Cheyette & Plaut, 2017;Potter, 1976;Potter, Wyble, Fig. 8. A model of the semantic similarity effect. Note. When encoding items into WM, a new binding is created between this item and its context. At the same time, this item becomes activated in semantic longterm memory. Semantically similar items are assumed to have direct connections in the semantic network and spread activation to each other. When trying to retrieve an item by cueing it with its context, this item has an activation level, which is a combination of the activation provided by the item's binding to its context and its activation in semantic memory. If the activation level of the item is beyond a retrieval threshold, it is recalled. Otherwise, an omission is produced. When semantically similar items are encoded in the same list, they have a higher activation level thanks to the spreading of activation principle, which helps them to overcome the retrieval threshold more often than semantically dissimilar items. Hagmann, & McCourt, 2014;Tyler, Moss, Galpin, & Voice, 2002). Second, the fact that we observed very strong beneficial effects of semantic similarity on item memory goes against this claim. It shows that people had access to words' meaning and used it to increase the number of items they could recall. Strong semantic similarity effects can even be observed in running span procedures using fast presentation of memoranda (Kowialiewski & Majerus, 2018).

Possible limitations
One possible objection to our interpretation is that phonological and semantic similarity measurements were not equivalent. This is unlikely because both kinds of similarity manipulations led to comparably strong impact on item memory, showing that people were able to detect the presence of similarities to about the same extent across both manipulations. In addition, strong phonological similarity effects on order memory can be already observed with much weaker manipulations than ours, for instance when lists items share only one phoneme (Camos, Mora, & Barrouillet, 2013;Fallon, Mak, Tehan, & Daly, 2005;Gupta et al., 2005). Furthermore, we are confident that our semantic similarity manipulation was a robust one, as our similar and dissimilar lists strongly differed across several semantic-similarity metrics (see Table 1). If the item-context binding process was subject to confusion errors driven by semantic similarity, we would have expected at least small detrimental effects on memory for order.
It is also possible that the measures we used for item and confusion errors do not reflect what we wanted to measure. For instance, it has been argued that order reconstruction is not a pure measure of confusion errors, and could also partially reflect item memory (Neath, 1997). Contrary to this latter claim, three main outcomes support the validity of our measures. First, none of the semantic manipulations affected confusion errors, despite strongly affecting item memory. If our measures of confusion errors were not process pure, they should have been affected by semantic similarity in one way or another. This was not observed. Second, the rhyming manipulation led to a dramatic drop of performance on order memory, despite strongly enhancing item memory. If our confusion-error measures were also affected by item information, we shouldn't have observed these divergent effects of phonological similarity on memory for item and confusion errors. Finally, all measures of confusion errors converged toward the same pattern of performance. The results illustrated in Fig. 3, middle and right panels, clearly indicate similar performance level and serial position curves across all experiments and similarity manipulations. We can therefore be confident that all our measures of confusion errors reflect the same construct.

Conclusion
To sum up, we tested how phonological and semantic similarity impacted the maintenance of novel item-context bindings in WM. Our exhaustive tests showed that phonological similarity increases confusions errors, leading to a performance decline in all WM tasks we used. By contrast, across all experiments, semantic similarity did not increase confusion errors and did not decrease WM performance. These results imply that there is a fundamental difference between the representation of semantics and phonology in verbal WM. Either semantics is not bound to contexts, or it is bound to contexts, but in a different way than other kinds of information, such that it does not lead to confusion errors. The benefit of semantic similarity on item memory, can be explained by assuming that semantically similar items activate each other in longterm memory through their associations in a semantic network.

Open Science statement
All the data and codes have been made available on the Open Science Framework: https://osf.io/tpsg2/ Author contributions B. Kowaliewski and K. Oberauer designed the experiments. B. Kowaliewski programmed the experiments, collected, and analyzed the data, and drafted the initial manuscript. E. Mizrak, J. Krasnoff, and K. Oberauer provided critical feedback and revisions. All authors approved the final manuscript for submission.

Data availability
All the materials, codes, data, and data analyses across all experiments have been made available on the Open Science Framework: https://osf.io/tpsg2/ constrained to have a Levenshtein distance above the value of two to ensure sufficient phonological dissimilarity between the items in a dissimilar list.
In addition, we kept semantic similarity equal between the phonologically similar and dissimilar lists. Therefore, LSA (latent semantic analysis) values were obtained for each pair of stimuli within each list using the TASA semantic space available at the following address: https://sites.google. com/site/fritzgntr/software-resources/semantic_spaces (see also Günther, Dudschig, & Kaup, 2015). We then compared the LSA values between the dissimilar and the similar lists. Dissimilar lists were only included in the experiment if there was no evidence for a difference in LSA similarity between them and the similar lists. As a criterion we determined a BF superior to 3 in favor of the absence of a difference (obtained in a Bayesian independent samples t-test). If the BF was below 3, new dissimilar lists were generated until this criterion was met. Therefore, if participants applied a forward recall for each item going over each input position until they reach the cued position or the cued item, we would have observed a systematic linear increase in response times as a function of input position. This was not systematically observed.