Modelling orthographic similarity effects in recognition memory reveals support for open bigram representations of letter coding

A variety of letter string representations has been proposed in the reading literature to account for empirically established orthographic similarity effects from masked priming studies. However, these similarity effects have not been explored in episodic memory paradigms and very few memory models have employed orthographic representation of words. In the current work, through two recognition memory experiments employing word and pseudoword stimuli respectively, we empirically established a set of key orthographic similarity effects for the first time in recognition memory – namely the substitution effect, transposition effect and reverse effect in recognition memory of words and pseudowords, and a start-letter importance in recognition memory of words. Subsequently, we compared orthographic representations from the reading literature including slot coding, closed-bigram, open-bigram and the overlap model. Each of these representations was situated in a global matching model and fitted to recognition performance via Luce ’ s choice rule in a hierarchical Bayesian framework. Model selection results showed support for the open-bigram representation in both experiments.


Introduction
Recognition memory is an episodic memory task where participants are required to differentiate items they have previously encountered from those they have not.Many models have described retrieval in recognition as a global matching process in which memory strength is determined by calculating the similarity between the test probe and each of the representations stored in memory (e.g.Cox & Shiffrin, 2017;Gillund & Shiffrin, 1984;Hintzman, 1984;Osth & Dennis, 2015a;Shiffrin & Steyvers, 1997, see Osth & Dennis, 2020, for a review).According to global matching, memory errors are a direct consequence of similarity -increase in the similarity between lures and stored representations in memory will increase false recognition, thus hurting memory performance.
The current study focuses on the similarity structure of word stimuli, and in particular perceptual similarity.Perceptual similarity of words has a clear consequence on memory.In a perceptual category length paradigm where category length (i.e. the number of studied items from a phonological and/or orthographical category, e.g.lack, pack, black, and flack) is varied, false alarm rates to lures from the same category (e.g.rack) increase with increased category length (e.g.Heathcote, 2003;Shiffrin et al., 1995;Steyvers, 2000).In the perceptual Deese-Roediger-McDermotte (DRM;Deese, 1959;Roediger & McDermott, 1995;Sommers & Lewis, 1999) paradigm, where perceptual associates are learned during the study phase (e.g. lack, pack, flack) which are all related to a non-presented critical item (e.g.black), false recall and recognition of the critical item has been consistently observed (see Chang & Brainerd, 2021;Coane et al., 2021, for reviews).
Despite the well documented effect of perceptual similarity on memory performance, very little work addresses what type of perceptual representations of words underlies these effects.The present work is focused explicitly on this issue where perceptual representations are defined using orthographic representationshow letters are arranged together.As we will elaborate below, a variety of orthographic representations has been proposed in the reading and visual word recognition literature where some of the schemes code letters in their absolute positions such as slot coding and the overlap model (Gomez et al., 2008), while others code letters in their positions relative to each other such as closed and open bigram models (see Davis & Bowers, 2006;Hannagan et al., 2011 for reviews).Each of these proposed representations has their own consequence for similarity between pairs of letter strings, which has been traditionally tested against similarity phenomena such as priming effects in visual word recognition studies which reveal the relative similarity structure of letter string pairs (e.g.Davis & Bowers, 2006;Gomez et al., 2008).However, such representations have made little contact with episodic memory data.
A recent exception is work by Osth and Zhang (2023) who applied a number of models of absolute-and relative-position coding to four datasets within a global matching model.While such representations were generally successful in capturing the high false recognition rates to orthographically similar lures, it was difficult to distinguish between the representations as there was no representation that emerged as a consistent winner across the datasets.In the current work, we will attempt to adjudicate between these representations by testing key benchmark orthographic similarity effects in a recognition memory paradigm.As we will elaborate below, although these similarity effects have been consistently observed in psycholinguistic lexical decision and naming tasks (e.g.Ferrand & Grainger, 1992;Forster et al., 1987;Lupker et al., 2020;Perea & Lupker, 2003a, 2003b;Schoonbaert & Grainger, 2004), to our knowledge, they have not yet been explored in recognition memory paradigms.Subsequently, different orthographic representations will be situated in global matching models and fit to recognition performance where we can compare whether each of the orthographic representations will be able to capture these effects.

Orthographic representations
As mentioned previously, orthographic representations can be classified according to whether letter orders are coded in their absolute positions within the letter string or whether the letters are coded in relative position to each other.Below we review four different orthographic coding schemes, two each class with simple example of similarity calculation between pairs of letter strings.Complete formulation of similarity calculation defined by each scheme is provided in the Computational Modelling section.
Absolute position coding of word orthography holds that the positions of individual letters in a word are represented by coordinates in a common reference frame with reference to one or more origins.The simplest approach is slot coding, in which each letter has a single position coordinate originating at the first letter of the word.That is, a word-form can be considered as a series of slots each containing a letter of the word in their positions of occurrence from the beginning of the word.For example, the word stop would be represented as a collection of four letter codes: {s 1 , t 2 , o 3 , p 4 }, and the word shop can be represented as {s 1 , h 2 , o 3 , p 4 } where each letter code L n is a unique code for letter L in position n.Similarity of two words is determined by the number of matched letter codes, in this example three letter codes s 1 , o 3 and p 4 .Common to all representations we explored in the current study, the matches between a pair are summed together and divided by the alignment length, which is defined as the length of the longer word of a pair, to produce a measure of similarity between 0 and 1. Therefore according to slot coding, the similarity value between the pair stop and shop would be 3/4 = 0.75.
Slot coding has traditionally been the most extensively used approach to word-form representations.It is adopted in the interactive activation model (McClelland & Rumelhart, 1981), the multiple read-out model (MROM; Grainger & Jacobs, 1996), the dual-route cascade (DRC) model (Coltheart et al., 2001), the activation-verification model (Paap et al., 1982), and the connectionist dual process model (CDP ++; Perry et al., 2010).However, due to the strict code of letter positions, any slight disruption in the absolute position of letters will cause challenge for slot coding and result in loss of similarity.For example, the word displacing, coded as {d 1 , i 2 , s 3 , p 4 , l 5 , a 6 , c 7 , i 8 , n 9 , g 10 }, will have no matches to the word place, coded as {p 1 , l 2 , a 3 , c 4 , e 5 } and consequently the similarity between the two strings is zero despite the fact the word place is a subset of the word displacing.
The strict slot coding can be made more flexible by incorporating position uncertainty (Ratcliff, 1981) as in the overlap model (Gomez et al., 2008).In the overlap model, instead of associating letters with single positions, each letter is distributed across multiple positions according to an uncertainty function that is centred on the true position relative to the beginning of the word.This means that letters are associated with multiple positions, but most strongly associated with their true position in the letter string.For example, as shown in the top panel of Fig. 1a, the letter h in the word shop is represented in all four positions but most strongly represented in the second position and is weakest in the fourth position.
The position uncertainty in the overlap model is formalized using Gaussian functions.Specifically, each letter is associated with a Gaussian function centred on its true position and the standard deviation of the Gaussian function is a free parameter varying by letter position (Gomez et al., 2008).The overlap model assumes that position uncertainty only occurs on one letter string of a pair, which is often a briefly presented item.The other letter string of the pair, often presented for longer if not unlimited time, has no position uncertainty meaning each letter is coded exactly in their true position as in slot coding (bottom panels of Fig. 1).Similarity between two letter strings is determined by the extent of overlap shared by the positional distribution of two strings (illustrated as shaded area in the top panels of Fig. 1).For example, as shown in Fig. 1a, the match values for shared letters s, o, and p between the pair stop and shop are calculated as 0.95, 0.29 and 0.28 respectively1 (for calculation details see Equation (4).The match values are summed together and divided by the alignment length of the pair (i.e. 4), yielding a similarity value of 0.38.Note that level of position uncertainty varies by letter positions, which affects the corresponding match strengths, a point we will return to in the General Discussion.Due to position uncertainty, the overlap model is able to capture the similarity between the pair place and displacing, which the slot coding fails to do.According to the overlap model, shared letters p, l, a, c are also represented in nearby positions therefore their representations overlap between the two words, contributing to similarity between the two strings.
Relative position coding schemes, on the other hand, code the order of letters directly by their relationship with each other within a word.In other words, a letter string is best represented as a set of n-grams and similarity between two letter strings is determined by the number of shared n-grams between them.The most common implementation is coding letter strings as bigrams.In the closed-bigram model, words are represented by adjacent bigrams only.For example, the word stop and shop would be coded as {_s, st, to, op, p_} and {_s, sh, ho, op, p_} respectively where_indicates the word boundaries associated with the start and end letter of the word.We can calculate the similarity between the pair in the same manner as previous schemesthe summed number of shared bigrams between the pair stop and shop (i.e. 3 bigrams:_s, op, p_) divided by the alignment length defined here as the number of closed bigrams of the longer of the pair (i.e. 5) to produce a similarity value of 3/5 = 0.6.Relative position models are also sensitive to similar letter strings where the absolute position is disrupted.Back to the example word pair displacing and place, the closed bigram model leverages the common bigrams pl, la, and ac in the common substring plac in the similarity calculation.
An extra level of flexibility can be introduced by loosening the contiguity requirement, which means that bigrams can be separated by one or more letters.This is adopted in open-bigram schemes, in which a word string is encoded in terms of all possible letter pairs instead of just the adjacent letter pairs.For example, the word stop, and shop would be coded as {_s,st,so,sp,to,tp,op,p_},and {_s,sh,so,sp,ho,hp,op,p_} respectively. This pair shares 5 bigrams (i.e._s,so,sp,op,p_), and the similarity between the pair is calculated as 5/8 = 0.625 where 8 is the alignment length (i.e. the number of open bigrams of the longer of the pair).Note that in the current work, we only allow for word boundaries for the start and end letters, instead of all possible open bigrams involving the boundaries.To avoid an explosion of possible bigrams as word length increases, we follow a constrained open-bigram model (Grainger & van Heuven, 2003) where we only encode bigrams separated by up to two letters.For example, the 5-letter-long word trial would be coded as {_t, tr, ti, ta, ri, ra, rl, ia, il, al, l_} where bigrams separated by three and more letters such as tl will not be encoded.Open-bigram schemes are employed in visual word recognition models such as the sequential encoding regulated by inputs to oscillations within letter units (SERIOL) model (Whitney, 2001), and the OB1-Reader (Snell et al., 2018).A different approach to letter position coding is spatial coding (Grossberg, 1978) as adopted by the self-organising lexical acquisition and recognition model (SOLAR; Davis, 1999Davis, , 2010)).Spatial coding blurs the distinction between the absolute-and relative-position coding modelsit assigns monotonic values to letters as they appear in a string but calculates similarity based on relative alignment of letter positions in two strings.Spatial coding is also flexible in capturing similarity with disruption in absolute positions of letters.However, we did not fit this model to data in the current study since it is computationally intractable.We will return to this model in the General Discussion

Orthographic similarity effects from masked priming studies
A number of benchmark findings regarding the orthographic similarity structure of words have been reported in the psycholinguistics of reading literature, which are used to test proposed orthographic representations.These orthographic similarity effects have typically been observed in masked priming studies (Forster & Davis, 1984) where a prime is presented very briefly (typically for 50 ms or less) in lowercase, which is followed by a target in uppercase (e.g.abc -ABCD).The similarity between the prime and target stimuli is determined by the amount of facilitation the prime has on the processing of the target.This is commonly measured by the speed at naming the target or determining whether the target letter string is a word or nonword.Masked priming studies have established that primes which are orthographically similar to the target (typically a nonword formed through minor alteration from the target word) have a facilitatory effect on the processing of the target word in comparison to orthographically unrelated letter string primes (e.g.Ferrand & Grainger, 1992;Forster et al., 1987;Rueckl, 1990;Schoonbaert & Grainger, 2004).Here, we briefly review three main types of orthographic similarity effects revealing the similarity structure between prime and target items that have been consistently reported in masked priming studies and other paradigms, namely the substitution effect, the edge effect, and transposition effects (for reviews see Grainger, 2008;Hannagan et al., 2011).Following this we discuss whether different orthographic representations can account for these effects.It is important to note the although these to-be-reviewed orthographic similarity effects have been robustly observed in masked priming studies, to our knowledge they have not yet been tested in recognition memory.

Substitution effect
The most straightforward and intuitive effect is the substitution effect.It is consistently observed that primes formed by substituting a single letter of the target (i.e.Single Substitution Neighbour) have facilitatory effect on target processing, for example, card -CURD.The facilitatory effect was observed even when the position of the substitution letter shifted (Davis & Bowers, 2006), but the facilitation is very weak if it is present at all when the substitution involves two letters (i.e.Double Substitution Neighbour for example cold -CURD) in comparison to unrelated control strings (e.g.file -CURD; Schoonbaert & Grainger, 2004).

Transposition effects
Perhaps the most consequential empirical benchmark for adjudicating between the orthographic representations is the transposition effect, which less flexible schemes are unable to capture (e.g.Davis & Bowers, 2006), as we will discuss in the next section.It is robustly observed in masked priming studies that when a prime is formed by transposing two adjacent internal letters in the target, the prime would have more facilitation in the processing of the target than substituting a single internal letter (e.g.Andrews, 1996;Chambers, 1979;Forster et al., 1987;Rueckl & Rimzhim, 2011), and substituting the two transposed letters entirely (e.g.Acha & Perea, 2008;Lupker et al., 2008;Perea & Lupker, 2003a, 2003b).The transposition effect extends to more extreme transpositions.For example, transposition effect has been found when primes involves several transposition letter pairs (e.g.Guerrera & Forster, 2008), and when the transposed letters are non-adjacent (e.g.Acha & Perea, 2008;Lupker et al., 2008;Perea et al., 2008;Perea & Lupker, 2004).A special case is the Reversed-Interior primes which involves the reverse of interior letters (e.g.cetupmor -COMPUTER) as initially investigated by Whitney et al. (2012) using 7-letter-long words, which can be considered as a combination of several transpositions (i.e. two pairs of distant transpositions and one pair of adjacent transposition in this example) and found to have a greater facilitatory effect than unrelated primes in priming studies (Davis & Lupker, 2017).To reduce confusability, in the remaining text we use the term "transposition effect" to refer to single interior adjacent transposition only, and we use "reverse effect" to refer to the reversed-interior effect.

Edge effect
The fact that exterior letters (i.e. the first and the last letter in a word) are more consequential for word identification has been demonstrated in a variety of paradigms (e.g.Jordan et al., 2003;Stevens & Grainger, 2003).One intuitive reason for the important role of exterior letters is that the blank spaces at the border of a word lead to less lateral interference from neighbouring letters for the initial and final letters, therefore improving the perception of exterior letters (Townsend et al., 1971).In masked priming studies, primes that share the same exterior letter pairs but have different interior letters with the target word (e.g.lert -LOST) produce more facilitation than primes share interior letters but different exterior letters with the targets (e.g.gosb -LOST), which produce weak or no facilitation effects (e.g.Humphreys et al., 1990;Jordan et al., 2003;McCusker et al., 1981).Investigating the role of start and end letter independently, while primes having the same initial letter and a different final letter as the target have been consistently reported to have a facilitatory effect compared to primes that have a different initial letter to the target, some studies have found that maintaining the end letter does not produce facilitatory effect in comparison to interior substitution (Grainger et al., 2006;Guerrera & Forster, 2008;Lupker et al., 2020;Whitney, 2001).Therefore, depending on whether the start and/or end letter are more important than interior letters in orthographic representation processing in specific paradigms, a useful orthographic representation needs to be able to make corresponding predictions.Next, we will compare and contrast different orthographic representations in their ability to account for L. Zhang and Adam.F.Osth these benchmark effects.

Can different orthographic representations account for the Benchmarks?
All coding schemes are able to account for the simple substitution effect.Regardless of whether the basic coding unit is single letters or bigrams, and whether letters are coded in absolute positions or relative positions, all schemes predict that the single substitution neighbour (e.g.CARD -CURD) share more matches than double substitution neighbour (e.g.COLD -CURD); therefore, single substitution pairs have higher similarity values than double substitution pairs (see Table 1 for match calculation).
To account for edge effects, an extra assumption needs to be added to all models, which is to allow the exterior letter matches contribute more to the similarity value calculation than interior-letter matches.This can be easily implemented with absolute coding schemes where the matches of the start and/or end letter can be weighted differently than matches of interior letters.For bigram models, because the word boundary can normally be represented in pair with the exterior letters, for example_C and D_in CURD, the boundary pairs can also be weighted differently.If matches of the exterior letters have higher weights than matches of the interior letters, then the exterior letter importance can be captured because string pairs with different exterior letters (ABCD -XBCY) will have a smaller match value than string pairs with different interior letters (ABCD -AXYD).
While all schemes are able to account for the substitution and edge effects, as mentioned previously, the transposition effect places considerably more constraint on the different schemes.The less flexible schemes -slot coding and the closed-bigram model, are completely unable to account for the transposition effect.As mentioned above, in the slot coding model, any slight disruption in the absolute position of letters between two strings will reduce the number of matches between them.The slot coding would predict that the transposition pair COLD and CLOD would be no more similar than the double substitution pair COLD and CURD (each pair has 2 matched letter codes C 1 and D 4 ) since letter code O 2 and L 3 in the word COLD do not match codes L 2 and O 3 in the word CLOD.Although employing an absolute position code, the position uncertainty assumption allows the overlap model to account for transposition effect because letters are in multiple positions.According to the overlap model, the representation for the word COLD also represents letter O in the third position and letter L in the second position, albeit to a weaker degree than their true positions, therefore the word COLD is more similar to its transposition neighbour CLOD than double substitution neighbour CURD (Fig. 1b, 1c).
For relative position coding schemes, closed-bigram model, because of the contiguity requirement, fails to predict the transposition effect.For example, the transposition pair COLD and CLOD share two bigrams (_C and D_), same as the shared bigrams between double substitution pair COLD and CURD.Therefore the closed-bigram model predicts the same similarity level for transposition pair and double substitution pair.The open-bigram model, on the other hand, is able to predict the transposition effect since it allows nonadjacent bigrams to be encoded.In this case, the transposition pair shares 7 bigrams (_C, CO, CL, CD, OD, LD, D_), therefore has higher similarity than double substitution pair COLD and CURD (3 matched bigrams:_C, CD, D_).

The current study
The current work aimed firstly to empirically establish a number of orthographic similarity effects that have been demonstrated to constrain the relevant orthographic representationsthe substitution effect, the edge effect, the transposition effect and the reverse effectfor the first time in a recognition memory paradigm.
While the effects of orthographic similarity have been robustly observed in masked priming paradigms, it is not a foregone conclusion that they will generalize to long-term episodic memory tasks for the following reasons.First, masked priming studies tend to use a very fast presentation time (around 50 ms or less), while presentation times in recognition memory paradigms is often much longerbetween 500 ms and 3 s, which may afford storage of additional features or qualitatively different orthographic features.Second, the two paradigms involves different retrieval processes.The facilitatory effect observed in masked priming studies is conventionally believed to depend on the similarity between the presented pair (i.e. the prime and the target).Recognition, in comparison, is believed to involve a global matching retrieval process where the test probe needs to be compared to all items stored in long-term memory (e.g., Clark & Gronlund, 1996;Osth & Dennis, 2020).

Table 1
Example Similarity Calculations for Single vs. Double Substitution Pairs from Different Models.Note that the level of position uncertainty does not affect the conclusion that the overlap model predicts higher similarity for single substitution pairs than double substitution pairs since matched letters for single substitution pairs would always be a subset of matched letters for double substitution pairs.
L. Zhang and Adam.F.Osth Subsequently, we aimed to determine which of the orthographic representations (i.e.slot code, overlap model, closed bigram, and open bigram) best captures recognition performance on these benchmark similarity effects through formal modelling.As previously discussed, recognition performance is based on the similarity between the test probe to the study content.Therefore, each of the orthographic representation schemes will be situated in a global matching model to calculate a global similarity index for each test probe, which will be subsequently used to predict item-level recognition accuracy via Luce's choice rule in a hierarchical Bayesian framework.
Two experiments were conducted, where the main difference is that Experiment 1 used word stimuli and Experiment 2 used pseudowords stimuli (pronounceable nonwords).Given that the transposition and reverse pairs are very rare in natural English language (as will be detailed later in the Method section) and we were only able to get around 40 transposition pairs and around 16 reverse pairs through an exhaustive search.Therefore, a second study using pseudowords with more trials was conducted to replicate the findings from Experiment 1.
In both experiments, the study and test lists were constructed through target-lure pairs to test each of the orthographic similarity effects (see Table 2).For example, to test the reverse effect, a study item ABCDE is paired with a reverse lure ADCBE in the test list along with a double substitution control lure AXCYE in the test list to check whether reverse lures are more similar to the study content in comparison to control lures which differ by the same number of letters from the study item as the reverse lures.To foreshadow, we established the substitution effect, the transposition effect, the reverse effect and the start letter importance in recognition memory of words.Furthermore, we found that while both the overlap and open bigram representations were able to capture the transposition effect, the open-bigram model performed the best in capturing orthographic similarity effects in recognition memory.

Experiment 1 2.1.1. Participants
A total of 172 participants were recruited in Experiment 1. Participants were recruited online through the University of Melbourne SONA System and either received course credit or reimbursed for their participation.The study was approved by The University of Melbourne Psychological Sciences Human Ethics Advisory Group (Ethics ID: 1851413).All participants provided informed consent before commencing the experiment.Participants having close-to-chance performance (d' < 0.2) were excluded from data-analysis.After exclusion, 162 participants were included in data analysis.

Materials
Experiment 1 was a browser-based online experiment created using the jsPsych JavaScript package (de Leeuw, 2015) which controlled stimulus presentation and recorded responses.The stimuli of Experiment 1 were English words, 4-7 letters in length and 0.2-300 count per million in frequency derived from the SUBTLEX English database.All stimuli were presented in uppercase and white text positioned in the centre of a black background.

Design
Table 2 summarises the 14 different lure types which were designed to test each orthographic similarity effects: three lures types testing the transposition effect; two lure types testing the reverse effect, six lure types testing the edge effect; and three types of filler lures which test the substitution effect.Each lure will be orthographically similar to one word from the study list.We will refer to the study list word as the PARENT word.
The design of Experiment 1 was highly constrained by the availability of transposition and reverse word pairs in English and what we employed have exhausted the English database.There are 40 adjacent transposition pairs that differ by the transposition of two adjacent interior letters (e.g.CLOD and COLD).Each word of the pairs has at least one single interior substitution (e.g.CLOD and CLAD) or double interior substitution (e.g.CLOD and CARD) control lure word.The transposition pairs we employed incorporated those adopted in one of the original transposition effect studies from Andrews (1996).There are 16 interior reverse word pairs which are either 5 or 6 letters in length.Each word of a pair has a control substitution lure that differ by the same number of letters from the pair as the number of letters that differ between the corresponding reverse pair.For example, for a 5-letter-long reverse pair SNAPS and SPANS (letter difference = 2), the word SNAPS has a double substitution control lure SLAMS, and the word SPANS has a double substitution control lure SLATS.For a 6-letter-long reverse pair PATROL and PORTAL (letter difference = 4), the control lure PENCIL differed by four letters from each word of the pair (quadruple substitution pair).Each word in the transposition word pairs and reverse word pairs was randomly selected to be the parent word or the lure probe across participants.
Word pairs differing in one or both of the exterior letters were used to test one of the three edge effects: start letter importance (StartMiss), end letter importance (EndMiss), and both-letter importance (BothMiss), where the word pairs all have the same length and only differ in the start letter, the end letter, and both exterior letters respectively.For each pair, one word is randomly selected to be the parent word and a corresponding substitution control lure was selected: interior single substitution lures for parent words testing the start letter importance and end letter importance; and interior double substitution lures for parent words testing both-letter importance.Lastly, The filler lure types function to lengthen the study and test list to avoid floor effects, and to provide a baseline against which all other lures types can be compared with.Filler words comprised of single substitution, double substitution and quadruple substitution word pairs.Each word of a pair was randomly selected to be the parent word, which has the same length and same exterior letters, but differ in one, two and four interior letters from the other word of the pair respectively.There was no repetitive use of words across different parent words and lure types.
Experiment 1 consisted of four study-test cycles.Each of the four study lists contains 33 targets: 10 transposition parent words, 4 reverse parent words, 9 edge effect parent words (3 per subclass), and 10 filler parent words (4 single substitution, 4 double substitution, and 2 quadruple substitution).These parent words were randomly sampled from the corresponding orthographic pair sets.Each test list has 89 test probes including 33 targets and 56 lures: one experimental lure and one substitution control lure for each reverse effect parent word and exterior effect parent word; one experimental lure and one substitution control lure (either single or double substitution) for each transposition effect parent word; and one substitution lure for each filler parent word.The presentation of targets was randomized within each study list.The order of targets and lures were randomized within each test list.The presentation of single and double substitution control lures for transpositions effect parent words are counter-balanced across participants where possible.Two words in a transposition or reverse pair are also counter-balanced across participants so that each word of a pair has the equal chance of being parent word target and lure.
One thing to note is that we did not control for the similarity between each lure probe and other items on the study list besides the corresponding parent word.Consequently, there may be other study list items that are more similar to a given lure probe than its parent word.This would have the most impact on the quadruple substitution lures since it is likely that there are other study items being one or two letters distant from the lure, therefore, the lure would be more similar to the study content than expected and better be classified as a single or double substitution lure instead.However, as we will return back to in the Results section, this has inconsequential impact on the effects we observed in the data.

Procedure
Each participant completed a single session lasting approximately 15 min.During each study phase, target words were displayed one at a time for one second with 150 ms inter-stimulus intervals between presentations of words.Then, participants completed a distractor task for 45 s where they solved simple mathematical questions.Specifically participants were asked to determine the correctness of a series of mathematical statements: A + B + C = D (e.g. 3 + 4 + 1 = 11) and participants pressed either the "1″ or "0" key on the keyboard to indicate true or false of the statement.Performance for the distractor task was not recorded.This was followed by a test phase where test probes were presented one at a time and participants responded at their own pace, pressing either the "1" or "0" key on the keyboard to indicate whether each test probe was an old or a new word respectively.Recognition response and response time were be recorded.Any responses slower than 7.5 s or faster than 150 ms prompted a warning message of "too slow" or "too fast" respectively to ensure participants' engagement.Before the commencement of the first study-test cycle, participants engaged in a practice trial which consisted of a shortened study-test cycle with words that were not presented in the actual study and test lists.

Experiment 2
Since the number of transposition and reverse lures was naturally constrained in Experiment 1, Experiment 2 was designed to replicate Experiment 1 with larger numbers of trials using pronounceable nonwords (also termed as pseudowords).Experiment 2 was mostly identical to Experiment 1 except for the stimuli used and the length of the experiment.

Participants
A total of 99 participants were recruited in Experiment 2. Participants were recruited online through the University of Melbourne SONA System and either received course credit or reimbursed for their participation.All participants provided informed consent before commencing the experiment.Participants having close-to-chance performance (d' < 0.2) were excluded from data-analysis.After exclusion, 81 participants were included in data analysis.

Materials
The pseudowords used in Experiment 2 was generated as follow: firstly, due to the limited number of words in the transposition and reverse conditions, a list of new words from the original word pool but not used in Experiment 1 were randomly selected and added to these two conditions separately for following transformation.Then a list of pseudowords were generated based on the target words in each condition of Experiment 1 using Wuggy (Keuleers & Brysbaert, 2010), a program which automatically generates pseudowords for a given set of words in English.Subsequently, the same types of experimental and control lures were generated for each target pseudowords as in Experiment 1. Specifically, for each pseudoword in the transposition condition (e.g.PUSITES), a transposition neighbour was generated by randomly transposing two adjacent interior letters of the pseudoword (e.g.PUSTIES).For each one of the transposition pair, a single substitution control lure and a double substitution control lure were then generated by randomly substituting one or two of the transposed letters.For pseudowords in the reverse condition, a reverse neighbour was firstly generated and control lures were subsequently generated by randomly substituting either two or four interior letters of the reverse pair depending on length of the pseudowords.For edge effect condition and filler condition, experimental and control lures were generated by randomly replacing letters of the target pseudowords according to specific condition types.Finally, all pseudowords were manually checked to ensure their pronounceability.

Design and procedure
Experiment 2 had the same design and procedure as Experiment 1 but had eight study-test cycles instead of four, therefore lasted for approximately 30 min.Another difference is that in the instructions of Experiment 2, participants were told explicitly that they were going to be presented with nonword strings.

Data screening
For both experiments, responses from practice trials at the start of the experiments were not recorded.Responses with reaction time faster than 200 ms or slower than 4 s were excluded to omit fast or slow guess responses, resulting in 0.92 % responses in Experiment 1 and 0.93 % responses in Experiment 2 being excluded.

Orthographic similarity effects in recognition memory
As mentioned in the Method, we did not control for the maximum similarity of each lure probe to other items on the study list.Therefore, in the following Bayesian statistical analysis, we re-grouped lures according to their actual maximum similarity to the study list.Lures with higher-than-designated maximum similarity to the study content were regrouped with corresponding filler lure types.For example, for a quadruple substitution lure, which is four letters different from its parent word on the study list, if on the same study list there is a target word that is one letter different from this quadruple substitution lure, the actual maximum similarity of this quadruple substitution lure to the study list would be one therefore would be re-grouped as a filler single substitution lure.Table 3 shows the proportion of regrouped lure trials for each lure type in both experiments.One thing to note is that we treat letter transposition as equivalent to one letter difference as adopted in the Damerau-Levenshtein distance calculation (Damerau, 1964) of string edit distance therefore transposition lures were not regrouped.The same Bayesian statistical analysis as we applied to the re-grouped data was also performed to original data based on our designated lure types without regrouping (see Appendix A) and very similar results were found.Fig. 2 shows the hit rate (HR) for targets, and false alarm rate (FAR) for experimental and control lures testing different orthographic similarity effects in Experiment 1 and 2. Higher FAR indicates greater confusability, and therefore higher similarity between the lure type and the study content.Notably, FAR in Experiment 2 is consistently higher than FAR in Experiment 1 across all lure types, suggesting poorer performance in false recognition of pseudowords compared to words.
To test various similarity effects statistically, we conducted a series of Bayesian paired samples t-tests between each pair of relevant lure types using the JASP software (JASP team, 2020).We employed the default prior distribution in JASP (JASP team, 2020) which is a Cauchy distribution centred on a zero effect size and a scale of 0.707.Bayes factors (BF 10 ) were used to assess strengths of evidence in supporting the alternative hypotheses that the FARs do differ between the two lure types being compared, in comparison to the null hypotheses that they do not differ.For example, a BF 10 of 10 suggests that the data is 10 times more likely under the alternative hypothesis than the null hypothesis.
Evidently, our results show a large transposition effect in both experiments.In Experiment 1, Bayesian paired-samples t-tests showed that the FAR of the transposition lures (M = 0.299) is higher than the FAR of single substitution control lures (M = 0.182), BF 10 = 5.496e + 18, and the FAR of the transposition lures is also higher than double substitution control lures (M = 0.136), BF 10 = 5.099e + 34.In Experiment 2, Bayesian paired-samples t-tests also showed that the FAR of the transposition lures (M = 0.358) is higher than (1) the FAR of single substitution control lures (M = 0.296), BF 10 = 4.376e + 6, and ( 2 paired samples t-test revealed moderate evidence for a start letter importance where the lures differing in the start letter from the study content (i.e.StartMiss lures, M = 0.176) had lower FAR in comparison to control lures differing in one interior letter (M = 0.209), BF 10 = 3.853.However, end letter importance in recognition memory of words is not supported by our results where a Bayesian paired samples t-test revealed no evidence for any meaningful difference between FAR of lures differing in the end letter (i.e.EndMiss lures; M = 0.233) and corresponding control lures differing in an interior letter (M = 0.226), BF 10 = 0.103.For recognition memory of pseudowords in Experiment 2 in comparison, neither start-letter nor end-letter importance was observed in our data.Specifically, a Bayesian paired samples t-test revealed inconclusive evidence that the FAR for StartMiss lures (M = 0.284) is meaningfully different from corresponding control lures (M = 0.259), BF 10 = 0.62.Interestingly, pseudowords differing in the end letter from the study content (i.e.EndMiss, M = 0.318) showed higher FAR than pseudowords differing in an interior letter (M = 0.272), BF 10 = 8.868, suggesting the possibility that interior letters are more important than end letters in recognition memory of pseudowords.
Finally, in both experiments, lures differing in both start and end letter from a studied word (i.e.BothMiss; M = 0.107 in Experiment 1 and M = 0.231 in Experiment 2) had slightly lower FAR than substitution control lures which differ in two interior letters from the same target word (M = 0.132 in Experiment 1, and M = 0.261 in Experiment 2).However, Bayesian paired samples t-tests revealed a BF 10 of 1.489 in Experiment 1, and a BF 10 of 2.372 in Experiment 2, both suggesting inconclusive evidence for a both-exterior-letter importance.In summary, we have successfully established the substitution, transposition and reverse effects in recognition memory of words and pseudowords, and the start letter importance in recognition memory of words, but not for pseudowords.

Computational modelling
To model recognition performance, each of the orthographic representations (i.e.slot coding, closed-bigram model, open-bigram model, and the overlap model) was situated in a global matching model where the similarities of the test probes to each of the items on the study list were calculated, transformed, aggregated and finally converted to recognition responses on an item-by-item basis.To reiterate, we calculate inter-item orthographic similarity (O) between a test probe and a studied item as the summed matches (m) between the pair divided by the alignment length (γ) defined as the length of the longer word of a pair, producing a similarity value between 0 and 1.
Specifically, for slot coding, we can express a match m on position k between a test probe letter string i and a studied letter string j as: (1) Where L refers to the length of the word (i.e.number of letters).Matches of all interior letters were fixed at a value of one.α and β, common to all models, are the weight parameters to scale the match on the start letter and the end letter respectively in order to capture the edge effects.The α and β parameters are freely estimated but do not vary across individual test trials.Values of α and β estimated to be higher than one indicate exterior letter importance, since matches on the start and/or end letter contribute more to the overall similarity than matches on interior letters.Consistent with Osth and Zhang (2023), we assumed that the β parameter only applies to the end letter of the studied item j and not the test probe item.This is inconsequential when the studied word and the probe word have the same length.However, if the two words are different in length, any change in the weight of the final letter only applies to matches involving the end letter of the studied word not the probe word.Another important note is that the α and β parameters are a property of letter order, rather than letter identity.To illustrate, a α of 2 means that a match on the first letter is twice as important as a match of an interior letter regardless of which letter is at the start of the string.Then, the total number of matches is the summed m value across all letter positions: The orthographic similarity (O) between the two letter strings i and j is: where γ ij is the alignment length of i and j, defined as the length (L) of the longer of the two letter strings.The inclusion of the weight parameters α and β in the denominator of Equation ( 3) is to ensure that the similarity value remains bounded between 0 and 1 as values of α and β diverge from one.
For the overlap model (Gomez et al., 2008), we assumed position uncertainty to be at the representation of studied items as they had limited viewing time.We define k as the position of the letter in the encoded string in memory (j), and q as the position of the same letter in the test string (i).A match m for letter on position k of the encoded string can be expressed as: L. Zhang and Adam.F.Osth where Φ is the cumulative distribution function of the normal distribution.α and β are applied in the same manner as in the slot coding model, where k = 1 and k = L(j) refer to the first letter and the last letter of the studied item j respectively.Conventionally, the overlap model assumes each letter position to be associated with a different standard deviation parameter (σ) for position uncertainty which can be freely estimated to optimize model fit.However, this would result in a large number of parameters (seven standard deviation parameters in our experiments where we have words up to seven letters in length).Therefore, we adopted the simplification made in Gomez et al., (2008) where the level of position uncertainty was calculated as an exponential function over the letter positions: where σ k is the standard deviation of the normal distribution at slot position k, which increases from the start letter to the end letter following an exponential function.Parameter r and d describe the rate and asymptote of the exponential growth respectively, both of which were estimated as free parameters and fixed across individual test trials.Match values calculated from Equation ( 4) was subjected to Equation ( 2) and (3) to get the orthographic similarity value for the pair.
For both closed and open bigram models, similarity value is calculated as the number of shared bigrams divided by the alignment length of the pair, which is the number of bigrams of the longer of the pair.Specifically, for both closed and open bigram models, a match m on a bigram k between strings i and j can be expressed as: where for the closed-bigram model, k refers to adjacent bigrams in a string, and for the open-bigram model, k refers to bigrams separated by less than three letters in a string (Grainger & van Heuven, 2003).α and β apply to matches on bigrams that involve the boundaries of the word (e.g._s and p_in the word stop).It is important to note that for both closed and open bigram models, each bigram was only matched once to avoid higher than one self-similarity values for strings with repeated bigrams.Using the closed-bigram model as an example, when calculating the self-match similarity value of the string "abab" coded as {_a, ab, ba, ab, b_}, if each bigram "ab" can be matched twice since there is two "ab" bigrams in the set, it would result in a higher-than-one similarity value.
(7/5).Therefore, for Equation (6), each matched bigram was removed from the set of possible bigrams during the comparison process.The overall orthographic similarity was calculated in the same manner as previous models according to Equation ( 2) and (3) but with exception that γ ij in Equation (3) refers to the number of bigrams in the longer of the pair i and j for bigram models.
The stimuli in Experiment 1 were words, which have both orthographic and semantic features.We initially pursued global matching models that only comprised orthographic features.However, we found that several of the orthographic similarity manipulations impaired performance in the models too severely relative to the data.To remedy this problem, we additionally incorporated semantic representations into each model.This mitigated the performance decrements because the manipulations of orthographic similarity did not change the degree of semantic similarity in a systematic way, which reduced the changes in global similarity across conditions and reduced the changes in the false alarm rate.
Specifically, inter-item similarity was expanded to include semantic similarity (S) and orthographic similarity (O) to form global similarity values on individual trials.We constructed semantic similarity using the word2vec model (Mikolov et al., 2013), a distributed semantic model which has gained popularity in recent years due to its impressive performance at various semantic tasks (e. g.Mandera et al., 2017).We used the pre-trained vectors which were trained on a complete Wikipedia corpus using the fastText model with subword information (Grave et al., 2018).The semantic similarity S ij between strings i and j is calculated as the: where cosine indicates the cosine of the angle between the two vectors.Cosine values are bounded between − 1 and 1 where a cosine value of zero occurs when two vectors are orthogonal to each other indicating maximum dissimilarity, which makes values below zero hard to interpret.Due to this reason, we truncated cosine values in our models at a lower bound of zero.
The inter-item similarity in Experiment 1 was calculated as an additive combination of semantic similarity (S ij ) and orthographic similarity (O ij ) and is calculated between a test probe and each of the studied items on the corresponding study list, which gets aggregated together to produce a global similarity value for the test probe.Specifically, the global similarity on trial i (g i ) is calculated as: Where n is the length of the study list.We incorporated an additional weighting parameter w which reflects the relative influence of L. Zhang and Adam.F.Osth semantic and orthographic representations on the global similarity.p is a power parameter we applied to individual inter-item similarity calculation.Including this non-linearity transformation with exponent p (p > 1) in the global similarity calculation has the effect of increasing the signal-to-noise ratio where increases in p punish low similarity values by pushing them closer to zero, while values that are close to 1 are less affected.This non-linearity transformation of similarity originated in the Minerva 2 (Hintzman, 1988) model where similarity values are raised to a power of 3. Our previous work (Osth & Zhang, 2023) also freely estimated the non-linearity parameter p and found a it was critical for capturing the difference between high and low similarity lures.
In Experiment 2, stimuli were pseudowords which by definition do not have semantics.However, there may still be dimensions other than orthography that two pseudowords can be similar on with one possibility being phonology.This would be most influential for target self matches where two identical strings are being compared, thus would have perfect similarity on all possible dimensions.Therefore, we assumed similarity on all other dimensions expect for orthography (S) to be 1 for target self-matches and 0 for all other inter-item comparisons and global similarity for each test probe (g i ) is calculated as: In both experiments, the global similarity value (g i ) was used to predict item-level recognition accuracy via Luce's choice rule: Where P(old) i is the probability of recognising an test probe i as an old stimuli, and g i is the global similarity value of test probe i and c is a decision criterion parameter.Luce's choice rule is used in the generalized context model (Nosofsky, 1986(Nosofsky, , 1991) ) and the exemplar-based random walk and linear ballistic accumulator models (Donkin & Nosofsky, 2012;Nosofsky et al., 2011;Osth et al., 2023).Table 4 summarises model parameters for each orthographic scheme in both experimentseach model contains five parameters for each participant with the exception of the overlap model which contains seven parameters per participant.

The model fit
In both experiments, models were fit to individual participant data within a hierarchical Bayesian framework which allowed the estimation of both individual participant-level and group-level parameters simultaneously.One advantage for this approach is that parameter estimates for individual participants get pulled towards the more certain group estimates (i.e.shrinkage), and shrinkage is stronger for individuals with less data.This not only allows better estimation of individual participant's parameters, it also naturally accounts for difference in the amount of data between individual participants (Boehm et al., 2018).
The Bayesian models allow us to quantify the uncertainty of estimates through posterior distribution of parameters.The posterior distributions were estimated using Differential Evolution Markov chain Monte-Carlo sampling (DE-MCMC; Turner et al., 2013), a method which is robust to parameter correlations.Details of the prior distributions on model parameters and the sampling procedure are provided in Appendix B and C. Posterior convergence was inspected through the Gelman-Rubin (GR) convergence statistic (Gelman & Rubin, 1992).Models were considered converged if GR values were less than 1.1 for both individual-level and group-level parameters (Gelman & Rubin, 1992).Chains were visually inspected to further confirm convergence.

Model selection
To compare which orthographic representation scheme provided the best fit for data, we adopted the Widely Applicable Information Criterion (WAIC; Watanabe, 2010) for model selection.The WAIC selects a model based on its ability to predict future data by balancing between model complexity and goodness-of-fit.Models with lower WAIC scores have better predictive accuracy by striking a better balance between model complexity and goodness-of-fit.WAIC is calculated on a log-likelihood scale, which means that a WAIC difference score of ten and above can be considered large (e.g.Osth et al., 2017).
In each experiment, the WAIC scores for individual participants were summed to form a single WAIC score for each model.Table 5 shows the WAIC scores for each of the four models in Experiment 1 and 2. The WAIC scores are presented on a difference scale where the winning model in each experiment has a value of zero and all other models have a positive value indicating the difference in its WAIC score from the winning model.Lower values indicate a better balance between fit and complexity.Both experiments showed that

Table 4
Parameter Names and Descriptions for Different Orthographic Models.the open-bigram model provided the best model fit, which was followed by the overlap model.As we will discuss shortly, the advantage of the open bigram and overlap models is likely due to the flexibility to capture the transposition and reverse effects.Figs. 3 and 4 show visual inspection of group-level model fits against group-level data for hit rates (left column), and false alarm rates (middle and right columns) across edge effect, transposition, reverse, and filler lure types in Experiment 1 and 2 respectively.Note that in all model fit figures we present in this paper (Figures 3,4,6,and 7), we group data and model predictions for lure trials according to their actual maximum similarity to the study content, instead of their designated lure types.Model fits from two experiments showed consistent patterns, and therefore are presented simultaneously.Differences between the two experiments' model fitting results will be highlighted where relevant.Fig. 5 shows the group mean μ estimates for all parameters of the winning models (i.e.open-bigram model) in Experiment 1 and 2. We will discuss these parameter estimates where relevant.
As shown in the Figs. 3 and 4, all models captured the HR well, however, models differ in their ability to capture the FAR across different lure types.As expected, the most notable and diagnostic discrepancy in model coverage is in the transposition and reverse effects.As shown in the Middle panels of Figs. 3 and 4, the slot coding and the closed-bigram models are not able to capture the transposition and reverse effects.Specifically, both models predict that the transposition lures (Trans) are no more similar to the study content than the double substitution control lures (Trans_DS) while the single substitution control lures (Trans_SS) exhibit the highest similarity with the study content.Similarly, as expected, both slot coding and closed bigram models failed to capture the reverse effect where they both predicted that reverse lures (Reverse) and substitution control lures (ReverseC) show same level of similarity to the study content.
In comparison, both open-bigram and the overlap model are able to capture the transposition and reverse effects as shown in the right columns of Figs. 3 and 4 where both models predict higher similarity and confusability of the transposition and reverse lures compared to corresponding substitution control lures.However, the open-bigram model performed evidently better in predicting the large transposition effects and reverse effect as observed in data and was successful in predicting that the transposition lure has higher FAR than single substitution control lures.
The overlap model predicts that transposition lures and reverse lures show higher FAR than corresponding substitution control lures, however, the predicted effects are evidently very small.We were somewhat surprised by this finding given that the transposition effect is one of the benchmark findings that the model was designed to capture.One possibility for this failure can be attributed to the exponential function of position uncertainty we adopted as in Gomez et al. (2008), which may have overly constrained the models.In order to test this possibility, we ran a free position uncertainty version of the Overlap model (Overlap free ) where the standard deviation of the normal distribution (σ) is freely estimated for each position.As shown in Fig. 6, allowing for free uncertainty parameters do not  Note that the Overlap model we adopted aligned words at the start letter, which may also have constrained the model's ability to capture similarity between words of different length.To test this possibility, we included a third version of the Overlap model called the Overlap Back model, which we also implemented in previous work (Osth & Zhang, 2023) and was inspired by orthographic representations where letter position is determined by both the start and end of the word (i.e. both edges letter coding; Fischer-Baum et al., 2010;Jacobs et al., 1998).In the Overlap Back model, letters are not only aligned at the start of the word but also the end of the word.The similarity value for end-letter-alignment is also calculated using Equation (4) with the exception that this is calculated relative to the end of the strings.The similarity values of both start-letter alignment (O ij,start ) and end-letter alignment (O ij,end ) of two letter strings were combined using an attention constraint parameter (w OVL ): Fig. 6 also shows model predictions for the Overlap Back model which does not improve the predictions of the overlap model in capturing the transposition and reverse effects compared to the overlap model aligned at only the start of letter strings.The weight parameter for the start-letter-alignment similarity component (w OVL ) was estimated to be 0.9 at the group level, which further supports the very slight contribution of the backwards similarity in capturing the data and suggests that the failure of the Overlap model in capturing the transposition and reverse effects are not due to letter alignment.
For the less constraining effects, namely the edge effects and the substitution effect, all models were able to capture the main trend of the data.For the filler lure types in both experiments, as shown in the middle and right columns of Figs. 3 and 4, all models were able to predict that single substitution control lures show higher confusability than double substitution control lures, which in turn show higher confusability than quadruple substitution control lures.However, models in the two experiments differ in predicting the observed edge effects.In Experiment 1, all models were able to capture the start letter importance by showing a lower FAR for the start letter difference lures (StartMiss) than the single substitution control lures.However, the winning model, the open-bigram model, evidently performed better at capturing the actual FAR across all lure types testing the edge effects than other schemes.The start letter importance is also supported by the estimated group mean value for the start letter importance parameter (α μ ) which is 2.48 for openbigram model (Fig. 5), suggesting the start letter is weighted more than twice as high as the interior letters.
In Experiment 2, however, none of the schemes were able to capture the FAR patterns across lure types testing edge effects simultaneously.To reiterate, in Experiment 2 lures with start and end letters being replaced together (i.e.BothMiss) showed slightly lower FAR than interior substitution control lures, therefore demonstrating the importance of the exterior letters as a pair.However, when replaced individually, lures with different start letters (StartMiss) and lures with different end letters (EndMiss) showed higher FARs (i.e. more confusable) than corresponding control lures, therefore showed the opposite of exterior letter importance.This have caused challenge for all models to simultaneously capture the conflicting patterns.That is having the start weight (α) and/or end weight (β) parameters lower than one would help the models to capture the higher FARs for the StartMiss and EndMiss lures, however, at the same time would produce a higher FAR for the BothMiss than corresponding control lures, which is opposite to the observed data.An important note is that, despite the observed difference in group-level mean FARs across different lure types here, our data showed no conclusive evidence for a start-letter non-importance nor a both-exterior-letter importance, and only moderate evidence for an end-letter importance as suggested by Bayesian paired-sample t tests.

General Discussion
The current study aimed to establish the transposition effect, reverse effect, edge effect and substitution effect in episodic recognition memory.Subsequently, we aimed to explore the orthographic representations (i.e.slot coding, closed bigram, open bigram, or the overlap model) that best describes these orthographic similarity effects in episodic recognition.Results from two experiments showed consistent results.Specifically, we have established the substitution effect, transposition effect and reverse effect in recognition of words and pseudowords.We found a start letter importance in the recognition of words, but not for pseudowords.Model selection in both experiments supports an open-bigram model of orthographic representation in recognition memory.
Our results, consistent with previous work in the psycholinguistic literature, showed support for the open-bigram model and overlap model over the strict slot coding and closed-bigram model.The slot coding and closed n-grams models, despite being the most extensively used letter coding mechanism employed in the most influential models of visual word recognition and reading (e.g.Coltheart et al., 2001;Grainger & Jacobs, 1996;McClelland & Rumelhart, 1981;Paap et al., 1982;Seidenberg & McClelland, 1989;Wickelgran, 1969), has been shown to have difficulty at capturing the facilitatory effect of transposition primes in masked priming studies (e.g.Davis & Bowers, 2006).Our result provides further evidence against these models and generalized the challenge of these models at capturing transposition effects to a different experimental paradigm, namely, episodic recognition memory task where the test probe needs to be compared to an entire list of items to yield a global similarity match value.
The overlap model was proposed to be able to account for letter transposition effects in masked priming studies (e.g.Gomez, 2008).Surprisingly, our results showed that the overlap model failed to capture the big transposition and reverse effect as observed in our data.The overlap model explains transposition effect through position uncertainty.Specifically, the higher uncertainty in position representations, the more similar transposed pairs become since the off-position area under the curve for the transposed letters become larger.However, this comes at the cost of decreasing similarity between all pairs since the area under the actual position of the letters becomes smaller, therefore attenuating similarity difference between different lures.Consequently, it makes the overlap model harder to account for the large confusability difference for lures across different conditions, especially the difference between single and double substitution lures, as in our data.Therefore, our paradigm placed a higher constraint on the overlap model and showed its inability to capture the similarity structure of items across multiple orthographic similarity types in recognition memory.
The estimated values of position uncertainty are relatively low in our overlap models.We calculated position uncertainty for each letter position from the estimated group-level rate (r) and asymptote (d) parameters of the exponential growth function (see Equation ( 5) in both experiments.Table 6 shows these calculated values from our studies along with the position uncertainty values calculated from the average rate (r) and asymptote (d) parameter values estimated from the original Overlap paper (Gomez et al., 2008).It is evident that our estimated uncertainty is consistently lower than those estimated in Gomez et al. (2008), except for the first position in Experiment 2. This inconsistency can be attributed to the fundamental difference in the masked priming paradigm and episodic recognition as discussed in the Introduction.In episodic recognition, the presentation time of the target items are often longer and up to seconds (e.g. 1 s per item in our experiments).Therefore, position uncertainty due to the poor perception of the letters may be minimized in this case.However, this does not necessarily mean that the overlap model also fails with much faster presentation in episodic recognition.Therefore one future avenue for investigation is to check whether the position uncertainty would increase with faster presentation in recognition and lead to better performance of the Overlap model.Research in this direction also leads to another theoretical question, however, which is whether the representations of words are fundamentally different under fast and slow presentations.

Spatial coding model of string orthography
The models we tested in the current study are well documented and computationally simple.However, we acknowledge that other representation schemes may also be able to capture the orthographic similarity effects in our data.One strong candidate is spatial coding (Grossberg, 1978) as adopted by the self-organising lexical acquisition and recognition model (SOLAR; Davis, 1999Davis, , 2010)).
Similarly to the overlap model, spatial coding also incorporates position uncertainty in the representation of letters.Critically, instead of associating position uncertainty with absolute position slots as in the Overlap model, spatial coding associates position uncertainty with the relative order of letters in a string.Specifically, in spatial coding, a signal-weight difference value is calculated for all shared letters between two strings as the difference between a letter's position in one string and the same letter's position in the other string.For example, for the word pair STOP and POST, the signal-weight difference value for each letter S, T, O, and P would be 2, 2, − 1, − 3 respectively.These difference values indicate the difference of relative positions of each letter where the letters S, T, O and P in the string STOP are 2 slots ahead, 2 slots ahead, 1 slot behind, and 3 slots behind the same letters in the string POST respectively.Therefore, these difference values represents how consistently matching letters are shifted in both strings.Subsequently, each of these difference values is associated with a continuous function centred on the actual difference value with a free variance parameter (σ) representing position uncertainty.These difference functions are then summed together to form a superposition function, the peak of which represents the match of the two strings.The more consistent the difference values are, the more closely these difference functions align with each other which would result in higher peak value of the superposition function thus higher similarity.
Position uncertainty allows spatial coding to account for the transposition effect.For transposition pairs, the difference function of the transposed letters, although slightly displaced from difference functions of other in-position letters, still increases the peak of the superposition function.In comparison, the substituted letters in double substitution pair do not increase the peak at all therefore resulting in a lower similarity value than a transposition pair.
Despite the ability of the spatial coding model to account for the transposition effect, it is nevertheless computationally more difficult than models explored in the current study as to our knowledge, there is no analytic approximation of the superposition function.When calculating the match value between a string pair in spatial coding, a position uncertainty parameter (σ) needs to be estimated and a maximum value of the superposition function needs to be calculated.This is practical in calculating pairwise similarities as in masked priming studies.However, it becomes computationally intractable in a global matching framework of recognition memory where each test probe needs to be compared to a large number of study list items.We fitted a simple version of the spatial coding model to our data with the position uncertainty parameter (σ) fixed at 3 across participants as adopted in Davis and Bowers

Integrating orthographic representations with global matching models
Many existing memory models assume items are represented by randomly generated vectors with a particular similarity structure (e.g.Cox & Shiffrin, 2017;Dennis & Humphreys. 2001;Eich, 1982;Hintzman, 1988;McClelland & Chappell, 1998;Murdock, 1982Murdock, ,1993;;Osth & Dennis, 2015a;Pike, 1984;Shiffrin & Steyvers, 1997).Most of these models adopting random representations are process oriented: they describe the storage and retrieval processes operating on the representations without explicating the origin of the representations themselves.These models need additional parameters to account for similarity effects.For example, the REM model predicts category length effect by varying the proportion of shared elements by vectors (Shiffrin & Steyvers, 1997).
Recent work has advanced computational recognition models by integrating realistic semantic representations.Employing high dimensional vectors derived from word-cooccurrence in text corpora, the recognition by semantic synchronization model (RSS; Johns et al., 2012) and recent attempts from Osth et al. (2020) and Reid andJamieson (2022, 2023) have successfully simulated various benchmark effects including list length, list strength effects, word frequency effects, and semantic DRM effects without any additional parameters defining the similarity structure.
Our estimates of the weight on semantic similarity parameter (w) suggest that orthographic similarity is as consequential as semantic similarity in recognition memory.The weight parameter on semantic similarity (w) indicates the relative importance of semantic compared to orthographic representations, where estimates of w = 0.5 would suggest equal importance.As shown in Fig. 5, the group level mean estimates of the weight on semantic similarities (w) for the winning model are less than 0.5, indicating a dominance of orthographic similarity in our data.Consistent findings have been found by Osth and Zhang (2023) where similar models were applied to four recognition datasets, none of which manipulated orthographic characteristics of word stimuli.The estimates for weight on orthographic similarity were found to be close to and above 0.5 for all datasets except for a deep processing condition of one dataset which emphasized semantic processing during memory encoding.These findings contradict the notion that long-term memory primarily consists of semantic representations (e.g.Baddeley, 1966) and encourage researchers to consider integrating orthographic representations in computational recognition memory models.
Despite the importance of orthographic representations, very few memory studies have attempted to explore them.Recent work from Reid et al. (2023) incorporated two vector-based relative position orthographic representations in MIVERA 2 (Hintzman, 1986(Hintzman, , 1988) ) and successfully simulated performance pattern in an item-based directed forgetting paradigm.However, it was not a focus of their study to compare between representations.In the current study, we found evidence for an open-bigram model over three competing orthographic representations as a plausible candidate for realistic perceptual representation in a global matching model.Therefore, future studies could integrate the open-bigram representation in global matching models to directly model perceptual DRM effects and perceptual category length effects, leading to a more complete, similarity-based, and computationally plausible account for false recognition.A better understanding of realistic perceptual and semantic representations, as well as how they can be integrated together, would help with the development of a recognition model that allow a priori prediction on the level of individual items.

Analogues in memory for serial order
It is worthwhile to note that similar debate between absolute and relative order coding can also be found in the memory for serial order literature which concerns how item's order within a sequence is represented (see Hurlstone, 2021;Osth & Hurlstone, 2023 for reviews).Closed and open bigram code of orthographic representations strongly resemble associative chaining models of serial recall which encode serial orders by forming associations between study items (e.g.Lewandowsky & Murdock, 1989;Murdock, 1993Murdock, ,1995;;Raaijmakers & Shiffrin, 1981;Solway et al., 2012).In contrast, absolute position code of letter strings resembles serial order models which encode the absolute order of items in serial recall for example, the box model (Conrad, 1965) and position marking models (e.g. Brown et al., 2000;Burgess & Hitch, 1992, 1999, 2006;Farrell, 2006;Lewandowsky & Farrell, 2008).
Several analogues of the benchmarks we considered here can also be found in the serial order literature.The advantage of the start letters in words can be considered a within-word primacy effect.Primacy effects are robustly observed in serial recall tasks in conjunction with a relatively small recency effect (Farrell & Lewandowsky, 2012) which is sometimes absent entirely (Osth & Dennis, 2015b, c).In addition, an analogue of the transposition effect is the fill-in effect.To illustrate the fill-in effect, consider when a sequence such as ABCDE is learned and a participant skips from A to C. A fill-in is when the subsequent response is B, producing sequence ACBthe term arises because it is as if participants are "filling in" the missing item.An in-fill error, in contrast, occurs when participants continue onward after the skip, recalling an item such as D, producing sequence ACD.Fill-in errors consistently outnumber in-fill errors, meaning sequence ACB is considerably more likely than ACD (Logan, 2021;Logan & Cox, 2023;Osth & Dennis, 2015c;Page & Norris, 1998;Surprenant et al., 2005).
Fill-in sequences are essentially transposition sequences as the sequence ACB contains a reversal of two items from the original sequence.However, while the current work has shown that transposition errors in recognition memory support relative position coding, the opposite conclusions were reached in the serial recall literature, where chaining models encounter considerable difficulty in accounting for fill-in errors.This is because in serial recall, it is commonly assumed that each item from the sequence is recalled individually and used as a cue (but see Dennis, 2009 for an exception, which can capture the fill-in pattern).Thus, when a skip occurs, such as from A to C, item C should be used as a cue, which should trigger recall of item D if asymmetric associations are employed or at least produce equal probabilities of recalling B and D if symmetric associations are employed (e.g, Lewandowsky & Murdock, 1989).In compound chaining models where multiple items are associated and used as cues (e.g., Murdock, 1995), the most recently recalled item (item C) dominates the retrieval cue, and thus fill-in errors will not be predicted in the absence of other mechanisms.
It remains an open question that is considerably beyond the scope of the present work as to why the representation that is preferred for recognition memoryrepresentations of relative position -have not been found to be well supported in the serial recall literature (e.g., Henson et al., 1996).One possibility is that the underlying representation is not the problem, but it is instead the way the representations are accessed that lead to failed predictions for errors in serial recall.For instance, the model of Dennis (2009) employs associations between items in a similar manner as open bigram representations, but has the entire sequence retrieved simultaneously instead of retrieving items one-at-a-time.The model is able to produce fill-in errors along with a number of other benchmarks that have been assumed to require representations of absolute position.While this model only applied to serial recall, Osth and Dennis (2015d) speculated on how this same architecture could also be applied to benchmarks from recognition memory.
While we focus on serial recall in this section, we acknowledge that serial order tasks, such as the recognition-of-order task and serial order reconstruction task (see Attout & Majerus, 2015;van Dijck et al., 2013 for example task descriptions), may be more directly analogous to the recognition memory task that is of focus in the present work.However, to our knowledge, the benchmarks we considered here (e.g.transposition, exterior importance) have not been investigated in these tasks.

Further constraining orthographic representations with feature frequency
The current models are limited in that they only contain order information of orthographic units, with no differentiation between identities of different orthographic units (i.e.letters or bigrams).This means that we have assumed matches on different letters or bigrams to make equal contribution to the final similarity calculation.To illustrate, matches of letter A and matches of letter X in the pair ABXD and ACXY contribute equally to the pair similarity calculation.However, letter identities may have important implications in orthographic similarity calculation.
There is evidence in recognition memory that rare letters are more consequential than common letters.Specifically a feature frequency effect was found that when words were equated on normative word frequency, words comprised of rare letters (e.g.x, j, v) were better recognised showing a higher HR and lower FAR in comparison to words comprised of common letters (e.g.a, b, c; Malmberg et al., 2002;Steyvers, 2000).The feature-frequency assumption was adopted by the REM model (Shiffrin & Steyvers, 1997) to account for the well-established word frequency effect that low frequency words are better recognised than high frequency words.The feature-frequency assumption attributes the better performance of low frequency words to the fact that low frequency words on average contain more rare letters whereas high frequency words contain relatively more common letters.Rare letters are less likely to be encountered by chance thus are more diagnostic in making old/new decision.Therefore, a low word frequency target on average produce higher levels of familiarity because the more diagnostic matches on rare letters.
Thus, the orthographic order models in the current study can be further specified with the integration of feature frequency.Normative letter or bigram frequency can be calculated from word corpora, which then can be used to scale the match values in similarity calculation.Future studies could investigate whether the integration of feature frequency is necessary and the appropriate mechanism to integrate feature frequencies in recognition memory.This would allow more insight into the similarity structure of words in models of recognition.

Conclusion
In conclusion, current studies constitute an important first step in integrating realistic orthographic representations in computational memory models.We have established the substitution, transposition, reverse, and start letter importance in recognition memory of words and our results support a computationally simple open-bigram representation in recognition.

Prior Distributions on Model Parameters
Individual parameters were sampled from group-level distributions.Means and standard deviations of the group level distributions are denoted using μ and σ superscripts.Some of the parameters with a lower bound of zero were sampled on a log scale to improve sampling.For parameters sampled on a log scale, the subject-level parameters were exponentiated in the fitting procedure.For the non-linearity parameter p, we further added a constant of 1 to the exponentiated value p * in the fitting procedure as shown in Equation B(1) to ensure a minimum value of 1 for the non-linearity transformation.

Fig. 1 .
Fig. 1.A Representation of the Encoding of Letter Positions according to the Overlap Model.Note.Panel a shows similarity between word pair SHOP and STOP with position uncertainty in the word SHOP.Panel b shows similarity between word pair CURD and COLD with position uncertainty on the word CURD.Panel c shows similarity between word pair CLOD and COLD with position uncertainty on the word CLOD.The shaded area under the normal distributions represent the overlap values.Standard deviations used in all three examples from position 1 to 4 are 0.25, 0.8, 0.35 and 0.4 respectively.The match values for each letters m is also depicted.
Fig. 2. Hit Rate and False Alarm Rate in Experiment 1 and 2. Note.HR = hit rate.FAR = false alarm rate.P(yes) = probability of giving a yes (old) response on the test.Trans = Transposition lure.SS = Single interior substitution.DS = Double interior substitution.QS = quadruple interior substitution.reverseC = reverse control.Trans_SS, Trans_DS, reverseC, StartMiss_SS, EndMiss_SS, BothMiss_DS are substitution control lures for each similarity effect.Error bars represents 95 % within subjects confidence interval calculated according to Morey (2008).
all models (Slot code, Closed-bigram, Open-bigram, Overlap) c Decision criterion in Luce's choice rule p Power parameter for nonlinear transformation of global similarity α Start letter importance parameter β End letter importance parameter w Weight of semantic similarity Parameters unique to the Overlap model d Asymptote parameter for the exponential position uncertainty function r Rate parameter for the exponential position uncertainty function L. Zhang and Adam.F.Osth

Fig. 3 .
Fig. 3. Group-level Model Predictions against Group-level Data in Experiment 1. Note.Experiment 1 group averaged model predictions against group averaged data on Hit Rates (HR; left column), and False Alarm Rates (FAR; middle, and right columns) across different lure types.Lure types testing different similarity effects were separated with background shade where transposition, reverse edge effects and filler conditions were displayed from left to right.Black markers are the data.Slot coding, closed bigrams, open bigrams, and the Overlap models are shown in red, orange, blue, and green respectively.Error bars indicate the 95 % highest density intervals (HDIs).Trans = Transposition.SS = Single interior substitution.DS = Double interior substitution.ReverseC = Reverse control.QS = quadruple interior substitution.P(yes) = probability of giving a yes (old) response on the test.Trans_SS, Trans_DS, reverseC, StartMiss_SS, EndMiss_SS, BothMiss_DS are substitution control lures for each similarity effect.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 4 .Fig. 5 .
Fig. 4. Group-level Model Predictions against Group-level Data in Experiment 2. Note.Experiment 2 group averaged model predictions against group averaged data on Hit Rates (HR; left column), and False Alarm Rates (FAR; middle, and right columns) across different lure types.Lure types testing different similarity effects were separated with background shade where transposition, reverse edge effects and filler conditions were displayed from left to right.Black markers are the data.Slot coding, closed bigrams, open bigrams, and the Overlap models are shown in red, orange, blue, and green respectively.Error bars indicate the 95 % highest density intervals (HDIs).Trans = Transposition.SS = Single interior substitution.DS = Double interior substitution.ReverseC = Reverse control.QS = quadruple interior substitution.P(yes) = probability of giving a yes (old) response on the test.Trans_SS, Trans_DS, reverseC, StartMiss_SS, EndMiss_SS, BothMiss_DS are substitution control lures for each similarity effect.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 6 .
Fig. 6.Model Predictions against Data for Three Overlap Model Variants in Experiment 1. Note.Experiment 1 group averaged model predictions against group averaged data on Hit Rates (HR; left column), and False Alarm Rates (FAR; middle, and right columns) across different lure types.Lure types testing different similarity effects were separated with background shade where transposition, reverse edge effects and filler conditions were displayed from left to right.Black markers are the data.The Overlap model with exponential position uncertainty function, with start and back alignment, and with free position uncertainty are shown in red, blue, and green respectively.Error bars indicate the 95 % highest density intervals (HDIs).Trans = Transposition.SS = Single interior substitution.DS = Double interior substitution.ReverseC = Reverse control.QS = quadruple interior substitution.P(yes) = probability of giving a yes (old) response on the test.Trans_SS, Trans_DS, reverseC, StartMiss_SS, EndMiss_SS, BothMiss_DS are substitution control lures for each similarity effect.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 7 .
Fig. 7. Group-level Model Prediction for Spatial Coding vs. Open-bigram Models.Note.Experiment 1 group averaged model predictions against group averaged data on Hit Rates (HR; left column), and False Alarm Rates (FAR; middle, and right columns) across different lure types.Lure types testing different similarity effects were separated with background shade where transposition, reverse edge effects and filler conditions were displayed from left to right.Black markers are the data.The open-bigram model, and the spatial coding model are shown in red and blue respectively.Error bars indicate the 95 % highest density intervals (HDIs).Trans = Transposition.SS = Single interior substitution.DS = Double interior substitution.QS = quadruple interior substitution.P(yes) = probability of giving a yes (old) response on the test.Trans_SS, Trans_DS, reverseC, StartMiss_SS, EndMiss_SS, BothMiss_DS are substitution control lures for each similarity effect.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. A1 .
Fig. A1.Hit Rate and False Alarm Rate in Experiment 1 and 2 across Designated Lure Groups.Note.HR = hit rate.FAR = false alarm rate.ES = Edge effect condition, start letter difference.EE = Edge effect condition, end letter difference.EL1 = Edge effect condition, single substitution control.EB = Edge effect condition, both exterior letters difference.EL2 = Edge effect condition, double substitution control.Trans = transposition lure.TL1 = transposition condition, single substitution control.TL2 = transposition condition, double substitution control.Reverse = Reverse lures.RCL = reverse condition, substitution controls.FL1 = filler condition, single substitution lure.FL2 = filler condition, double substitution lure.FL4 = filler condition (4)-letter-substitution control.P(yes) = probability of giving a yes (old) response on the test.Error bars represents 95 % within subjects confidence interval calculated according to Morey (2008).

Table 2
Lure Types to Test Different Orthographic Similarity Effects.
/Note.SS: single substitution.DS: double substitution.QS: quadruple substitution.Example lures are in parentheses.*Targetsandlures in QS filler condition are all 6 letters in length.L.Zhang and Adam.F.Osth

Table 3
Proportion of Re-grouped Lure Trials in Experiment 1 and 2.
Note.This table shows the designed maximum similarity for each lure type in terms of letter difference.Lures within each type which have more similar items on the study list (meaning have smaller letter difference than designed) were regrouped with the filler lure types according to their acrtual maximum similarity.SS: Single interior substitution.DS: double interior substitution.QS: quadruple interior substitution.*Wetreatletter transposition as a letter distance of one.L.Zhang and Adam.F.Osth

Table 5
WAIC Difference Scores for Each Model in Experiment 1 and 2.

Table 6
Group-level Estimated Position Uncertainty for the Overlap Model.2006).As shown in Fig. 7, the spatial coding model fits the data modestly well, however, still performs worse at capturing the transposition effect and reverse effect in comparison to the open-bigram model.Model selection further supports the open-bigram model by having a WAIC value 123 points lower than the spatial coding model.Therefore, our results support the open-bigram model at providing an appealing account for orthographic similarity effects in episodic recognition, which should be computationally easier to work with for researchers and easier to test.
Note. r and d are the group-level averaged estimates for the rate and asymptote parameters of the exponential growth for position uncertainty across letter positions.S1 -S7 shows position uncertainty value calculated from the estimated d and r for slot 1 (start letter) to slot 7 (end letter) of a 7-letterstring.(