Chunking and data compression in verbal short-term memory

.


Introduction
The idea that verbal STM 1 (short-term memory) capacity is strongly influenced by the number of chunks that can be held in STM has become part of conventional wisdom.Miller's (1956) famous work on chunking "The magical number seven" argued that the capacity of STM is a function of the number of chunks that can be stored and not the number of items nor or the amount of information.However, one important question has received very little attention: what is a chunk?
Let us assume that there is some sense in which FBI is a chunk, and that this confers a memory advantage over an unfamiliar sequence of letters such as IFB.What are the properties of the representation of that chunk that lead to superior memory?STM must represent something other than the raw sequence of letters in the chunk, so what might that be?Miller himself acknowledged that "we are not very definite about what constitutes a chunk of information."(p93).One simple suggestion comes from Shiffrin and Nosofsky (1994) who defined a chunk as "a pronounceable label that may be cycled within short-term memory".They went on to point out that "The label must have the critical property that it can be unpacked accurately into its original constituents."(p 360).That is, chunking is performed by recoding chunks in the input into a different code in STM.Bower (1970) made a similar suggestion to account for chunking in LTM.The idea of chunks as verbal labels would seem to apply most directly to those instances where memory is increased by an explicit recoding strategy.For example, Miller reported an unpublished study by Smith who taught himself to chunk binary digits by recoding them into octal digits.Three binary digits can be recoded as a single octal digit.Here the octal digit becomes the label for the chunk of thee binary digits.Smith could recall about 12 octal digits and, by using the recoding method, he could recall about 36 binary digits.Even more elaborate chunking schemes were reported by Ericcson, Chase, and Faloon (1980).After extensive training, their participant SF was able to recall up to 79 digits.SF achieved this by making extensive use of his knowledge of running times to recode the digits.Training did not increase his STM capacity.Despite extensive practice with digits, SF did not increase his memory for consonants.The improvement appeared to be entirely attributable to the development of efficient chunking strategies.
In this paper we begin by reviewing existing suggestions about the nature of chunks in STM.The simplest proposals assume that chunking is achieved by recoding.Recoding schemes are examples of data compression that capitalize on redundancy in the input to allow more items to be stored in the available capacity.We provide an extensive discussion of data compression and its implications for models of STM.
Data compression can operate at multiple levels in both memory and perception.In addition to the situation-specific forms of compression involved, say, in recoding binary digits as octal digits, compression will also play a role in acquiring and developing the representations that underlie STM.For example, the exact form of the linguistic representations supported by STM will depend on the phonetic and phonological properties of one's native language.These representations in turn will constrain the scope for further compression or chunking.
The distinctive feature of data compression schemes is that compression must change the contents of STM.The compressed code must replace the original message in STM.An alternative view is that the benefits of chunking might arise from what is often termed redintegration.According to this proposal, chunking does not change the contents of STM, chunks exist only in LTM, and LTM representations help recover degraded information from STM.The evidence suggests that both compression and redintegration can play a role in STM, but that it is possible to differentiate between them.
We conclude by offering a revised account of chunking.We suggest that Miller was wrong to reject the idea that the capacity of verbal STM is better viewed as being determined by the amount of information that can be stored.Chunking can help make better use of the available capacity, but its efficiency depends on both the degree of redundancy in the input and the nature of the representational vocabulary underlying a particular memory system (e.g.bits, phonemes, words).For example, a store that can only represent words will be unable to hold an optimally compressed copy of the input.Sometimes this will give the impression that capacity is determined by the number of chunks that can be stored, sometimes by the number of items, and sometimes by the amount of information that can be stored.

Chunking and grouping
Following Miller, we will make a distinction between chunking and grouping.Whereas chunks are determined by pre-existing representations in LTM, grouping is determined by the perceptual structure of the input.For example, lists of digits might be grouped in threes by inserting pauses between the digits, and this will enhance in serial recall (Broadbent & Broadbent, 1981;Frankish, 1985;Henson, Burgess, & Frith, 2000;Hitch, Burgess, Towse, & Culpin, 1996;Ryan, 1969aRyan, , 1969b;;Wickelgren, 1964), but those groups need not correspond to any existing representations in LTM.In contrast, an ungrouped list of letters such as IBMBBCFBI can be parsed into three chunks that each have preexisting representations: IBM, BBC, FBI.
However, grouping can influence the formation of chunks.For example, McLean and Gregg (1967) had participants learn lists of 24 letters presented in groups of between one and eight letters.The timing of pauses in spoken recall reflected the pattern of grouping, suggesting that the groups had become chunks in LTM.Bower and Winzenz (1969) and Winzenz (1972) found that learning of repeated lists of digits in a Hebb (1961) repetition paradigm was impaired when the grouping of the digits changed from trial to trial.Similarly, Schwartz and Bryden (1971) found that if the initial items in a list varied across repetitions there was no learning, whereas there was learning if the final items varied.It seems that in order to learn a chunk the onset of the chunk must be marked by a grouping boundary, such as the beginning of a list.In each of these studies grouping determines which items become chunks in LTM.

Terminology
The term 'chunk' is used in several different ways.The representation of FBI in LTM might be called a chunk, and so might the representation that codes the chunk in STM.A chunk in STM might be coded by a label, or perhaps by a pointer to the representation in LTM.It is also possible that the representation of a chunk in STM might have the same form as the representation of the chunk in LTM.This is the case with redintegration theories that we will discuss later.Discussions of chunking rarely pay heed to these differences.Usually the term 'chunk' appears to be used to refer to the combination of the representations of the chunk in LTM and STM.When discussing chunking in general terms, we will use 'chunk' in that theory-neutral sense.However, given that our aim is to specify the nature of the chunking process in more detail, when presenting our theoretical analysis we will try to be more specific.
Although our concern here is with the influence of chunking on STM, the concept of chunking appears in different guises in many areas of cognition.Interestingly, in a recent review of the meanings of chunks and chunking (Gobet, Lloyd-Kelly, & Lane, 2016) there is no mention of chunking in STM.Many models of learning assume that learning involves creating chunks that can be combined to form a hierarchy of chunks (e.g.ACT-R: Anderson, 1983;CHREST: Gobet, 1993; CLASSIC: G.Jones, Gobet, Freudenthal, Watson, & Pine, 2014;SOAR: Newell, 1994;MDL Chunker: Robinet, Lemaire, & Gordon, 2011;EPAM: Simon & Feigenbaum, 1964).Gobet et al. (2001) suggested that it was possible to produce a common definition of a chunk as "a collection of elements having strong associations with one another, but weak associations with elements within other chunks" (c.f.Chase & Simon, 1973;Cowan, 2001;Simon, 1974).There are more elaborate notions of chunking.In ACT-R (Anderson, 1983;Anderson, Bothell, Lebiere, & Matessa, 1998) chunks are viewed as schema-like structures containing pointers which encode their contents.These approaches can be seen as investigations into the structure of chunk representations in LTM and how they are learned.The focus in the current paper is on how those representations influence storage in STM.Although, as suggested by Baddeley and Hitch (1974), we assume that the contents of verbal STM are coded phonologically, the formal information-theoretic arguments we present necessarily also apply to semantic or syntactic levels of representation.Finally, when we discuss the information capacity of STM we are using it in the same information-theoretic sense as Miller.If STM has a capacity of four bits that means that the information retrieved from STM can support a decision between 16 alternatives.Those alternatives could equally well be phonological, lexical or semantic.That is, while the form of the representations in STM may be assumed to vary, the metric determining capacity must remain the same.Some of the most important concepts which will be introduced here are given in Box 1.

Does verbal STM have a constant chunk capacity?
Miller argued that the capacity of STM was not determined by the amount of information that could be held in the store.Nowadays we standardly think of computer storage in terms of information capacitythe number of bits or bytes.How did Miller come to the conclusion that STM capacity was not a function of the amount of information that could be stored?His argument was based on data from studies by Hayes (1952) and Pollack (1953) comparing memory span for different types of materials.If span were determined by the amount of information required to store the items, then items selected from a large set should take up more capacity than items selected form a small set.In a message that is known to consist only of decimal digits, the identity of each digit requires 3.3 bits of information to transmit.One of the 26 letters of the alphabet requires 4.7 bits, and one word form a set of 1000 words requires 9.97 bits.A system with a capacity of about 10 bits should therefore be able to store approximately one word, two letters of the alphabet, three decimal digits, or 10 binary digits.Accordingly, span should be lower for decimal digits than for binary digits, lower for letters, and lower still for words.Hayes measured memory span for lists of binary digits, decimal digits, letters, letter plus digits, and words from a set of 1000.However, Hayes found that the number of items that could be held in memory was roughly equivalent for all classes of stimulus.A similar result was reported by Pollack (1953).All three authors took this to imply that STM storage was not limited by the information capacity as measured in bits.From the chunking study by Smith (reported in Miller, 1956), and many similar studies, we know that span can be increased by chunking.The conclusion therefore was that capacity is not determined by the number of individual items that can be stored, but by the number of chunks into which the message can be recoded.As we will see later, this conclusion that capacity is not determined by information depends on the assumption that people know the set of items that they will be tested on, and that they have developed an optimal code for representing those items in those particular experiments.
The possibility that capacity might be determined by the number of chunks that can be stored was addressed more directly in a series of studies by Cowan and colleagues (Chen & Cowan, 2005, 2009;Cowan, Chen, & Rouder, 2004;Cowan, Rouder, Blume, & Saults, 2012).The main aim of these studies has been to establish whether memory capacity is limited by the number of items that can be held in memory or by the number of chunks.In these experiments participants learnt chunks comprised of different numbers of words.They were then tested on either free recall, serial recall, or forced-choice recognition of lists containing different numbers of chunks.In one example, Chen and Cowan (2009) familiarized their participants with pairs of words (e.g.brick-hat, king-desk) and with single words tested by cued recall.Singletons and pairs of words were both deemed to comprise a single chunk.The words were presented repeatedly until participants' performance was perfect.The familiarization phase was followed by serial recall of lists of 2, 4, 6, 8 or 12 singletons, or 4 or 6 learned pairs.Participants performed articulatory suppression during list presentation.Responses were scored in terms of the number of chunks (either singletons or pairs) that were recalled in either the correct position (strict scoring) or anywhere in the list (lenient scoring).The data indicated that recall appeared be a function of the number of chunks in the input, but only under articulatory suppression and with lenient scoring.With strict scoring and without articulatory suppression, performance was worse for longer chunks.Cowan et al. (2012) extended this work using a greater range of chunk sizes and list lengths and found a similar pattern of results.Thus, contrary to Miller's suggestion, there is little evidence that memory capacity determined solely by the number of chunks that can be stored, except under the particular combination of articulatory suppression and free-recall scoring.

Data compression
An intuitive notion of chunking is that it enables us to squeeze more information into a limited memory capacity.That is, chunking is an example of data compression (Brady, Konkle, & Alvarez, 2009;Chekaf, Cowan, & Mathy, 2016;Huang & Awh, 2018;Mathy & Feldman, 2012;Norris, Kalm, & Hall, 2020;Thalmann, Souza, & Oberauer, 2019).Data compression is only possible when there is redundancy in the signal.In a series of studies by Cowan and colleagues (Chen & Cowan, 2005, 2009;Cowan et al., 2004;Cowan et al., 2012), considered in more detail later, subjects learned multi-word chunks such as brick-hat.Assuming that neither brick nor hat ever appears alone in the experiment, the chunk contains redundant information.If the list contains brick, the next word Box 1 Basic concepts.

Lossless compression
Compress in such a way that the original message can be fully reconstructed.e.g.zip files, or FLAC audio files.The strategy of recoding binary digits as octal digits is a form of lossless compression if the message is known to contain only binary digits.

Deletion code
Form a new code for a chunk by deleting redundant information e.g.brick, hat -> brick.

Lossless compression without chunking
e.g.Huffman code -replace more frequent items with shorter codes (c.f.Zipf's law).

Lossy compression
Compression which does not permit the original message can be fully reconstructed.e.g.MP3, JPEG.Most of perception can be thought of as lossy compression.e.g.Perceiving an acoustic waveform as a word constructs a compressed representation from which the original waveform cannot be reconstructed.

Redintegration
Reconstructing a degraded trace in STM with reference to information in LTM.

p(chunk i |x) ∝ p(x| chunk i ) ⋅ p(chunk i )
where p(chunk i |x) is the posterior probability of recalling the items forming chunk i given some, possibly degraded, information in STM, x, and where p(chunk i ) is the prior probability of that chunk.The greater the prior probability of the chunk, the more likely it is to be recalled correctly.

Minimum description length
A metric for comparing compression methods by computing the amount of information required to represent the entire message, which may consist of a coded message and a codebook.

Codebook/dictionary
Entries in the codebook link the input representations to compressed to support encoding and decoding.For example, an entry might link the sequence of binary digits, 011, to the octal digit 3.

Self-information
The amount of information needed to code each item in a message as a function of its probability: D. Norris and K. Kalm must be hat.In order to retrieve the chunk brick-hat it would be sufficient to store only brick or only hat in STM.But what about the case of chunking a random sequence of binary digits by recoding them into octal?By definition, a random sequence contains no redundancy and should therefore not be compressible.The redundancy follows from the fact that when coding a binary sequence verbally we use only two words (zero/oh/nought, one) out of the set of thousands of possible words in the language.For example, there are well over two-thousand monosyllabic words in English, so each monosyllabic word could code approximately 11 bits.However, a word used to represent 0 or 1 codes only a single bit.In effect, the remaining 10 bits are redundant (predictable).In principle we could assign a separate monosyllabic word to each possible 11 bit binary number (although this might take quite some time to learn).With a memory span of six words it should then be possible to remember a sequence of 66 binary digits.The redundancy involved in coding binary numbers using two words is analogous to using each word in a 16-bit computer to code a single bit; 15 of the bits will be completely redundant.We will return to this point later: whether the capacity of a memory system appears to be limited by chunks or information will depend on the granularity of the representations available (e.g.bits, phonemes, words).The larger the units of storage, the more the system will appear to have a chunk-based limit.Miller suggested that "Since the memory span is a fixed number of chunks" we can build "larger and larger chunks, each chunk containing more information than before".Instead, we will argue that the benefit of chunking is that it enables us to make more efficient use of the available information capacity, but that capacity is fixed.
A feature shared by many forms of data compression is that there are two parts to the compressed representation: the compressed code itself, and a codebook or dictionary (or possibly a program) that can be used to translate the compressed code back into its full form.For example, because the words 'chunk' and 'compressed' occur frequently in this manuscript we could compress the text by replacing all occurrences of those words with a shorter sequence of letters (perhaps 'CH' and 'CO').In order to recover the original text we must also store the codebook (CH = chunk, CO = compression) along with the compressed sequence.We might consider the entries in the codebook to be analogous to the verbal labels for chunks proposed by Shiffrin and Nosofsky (1994).The codebook enables the compressed representation (chunk labels) to be converted back into its original form.Note that there need be no systematic relationship between the form of the chunks in the original message and their codes.The code for 'compressed' could equally well be '97'.
The number of bits required to code an item is given by its self-information: where p i is the probability of item i.The more items there are in a given set, the lower will be their average probability and the greater their selfinformation, and the more bits will be required to encode them.Importantly, the number of bits required to code an item is not simply a function of the number of items, but also of the probability of each item.Items that occur infrequently are more surprising.They therefore contain more information and require more bits to code optimally.Table 1 shows the self-information for the letters of English determined by their frequency of occurrence in a sample of 40,000 words. 2 The letters X, Q, J and Z are much less frequent than the letter E, and thus convey more information.The message here is that memory capacity is a trade-off -for a given capacity we can store a lot of items from a small set, or a smaller number items from a larger set.We can also store more high probability items than low probability ones.The limiting factor in STM is not the size of the chunks (that is the entries in the codebook) but amount of information needed to select the appropriate chunk.Storing the chunks themselves is the task of LTM, One of the simplest algorithms for performing compression is Huffman coding (Huffman, 1952).This constructs an economical representation of its input by taking account of the relative frequency of symbols in the input such that more frequently occurring symbols are replaced with codes requiring fewer bits than less frequently occurring symbols.Shorter codes are assigned to items with lower self-information.As an illustration of Huffman coding, Table 1 also gives the Huffman codes for the letters of written English.An important feature of a Huffman code is that it is a prefix code.That is, no whole code is a prefix of another code.This means that the codes can simply be concatenated with no separation between them and can still be decoded unambiguously.If the ends of the individual code words had to be explicitly marked, those markings would require additional information to store.A binary code to represent 26 letters would require five bits.With Huffman coding the most frequent nine letters take four bits or less, with the top two taking only three bits.Thus the length of the Huffman code for each letter is approximately the same as its self-information.In itself, Huffman coding does not perform any kind of chunking.The coding of any character is determined only by its relative frequency; characters are never combined to form larger chunks with distinct representations.The code for a character can be considered to be a pointer or a reference in that it identifies the entry in the codebook that contains the original character in the message.

Pointers
It has often been suggested that STM might store pointers to objects rather than copies (e.g.Bowman & Wyble, 2007;Cowan, 2001;Norris, 2017;Potter, 1993;Ruchkin, Grafman, Cameron, & Berndt, 2003;Trick & Pylyshyn, 1993).In the context of visual STM, Trick and Pylyshyn (1993) were explicit in drawing the parallel between their ideas and the use of pointers in computer languages like C.More recently Huang and Awh (2018) suggested that chunks might be represented in visual STM Table 1 lists the letters of the alphabet and their probability of occurrence in a 40,000 word dictionary along with their Huffman code words and their selfinformation.For the letter E, for example, self-information = − log2(0.12)= 3.06.by 'handles', defined as 'content-free labels'.The idea of a handle is also borrowed from computer terminology.In informal terms, handles are more elaborate versions of pointers that might refer to a file or some other resource rather than simply a location in memory.In computing, the most basic notion of a pointer is as a fixed sized object that stores a memory address.Storing pointers is more efficient than storing copies of objects, but does not offer further opportunities for compression.Each object requires a pointer of the same size.More generally a pointer is simply a reference.A more economical way of storing information in STM is to store compressed codes, as might be generated by Huffman coding, and use them to reference objects in LTM via a codebook.Here we will assume that a pointer or handle in STM is just the code that selects an entry from the codebook.
Pointers do not provide a way of breaking free of the constraints on the amount of information that can be stored in a given memory system.We noted earlier that to store one word from a set of 1000 would take 9.7 bits.But all that needs to be held in the store is the information required to decide which of those 1000 words must be remembered.No matter how long the words in the lexicon are, the function of the pointer is to select the appropriate word from the set of possible words.Pointers cannot grow without limit.While a six-bit pointer would be generally be sufficient to address each phoneme in a language, a pointer that could address any possible representation that might be present in human LTM would have to contain a very large number of bits indeed.Such a system would need a huge capacity.
Miller suggested that we could increase the amount of information stored in memory by creating larger chunks.However, increasing the size of the chunks, or the amount of information that can be encoded in the chunk, does not necessarily increase the amount of information that can be stored in STM.Imagine learning the text of the Bible and the works of Shakespeare as two very big chunks.LTM would need to contain all of the information necessary to accurately encode the sequence of words in each.However, given that those two chunks exist in LTM, a sequence of those chunks can be coded in STM by one bit per chunk, so long as the sequence never contains anything other than those specific chunks.Contrary to Miller's suggestion, increasing the size of the chunks does not increase the size of either the codes or the pointers required to represent them in STM, or the amount of information that can be stored in STM.

Algorithms for data compression
One of the most familiar uses of data compression is to create computer zip files.This is an example of lossless data compression: when the files are uncompressed the original data can be recovered perfectly.In contrast, with lossy compression, some information is thrown away.The most familiar examples of lossy compression are MP3 encoding of audio signals and JPEG encoding of images.In both cases the original signal can only be approximately recovered from the compressed file.There is a trade-off between the degree of compression and the fidelity of the recovered signal.The greater the compression, the less accurately the signal can be recovered.It might seem that lossy compression must always be a bad thing and should be avoided unless there are strong limits on storage capacity.However, as we will see later, in the context of human memory and perception, compression can have the benefit of revealing important information while discarding unwanted information.Furthermore, as Nassar, Helmers, and Frank (2018) have shown in the case of visual STM, lossy compression can lead to better memory.Their proposals differ from what we have termed chunking in that the chunks are constructed on-line without reference to existing representations in LTM.Their chunking procedure can be thought of as being broadly analogous to MP3 compression; there is a set of principles that can be applied to compress any visual input regardless of whether it has been encountered before.The compressed representations can be considered to be chunks.
An efficient compression scheme is one which minimizes the combined size of the code plus codebook.This is the principle behind Minimum Description Length (MDL) encoding (Grünwald, Myung, & Pitt, 2005;Rissanen, 1978;Wallace & Boulton, 1968), in which length is measured in bits.For example, if a particular word occurred only once in a message, replacing that word with a code representing a chunk would actually increase the combined length because that word would still have to have an entry in the codebook.In the MDL framework, compression might also be achieved by creating a program that could reconstruct the input.The series of ascending real numbers would take an infinite amount of storage, but can trivially be encoded in a very short program.MDL is best considered as providing a metric for evaluating the relative success of different compression methods.An ideal compression method would be one which minimized the redundancy in the input.However, MDL does not provide a specific algorithm for generating a compressed code.
The idea that STM might be able to benefit from data compression raises two intriguing questions: what are the codes, and where is the codebook stored?An important feature of the MDL framework is that it applies to the combined length of the code and codebook.In psychological accounts of chunking such of that of Cowan, the usual assumption is that the chunks themselves are stored in LTM and that that these are different from the representations in STM.That is, the compressed codes are in STM and the codebook is held in LTM.Returning to the example of recoding binary to octal, the mapping between binary and octal must already be stored in LTM.Incoming sets of three binary digits are replaced in STM by the corresponding octal digit and, at recall, those octal digits in STM must be used to access LTM and read out the corresponding binary digits.The chunks (entries in the codebook) are constructed on the basis of experience of previously encountered material rather than being constructed afresh as each new set of items to be remembered comes along.It might seem that we could therefore expand the codebook without limit, as there are few practical constraints on the capacity of LTM.However, by analogy with pointers discussed earlier, the larger the codebook, the more information the codes in STM need to address the appropriate entry in the codebook.If the codebook is made larger than necessary, the number of items that can be stored in STM will decrease.

Compression and learning
So far we have considered data compression solely as a means of squeezing more information into a given amount of memory.However, data compression follows naturally from something that is even more important than making efficient use of memory: learning.Compression is only possible by virtue of having learned something about the regularities in the data.As noted by Grünwald (2007) "the more we are able to compress the data, the more we have learned about the data" (p xxv).For example, the infinite sequence of ascending even numbers would be impossible to store in any memory system, but once one learns the underlying regularity it can be compressed into very short computer program.
However, learning the regularities is a difficult problem with no guarantee of success.Consider the following series of digits: 415926535897932384626433832795028841971..This series can actually be represented very economically because it is the value of π starting from the third digit, but it is unlikely that anybody would discover that fact.Data compression in the form of chunking plays a central role in several models of learning (e.g.Orbán, Fiser, Aslin, & Lengyel, 2008;Robinet et al., 2011;Servan-Schreiber & Anderson, 1990).Both the Robinet et al. (2011) and Servan-Schreiber and Anderson (1990) models operate by building up successively larger chunks.For example, Robinet et al.'s MDL-Chunker model has two components: the first proposes possible chunks, and the second selects those chunks that allow the input to be represented with the minimum description length.The process can be repeated to form yet larger chunks.

Compression and language
The words in the language can be seen as a compressed code that is used for communicating meaning.A simple way to characterize speech communication is as a way of transferring a sequence of words between two speakers.An efficient way to achieve this might be to use a compression scheme like a Huffman code where the most commonly occurring elements are represented by the shortest codes.Zipf's law (Zipf, 1935) indicates that this is indeed what happens -more frequent words tend to be shorter than rarer words.This enables information to be transmitted at a faster rate.More generally, Piantadosi, Tily, and Gibson (2011) have shown that word length is optimized for efficient communication.
Not surprisingly then, almost all computational models of how infants learn words from continuous speech rely on developing an efficient compressed code.The chunks in that code then become candidate words.That is, the codebook or dictionary becomes the child's lexicon.For example, Brent and Cartwright (1996) proposed a model based on Minimal Description Length.More recent models are generally expressed in a Bayesian framework (for review see Pearl, Goldwater, & Steyvers, 2010).Bayesian and MDL approaches are closely related in that the MDL principle can always be interpreted in terms of Bayesian model comparison (see MacKay, 2003, p 354).Note that although some of the early models needed to operate on the entire input corpus (Goldwater, Griffiths, & Johnson, 2009) in order to identify words, most models work incrementally (e.g.Brent, 1999;Pearl et al., 2010).For example, Brent's MBDP-1 model operates one utterance at a time.
Compression will also play a role in determining the nature of the linguistic representations underlying verbal STM.In turn, these representations will determine how efficiently the store can be used to perform additional compression or chunking.For example, a phonemic transcription is already highly compressed relative to the original acoustic input.The full processing stream of speech recognition involves moving from an informationally rich stream of acoustic data to a representation in terms of meaningful linguistic units such as phonemes or words.This entails a massive degree of data compression.Assume that we begin with a CD quality monaural signal with 16-bit samples at 44.1 kHz.This corresponds to a data rate of about 700 kbps (kilobits per second) and is more than enough to cover the full range of human hearing.The same signal can be compressed using MP3 encoding at 160 kbps with negligible loss of fidelity to the human ear.MP3 coding at only 32 kbps is still quite sufficient to transmit intelligible speech.Alternatively, a good speech codec can reduce this to about 800 bps, and to transmit a phonemic transcription would take only 50-100 bps.This in turn can be reduced by at least a factor of 2 by taking account of the redundancies in human language.We can therefore compress the acoustic signal by a factor of over 10,000 and still recover the original message.However, achieving this degree of compression requires a huge amount of knowledge.Even the process of constructing an MP3 encoding requires knowledge about human hearing and the importance of critical bands.The acoustic waveform reconstructed from a high quality MP3 might not be high quality to an organism with a different hearing system.Converting an acoustic waveform into a phonemic transcription requires a deep understanding of the regularities of human speech.Even the best automatic speech recognition systems can only approximate this level of compression under ideal conditions -they have not yet learned enough about the regularities of human speech.
The task of recognizing words from a continuous acoustic input therefore involves a huge degree of data compression.However, the goal is not compression, but understanding.Converting the acoustic signal to some form of linguistic representation gives us data compression for free and we now have a signal that we can hold in a store with limited capacity.However, there is one very important consequence of such extreme data compression; all we can place in that store is speech, because the resulting code is incapable of representing anything else.The difference in timbre between a note played on a piano and a trumpet, for example, simply cannot be represented in terms of phonemes.This is another manifestation of the trade-off discussed in the context of pointers.We can use a given amount of memory to store a large number of things from a small set (phonemes) or a small number of things from a large set (arbitrary sound waves).It also demonstrates that this is a form of lossy data compressioninformation is thrown away that cannot be recovered.But most of the information that is lost plays no role in understanding speech.More specifically, the information that is thrown away plays no role in the understanding of the speech of our native language.Short-term memory for non-native speech is much poorer than for speech in our native language (Thorn & Gathercole, 1999).This offers a possible explanation of why nonwords words with a high phonotactic probability are easier to repeat than those with a lower phonotactic probability (Gathercole, Frankish, Pickering, & Peaker, 1999;G. Jones & Macken, 2018); they can be coded more economically in phonological STM (c.f.Tamburelli, Jones, Gobet, & Pine, 2012).A consequence of these considerations is that whenever some set of processing operations only needs access to a limited set of representations, the efficiency of storage will be maximized by adapting the compression to encode only the required representations.If some process only needs access to speech it is best served by a store that can only encode speech.A process that operated on vision should only encode the visual characteristics of stimuli.From the perspective of achieving efficient data compression, there will be a strong incentive to develop specialist modality-specific stores.For a given information capacity, a generalpurpose amodal store will never be able to hold as many items or chunks as a modality-specific store.Christiansen and Chater (2016) present an extensive analysis of the role of compression and chunking more generally in language comprehension and production.It is interesting to note that much of their argument is motivated by the idea that forming progressively larger or more abstract chunks primarily serves to reduce interference rather than to achieve more efficient storage.
The fact that there are constraints on how compression might determine the representations used in a phonological store does not imply that the representations are completely fixed.The nature of the phonological or phonetic code must be capable of changing with experience in order to acquire one's native language, or to learn a new language.Different languages have different phonological repertoires.Language acquisition therefore depends on tuning the store to support the necessary phonological representations.
The ability to operate on a compressed representation of an acoustic signal is not something unique to humans.Neurophysiological recordings from the primary auditory afferents of frogs shows how the auditory system is tuned to the statistics of the environment.Rieke, Bodnar, and Bialek (1995) found that the rate of information transmission of spike trains was 2-6 times higher for naturalistic sounds than for broad-band stimuli, and that this rate was close to the theoretical maximum.
A contrasting position taken by Jones and Macken (2018) (see also: G.Jones, 2016;G. Jones et al., 2014;G. Jones & Macken, 2015) is that phenomena such as the phonotactic probability effect have nothing to do with a dedicated phonological store.They argue that "performance in vSTM measures primarily reflects domain-general associative learning "(216) and they assume that experience with linguistic input makes it possible to represent the input in terms of successively larger chunks.In line with Miller, they suggest that sequences comprised of fewer chunks are easier to remember.However, contrary to Miller, they wish to deny the existence of STM altogether -"our computational model of associative learning provides a parsimonious explanation of performance in vSTM tasks without the need for additional bespoke processes such as a short-term memory system" (p226).They used their model to explain the effects of phonotactic probability on nonword repetition (G.Jones et al., 2014) and have suggested that it explains why digit span is better than word span (G.Jones & Macken, 2018).
Nevertheless, their data make a good case that input sequences that D. Norris and K. Kalm can potentially be encoded in terms of fewer chunks are easier to remember.Such sequences will be easier to compress.The unresolved question, is where is the compressed code stored.In the case of the very local statistical dependencies between phonemes that might underlie nonword repetition effects, we suspect that the code and codebook may be part of STM itself.For larger-scale dependencies such as those present in digit sequences, it is much less clear.They might be better explained in terms of the redintegration mechanism to be described later.
The chunking process suggested by Jones and colleagues has parallels with a much earlier algorithm for redundancy reduction developed by Redlich (1993).This was not intended as a psychological model.Both procedures operate by combining existing chunks into larger chunks.However, in Redlich's procedure, larger chunks (Redlich describes them as features) are constructed incrementally only when they reduce the redundancy in the representation of the input corpus.As new chunks are formed these chunks in turn can be combined to form larger chunks if those new chunks reduce redundancy.The Jones model simply creates ever larger chunks regardless of whether they can help compress the corpus.
Compression should never be seen as an end in itself.For example, both MP3 and FLAC encoding of auditory signals operate on a purely acoustic level of representation that does not respect the linguistic structure of the input.Both may offer a high degree of compression of a speech signal, but the compressed representations will obscure the properties of the signal that are crucial for speech recognition.That is, although the acoustic signal can be reconstructed (perfectly in the case of FLAC, or approximately in the case of MP3), the compressed code itself obscures critical information.This is a point made very clearly by Barlow (2001): "coding should convert hidden redundancy into a manifest, explicit, immediately recognizable form, rather than reduce it or eliminate it" (p246).In the case of speech, that recognizable form might take the form of, for example, phonetic features or phonemes.
Compressed representations will generally not occupy the same similarity space as in the uncompressed representation.For example, the words 'had' and 'mad' are phonologically similar.We know that STM relies heavily on a phonological code (Conrad, 1965).However, the Huffman codes, of 'had' and 'mad' might be no more similar than those of 'gem' and 'artichoke'.On the other hand, compression at an acoustic level that resulted in representations of phonological features would necessarily construct a space where 'had' and 'mad' were similar.If the representations supporting verbal STM are primarily phonological, or at least tailored to speech, this will have implications for the kind of chunking process that might operate in studies of verbal STM.If the results of compression must be represented in a phonological code, compression would need to be achieved by replacing one phonologically expressed representation with another.As suggested by Chen and Cowan (2009) this could be done by storing only the first word in a chunk.Alternatively, a chunk may be replaced with a completely different phonological form, as in the case of recoding binary as octal.Note that both of these schemes only work because the experimental procedures limit the number of potential messages that can be conveyed.Coding by storing the first word of a chunk will only be effective if there is only ever one chunk beginning with a given word, otherwise there would be no way of using that word to retrieve the correct chunk.Recoding binary to octal is only possible because the message is known to contain only binary digits.This link between compression and the set of possible messages corresponds to a central result in the field of information theory: efficient storage and transmission of information depends on knowledge of the statistical properties of the signal (Shannon & Weaver, 1949).

Evidence for compression
Only a small number of studies of STM have considered chunking in the context of data compression (e.g.Brady et al., 2009;Chekaf et al., 2016;Mathy & Feldman, 2012;Norris et al., 2020;Thalmann et al., 2019).Mathy and Feldman had participants recall digit sequences which varied in compressibility.This was achieved by varying the number of increasing or decreasing runs of digits within the lists.For example, the list 123486456 has one run with an increment of 1 (1234), one with a decrement of 2 (864) and one with an increment of 1 (56).Lists with longer runs will be more compressible.Span was found to be influenced by both the number of digits in the list and their compressibility.This has parallels to Cowan et al.'s (2012) finding that memory performance in recalling word lists was a function of both the number of items and the number of chunks.
Mathy and Feldman did not offer any suggestion as to the form that the compressed representation might take or where the codebook might be stored.If participants can compress the run of digits 4, 5, 6, 7, then STM must contain a more economical representation of this sequence and also have access to a codebook (in LTM) to convert the code back into the sequence of digits.One possibility might be to incorporate some representation of ellipsis such as '4-7' where '-' codes the sequence of digits in between the first and last."5, 6, 7, 8" could be represented as '5-8'.Here '-' would represent a different set of digits from "4, 5, 6, 7" because it would encode "6, 7" instead of "5,6".We noted earlier that in the MDL framework, compression can be achieved by being able to generate the input from a program.The compressed code for '4-7' might indicate that the first and last digits should be passed to a program to generate the appropriate run of intervening digits.Of course, to achieve compression the call (pointer) to the program must be coded more economically than the sequence of digits it replaces.Chekaf et al. (2016) examined recall of sequences of symbols constructed from combinations of three features: large/small, square/triangle, black/white.The critical comparison was between a Rule condition where the symbols systematically cycled through the features (e.g.large white square, large black square, small white square, small black square, small white triangle, small black triangle) and a Dissimilar condition where there was no systematic relationship in the order of the symbols (e.g.large white square, small black triangle, large black square, small white triangle, small black square, small white square).Recall was better in the Rule condition.Chekaf et al. suggested that people were performing data compression.This extends Mathy and Feldman's (2012) result by showing that people are able to take advantage of regular patterns in sequences both when compressed representations exist before the experiment and when they must be developed during the experiment.Chekaf et al.'s interpretation was that this could not have been achieved by an MDL method because the sequences did not contain recurring patterns.However, as already noted, MDL coding does not need to be achieved via a codebook and can, for example, recode the input as a program.Recoding using a set of rules is perfectly consistent with the MDL framework.For example, by adding appropriate syntactic symbols combined with appropriate semantics in LTM, the Rule sequence above might be coded more economically as [white, black] (large square, small square, small triangle).Note that the code needs to contain enough information to distinguish unambiguously between different possible rule-based sequences.
Strong evidence for a genuine effect of data compression comes from a study of visual STM by Brady et al. (2009).They performed two visual STM experiments where participants had to report the color of one of eight rings or circles that had been presented for 1 s.In the first, the display consisted of four circles arranged in a diamond, each containing two concentric colored rings.Participants had to report the color of the ring indicated by a cue.Although all colors appeared equally often overall, the frequency of their pairings varied.Unsurprisingly, colors that appeared more frequently together were better recalled.The critical result was that when the displays contained more frequent pairings of colors, recall of colors that did not appear together frequently was also improved relative to trials where all of the pairings were low probability.This is exactly what would be expected if participants were forming more compressed representations of the high probability pairs.If the representations of the high probability pairs are compressed then they D. Norris and K. Kalm will take up less storage capacity, freeing up more capacity for low probability pairs appearing in the same display.Brady et al. modelled their data under the assumption that participants were learning the frequencies of color pairs which were represented by a Huffman code (Huffman, 1952).Although the number of items recalled increased throughout the course of the experiment, information capacity of STM (as given by the length of the Huffman code) remained constant.Brady et al. also constructed a chunking model where the probability of treating a pair as a chunk increased over the course of the experiment.This also gave a good account of the data.More recently, Ngiam, Brissenden, and Awh (2019) have replicated Brady et al.'s results, adding the qualification that the compression effect was only observed when participants were aware of the fact that some pairings were more frequent than others.

Chunks or information?
Brady et al.'s study suggests that, at least in the case of visual STM, chunking helps to make more efficient use of a store with a fixed information capacity.Chunking can improve performance but the capacity of STM as measured in bits remains constant.This contrasts with Miller's claim that we can increase the number of bits that memory can store by building ever larger chunks.Miller based his conclusions on studies by Hayes (1952) and Pollack (1953) which had found that span varied little as a function of the amount of information per item.However, span could only be a function of information if participants have developed an optimal compressed representation of the stimuli prior to the testing phase of the experiment.This depends on knowing the set of items that is used in the experiment and having the flexibility to develop an optimal code for those items.For example, in an experiment using digits as stimuli, each digit could be represented by a different four-bit pattern, whereas in an experiment using 1000 words, each word would need to be represented by a 10 bit pattern.The pattern for the word 'four 'and the digit '4' would therefore have different representations in the two schemes.A memory system that was adapted to store words would, of course, be able to store digits as the corresponding words, but its capacity for digits and words would be the same.As we will see later, the capacity of STM as measured in bits will be critically dependent on the coding scheme used.The important message here is that even if memory could hold only a fixed number of bits, capacity could only be fully utilized if the compressed code was tailored to a specific set of stimuli.
Once a code has been developed, further scope for compression to capitalize on redundancy in the input will depend on the granularity of those representations.As noted earlier, the larger the units, the more the system will appear to have a chunk-based limit.If the smallest unit of representation is a bit, and the compression is optimal, capacity will be determined by the number of bits available.However, if the smallest units of representation are the words in an individual's lexicon, this will limit the degree of compression that is possible.Compression will therefore depend on how efficiently the input can be recoded into the available representational units.Rather than thinking of capacity as being determined either by information or by chunks, it is perhaps more useful to think of it as being a continuous function of the density with which information can be packed into memory.With small units of information (e.g.discrete bits or continuous representations of information) and optimal recoding, capacity will appear to be determined by information.With larger units (e.g.words), especially when recoding options are limited, capacity will appear to be determined by chunks.The question is not therefore whether capacity is determined by information or chunks, but how efficiently the input can be compressed.The factors determining whether capacity appears to be constrained by chunks, items or information are summarized in Box 2.
Information-theoretic considerations place a number of constraints on how chunking might operate.First, there are fundamental limitations in the amount of information that can be coded by chunks.As the number of possible chunks that might need to be stored in STM increases, so will the number of bits required to code an index or pointer to a chunk.Ideally, representations in STM (codes or pointers) should be optimized to store information efficiently (that is, description length should be minimized).Second, the potential for compression will depend of the flexibility and granularity of the representations that STM can support.An inflexible system which has been fine-tuned to represent one kind of information economically may not be able to adapt to perform optimal compression of other material.This will apply even in the case of trying to compress digits using codes developed to represent words.There will be no problem representing the digits, as they are just a subset of the words; the problem will be that further compression can only be achieved by developing a completely different coding scheme.

Redintegration and Bayesian inference
The idea that chunking may be a form of data compression conforms to the informal notion that chunks help us squeeze more 'information' into a given amount of memory.An alternative view of chunking is that representations of chunks exist only in LTM, but can be used to reconstruct a degraded trace in STM by a process often referred to as redintegration (Brown & Hulme, 1995;Horowitz & Maneils, 1972;Hulme, Maughan, & Brown, 1991;Hulme et al., 1997;Jacobs, Dell, & Bannard, 2017;T. Jones & Farrell, 2018;Lewandowsky & Farrell, 2000;Poirier & Saint-Aubin, 1996;Roodenrys & Miller, 2008;Schweickert, 1993;Stuart & Hulme, 2009;Thorn, Gathercole, & Frankish, 2002).Redintegration offers a simple explanation of the fact that, for example, words are easier to remember than nonwords (Hulme et al., 1991;Hulme, Roodenrys, Brown, & Mercer, 1995) or that STM for high-frequency words is better than for low-frequency words (Hulme et al., 1997), but can apply equally well to chunks at other levels.
According to this view, the contents of STM are not recoded into chunks on input.Chunks in LTM come into play only at recall and do not alter the contents of STM.If information in STM is degraded as a consequence of forgetting, items or chunks with pre-existing LTM representations can be recovered on the basis of less information than would be required to recover items with no LTM representations.
Redintegration can be achieved by treating recall from memory as a process of Bayesian inference (see Box 1) whereby representations of chunks in LTM provide the priors that can be used to interpret a degraded representation in STM (Botvinick, 2005;Botvinick & Bylsma, 2005).According to this view, STM would be a passive process whose contents remain the same regardless of whether they correspond to chunks or not.Treating memory retrieval as a process of Bayesian inference has a long history in the study of memory (Botvinick, 2005;Botvinick & Bylsma, 2005;Crawford, Huttenlocher, & Engebretson, 2000;Hemmer & Steyvers, 2009a, 2009b;Huttenlocher, Hedges, & Duncan, 1991;Huttenlocher, Hedges, & Vevea, 2000;Shiffrin & Steyvers, 1997;Sims, Jacobs, & Knill, 2012;Xu & Griffiths, 2010).From this perspective chunks in LTM are considered to be hypotheses with associated priors, and recall involves computing the posterior probability of those chunks given the evidence in STM.The identity of more probable representations (those corresponding to chunks in LTM) can therefore be inferred from STM on the basis of less evidence than would be required for less predictable representations. 3For example, the letters FBI form a familiar chunk with a higher prior probability of occurring together than a random sequence of letters such as SVY.Given an equivalent amount of evidence from a degraded representation in STM, the letters FBI will have a higher posterior probability than the letters 3 The computations are identical to those used in models of word recognition such as Norris (2006) and Norris and McQueen (2008) where high-frequency words can be identified on the basis of less evidence than low-frequency words.The sole difference is whether the evidence comes directly from perceptual input or from representations held in STM.
D. Norris and K. Kalm SVY.The fact that FBI exists as a chunk in LTM will therefore benefit recall from STM.From a computational perspective, the use of Bayesian inference to combine different sources of evidence (in this case STM and LTM) is the same process as, for example, combining syntax and semantics (Padó, Keller, & Crocker, 2006) or vision and touch (Ernst & Banks, 2002).
We can now distinguish between three broad classes of model employing chunking or compression:

Ideal observer
An ideal observer (for an introduction to the concept of an ideal observer see Geisler, 2003) would take full advantage of the statistical properties of the input to produce an efficient code that maximizes the amount of information that can be stored in STM.It would not be limited to use any particular form of encoding.It is not restricted, for example, to a phonological or lexical code.Furthermore, the form of the codes would not necessarily have any systematic relationship to the form of the input.In the idea observer model the number of items that could be stored would be a function length of the code, as measured in bits.The ideal observer has the same properties as the ideal observer model of visual STM presented by Sims et al. (2012).

Fixed code
A more plausible model of verbal STM would assume that STM has a fixed representational vocabulary such as a phonological code.Changes in that code would be assumed to happen only slowly, as might be the case when learning a new language.By this account, compression can only be achieved by recoding the input using the same representational vocabulary.For example, the phonological representation of binary digits could be recoded into phonological representations of octal digits by consulting the codebook in LTM.Recall then involves consulting the codebook again to decode the encoded representations back into the input representations.Because the underlying representational vocabulary cannot be adapted, compression will not be optimal.

Redintegration
In the case of redintegration the input is not recoded.LTM comes into play only at the point of retrieval and there is no need for a codebook.Recall from STM will benefit from any form of constraining information (priors) stored in LTM.For example, priors derived from semantic or associative knowledge may lead "flying bird" to be more readily redintegrated than "crying bird".
Chunking and redintegration are not mutually exclusive.Imagine that the binary digits 100 have been recoded in STM into a phonological code for the octal digit four.Even if the phonological code is heavily degraded and might be consistent with four, shore, or more, the knowledge that STM should contain only octal digits will help to retrieve the correct item.
Both chunking and redintegration will lead to improved memory for inputs containing chunks.We assume that linguistic encoding is optimized to code spoken language efficiently, but not flexible enough to dynamically adapt to the statistical properties of the input on a trial-bytrial basis.For example, even if the input on a particular trial is the same word repeated several times, it will not be able to take advantage of this redundancy and generate a more efficient representation that can be stored more economically in STM.
Note that the LTM decoding process in the chunking and redintegration schemes are different.In a slot-based model such as that proposed by Cowan, the LTM decoding might simply consist of recognizing that an item is stored in a slot corresponds to a chunk in LTM (e.g.'F' -> FBI) and then replacing that index with the full chunk to generate a response ('F' -> 'F', 'B', 'I').In the case of redintegration the output from STM is not a different encoding of the chunk but a degraded form of the input chunk itself, ('FBI', not 'F', and not an index or pointer to 'FBI') and LTM is used to reconstruct that degraded representation and produce a more reliable response.
Assuming that the underlying capacity of STM in bits is the same in all three cases, only the ideal observer will appear to have a capacity determined by the amount of information that can be stored, because only the ideal observer has the flexibility to develop an optimal code.Consider a memory system that has evolved to efficiently store letters of the alphabet when those letters all appear equally often.Each letter will take 4.7 bits to encode.However, as shown in Table 1, if those letters appear with the same frequency as they do in written English, it is possible to develop a more efficient code where more common letters are encoded with fewer bits.A system where the basic unit of storage was the letter would not be able to take full advantage of the redundancy in the input without changing the underlying representation of the letters.However, if there are chunks of letters which appear frequently in the input, (e.g.FBI where F is always followed by BI) then the chunks can be recoded (e.g.FBI -> F).The capacity of such a system will appear to be determined by the number of chunks in the input and not by the Box 2 Factors determining whether capacity is determined by information, items, or chunks.

Chunks
Chunking involves recoding the message (input) expressed in one vocabulary -perhaps words of phonemes -into a more efficient code in the same vocabulary.An example is recoding sequences of binary digits into octal digits (e.g.011 <->3).In the terminology of data compression, the bi-directional mapping between binary and octal is stored in a codebook or dictionary.In psychological terms, the codebook will be represented in LTM.The number of items that can be stored in STM, depends on the form of the chunks used which, in turn, depends on being able to capitalize on redundancy in the input.

Items/words
If the participant has not discovered the redundancy, or the processes of recoding and decoding are too slow or effortful to influence memory, performance capacity will appear to be determined by the number of items that can be stored.For example, recoding binary to octal breaks down under time pressure (Pollack & Johnson, 1965).

Information
This will only be likely to be achieved if there are no constraints on the representational vocabulary and the chunking or compression process is free to utilize whatever code can achieve maximum compression.This is most directly analogous to computer compression algorithms such as those used in creating zip files.The encoded message will be in a different vocabulary from the input message.The encoding could make use of a codebook in LTM, or the codebook could form part of the compressed representation itself.
D. Norris and K. Kalm information capacity of the store.
The difference between the models is in the nature of the representations generated by the encoding process.If those representations are words or phonemes this will limit the flexibility of the system to develop efficient codes.

Evidence for redintegration
The case for redintegration has most frequently been made in the context of the recall of individual items.For example, Hulme et al. (1997) found that low-frequency words were sometimes erroneously recalled as a similar sounding higher-frequency word ('list' substituted for 'lisp').However, redintegration can also operate over larger units or whole lists.More direct evidence for a Bayesian account of recall from STM comes from a study by Botvinick and Bylsma (2005).They trained participants to recall sequences of nonwords generated by an artificial grammar and found that their performance improved over time as a function of how likely the sequences were to be generated by the grammar.A similar result has been reported by Majerus, Van der Linden, Mulder, Meulemans, and Peters (2004).In some respects these results are similar to Mathy and Feldman's finding that sequences of digits are recalled better when they contain runs of ascending or descending digits.The main difference is in terms of whether the familiarity of the sequences comes from pre-existing knowledge or from experimental training.A critical finding in Botvinick and Bylsma's study was that for sequences of equal probability, recall was worse if neighboring sequences were also of high probability.This is exactly what would be expected from a Bayesian framework which assumes that participants will recall the sequence with the highest posterior probability.This will be a function of both the evidence for each possible sequence (the likelihood) and their prior probability (determined from experience with the grammar).Recall will therefore be biased towards higher probability sequences which will reduce the probability of correctly recalling a sequence with a high probability neighbor.This kind of result is difficult to reconcile with compression schemes involving recoding where the form of the code will generally be completely orthogonal to the form of the input.Sequences that are neighbors in the original message space will therefore not be neighbors in the compressed representation.If that is so, tasks relying on a compressed code will show no systematic neighborhood effects.
Whereas Botvinick and Bylsma used an artificial grammar, Jones and Farrell (2018) studied the influence of English syntax on memory.Recall of syntactically ill-formed sequences was found to be biased towards producing sequences that conformed more closely to English syntax.Like Botnivick and Bylsma, they argued that this was best explained in Bayesian terms as a bias on recall of order.Interestingly, in these experiments the bias could not operate at the level of the order of individual words as most of the words would never have co-occurred before: the bias must operate at the level of syntax.Jones and Farrell compared four different models, all implemented within Henson's (1998) Start-End Model, and concluded that their results were more consistent with redintegration than chunking.
Although compression cannot easily explain the data from Botvinick and Bylsma or Jones and Farrell, redintegration has no explanation for the recorded cases of exceptional digit recall.Individuals with a very large digit recall capacity, such as SF, invariably adopt complex recoding or chunking strategies during presentation (e.g.Ericcson et al., 1980;Ericsson, Delaney, Weaver, & Mahadevan, 2004).In these extreme cases it is unlikely that redintegration has any significant role to play in memory.If it did, this would imply that someone like SF would have had to have stored a partially degraded sequence of 79 digits in STM and then used his knowledge of running times to reconstruct that degraded trace.However, in conventional STM studies with untrained participants, both data compression and redintegration might operate.

Is there compression in verbal STM?
Although there is evidence of compression in visual STM Brady (Brady et al., 2009) it is not at all clear whether one might expect to see compression in verbal STM given that the representations supporting verbal STM are primarily phonological.Consider an experiment similar to those of Cowan and colleagues where chunks are induced experimentally by exposing participants to specific pairings of words, and where all individual words are seen equally oftenall that differs between chunks and other words is the pairing of the words (their mutual information).The chunks introduce redundancy into the input and there is therefore the potential for compression.If the input is treated as a sequence of words, both unchunked single words and chunked pairs contain the same information and should be represented by codes of the same length.But what if the input is represented in a purely phonological form?Changing the probabilities of sequences of words (their mutual information) will not necessarily have a significant impact on the statistical regularities at the phonological level.The redundancy introduced by the presence of word chunks will be much less apparent and will only emerge if the system is sensitive to statistical relationships in the input spanning many phonemes.The difference can be illustrated by considering Shannon and Weaver's (1949)  In both cases statistics are computed over the same number of units, but the second-order letter approximation contains no useable information about the associations between words.A store that is sensitive only to statistics computed over phonological representations is therefore unlikely to adapt its coding scheme to regularities that are only present at a lexical level.This places strong constraints on the kind of compression that is possible.Compression will only be possible by changing the mapping between phonological codes and chunks.This restricts us to chunking by verbal recoding or deletion.

Distinguishing between compression and redintegration
Recent studies by Norris et al. (2020) and Thalmann et al. (2019) have applied the same logic as Brady et al. (2009) to investigate compression in verbal STM.As in the experiments by Cowan and colleagues, both studies included experiments in which participants learned three word chunks.In contrast to Cowan's studies, lists were the same length, but varied in the number of chunks they contained.Norris et al. (2020) found that with three-word chunks memory improved both for the items in the chunks and for singleton items that were not part of the larger chunks.This is exactly what would be expected on the basis of a simple chunking model in which each chunk takes up a single slot in memory: if some words in the list are combined to form chunks they will take up fewer slots and free up more slots to store singletons.At least for three-word chunks this appears to be a robust effect and is found with 2AFC recognition, probe recall, and serial recall both with and without articulatory suppression.Thalmann et al. (2019) found the same results with three-letter acronyms as chunks.However, Norris et al. (2020) also performed two-alternative forced-choice (2AFC) and serial recall experiments with pre-learned chunks that were pairs of words rather than triples.Under these circumstances there was still a benefit of chunking in that pairs were recalled better than singletons, but recall of both pairs and singletons was unaffected by the number of other pairs in the list.This is the pattern that would be expected on the basis of redintegration.Redintegration will improve recall of the chunks themselves, but will not D. Norris and K. Kalm release any memory capacity that could be used to store additional items.Norris et al. (2020) suggested that people might be more likely to adopt a compression-based chunking procedure with larger chunks where the benefits of active chunking might be more likely to outweigh the costs.Assume for a moment that people adopt the strategy suggested by Chen and Cowan (2009) and remember chunks by storing the first item of each chunk.Each chunk has to be identified as such, to be replaced by the first word of the chunk and at recall, the first word of the chunk must be used to retrieve the second word in the chunk.To benefit from active chunking all of this extra processing has to be balanced against the cost of remembering one extra item per chunk.
Several lines of evidence indicate that chunking does indeed have a cost.Glanzer and Fleishman (1967) had participants recall strings of nine binary digits presented simultaneously for 0.5 s.Some of their participants were trained over nine days to recode binary digits as octal digits, and others to use different strategies, including ones of their own choice.After nine days of training the binary to octal conversion process to might have been expected to become automatic.Based on the findings of Smith (cited in Miller, 1956) one might assume that those trained to use the octal recoding strategy would perform best.In fact, they performed worst.Glanzer and Fleishman suggested that participants may not have been able to apply the strategy efficiently enough to benefit from it.Perhaps this is not too surprising given that their participants would only have had 167 ms to perform each of the three binary to octal conversions required to recode a nine-digit binary number.Pollack and Johnson (1965) used a less demanding task, but still found limits on how efficiently subjects could employ recoding strategies.They trained participants for twenty-eight 1.5 h sessions to recode groups of four binary numbers as the decimal numbers 0-15.The rate of presentation was varied and they found that performance dropped from near perfect at a rate of 0.7 digits per second to only 60% of lists being correct at a rate of 2.8 digits per second.Both of these studies highlight the fact that the benefit of chunking has to be weighed against the cost of recoding material into chunks in the first place.Kleinberg and Kaufman (1971) tested memory for visual patterns constructed from displays of 13 illuminated dots in a 5 × 5 matrix.They found that memory was constant in terms of amount of information at fast presentation rates (less than one a second) but constant in chunks for slower rates.Effective use of chunking required time.
Huang and Awh (2018) used a probe paradigm similar to that of Brady et al. (2009), where the stimuli could either be four color pairs or four letter pairs.The pairs were either familiar or not.In the case of letters the familiar pairs formed words.They found a benefit of chunking, but this disappeared when participants had to respond under time pressure.This study indicates that there is a cost in decoding chunks as well as in encoding them.In the chunking model favored by Cowan et al., the use of chunks is assumed to be probabilistic rather than all or none.In a concept recognition task Bradmetz and Mathy (2008) have modelled response times as the time taken to decompress a compressed representation.Interestingly, they concluded that participants did not use optimal compression.
We suggest that the cost of chunking may be one factor that modulates the probability of forming and using chunks.The overall picture that emerges from studies of chunking and redintegration in STM is therefore that there is evidence for both, but chunking by recoding more is costly than redintegration.Information must be encoded into chunks and then those chunks must be decoded to recover the original message.Because of this extra cost, chunking happens some of the time but redintegration happens all of the time.

Chunking other domains
Much of what we have said about chunking in verbal STM applies to other domains.De Groot's (1946) famous study of chess masters' memory for the arrangement of chess pieces first brought the concept of chunking into memory research.Similar results have been found for differences between experts and novices in solving physics problems (Larkin, McDermott, Simon, & Simon, 1980) and expert and novice programmers in remembering computer code (Ye & Salvendy, 1994).
Chunking has also been invoked to explain improved memory for regular than irregular patterns in visual short-term memory (Bor, Duncan, Wiseman, & Owen, 2003;Brady et al., 2009).Some of these results can be attributed to recoding into a different representational vocabulary.For example, configurations of chess pieces could be recoded as 'knight fork' or 'Opera mate'.What is common to all of these examples is the availability of an alternative code in LTM that can map between the visual scene and higher order labels.Furthermore, the input itself can readily be translated into a symbolic code.In chess, is the location and identity of each piece needs to be stored in the representational vocabulary of chess.There is no need to remember the orientation of the pieces, or their exact shape.What must be remembered is the standard label for the pieces (knight, rook etc.) and the square on the board they occupy.In contrast, in many visual STM tasks it is the lower level perceptual details that have to be remembered; the exact physical location, orientation, color or configuration.For some arbitrary visual pattern there is no guarantee that this can be mapped onto a pre-existing representation in LTM.However, the representation of the input can potentially be compressed.To take a very simple example, a red square, a red triangle and a red circle can be more effectively compressed than a red square, a blue triangle and a green circle.In the former case, only one color has to be specified.Note that these principles apply regardless of whether statistical regularities can be coded as simple verbal expressions.Orhan and Jacobs (2013) have provided a formal account of how visual STM can take advantage of the statistical properties of a visual scene to encode it more economically and to improve memory performance (for review see Orhan, Sims, Jacobs, & Knill, 2014).This is a form of data compression.

Discussion
According to Miller, chunking involves active recoding.In the simplest case this might involve nothing more than deleting redundant information such as remembering only the first word of a multi-word chunk.Alternatively, it might involve replacing one representation in memory with a completely different one, as when recoding binary as octal.A very different view is that chunks exist only in LTM, and the contents of STM are not recoded.The benefit of having chunks comes about from their ability to support redintegration of degraded traces.
The simplest way to implement chunking as data compression would be to consider chunks to be functionally equivalent to pointers or references to representations elsewhere in memory (c.f.Huang & Awh, 2018) .This forces us to consider the amount of information that might be encoded in a pointer.Pointers themselves must necessarily take up memory capacity.Those that have the potential to address a large number of chunks will take up more capacity than pointers to a small set of chunks.A given amount of memory can therefore address a large number of items selected from a small set or a small number of items selected from a larger set.Chunking can be seen as a means of using the pointers in an efficient way.For example, a pointer that could address any word in one's vocabulary would be used inefficiently if it only had to distinguish between the words one and zero.A pointer that could address the names of the digits 0-7 would make much better use of the available capacity.Recoding binary digits into octal will therefore improve recall.
Most demonstrations of exceptional memory performance have been achieved using relatively slow rates of presentation.For example, SF (Ericcson et al., 1980) received one digit per second.He explicitly recoded digit sequences into running times.For most of us, very few sequences of digits other than consecutive runs of increasing or decreasing digits have any great familiarity.We are generally poor at noticing even familiar sequences when they are embedded in longer sequences (75106661, 18191457, 3117767, all contain a significant D. Norris and K. Kalm historical date beginning at position 3).Perhaps individuals such as SF have a much greater repository of familiar numbers (chunks) in LTM and become practiced in parsing the input into those chunks.In their study of the mnemonist Rajan, Ericsson et al. (2004) attributed his exceptional performance to extensive practice in memorizing phone numbers, dates, cricket scores and statistics, rather than to any increase in underlying memory capacity.
Chunking requires active processes of recoding and decoding and, as we have seen, comes at a cost.In contrast, when using redintegration, knowledge of chunks and statistical regularities in the input need only come into play at recall.Nevertheless, it may take extensive experience to internalize the necessary statistics.Botvinick and Bylsma (2005), who showed that serial recall was influenced by the constraints of an artificial grammar, had their participants practice for one hour a day over 15 consecutive days.Botvinick (2005) also used 15 sessions of training.On a shorter time-scale, Majerus et al. (2004) showed that participants' memory for nonwords improved following about 30 min exposure to syllables generated by a simple phonotactic grammar (3000 CV syllables).Mathy and Feldman (2012) gave their participants only 100 lists and found that performance improved when lists contained runs of consecutive increasing or decreasing digits.However, in contrast to the more subtle manipulation used by Botvinick and Bylsma (2005), or the experimentally induced chunks used by Cowan and colleagues, runs of digits may tap into well-established representations.They may even be susceptible to high-level cognitive strategies analogous to the binary to octal recoding studied by Smith (cited in Miller, 1956).

Compression and resources
The accounts of chunking considered so far all assume that memory capacity is limited by the available number of discrete representations such as bits, items, or chunks.A fundamentally different view of memory is that capacity is determined by a limited resource (Alvarez & Cavanagh, 2004;Bays, 2014Bays, , 2015;;Bays, Catalao, & Husain, 2009;Fougnie, Cormiea, Kanabar, & Alvarez, 2016;Just & Carpenter, 1992;van den Berg & Ma, 2017) which can be flexibly distributed across items in memory.As the number of items in memory increases, a common pool of resources will have to be distributed over a larger number of items (Sims et al., 2012).Resource-based models have generally been advocated to explain capacity limits on visual STM, but most connectionist models of verbal STM are also limited by the available neural resources (e.g.Grossberg, 1978;Haarmann & Usher, 2001;Page & Norris, 1998).However, in these models there is no flexibility in how these resources are allocated.Furthermore, resources are distributed over a fixed set of representations.
A simple formal way to view resources is as the capacity of the store measured in bits.That capacity can potentially be distributed over the items in memory to vary the precision with which they are coded. 4In the brain this could perhaps be implemented in terms of a resource limited pool of spiking neurons (Bays, 2014(Bays, , 2015)).Given that memory has limited capacity it makes sense to distribute that capacity over items so as to minimize the distortion in the information that can be retrieved.More specifically, it makes sense to minimize the cost of errors.This is the basis of rate-distortion theory (Berger, 1971) which has been applied to successfully model data from visual STM tasks (Sims, 2015(Sims, , 2016;;Sims et al., 2012).
In work on visual STM there continues to be a debate as to whether there really is flexibility in resource allocation in visual STM (Chekaf et al., 2016;Fukuda, Awh, & Vogel, 2010;Zhang & Luck, 2011).Joseph et al. (2015) have presented evidence consistent with a resource account of STM for speech.Precision of recall, as measured by participants' ability to adjust a probe stimulus to match one of the items to be remembered, decreased with memory load.However, this result is exactly what would be expected on the basis of any model that assumes that there is decay in STM such that representations become degraded over time -information is simply lost.
One way that the presence of chunks might influence memory in a resource model would be for resources to be distributed equally across chunks rather than items.This would make different predictions from a simple redintegration model, as the presence of chunks in a list would now improve memory throughout the list.As the number of chunks in a list decreases, all chunks can be coded with greater precision.In an ideal communication system, resources should also be distributed so as to reflect the statistical properties of the message and to ensure that less predictable or less easily recoverable parts of the message are encoded with greater precision.This is effectively a form of lossy compression where resources are distributed around list items so as to achieve a uniform improvement in memory performance or, in rate-distortion terms, the lowest cost.This is an example of how the efficiency of a communication channel can be improved by adapting to the statistics of the input (Shannon & Weaver, 1949).
Although others have treated chunking as a form of data compression (e.g.Brady et al., 2009;Chekaf et al., 2016;Mathy & Feldman, 2012;Norris et al., 2020;Thalmann et al., 2019) only the papers by Norris et al., (2020) andThalmann et al. (2019) have considered the relationship between compression and redintegration.The present paper has extended this work by providing an analysis of the information theoretic constraints on using and devloping compressed representetations in verbal STM.Most importantly, we have shown that the potential for chunking will depend critically on granularity of the available representations.Consequently, sometimes capacity will appear to be limited by the number of chunks that can be stored and sometimes but the number of items, but the underlying limit will always be a function of the amount of information that can be stored.

Conclusion
Miller framed his original discussion in the context of information theory.He suggested that memory capacity was determined not by limitations on information capacity but by limits on the number of chunks that can be stored.This immediately raises the question of what is a chunk and what is it about a chunk that enables more items to be stored.The simplest view is that a chunk in STM is simply a verbal label which can be used to retrieve the contents of the chunk from LTM.We suggest that the underlying limitation on STM capacity is information after all and that there are two factors that determine this capacity.The first is the nature of the representational vocabulary of the store (bits, phonemes, words).The vocabulary will be determined by the nature of the representations that have been developed in the process of learning about our native language.In the case of verbal STM that might take the form of a phonological code.Our ability to store those representations will depend on the underlying information capacity.The second is the efficiency with which we can recode or chunk the input into that vocabulary.Both are determined by the form of data compression involved.
Although we normally think of chunking as a conscious strategy, it might not be any different in principle from the automatic data compression that takes place in perception.For example, in speech perception the raw acoustic waveform is compressed into a representation that, in Barlow's (2001) terms, makes hidden redundancy manifest.That compressed code is then an ideal basis for representing speech information in a phonological short-term store with a limited information capacity.The need to both recognize and store phonological information efficiently therefore provides a strong pressure to develop a specialized phonological store and not to rely on some general purpose amodal storage system.Phonological representations are already chunks that form an efficient compressed representation of a highly redundant acoustic waveform.These representations can compress any message encoded in a listener's native language.However, once these codes have been developed, further compression will be constrained by the nature of these representations.When there is redundancy in the input it may be possible to use chunking to perform and additional level of compression, as when recoding binary digits into octal.This second level of compression will often be time-consuming and effortful to apply.A further way of taking advantage of the presence of familiar chunks is to employ redintegration.Redintegration does not have the same costly overheads as it relies on the same processes that drive perception in general.It has the potential to capitalize on any accessible statistical knowledge of the environment.
Under most circumstances, chunking by compression and by redintegration will produce the same result -performance will improve when material can be represented in terms of fewer and larger chunks.However, the two accounts can be teased apart by appropriately designed experiments.In the case of visual STM, Brady et al. (2009) demonstrated that the presence of a chunk in a display can free up capacity and lead to improved memory for other items in the display.Norris et al. (2020) and Thalmann et al. (2019) found similar effects in verbal STM.On the other hand, the studies by Norris et al. (2020) and Botvinick and Bylsma (2005) provide evidence for redintegration.The data reviewed here suggest that recall is influenced by both compression and redintegration.Performance is dominated by redintegration when the cost of recoding outweighs the benefits, and by compression when the benefits are greater; such as recoding binary to octal when there is no time pressure.
Our proposal is that any apparent chunk limit comes about from limits on how efficiently we can utilize the existing codes in a limited information capacity store.Chunking will be effective because it makes better use of the available capacity to store information, and not because the capacity itself is determined by the number of chunks (c.f.Brady et al., 2009).In the case of verbal STM, the store will already have developed an efficient code for the representation of one's native spoken language.Further compression will require that code to be modified, as perhaps when learning another language.Alternatively, a new code can be constructed using the vocabulary provided by the existing code (phonemes, or perhaps even words).Perhaps the paradigmatic example of this being coding binary digits as octal digits.Converting a message to and from this code may take time or effort.The benefit of chunking is therefore a trade-off between the benefit of generating a more efficient representation and the cost of the recoding process.
This account considers capacity to be determined by an interaction between the nature of the representational substrate of memory (samples, bits, phonemes, words) and the efficiency with which the input can be recoded in terms of those representations.With efficient use of a finegrained level of representation (e.g.bits, or a continuous resource), capacity will appear to be determined by information.With less efficient use of a coarser-grained level of information, capacity will appear to be determined by the number of discrete chunks that can be stored.In a memory system that can store only words, capacity for chunks will be the same as capacity for words.In the absence of recoding at all, capacity will be determined by the number of items that can be stored.In all cases, capacity is a function of how effectively information in memory can be compressed.

Declaration of competing interest
None.
examples of a second-order approximation to English letters and a second-order approximation to English words: Letters: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.Words: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.