A CODE model bridging crowding in sparse and dense displays

Visual crowding is arguably the strongest limitation imposed on extrafoveal vision, and is a relatively well-understood phenomenon. However, most investigations and theories are based on sparse displays consisting of a target and at most a handful of flanker objects. Recent findings suggest that the laws thought to govern crowding may not hold for densely cluttered displays, and that grouping and nearest neighbour effects may be more important. Here we present a computational model that accounts for crowding effects in both sparse and dense displays. The model is an adaptation and extension of an earlier model that has previously successfully accounted for spatial clustering, numerosity and object-based attention phenomena. Our model combines grouping by proximity and similarity with a nearest neighbour rule, and defines crowding as the extent to which target and flankers fail to segment. We show that when the model is optimized for explaining crowding phenomena in classic, sparse displays, it also does a good job in capturing novel crowding patterns in dense displays, in both existing and new data sets. The model thus ties together different principles governing crowding, specifically Bouma ’ s law, grouping, and nearest neighbour similarity effects.


Introduction
Compared to central vision, peripheral vision is severely limited.Arguably the strongest limitation is imposed by what is known as visual crowding.Objects that can be perfectly identified in peripheral vision when presented in isolation, can become unrecognizable when closely flanked by surrounding objects (Korte, 1923; for reviews see Herzog & Manassi, 2015;Whitney & Levi, 2011;Levi, 2008).How close, is described by Bouma's law, which states that the critical distance within which discrimination of a target stimulus suffers from surrounding stimuli corresponds to half the target's eccentricity (Bouma, 1970).
While many studies have replicated Bouma's findings (see e.g., Toet & Levi, 1992;Van den Berg, Roerdink, & Cornelissen, 2007), it has also become clear that distance is not the only factor.For one, the critical spacing depends on the similarity between target and flankers, with crowding being stronger the greater the similarity (Andriessen & Bouma, 1976;Chung, Levi, & Legge, 2001;Felisberti, Solomon, & Morgan, 2005a;Greenwood & Parsons, 2020;Levi, Toet, Tripathy, & Kooi, 1994; see Pelli, Palomares, & Majaj, 2004, for an overview).This suggests that crowding is at least partly caused by perceptual grouping between target and distractors.The role of perceptual grouping is further corroborated by studies showing that target identification suffers more when target and flankers group into a common global structure, while it benefits when the flankers themselves group into a structure that then more easily segments from the target (Herzog & Manassi, 2015;Livne & Sagi, 2007;Malania, Herzog, & Westheimer, 2007;Manassi, Sayim, & Herzog, 2012, 2013).In fact, merely increasing the number of flankers can already aid target identification under conditions which promote grouping between the flankers into a larger structure, which in turn enables perceptual segmentation from the target (Banks, Larson, & Prinzmetal, 1979;Põder, 2006).
The importance of similarity over distance also becomes clear when, instead of the typical sparse displays as have been used in the majority of crowding studies so far (see Fig. 1A for an example), relatively densely cluttered and heterogeneously arranged stimulus arrays are used (Van der Burg, Olivers, & Cass, 2017, see Fig. 1C for an example).In the Van der Burg et al. (2017) study, participants were instructed to report the orientation of a near-vertical target (i.e., slightly tilted to the left or right) presented in peripheral vision, and surrounded by a heterogeneous set of in total of 284 distractors that were oriented either horizontally or vertically.Given the earlier-mentioned similarity effects, we expected vertical distractors to interfere more with target discrimination than horizontal distractors, and we sought to determine the spatial range within which they would do so.Given that the large number of possible stimulus configurations (2 284 ) made it impossible to evaluate each potential configuration using a standard factorial design, we applied a technique based on genetic algorithms (Holland, 1975; see also Kong, Alais, & Van der Burg, 2016b;Kong, Alais, & Van der Burg, 2016a;Van der Burg, Cass, Theeuwes, & Alais, 2015).The main idea behind this optimization technique is that displays evolve, such that flanker arrays that yield little crowding and thus good performance survive, while displays that yield strong crowding become extinct (i.e., a survival of the fittest principle).The results illustrated in the left panel of Fig. 1D showed that performance was driven virtually solely by the distractors directly abutting the target, within just 1 • of visual angle.That is, throughout the evolution, similar (vertical) distractors in these neighbouring positions were replaced with dissimilar (horizontal) ones, while beyond this range there was no reliably measurable change in relative distractor proportions.It is worth noting that this 1 • was well within the 3 • that would be predicted on the basis of Bouma's law given the target's eccentricity (6 • ), as simulation models using the same genetic algorithm also confirmed (see right panel of Fig. 1D).Moreover, under standard sparse display conditions (Fig. 1A), the same line segments empirically resulted in a critical distance commensurate with Bouma's law (Fig. 1B).We concluded that Bouma's law does not necessarily generalize to dense displays.Instead, in such situations crowding appears solely determined by the similarity of the nearest neighbours.
The important question is then whether such dense arrays represent a fundamentally different situation where the principles governing sparse displays no longer hold, or whether one and the same set of mechanisms can explain both the nearest neighbour effects as observed in dense displays, and the critical distance effects as observed for sparse displays.In the present study we present a model that attempts to bridge the explanatory gap between crowding phenomena associated with sparse and dense displays.The model is based on two simple grouping principles, namely grouping by proximity and grouping by similarityboth basic mechanisms that, ever since the Gestalt psychologists, have been known to govern perception (see e.g.Wagemans, 2018, for a recent overview).We show that when the model is optimized for crowding in standard, sparse displays, it also does a good job accounting for crowding in dense displays.The model thus ties together different principles governing crowding, specifically Bouma's law, grouping, and nearest neighbour similarity effects.

The model
Our model derives from the CODE algorithm originally introduced by van Oeffelen and Vos (1982a) and adapted by Compton and Logan (1993;Logan, 1996).The CODE algorithm is one of the surprisingly few formal models of perceptual grouping by proximity.The algorithm specifies how objects cluster together perceptually on the basis of their proximity, and how such clusters are organised hierarchically.Apart from grouping by proximity, the model has been successfully applied to numerosity judgments (Allik & Tuulmets, 1991; Vos, Van Oeffelen, Fig. 1.Displays and results from Van der Burg, Olivers and Cass (2017).A) Illustration of a sparse display.Participants were instructed to report the orientation of the target line (tilted to the left or right from vertical), which was always presented at the same location at 6 • eccentricity (left or right from fixation), as indicated by a small red dot.The white dot indicates the central fixation point.The target line was either surrounded by either four vertical or horizontal distractors and the targetdistractor distance was manipulated.B) The results of this sparse display task, which are typical for crowding.Interference was largest when the target was surrounded by vertical distractors, and followed Bouma's law (with the predicted critical distance indicate by the vertical dashed line).C) Illustration of the dense displays used in Van der Burg et al. (2017).Here the task remained the same, but the target was surrounded by 284 distractors of randomly selected vertical and horizontal orientation.D) A genetic algorithm was applied to determine the spatial range within which distractors affected target identification.The black areas show where similar, vertical distractors disappeared (and were replaced by horizontal distractors).The left panel reflects the behavioural results, the right panel a simulation of an ideal Bouma-based observer, using the same genetic algorithm.A clear difference in the spatial extent of the crowding between the empirical data and the Bouma model can be observed.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Tibosch, & Allik, 1988), good continuation (Van Oeffelen, Smits, & Vos, 1985;Vos & Helsper, 1991), and to object-based attention (Logan, 1996).We implemented a simplified version of the original model, and added a similarity parameter to allow grouping strength to vary as a function of the resemblance between elements.We will discuss other small differences to the original implementation in the General Discussion.Fig. 2 illustrates the basic principles for one-dimensional arrays of elements, but the same principles hold for two-dimensional arrays.The model assumes that individual elements are represented not by a point in space, but by a certain activity distribution (illustrated as black dashed lines in Fig. 2), and that this activity distribution depends on the proximity and feature similarity of neighbouring elements.Specifically, the strength with which one item exerts an influence on the next item with a similar feature diminishes with increasing distance, at a decelerating rate.This is described as a Laplace distribution in Equation 1(following Compton & Logan, 1993;Logan, 1996;Shepard, 1987): where f(x) expresses the grouping strength that an individual item with a certain feature exerts on its neighbours, µ represents the location of a given element in the display, |x-µ| represents the distance relative to this location, and σ represents the width of the distribution and thus reflects the effect of proximity.Parameter w is an additional weight parameter that is also important, as it reflects effects of feature similarity, as we will return to later.Moreover, in the model, σ scales linearly for different target eccentricities, consistent with the linear relationship between eccentricity and the critical distance posited by Bouma's law (Bouma, 1970;Pelli & Tillman, 2008), as well as with the linear increase in receptive field size in visual cortex (Wandell & Winawer, 2015).The individual item activity distribution is then normalized, such that the maximum value for the total feature activity for each individual element in the display is 1.Spatial grouping and segmentation.Grouped elements create what has been referred to as a CODE surface, and which is illustrated by the solid red lines in Fig. 2. We first describe the spatial component to grouping, followed by similarity grouping.The activity distribution of a spatial group of items, g(x), is simply the sum of the normalized individual element activity, as is described in Equation 2: where i indexes each of the N items.This creates the different activity landscapes that are illustrated by the red solid lines in Fig. 2. Consider then first Fig.2A, in which the items are close together, and thus strengthen each other considerably as a group, resulting in an activity landscape with high overall activity.Furthermore, this activity landscape creates a hierarchy of representation: There is a large common area under the curve, shared by all three items, reflecting a strong group, or common surface.But there is also some unique additional peak activity for each of the individual items.Thus, the activity landscape reflects the perceptual experience of dealing with individual elements, as well as a cluster of elements.Critically, the model assumes that successful identification of any individual element, including the target, depends on how well it segments from the group.In the model, the strength of segmentation is set as the activity that is uniquely associated with the individual elements, relative to the activity associated with the group as a whole.The model searches for the inflection points (saddle points or local minima) in the landscape that are nearest to the target (cf.Logan, 1996).These are the points where the target breaks from the group, and they are illustrated by the black dots on the red curves in Fig. 2. The model sets a threshold at these points.The total area under this threshold then reflects the activity that is shared between target and flankersin other words the group activitywhile the total area above the threshold represents unique activity not shared between target and flankers.The stronger the unique activity of items relative to the shared activity, the easier it is to segment them and thus to identify the target.
In the model, the probability of items successfully segmenting from the group is then simply the division of the area above the threshold by the total area under the curve, as per Equation 3.
Here, p seg|T is the probability of segmentation given a certain threshold T, where f tgt (x)dx represents the total activity above threshold, and g(x) dx the total activity below the curve, see Equations 1 and 2 for f(x) and g (x), respectively.Two further conditions hold.First, the target group is bound by a minimum activity level of 0.05.Any regions beyond a 0.05 inflection point are not included in Equation 3.This cut-off serves to exclude clusters that clearly segment from, and are therefore not part of the target's group.Second, as the display may be asymmetrical, the model computes the threshold separately for the left side and the right side (T left and T right in Fig. 2A), and accordingly computes p seg|T values separately for the left and right.Where they differ, the model takes the minimum (in other words, the strongest of the left or right groupings determines the ultimate segmentation strength).Target discrimination performance is then the direct consequence of the segmentation probability p seg|T , plus a correction for guessing (with the guess rate being 0.5 in the two-alternative target discrimination tasks considered here), as per Equation 4: Here, p correct reflects the probability of correctly identifying the target orientation, while p target alone reflects the probability of identifying the target in isolation, without flankers, which is taken into account to set the maximum performance (i.e., ceiling level), which is derived empirically from the experiments we will present later.The effect of increasing the element spacing can then be observed by comparing Fig. 2A to Fig. 2B.With increasing distance, the individual features do not strengthen each other as much, and hence the overall group activity becomes weaker, as is reflected in the lower overall cluster of activity, and thus the reduced area under the inflection points (i.e., the threshold).Importantly, as a consequence, the balance between unique item activity and common group activity has now substantially swayed in favour of the first, and hence the likelihood of successful item segmentation increases considerably, allowing for better target discrimination.
Similarity.Now let us turn to similarity.Whereas panels A) and B) in Fig. 2 reflect the situation in which target and flankers are very similar in orientation, panels C) and D) show the scenario for the same spatial distances, but now for dissimilar elements.The model simply assumes that dissimilar elements reinforce each other less strongly, as is implemented by a reduced width of the individual Laplace curves by a factor w, which was here set at 0.1 (see Methods for further description of parameter settings).As can be seen in panels C) and D), this has a strong effect on segmentation, as the area under the curves is now almost entirely linked to just the individual items.The similarity, and therefore the grouping, is not set to zero though, as the line segments still share the fact that they are line segments, share a common onset, share a similar colour, etc.When sufficiently close together, as in panel C), there is still some residual grouping which will affect segmentation performance.However, it is clear that the grouping effect is overall much weaker for dissimilar items, and rapidly disappears with increasing distance.The model has four similarity parameters.Two of these are fixed and reflect the similarity between the distractors, with w maxsim representing maximum similarity, namely between two identical distractors, and with w minsim representing maximum dissimilarity, here between horizontal and vertical distractors.This leaves two freely varying parameters which reflect the similarity between target and distractors, with w sim representing the similarity between the target and similar distractors, w dissim represents the similarity between the target and dissimilar distractors.
Nearest neighbour rule.One final important principle of the model is that the feature similarity is only determined locally.That is, the similarity value (w) of an item is only set relative to its immediate neighbours.So in panels C) and D) of Fig. 2 for example, both flankers know that they are dissimilar from the central near-vertical target, and hence their similarity parameter ends up low, leading to a segmented representation.However, the left and right horizontal flankers do not know that they are identical to each other.Note that this does not mean that items do not at all affect each other beyond the nearest neighbour.The spatial distribution of feature activities is typically wide enough to go beyond the directly neighbouring position, especially for similar items.What it means is that the strength of the lateral spread (i.e. the similarity parameter) is only determined by the nearest neighbour.This nearest neighbour rule then has the additional effect that for multiple consecutive identical or similar items, grouping is mutually reinforced into a larger structure as their wide spatial distributions carry beyond the neighbouring item, while for dissimilar items any longer range grouping is effectively diminished by dissimilar intervening items.In Fig. 2 this is illustrated by the various configurations shown in panels E through H.Here it is also worth noting that the mutual similarity between a certain item and each of its left-or right-flanking neighbours may differ, and thus the Laplace distribution need not be symmetrical.For example, a near-vertical target might be flanked by a similar (vertical) element on the left, and a dissimilar (horizontal) item on the right.Therefore, for each item the leftward and rightward relationships were modelled separately (except for the outermost elements in the array, which were modelled symmetrically as they were the last item in the array).
Thus, with the σ and w parameters the model incorporates two basic principles that govern crowding as laid out in the Introduction, namely proximity and similarity.While σ shows a gradient that can account for critical distance effects, w reflects nearest neighbour interactions in terms of similarity.In the next sections we will show that with just two fixed (w minsim , w maxsim ) and three free parameters (σ, w sim , w dissim ), the model qualitatively captures the essence of a number of data patterns derived from sparse as well as dense displays.In the subsequent simulations, the main strategy for each dataset was to first fit the model to data from a standard, sparse display crowding experiment, and then to assess whether the same model (while retaining the parameter estimates) also captures the pattern of findings for dense displays.We emphasize that neither our goal nor our expectation was for the model to optimally fit all data patterns, as our model is very simple and therefore likely to miss out on the more complex intricacies of the visual system, as well as on numerous possible observer, stimulus and task variations.Rather, the model should be regarded as a proof of principle, demonstrating that common grouping mechanisms can explain crowding in both sparse and dense displays.

Study 1: Validating the model with sparse and dense displays
The main purpose of Study 1 was to assess whether the model captures the essence of crowding in the traditional sparse displays as well as in dense displays.The primary properties that the model should capture for sparse displays are a) worse performance for larger eccentricities; b) an increased critical crowding distance with larger eccentricity (following Bouma's law); and c) reduced crowding with reduced targetdistractor similarity.The model should capture these distance and similarity effects through the σ and w parameters, as was tested in Experiment 1a.In the empirical version of this experiment (which is illustrated in Fig. 1A), a nearly vertical target (tilted 5 degrees to the left E. Van der Burg et al. or right) was presented on the right or left from fixation at one of two eccentricities (4.7 • and 7.1 • visual angle) together with four flanking distractor elements.The distractors were either similar (vertical) or dissimilar (horizontal) to the target orientation, and distance to the target was systematically manipulated between 0.6 • and 4.1 • visual angle.The observers' task was to indicate the tilt of the target.Following earlier work, we expected performance to deteriorate with increasing eccentricity as well as with increasing target-distractor similarity.The goal was to test whether the model could provide a reasonable fit of the pattern of results, and to then deliver the parameter estimates to the dense display simulations of Experiment 1b.
For dense displays, the model should then capture the fact that items beyond those immediately flanking the target have little influence on performance, using the exact same parameter estimates.This was tested in Experiment 1b.In addition, the model makes a novel prediction regarding the effect of eccentricity.In contrast to sparse displays, in dense displays eccentricity should have less of an effect on the critical distance.This is because the similarity between items is only determined locally, between direct neighbours.As a consequence, the model predicts that the target segments reasonably well from surrounding items as long as the immediate neighbours are dissimilar.To this end, in Experiment 1b every position in the array was filled (as is illustrated in Fig. 1B).The task of identifying the slightly left-or right-tilted target element, as presented at a predictable location, remained the same.However, the method to determine which item positions (and thus which distance) were crucial to crowding in this type of display was different.As the spacing between all elements remained constant, instead the similarity of each individual item could vary instead.Due to the many possible combinations this creates, we switched to the genetic algorithm method introduced by Van der Burg et al., (2015, see Method).The algorithm starts from displays randomly filled with similar (vertical) and dissimilar (horizontal) distractors.Those displays that generate relatively weak crowding are allowed to mate and create offspring displays, while those that generate strong crowding go extinct.After a number of generations, we can then compare the surviving displays to the starting situation and assess which element positions mattered.Here we expected to replicate the main result of Van der Burg et al. (2017) who found that virtually only the elements most proximal to the target were affectedthat is, the evolution had caused them to converge to orientations most dissimilar to the target, while beyond the nearest neighbours there was little to no change in the statistics of the displays.The model should then replicate this nearest-neighbour effect.Importantly, in an extension of the original experiment, we also varied the target's eccentricity between 4.7 • and 7.1 • visual angle (the same values as in Experiment 1a).The model makes a unique prediction: While performance overall should suffer with eccentricity, the critical distance should not scale in the same way as for sparse displays.This is because in the model similarity is determined locally, between an item and its nearest neighbour, and in the dense displays as used here the nearest neighbours always occupy the positions closest to the target.Consequently, for targets further into the periphery, the similarity of the nearest neighbour is what counts in dense displays.
It deserves pointing out here that it is explicitly not the case that the model applies different sets of rules to sparse and dense displays.Exactly the same rules apply.So, sparse displays are also subject to the nearest neighbour rule, but here the flanking items are by definition the nearest neighbours, in which case only distance and similarity counts.
Conversely, the distance parameter σ has exactly the same value for dense displays as for sparse displays (whose parameter estimates the dense display estimate is based upon).But in dense displays the effect of more distal elements on the target is broken by dissimilar items in between, as similarity is determined by the nearest neighbour.

Method
Participants.20 participants from the Vrije Universiteit Amsterdam participated (11 females and 9 males; mean age: 23.5 years, ranging between 20 and 30 years) in both Experiment 1a, and 1b, in that order.Nineteen participants were naïve as to the purpose of the experiment and were compensated with credits or money (€8 per hour).One participant was the student who collected the data.After the experiment was explained to them, informed consent was obtained.The research protocol for the current and all subsequent experiments was approved by the Scientific and Ethics Review Board of the Faculty of Movement and Behavioural Science at the Vrije Universiteit Amsterdam.
Apparatus and stimuli.Participants carried out the experiment in a dimly lit cubicle and sat at a distance of approximately 70 cm from the LCD monitor (120 Hz refresh-rate, 38.70 • width and 24.28 • height).The experiment was programmed in Python using OpenSesame software (Mathot, Schreij, & Theeuwes, 2012).A display consisted of a dark grey (luminance = 8 cd m − 2 ) background with a black central fixation dot (0.1 • ; luminance <.5cd/m − 2 (− |-)) and two red dots (0.07 • , luminance = 23 cd m − 2 ) indicating the two possible target locations.The red dots were present to avoid spatial confusion between the target and distractors.The target and distractor elements were all white line segments (0.43 • length, 0.09 • width, luminance = 84 cd m − 2 ).The target was oriented either 5 • anticlockwise or clockwise from vertical, and was presented left or right from fixation.The target eccentricity was either 7.1 • , or 4.7 • .
In Experiment 1a, there were four distractors, presented left, right, above and below the target.The target-distractor distance was either 0.6 • , 1.2 • , 1.8 • , 2.4 • , 2.9 • , 3.5 • and 4.1 • .The distractors were either all similar to the target orientation (i.e., vertical), or all dissimilar (i.e., horizontal).Throughout the experiment the background was kept grey (luminance = 8 cd m − 2 ).In Experiment 1b, the entire grid was filled with distractors (apart from the target), which could either be similar or dissimilar depending on how the displays evolved.For the first generation, 20 different displays were created by randomly assigning a horizontal or vertical orientation to each of the 512 distractors that formed a grid of 27 × 19 (width: 16.2 • ; height: 11.4 • ; distance between elements: 0.6 • ).Including the target this made 513 elements.As in Van der Burg et al. (2017), the proportion of vertically oriented distractors in the first generation was tailored for each participant so that the accuracy score for the first population approached 70 % correct (to avoid a ceiling/floor effect).This proportion was determined in a pre-experimental session, in which we manipulated the proportion of verticals compared to horizontals.On average, the overall proportion vertical distractors in the first generation was 36 %.
Design and procedure.In Experiment 1a, a trial begun with the presentation of the fixation dot and the two red dots (to indicate the target locations) for a duration of 500 ms.Subsequently, the target and (if present) distractors were shown for 150 ms on either the left or right side.After this period, the fixation dot and the two red dots remained on the screen.Participants were asked to report the orientation of the target by pressing the z-or m-key when the target was anticlockwise or clockwise from vertical, respectively.Target orientation, distractor orientation, stimuli position and target-distractor separation were balanced and presented in random order within blocks.The eccentricity was either 7.1 • or 4.7 • .Participants performed the task in 14 experimental blocks of 64 trials each (2 distractor orientations × 2 display positions × 7 target-distractor distances × 2 target orientations + 8 target alone trials).Half of the participants started with eccentricity 7.1 • (block 1-7) followed by eccentricity 4.8 • (block 8-14), and the remaining participants with the reversed order.In Experiment 1b, for each generation, each distractor configuration was repeated 12 times.The stimulus was presented either on the right or on the left from fixation, as randomly determined.For the first generation, participants performed the orientation discrimination task on the 20 different displays (x 12 repetitions).Subsequently, the fitness value for each display was determined by calculating the mean accuracy over the 12 repetitions, and the four best displays were selected according to a survival of the fittest principle (i.e., those displays with the highest accuracy rates).
The four best displays were used to generate 12 new displays for the next generation.More specifically, evolution was performed by taking 50 % of the elements from one display and 50 % of the elements from another 'best' display (by maintaining the distractor location).In contrast to Van der Burg et al. (2017), we did not insert mutations.Each of the 12 evolved displays was repeated 12 times within a generation and participants performed the orientation discrimination task on each trial.This procedure was repeated for 4 generations, resulting in five generations of human performance.To determine whether participants were able to perform the task and to check whether participants were not fixating on one of the red dots instead of the fixation cross we also inserted target alone trials within blocks.If participants decide to fixate on one of the dots, then we expect the mean accuracy to approach 50 % correct (i.e., an exclusion criteria).Participants performed the task in 8 experimental blocks (one session).Half of the participants started with eccentricity 7.1 • (block 1-4) followed by eccentricity 4.8 • (block 5-8), and the remaining participants with the reversed order.Block 1 and 5 represents the first generation consisting of 252 trials each ((20 different displays × 2 display positions × 2 target orientations + 4 target alone trials) × 3 repetitions).The other blocks (generation 2-4) consisted of 156 trials ((12 different displays × 2 display positions × 2 target orientations + 4 target alone trials) × 3 repetitions).Participants performed 3 sessions in total.
Modelling.The model was implemented using Python version 3.9.7,as per the description in the introduction.The model (CODE_model.py)can be downloaded from Open Science Framework (OSF) page htt ps://osf.io/tjcbu/.Although all principles hold for two-dimensional displays, to keep things computationally tractable, the model lived in a one-dimensional world.The model thus considered one-dimensional arrays representing multiple positions in the visual field (from − 50 • to 50 • visual angle).The central position (0 • ) always contained the target (i.e., [0, 't']), while each distractor could be presented at a given distance from the target.For example, distractor [-3.1, 'v'] represents a vertical distractor 3.1 • from the left of the target location, and distractor [4.5, 'h'] represents a horizontal distractor 4.5 • from the right of the target location.The model has two fixed parameters, and three free parameters, all of which affect the width if the functions representing the elements.First, σ was allowed to vary between 0 and 1, and represents the spatial extent to which an item affects its neighbours.In Experiments 1a and 1b, the target was presented at different eccentricities, and on the basis of what is known about receptive field sizes we assumed that the spatial interactions between target and flankers elements would therefore scale linearly with eccentricity (certainly for the display ranges used here, Wandell & Winawer, 2015).For that reason, and as an additional validation of the model, we did not estimate σ separately for each eccentricity, but estimated σ for one eccentricity only, and then assumed it would scale linearly for other eccentricities.The effect of σ is then further modulated by similarity.Here w maxsim represents the maximum similarity, namely between two identical distractors, and which was therefore fixed at 1.In addition, w minsim represents the maximum dissimilarity in the model's world, namely between the two distractor types (horizontal versus vertical) and was also fixed, at 0.1.We chose 0.1 and not 0, because even horizontal and vertical items share properties, such as a common hue and brightness, a common shape, and a common onset.With a value of 0 dissimilar items would not affect each other at all.The 0.1 was based on piloting simulations which provided reasonable results.Crucial then are two free parameters, w sim and w dissim which represent the similarity of the near-vertical target to similar (i.e.vertical) and dissimilar (i.e.horizontal) distractors, respectively.Both these parameters were free to vary between w minsim and w maxsim .The model was then fit to the behavioural performance under standard, sparse display crowding conditions for the largest of the two target eccentricities only using the curve_fit function from the scipy Python module, which uses non-linear least squares methods.The estimated σ was then linearly extrapolated for the other eccentricity.Importantly, exactly the same σ, w sim , and w dissim values were then used in the simulations of Experiment 1b to see if the exact same model could account for the dense display data.For the simulations of Experiment 1b, the differences were in the stimulus input, as this was now a list with elements, representing target plus 40 distractors on each side in a 1dimensional space, with a distance between elements of 0.6 • .Here, like in the behavioural task, the initial generation of 20 displays was randomly determined by assigning either a vertical or horizontal distractor to each location, with each location having a 36 % chance to be vertical, and 64 % chance to be horizontal (following the average proportions for the participants).The model subsequently estimated the performance for each display in the generation, and a small random error (between − 5 and 5 %) was added to this performance, to add some noise to the system.Subsequently, the best four displays were selected, and combined to generate 12 evolved displays using the same methodology as in the behavioural task, representing the new generation.Then, this procedure (estimate performance for each display, select the best displays, generate new evolved displays) was repeated for four more generations.We present the average display proportions across simulations.
Fig. 4A shows the average human performance as a function of generation and eccentricity in Experiment 1b.As the graph suggests, performance improved as the displays evolved, F(1, 19) = 44.0,p <.001, η2 p =.698, and was better for the closer eccentricity, F(3, 57) = 24.2,p <.001, η2 p =.560.These effects were additive, as there was no sign of a reliable two-way interaction, F(3, 57) = 0.77, p =.774, η2 p =.012.To assess in what way the displays changed over generations, Fig. 4B shows a 2D color map of the changes in proportion similar (vertical) distractors, with the largest changes occurring close to the target.Fig. 4C shows the mean change in the proportion of vertical flankers in the final E. Van der Burg et al. generation (Generation 5) compared to the starting point (of 36 % verticals), as a function of eccentricity and target-distractor distance, which was divided in five bins of 1 • each.An ANOVA with the same factors as within-subject variables only revealed a main effect of targetdistractor distance, F(4, 76) = 14.0, p <.001, η2 p =.424.The reduction in the proportion of vertical distractors was significant for the first bin (target-distractor distance of 0.5 • − 1.5 • , t(19) = 3.65, p =.002, with no further differences for the other distances (all p values ≥ 0.377).The main effect of eccentricity as well as the two-way interaction failed to reach significance, F(1, 19) = 1.08, p =.312, η2 p =.054, and F(4, 76) = 0.27, p =.0.896,η2 p =.014, respectively.Thus, the effect was limited to the bin containing the directly neighbouring distractor elements, the spatial extent was much reduced compared to predictions on the basis of Bouma's law, and further contra this law there was no effect of eccentricity.
Modelling results.We first fitted the model to the behavioural performance in the large eccentricity condition of Experiment 1a.This resulted in the following parameter estimates: σ = 0.271; w sim = 0.765; w dissim = 0.123 (starting values were set to: σ = 0.4, w sim = 0.95 and w dissim = 0.15; target alone performance was set to 0.93 for the large and to 0.95 for the small eccentricity condition following the behavioral performance).The model fit is drawn as solid lines in Fig. 3B and appears to capture the data pattern quite well (r 2 = 0.960).For the small eccentricity condition we did not fit again, but instead used the same parameters, except for the σ parameter which we simply linearly scaled according to the eccentricity ratio (4.7 • /7.1 • ), resulting in a value of 0.180.The resulting model predictions are shown in Fig. 3A.Again, the fit appears to capture the data pattern well (r 2 = 0.931).The same parameter estimates were then transferred to the simulations of Experiment 1b.
For Experiment 1b we then used the same model as for Experiment 1a, including the same parameter values (σ = 0.180 and 0.271 for the small and large eccentricity conditions respectively; w sim = 0.765; w dissim = 0.123).Fig. 4D then shows the performance of the model as a function of generation and target eccentricity.As can be seen, as for humans, model performance improved rapidly with generation, F(4, 396) = 4011.9,p <.001, η2 p =.976, and it also shows worse performance overall for the larger eccentricity, F(1, 99) = 1111.6,p <.001, η2 p =.918.In contrast to the human data, the interaction was also significant, F(4, 396) = 65.1, p <.001, η2 p =.397, as the eccentricity effect was larger for the initial generation compared to the other generations.We believe this is largely due to ceiling effects at the later generations, as we will return to below.To assess in what way the displays changed over generations, Fig. 4B shows the outcome of the simulated evolutions for the one-dimensional world the model lives in.Both the colour map and the bottom graph show the proportion of verticals (i.e.similar distractors) surrounding the target position.The vertical dashed lines represent the predictions from Bouma's law.As can be seen, over generations, the number of verticals has dropped to zero for the positions directly flanking the target, with no systematic change in the positions further away.Moreover, there is no discernible effect of eccentricity.This performance pattern was statistically confirmed by conducting an ANOVA on the proportion verticals in the display with eccentricity and target distractor distance (10 distractors on the left and 10 distractors on the right from the target) as repeated-measures variables.The ANOVA yielded a significant effect of target-distractor distance, F(19, 1881) = 21.6,p <.001, η2 p =.179, but not when we excluded the nearest neighbours from the analyses, F(17, 1683) = 1.24, p =.228, η2 p =.012.
This confirms that the effect of target-distractor distance was primarily driven by the nearest neighbours.Neither the eccentricity effect, nor the two-way interaction were significant, F values ≤ 2.29, p values > 0.134.

Discussion
The model clearly captures some essential characteristics of crowding in both sparse and dense displays.First, it shows overall decreased performance with eccentricity.This is not surprising as σ scales with eccentricity, and thus so does the influence that items exert on each other.Second, it shows the classic flanker proximity effects in sparse displays, as the effect of the flankers reduces with distance.Third, it simulates the coupling of the critical distance to the eccentricity, thus following Bouma's law.Fourth, it captures similarity effects in that proximity effects are much reduced for dissimilar distractors.Fifth, it captures the limited spatial extent of crowding in dense displays.And last, but not least, it correctly predicts little to no effects of eccentricity on the critical spacing in such dense displays.Thus, the model successfully bridges the gap between sparse and dense displays.
While model behaviour qualitatively mimics the human data, there were also some differences, especially for the dense displays.First, the model learns more rapidly and performs overall better than humans, which resulted in much more pronounced nearest neighbour effects.This is the result of the model behaving near perfectly, resulting in nearoptimal evolution.For humans, the evolution is likely to be slowed by the more complex 2D displays (with more nearest neighbours as a result), and by task-unrelated factors.Furthermore, human performance suggested that especially the similarity of the distractors above and below the target mattered (see also Van der Burg et al., 2017).This suggests a role for grouping through connectivity or good continuation, which we here regard as a special case of grouping by similarity (as good continuation requires similar orientations).The model does not distinguish between these types, as it only lives in a 1D world anyway.

Study 2: The protective role of the nearest neighbours in dense displays
For the dense displays as used in Van der Burg et al. (2017) as well as in Study 1 here, we observed that the effect of distractors similar to the target is limited to the directly neighbouring locations, while similar distractors beyond these nearest neighbours have little to no influence.The model captures this because similarity is determined locally, between neighbouring items only.In this study we sought to first experimentally confirm this nearest neighbour similarity effect.While the GA method has provided evidence for a specific role of the nearest neighbours, we emphasize that the effects that this method produces are not the result of experimental manipulations, but rather the outcome of an evolutionary process.Moreover, although we were careful to have sufficient numbers of generations, one could always argue that the evolution did not run long enough, and thus that flankers beyond the nearest neighbours may be important if one extends the evolution.The goal of this study was therefore to empirically confirm the nearest neighbour similarity effect using experimental manipulations, and to then assess whether our model captures the data pattern also from this experiment.
As we slightly changed the stimulus characteristics for this experiment relative to Study 1, before we ran the main experiment (Experiment 2b), we first ran a new standard sparse display version of the crowding task (Experiment 2a).This model fits for this sparse display experiment (see Fig. 5) then also served to provide the parameter estimates for Experiment 2b, which sought to experimentally test the role of the similarity of the nearest neighbours to the target in dense displays.Fig. 6A illustrates the main manipulations of Experiment 2b.Observers discriminated the tilt of a near-vertical target in a field filled with distractors.We varied the homogeneity of the distractor field by changing the proportion of randomly placed similar distractors (verticals), and thus also dissimilar distractors (horizontals), between 0 % and 100 %, with 20 % increments.There were three main conditions, which varied in the constellation of the target's direct neighbours.In the Random Neighbours condition, the similarity (i.e., orientation) of the direct neighbours was subject to the same probabilistic manipulation as the rest of the display (see Fig. 6a middle row).Thus with increasing proportions of similar distractors, the chance of a target neighbour being similar changed accordingly, and one would thus expect performance to deteriorate.In the Dissimilar Neighbours condition, the eight positions surrounding the target were always occupied with dissimilar, horizontal distractors (see Fig. 6A upper row).From a nearest neighbour perspective, the prediction is then that adding similar distractors to the display will have little effect, given the dissimilar flankers immediately surroundings the target.Finally, in the Similar Neighbours condition, all neighbours were vertical, and therefore similar to the target (see Fig. 6A lower row).From a nearest neighbour perspective, target discrimination should now be most affected, regardless of the proportion verticals in the remainder of the display.We also included two baselines, containing only direct neighbours and which were either similar or dissimilar.

Method
Participants.In Experiment 2a, 24 participants (17 female, 7 male; mean age: 22.3 years, 18 to 44 years; all naïve as to the purpose of the experiment) took part.Two of these were excluded from further analyses due to chance level performance at 0.45 and 0.54 correct.In Experiment 2b, 25 participants (eight females; mean age: 25 years, ranging from 20 to 44 years) completed the experiment.The data from two participants were excluded from further analyses as they performed overall close to chance (0.48 and 0.50), while another participant was excluded as they performed close to chance in the easiest condition (i.e.where the target was surrounded by only horizontal nearest neighbours; 0.58 vs. 0.88 for the group).
Stimulus, design, and procedure.The experiment was programmed and run using E-Prime (Psychology Software Tools, Inc.; Pittsburgh, USA).The experimental procedure of Experiment 2a followed that of Experiment 1a, except where indicated.The display consisted of a white target line presented at 6 • eccentricity, surrounded by four white horizontal or vertical distractor lines (length 0.43 • visual angle, width 0.09 • ) above, below, to the left and to the right.A baseline condition was included in which the target was shown in isolation (i.e., no distractors were present).On a given trial all distractors were equidistant from the target.This target-distractor distance was systematically manipulated as either 0.75 • , 1.50 • , 2.25 • , 3.00 • , 4.50 • , or 5.25 • of visual angle.A trial started with the presentation of the fixation dot and the two red target location dots for 500 ms.Subsequently, the stimuli appeared for 150 ms, after which a display with the fixation dot and the two red target location dots remained on screen until response.Target orientation, target-distractor distance, target-distractor similarity and visual field were balanced and randomly mixed within six blocks of 96 trials each.The first block was considered a practice block, leaving 5 experimental blocks.
In Experiment 2b, the display consisted of a white target line surrounded by 284 white horizontal or vertical distractor lines.The target and distractor lines were either presented to the left or to the right of the white central fixation dot.The target eccentricity was 6 • and fixed during the experiment.In the Similar Neighbours condition, the orientation of the targets' nearest neighboursthe eight distractors directly abutting the target linewere all similar to the target (i.e.vertical).In the Dissimilar Neighbours condition they were all dissimilar (horizontal).In the Random Neighbours condition they were randomly determined with the proportion of similar nearest neighbours following the proportion of similar distractors in the rest of the display.The orientation of the those remaining 276 distractors (284 -8 nearest neighbours) was manipulated by parametrically varying the proportion of similar distractors in the display (0.0 -1.0, in steps of 0.2).The distractor and target lines were presented in an invisible 19 × 15 grid (13.5 • horizontal and 10.5 • vertical, with a 0.75 • distance between the line segments).Two baseline conditions were also included (not shown in Fig. 6a), in which the target was surrounded by only the eight nearest neighbours, while the rest of the display was empty.In these conditions the nearest neighbours were either all similar or all dissimilar to the target.As before, a trial started with the presentation of the fixation dot and the two red dots to signify the target locations for 500 ms.Subsequently, the target and the distractors appeared for 150 ms, and then disappeared leaving the fixation dot and the two red dots signifying the two potential target locations.Participants pressed the z-or m-key when the target was tilted anticlockwise or clockwise from vertical, respectively.The next trial was initiated after participants made the unspeeded response.The location of the line segments (left or right from fixation) was randomly determined on each trial.The target orientation (left or right of vertical), overall proportion of similar distractors across the display (0.0, 0.2, 0.4, 0.6, 0.8 and 1.0), and similarity of the nearest neighbours (all similar, dissimilar or random) were balanced and presented in random order within blocks, mixed together with the nearest neighbour only conditions.Participants completed one practice block followed by four experimental blocks of 120 trials each.After each block, E. Van der Burg et al. participants received feedback about their overall mean proportion correct.
Model fitting.The model fitting was done as in Study 1: We first fitted on the human data for the sparse displays (Experiment 2a) and then assessed to what extent the model's behavior also resembled the dense displays (Experiment 2b).For the latter, the stimulus input was a list with 81 elements, representing 80 distractors (40 to each side) and a target in a 1-dimensional space (distance between elements: 0.75 • ).For each nearest neighbour similarity condition and overall proportion similar distractors condition we rand 200 simulations.

Results
Experiment 2a.Fig. 5 shows the results for Experiment 2a, including the model fits.The findings again follow the expected pattern of crowding, with behavioural performance strongly decreasing for nearby similar distractors.Both main effects and the interaction were highly reliable, all F values ≥ 35.7, all p values < 0.001.For similar distractors, the pattern again nicely complies with Bouma's law.Target alone performance was 0.931.Fitting the model to this sparse display data set then yielded σ = 0.330, w sim = 0.730, and w dissim = 0.188 (start values: σ = 0.4, w sim = 0.95, and w dissim = 0.15).Fig. 5 (solid lines) shows that the model fits overall well (r 2 = 0.979).These parameter values were then transferred to Experiment 2b.
Experiment 2b.Fig. 6B shows the pattern of human performance for Experiment 2b, namely the mean proportion of correct target identification as a function of the proportion of distractors similar to the target in the display for each nearest neighbour similarity condition, including the two baselines.A repeated measures ANOVA on the mean proportion of correct target identifications was conducted with the proportion of similar distractors (0, 0.2, 0.4, 0.6, 0.8, 1.0) and nearest neighbour similarity condition (similar, dissimilar, or random) as within-subject variables.The ANOVA yielded a significant nearest neighbour similarity main effect, F(2, 42) = 64.5, p <.001, η2 p =.745, a significant main effect of the overall proportion similar distractors, F(5, 105) = 12.2, p <.001, η2 p =.386, as well as a significant two-way interaction between Fig. 6.Conditions and results of Experiment 2b.A) Illustration of the different conditions.Note that the displays are for illustrative purposes and not scaled to the original displays, which contained 284 distractor lines (see Fig. 1c for an example display drawn to scale).Also, the target was presented at eccentricity, not centrally.B) Behavioural results Experiment 2b.Here, the mean proportion correct is shown as a function of the proportion of similar distractors in the display for each nearest neighbour condition.The orange circles signify the condition in which the nearest neighbours (the eight distractors abutting the target element) were all dissimilar to the target orientation.The blue squares signify the condition in which the nearest neighbours were all similar.The orange and blue dashed lines correspond to the flanker alone baseline conditions.The green triangles signify the condition in which the similarity of the nearest neighbours was randomly determined.In this condition, the nearest neighbours followed the proportion of similar distractors in the entire display.Error bars represent the standard error of the mean.C) Same as panel B, but then for the model output.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)the nearest neighbour similarity and overall proportion of similar distractors, F(10, 210) = 13.0,p <.001, η2 p =.383.This interaction is evident from Fig. 6B and was further examined using three separate oneway ANOVAs for each nearest neighbour condition, with proportion similar distractors as factor.First, when the nearest neighbours were randomly determined in tandem with the entire display (green triangles in Fig. 6b), performance deteriorated monotonically as the proportion of elements similar to the target (i.e. the vertical elements) in the display increased, F(5, 105) = 26.5, p <.001, η2 p =.558, as was expected.
However, an exception to this general pattern emerged for the condition in which all the distractors were similar to the target orientation.Indeed, when going from 80 % to 100 % vertical distractors, performance increased significantly from 0.60 to 0.71, t(21) = 3.34, p =.003.Furthermore, performance in the homogenous similar flanker condition was better than in the baseline condition in which the target was surrounded by just the eight immediate vertical flankers, with the rest of the display being blank (the blue dashed line in Fig. 6b), t(21) = 3.17, p =.005.When the nearest neighbours were all similar to the target orientation (blue squares in Fig. 6b), there was no significant effect of the proportion similar distractors in the display, F(5, 105) = 2,13, p =.076, η2 p =.092, although the pattern again showed a trend towards improved performance when all distractors in the display were vertical.Finally, when the nearest neighbours were all dissimilar (orange circles in Fig. 6b), the pattern showed two aspects of importance.While the one-way ANOVA yielded a significant effect of the proportion similar distractors, F(5, 105) = 8.946, p <.001, η2 p =.299, this effect was solely determined by the condition in which all distractors in the field were dissimilar (i.e.horizontal), resulting in better performance than for any of the displays containing any proportion of similar (i.e.vertical) distractors.When we excluded this homogeneous condition from the analyses, no effect of the proportion similar distractors remained F(4, 84) = 1.18, p =.326, η2 p =.053.These results suggest three things: 1) Again the nature of the nearest neighbours appears crucial or crowding; 2) The presence of all dissimilar neighbours most proximal to the target provides a sufficient buffer against similar distractors further away, stressing again the importance of the nearest neighbours; 3) However, breaking up the overall homogeneity of the field has considerable impact too, suggesting some effects of global structure.
Fig. 6C shows the model simulation of the same conditions.As before, we derived the parameter estimates from fitting the model to data from the sparse display version, here Experiment 2a.Inter-element distance was also adapted to this experiment.Fig. 6c then shows the model's predictions for the dense displays as used in the main experiment.As can be seen, overall it resembles the overall human data pattern quite well, with a constant high performance for horizontal neighbours, and a constant low performance for similar, vertical neighbours, and a steady decline for random displays with increasing numbers of vertical distractors.Thus, the model captures the crucial nearest neighbour similarity effects in the data.The model even captures the small deflection back up (see green and blue lines in Fig. 6C), leading to relatively better performance for displays in which all distractors are target-similar (i.e.homogeneously vertical).The reason for this relative improvement is that within the model, all-similar distractor arrays create a stronger local group than arrays that are being broken up by the presence of a target-dissimilar distractor, even when those are sparse.We will return to this effect in Study 3, where we explore a similar occurrence more systematically.

Discussion
The behavioural results suggest that performance is modulated by both local and global structure of the display.First, when the displays were heterogeneous (i.e. when it contained both similar and dissimilar distractors), performance was driven by the local structure around the targetthe nearest neighbours.Specifically, it was relatively easy to identify the target when the nearest neighbours were dissimilar to the target, and difficult when they were similar.Importantly, under these conditions the proportion of similar distractors elsewhere in the display had no effect, despite the fact that a large proportion of these items fell within the classic interference zone when Bouma's law would apply.This confirms the conclusions of Experiment 1b as well as Van der Burg et al. (2017) of a special role for the similarity of the nearest neighbours.In fact, the nearest neighbours appear to provide an effective cordon against interference from items further away.Second, an exception occurred when all distractors in the display were the same.Whether similar or dissimilar to the target, a homogenous distractor field led to improved performance.This suggests that the overall display structure also played a role, as homogeneous fields interfered less with target discrimination.
Our model captures the data pattern remarkably well, including some of the more global effects.At the same time, it also missed out on some other aspects of the data.Most notable, while human performance showed very high performance for the homogeneously horizontal (i.e.target-dissimilar) distractor displays, at the same level as when there would only be the immediately neighbouring flankers, performance dropped as soon as only a few vertical (i.e.target-similar) distractors were added to the display.In contrast, here the model shows high performance throughout.This is because in the model segmentation of the target is already very strong on the basis of the local dissimilarity to its nearest neighbours.Additional local groupings (or break-ups of such groupings) beyond the nearest neighbours therefore have little effect here for the model's performance.Note that the model also misses the slight decrease in performance when all nearest neighbours are vertical and the rest of the display horizontal, creating a smaller horizontal square structure embedded in a larger vertical structure (most leftward blue condition in Fig. 6), or vice versa (most rightward orange condition in Fig. 6).This indicates that other, additional mechanisms affect human performance.This may include global, long-range grouping, but attentional effects may also play a role.We will return to this in the General Discussion.

Study 3: Adding distractors reduces crowding
Study 2 indicated that the model, despite its local characteristics, captures some grouping effects that go beyond the nearest neighbours.Specifically, performance was better for homogenous arrays of targetsimilar distractors than for distractor arrays that also contained dissimilar distractors.This may be a special case of a phenomenon that has been observed before: that crowding can actually diminish when the number of same distractors is increased (Banks, Larson, & Prinzmetal, 1979;Malania, Herzog, & Westheimer, 2007;Manassi, Sayim, & Herzog, 2012, 2013;Põder, 2006).The explanation is that the more of the same distractors there are, the more strongly they cluster together and segment from the target.Here we assess if our model can mimic this pattern.We focused on the findings from Malania et al. (2007) and Manassi et al. (2012), who used one-dimensional arrays containing line segments, and which therefore comes close to our model world.In their experiments, observers judged the arrangement of vertical Vernier line segments presented among varying numbers of vertical, and thus targetsimilar flankers.The results revealed a decrease in perceptual threshold with increasing numbers of flankers.Here we sought to simulate this situation and show how our model qualitatively captures the basic pattern.As we simulated existing data no new behavioral experiment on human observers was run.
Model simulation.Following Malania et al (2007) and Manassi et al. (2012), for this simulation we only used distractors similar to the target.We symmetrically varied the number of distractors left and right of the target between 1 and 10.We adopted the parameter values estimated in Study 2, so σ = 0.330, w sim = 0.730, and inter-element distance = 0.75 • .
Note that w dissim is not relevant here since we use vertical distractors only.Percentage correct for the target alone condition was set to 0.931, following Experiment 2a.Whereas the empirical studies of Malania et al.
E. Van der Burg et al. (2007) and Manassi et al. (2012) used Vernier direction acuity thresholds as the measure of performance, here instead we used proportion correct tilt discrimination as the (simulated) dependent measure.Fig. 7 shows how the model's performance improves with the number of flanking distractors.Consistent with the empirical pattern, model performance gradually improved when the number of flanking elements increased from one (0.62 accuracy) to ten distractors (0.72 accuracy) on each side.To understand why the model shows this, Fig. 7b is illustrative.Adding more distractors leads to strong above-threshold distractor clusters.As a result, the overall performance increases as this is reflected by the ratio between the total area above and below the threshold.As a general principle, within the model, performance benefits when targetdistractor segmentation is strong.Here the distractors themselves contribute substantially to such segmentation by clustering together.A similar effect occurred for the deflections back up in Study 2 as the number of target-similar distractors increased towards homogeneous displays.To conclude then, the model can capture some grouping effects that come with adding extra distractors to the display, on this basis of local grouping mechanisms.We emphasize though that the current simulation is qualitative in nature, as both the original stimuli and the metric used differed.For the same reason we do not wish to exclude the possibility of other, more global grouping mechanisms being at play.

General discussion
Crowding is a ubiquitous visual phenomenon with many different aspects.Most of the characteristics of crowding have been studied using sparse stimulus displays, while most of the crowding that affects our perception in real life is likely to be driven by densely cluttered visual environments.An understanding of crowding in dense displays is therefore crucial.We show here that core principles that have been shown to apply to crowding in sparse displays also help understand crowding in dense displays.Specifically, we show how a relatively simple model captures core crowding-related phenomena in both types of display.The model assumes grouping by proximity and similarity between elements, the strength of which decays at a decelerating rate.The model captures similarity, eccentricity as well as critical distance effects in sparse displays (Studies 1 and 2).It also captures nearest neighbour similarity effects in dense displays, whether these effects emerge through genetic algorithms (Study 1) or experimental manipulations (Study 2).It uniquely predicts that in contrast to sparse displays, in dense displays eccentricity has no effect on the critical distance, as interference from similar distractors remains limited to the nearest neighbour (Study 1).Finally, it captures local grouping benefits on performance, including the beneficial effect of adding more distractors of the same kind to a sparse display (Study 3).We conclude that in explaining these phenomena, there is little reason to assume fundamentally different mechanisms in operation for dense versus sparse displays.Rather, whether a display is sparse or dense determines which property of the model will dominateproximity or similarity.In sparse displays, the distractors are by definition the target's nearest neighbours and thus proximity of these neighbours becomes a main determinant.In dense displays, the nearest neighbours are by definition in the positions closest to the target, and hence their similarity becomes the major determinant of target discrimination.
After grouping by proximity (Compton & Logan, 1993;Van Oeffelen & Vos, 1982a), numerosity judgments (Allik & Tuulmets, 1991;Vos et al., 1988), good continuation (Van Oeffelen, Smits, & Vos, 1985;Vos & Helsper, 1991), and object-based attention (Logan, 1996), our simulations add crowding to the list of phenomena that can be captured by the CODE algorithm, thus adding to its validity as a model of perceptual grouping in general.We point out that there have been slightly different implementations of the CODE algorithm in the literature.First, where Van Oeffelen and Vos (1982a) originally proposed that σ is determined individually for each element's spread function and is standardly set at half the distance to its nearest neighbour, Compton and Logan (1993) arrived at a single common σ which corresponded to half the average distances between all the elements.Instead, we chose to estimate, rather than pre-set, its value on the basis of fits to the empirical data.Nevertheless, our estimates turned out very similar to the values set by Van  Oeffelen and Vos (1982b) as well as Compton and Logan (1993), and correspond to rather local influences of proximity, with little effect beyond a few elements.Second, while Van Oeffelen and Vos (1982a) proposed a normalization of the spread functions (as we did here), Compton and Logan (1993) argued against it as it yielded poorer fits for their data.However, this argument was only relevant in the context of how they set σ and does not affect our conclusions here.Third, Van Oeffelen and Vos (1982a) and Logan (1996) used the sum of element activity as the measure of grouping strength, while Compton and Logan (1993) used the maximum.Here too, this distinction only matters when assuming different spread functions (σ) depending on element distance, which is not the case for our model.Fourth, Van Oeffelen and Vos (1982a) set the segmentation threshold rather arbitrarily at 1, while Compton and Logan (1993) allowed for flexible thresholds to allow for a hierarchy of groups, with inflection (saddle) points being the points where groups segment.This is the version we adopted too.Fifth, while Van Oeffelen and Vos (1982a) used Gaussian spread functions, we, again following Compton and Logan (1993), preferred a Laplace spread function as it provides clearer inflection points.Finally, Logan (1996) in his General Discussion already proposed a way of implementing similarity, but this made use of a multitude of parameters in what was a much more complex model seeking to merge the CODE model with Bundesen's (1990) Theory of Visual Attention.Here we show how adding just a simple similarity-based modulation of proximity grouping suffices for the phenomena we sought to explain.
The current study confirms the value of the model, but also reveals some of its limitations.While the local, short-range character of the grouping as implemented here proves sufficient to capture some core phenomena as reported by us and others, it is not yet able to explain certain situations, and even outright fails to explain some other findings.First, grouping effects will no doubt depend on the exact nature and complexity of the constituting individual elements.For example, many crowding studies have used letters, which in and of themselves consist of multiple features that each may group with features from neighbouring elements in different ways (e.g.Bernard & Chung, 2011;Keshvari & Rosenholtz, 2016;Zahabi & Arguin, 2014).Moreover, letters and other more complex stimuli such as faces may activate holistic representations that may aid segmentation from surrounding stimuli.Our model currently assumes mere unidimensional stimuli (here orientation), and will need additional assumptions to be able to deal with more complex stimuli.Second, additional assumptions would also be necessary to account for the anisotropy in the strength of interference of flanking objects found here (Experiment 1b) as well as in Van der Burg et al (2017), where vertical neighbours placed above and below the near-vertical target affected performance more than vertical neighbours placed left and right from the target.This deviates from the radial/tangential anisotropy that is more standard for crowding, as items flanking the target along the radial axis (i.e. the axis through the fovea) interfere more strongly than flankers that are placed tangentially and where peripheral flankers interfere the strongest (Chambers & Wolford, 1983;Petrov & Meleshkevich, 2011;Toet & Levi, 1992;Van den Berg, Roerdink, & Cornelissen, 2010).Here we found the radial flankers to be weaker than the tangential flankers.The reasons for this difference probably also lie in the specific stimuli we used.Given that the target was near-vertical, it might have grouped with the top and bottom flankers not only on the basis of similarity, but also good continuation.Our model does not include good continuation, plus it lives in a onedimensional world so is unable to account for any two-dimensional anisotropies.Future versions could model a 2D world and include more types of grouping mechanism, as well as retinotopic anisotropies.
Such additional mechanisms may also need to include longer-range and higher-level grouping than the local grouping implemented here.Note that the model failed to capture the full pattern for entirely homogenous arrays of dissimilar distractors, where humans showed a marked performance improvement (see Fig. 6).A completely homogenous background may be treated as a single textured surface from which the target then stands out.The model also failed to predict some smaller effects of local structures consisting of multiple items, and that could be caused by interactions with second order, object-based grouping, and which may be based on feedback mechanisms from for example V4 or higher-order temporal lobe structures back to V1 (Lamme & Roelfsema, 2000;Farzin, Rivera, & Whitney, 2009;Livne & Sagi, 2007;Louie, Bressler, & Whitney, 2007;Manassi, Sayim, & Herzog, 2012, 2013;Saarela, Sayim, Westheimer, & Herzog, 2009).For example, Sayim et al. (2010) have shown that discrimination of a vertical Vernier target suffers considerably less from vertical flankers when those flankers are part of a 3D cuboid structure.Our model currently cannot deal with such structures.
Finally, an alternative or additional factor that may affect human performance beyond grouping is attention (e.g.Felisberti, Solomon, & Morgan, 2005b;Strasburger, 2005;Yeshurun & Rashal, 2010).Specifically, if the target is the only deviating item in a homogeneous array of distractors, it is more likely to attract attention than when there are also other salient disruptions present in the field, which may themselves capture attention (e.g.Nothdurft, 1993;van Zoest, Donk, & Theeuwes, 2004).The likelihood of attention being inadvertently captured by distractors is further increased if these distractors resemble the target (Folk, Remington, & Johnston, 1992).Thus, breaking up groupings may have two effects: A direct effect on figure-ground segmentation, and an indirect effect by drawing attention away from the target.These two effects may be more intertwined than is suggested here: Indeed, within Logan's (1996) version of the model, the segmentations of the CODE surface provides the objects of attention.

Neurophysiological correlates
As was the case for the original CODE model, our model describes grouping at an algorithmic level, and as such there is no direct relationship between model components and specific neural mechanisms.Nevertheless, the main aspects of the model come with plausible neural correlates.First of all, representing an item as a distribution rather than a point in space connects with population coding in neural populations (Pouget, Dayan, & Zemel, 2000).The width of these distributions then also represents the strength of the modulation that is exerted on neighbouring neurons, resulting in extra-classical receptive field interactions.Such interactions are likely caused by horizontal and/or short-range feedback connections, which moreover tend to be stronger for iso-orientation populations (e.g.Angelucci, Levitt, Walton, Hupe, Bullier, & Lund, 2002;Kapadia, Ito, Gilbert, & Westheimer, 1995;Knierim & Van Essen, 1992;Lamme, Super, & Spekreijse, 1998;Liang, Gong, Chen, Yan, Li, & Gilbert, 2017;Raizada & Grossberg, 2001;Stettler, Das, Bennett, & Gilbert, 2002), thus accounting for similaritybased interactions.While horizontal connections can be relatively long-range, their density drops with distance in what has been estimated as a Gaussian distribution (Buzás, Kovács, Ferecskó, Budd, Eysel, & Kisvárday, 2006).This may support the largely local character of the grouping that we assumed within our model.While grouping items may lead to a stronger combined percept of the cluster, this may operate at the expense of the perception of individual items (Parkes, Lund, Angelucci, Solomon, & Morgan, 2001).In line with other models of crowding that are more directly neurally inspired (Balas, Nakano, & Rosenholtz, 2009a;Dakin, Cass, Greenwood, & Bex, 2010;Van den Berg, Roerdink, & Cornelissen, 2010, see below) the integrated or pooled code then goes at the expense of the individual item code.Finally, the linear scaling of the distribution width with eccentricity is well supported by receptive field measurements in visual cortex (Wandell & Winawer, 2015).

Other formal models of crowding
Our model is by no means the first formal model of crowding.Greenwood, Bex and Dakin (2009) tested predictions derived from the formalization of two types of model.Both models are based on the increased positional uncertainty of feature representations in peripheral vision, which can cause perceptual overlap of features belonging to two different objects.One type of model computes a weighted average of the feature positions shared by targets and flankers (averaging model), while the other computes a probability of a target feature being substituted by a flanker feature (substitution model).These models were tested on data generated with three-element stimulus configurationsi.e. one target and two flankers, and the weighted averaging model prevailed over the substitution model.A later extension incorporated an elliptical interference zone that accounted for distance effects and anisotropies therein (Dakin et al., 2010).Averagingor poolingis also part of the population coding model of Van den Berg et al. (2010).This model assumes a spatial integration of feature signals, in populations of neurons.Through horizontal connections between simulated neurons, the model creates integration fields, within which features interfere with each other, resulting in a pooled signal.The model captures core characteristics of that averaging/pooling model such as the critical distance, the averaging of orientation features, and the asymmetry between foveally and peripherally placed flankers.It also allows for contour grouping effects.Both the averaging model and the population coding model have characteristics that are similar to our model, involving similar mathematical implementations.Most notably, the spatial uncertainty/integration in these models is expressed as spatial spread functions that is very similar to the one used in our model.However, while these previous models have the advantage of capturing 2D worlds and the asymmetries that come with it, they have only been tested with sparse displays.Moreover, they currently do not incorporate similarity (although this could be easy to implement).These are the two aspects where the strength of our model lies.Nevertheless, we foresee that these different types of models could be integrated to a considerable extent.
Another computational approach that assumes a pooling of signals is the Texture Tiling Model of Balas, Nakano, and Rosenholtz (2009a).This model assumes that a high-dimensional set of image properties (orientation, spatial frequency, contrast, etc.) are pooled by higher cortical layers into representations that capture a spatial summary statistic of the underlying features (or spatial filters, Balas, 2006;Freeman & Simoncelli, 2011;Freeman, Ziemba, Heeger, Simoncelli, & Movshon, 2013;Portilla & Simoncelli, 2000).This pooling is especially strong in the periphery.Crowding thus occurs because local features become "jumbled up" into a combined signal (which may be more complex than a simple spatial average, depending on how the signals are convolved).It is important to note that pooling is not necessarily the same as grouping (unless pooling occurs selectively for similar features), and the one study that has so far looked into this came to the conclusion that the Texture Tiling Model indeed does not reproduce the beneficial effects that grouping can have on crowding (i.e.uncrowding; Doerig et al., 2019).However, it can account for some feature similarity effects in visual search (Rosenholtz, Huang, Raj, Balas, & Ilie, 2012) and may thus also capture comparable effects in crowding.Conversely, although we have framed our model in terms of grouping, at a very general level its spatial gradients could also be conceived of as reflecting pooling, in which the Laplace probability distribution then reflects the chances of two features being pooled into a jumbled or swapped representation.In our model, this probability of signals being pooled would then be higher for similar than for dissimilar elements.
A recent model by Francis, Manassi, and Herzog (2017) does account for some of the similarity grouping effects treated here, most notably those in Study 3. It does this while similarity as such is not explicitly incorporated in their model.Rather, such grouping effects emerge from higher order boundary and surface detection mechanisms.Based on the LAMINART model (Cao & Grossberg, 2005;Raizada & Grossberg, 2001), this model is strongly informed by neurophysiological data, as it comprises a neural network simulating the interactions within different layers of V1 and V2, down to the single neuron level.Crucial to the boundary detection are bipole neurons, which integrate signals from similar neighbouring edges, akin to what Field, Hayes, and Hess (1993) have termed 'association fields'.This allows the model to detect boundaries and surfaces of groups of elements, and then, through attentional feedback mechanisms, segment these from the target.Subsequently, a template matching process allows for target identification.The Francis et al. (2017) model and our model have in common the assumption that grouping is an important cause of crowding (or the alleviation thereof).Moreover, in both models such grouping occurs through an excitatory lateral activation gradient.1However, there are also important differences.First, whereas the Francis et al.'s version of the LAMINART model assumes a specific implementation of boundary grouping, our model is less specific as it assumes a more general similarity-based grouping.As indicated by Francis et al. (2017), similarity could be added to their model in a future version.Currently, the similarity effects they have simulated so far emerge from the boundary grouping, as the more regular (i.e.similar) the items are, the clearer their shared boundary.We believe these boundary or surface grouping effects are important and may well account for the additional grouping effects in Study 2, which were left unexplained by our model.Second, the difference in specificity also returns in the level of implementation.While our model remains largely at the descriptive level, and with three parameter estimates remains simple and tractable, the LAMINARTbased model's explanatory core requires 38 parameter estimates (i.e. the interneuronal connection weights) plus a number of more peripheral parameters.Hence, we regard the two models as different types of potentially compatible models, and not as mutually exclusive.
Finally, Bornet, Doerig, Herzog, Francis and Van der Burg (2021) examined a range of pooling models (such as the Popcode model (Van den Berg et al., 2010), Texture Tiling model (Balas, Nakano, & Rosenholtz, 2009b), Bouma model (Bornet et al., 2021;Van der Burg, Olivers, & Cass, 2017), and a Convolutional Neural Network, or CNN (Doerig, Bornet, Choung, & Herzog, 2020;Krizhevsky, Sutskever, & Hinton, 2017) as well as grouping models (LAMINART model (Francis, Manassi, & Herzog, 2017), Capsule network (Doerig, Schmittwilken, Sayim, Manassi, & Herzog, 2019;Francis, Manassi, & Herzog, 2017) and Popart model (Bornet et al., 2021) to see which models can predict visual crowding performance in sparse and dense displays (similar to the displays used in the present study).For sparse displays, both grouping and pooling models (except the CNN) were able to capture human performance well (i.e., a similarity effect as well as distance effects were observed).For dense displays, Bornet and colleagues simulated human performance using the same genetic algorithm approach as in the present study and in the Van der Burg et al. (2017) study, but selecting the best displays according to each of the model's output.For dense displays, all grouping and two pooling models (Bouma and Popcode model) captured human performance, as the accuracy increased over generations.Interestingly though, the evolved displays were somewhat different for pooling models compared to human performance.Indeed, for the Bouma model and the Popcode model, all distractors within Bouma's region evolved towards a horizontal line, and for the texture model and CNN classifier there was no systematic distractor evolution at all, as if the models' performance did not depend upon a systematic part of the display.In stark contrast, for the grouping models, the nearest neighbours evolved towards a horizontal structure, indicating that crowding performance was solely determined by the nearest neighbours' orientation.This not only explains why we and many other researchers reported that crowding varies as a function of target eccentricity in sparse displays, it also explains why we did not observe a quite weak beyond the immediately neighbouring element.In essence our model is then scale-free.Francis et al. (2017) on the other hand refer to their model's grouping as long-range.However, we could not uncover the extent of the spatial grouping gradient (i.e. the bipole kernel) and how this related proportionally to the input stimuli.Thus it is possible that the lateral extent of the grouping in the models is quite comparable.
E. Van der Burg et al.

Fig. 2 .
Fig. 2. Illustration of the model and how it responds to various stimulus patterns (as shown at the top of each graph).The black dashed lines show the feature activity distributions for the different orientations (the near vertical target, and vertical and horizontal flankers).The solid red line reflects the grouped CODE surface, which is simply the sum of the normalized individual element activity.Segmentation of the target from the group is then achieved by setting a threshold (for left and right separately; T left and T right ) and the nearest point of inflection (saddle point) in the surface.Weak segmentationand thus crowding -occurs when the unique targetrelated activity (above the threshold) is minor compared to the group activity (below the threshold).Comparing the activation landscapes in A) to B) shows the effect of distance on segmentation strength (i.e. a relatively smaller item activity/group activity ratio with smaller distance).Comparing A) to C) shows the effect of flanker similarity, with weaker segmentation for similar flankers.Panels F) and G) show that in dense arrays, flankers further away exert relatively little effect on target segmentation.Comparing H) to G) shows how one dissimilar flanker breaks up the grouping.Note that for the dissimilar items the combined activity landscape (red lines) largely overlaps with the individual item activity (black dashed lines), as these items show only very weak grouping.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 3 .
Fig. 3. Results of Experiment 1. A) Mean proportion correct as a function of the target distractor distance for similar and dissimilar distractors.The target eccentricity is 4.7 • .B) Mean proportion correct as a function of the target distractor distance for each distractor orientation.The target eccentricity is 7.1 • .The continuous lines reflect the performance according to the model.The vertical dashed lines represent the critical distance according to Bouma's law.Note that we fitted the model to the data in the large eccentricity condition only, and we used the optimal parameters for the large eccentricity condition to derive the model performance for the small eccentricity condition.The errorbars reflect the SEM.

Fig. 4 .
Fig. 4. Results of Experiment 1b.A) Behavioural performance as a function of generation for each eccentricity condition.B) Evolution of the display for the behavioural experiment in generation 5.Here the mean proportion of verticals is shown for each distractor location and target eccentricity.The white square illustrates the target location.C) Mean proportion of verticals in generation 5 as a function of the target-distractor distance for both target eccentricities.The targetdistractor distance was binned in bins of 1 degree.Here, *** indicates significantly different from the baseline of 0.36 (the initial proportion verticals).D) Model performance as a function of generation for each target eccentricity condition.Here, the green circles signify the 4.7 • target eccentricity condition.The purple squares signify the 7.1 • target eccentricity condition.E) Evolution of the display according to the model.Here the mean proportion of vertical distractors is shown as a function of the target distractor distance for generation 5 (i.e., the last generation).Note that 0 • signifies the target location.The heatmap on top of the graph illustrates the proportion of vertical distractors for each location as well.The vertical dashed lines represent predictions for critical distance as made from Bouma's law.Note that the error-bars represent the SEM.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 5 .
Fig. 5. Results of Experiment 2a.Here the mean proportion correct is plotted as a function of target-distractors distance for dissimilar (circles) and similar distractors (squares).The continuous lines represent the model performance.The dashed line is the critical distance as derived from Bouma's law.The error bars represent the SEM.

Fig. 7 .
Fig. 7. Results of Study 3. A) Here the mean proportion correct is plotted as a function of number of neighbouring vertical distractor elements.B) Activation landscapes as a function of number of distractors.Note that these activity distributions look different from Fig. 2 (for the same stimulus configurations) due to the closer inter-element spacing here.

E
.Van der Burg et al.