Disentangling neural mechanisms for perceptual grouping

Forming perceptual groups and individuating objects in visual scenes is an essential step towards visual intelligence. This ability is thought to arise in the brain from computations implemented by bottom-up, horizontal, and top-down connections between neurons. However, the relative contributions of these connections to perceptual grouping are poorly understood. We address this question by systematically evaluating neural network architectures featuring combinations of these connections on two synthetic visual tasks, which stress low-level `gestalt' vs. high-level object cues for perceptual grouping. We show that increasing the difficulty of either task strains learning for networks that rely solely on bottom-up processing. Horizontal connections resolve this limitation on tasks with gestalt cues by supporting incremental spatial propagation of activities, whereas top-down connections rescue learning on tasks featuring object cues by propagating coarse predictions about the position of the target object. Our findings disassociate the computational roles of bottom-up, horizontal and top-down connectivity, and demonstrate how a model featuring all of these interactions can more flexibly learn to form perceptual groups.


Introduction
The ability to form perceptual groups and segment scenes into a discrete set of object-based representations constitutes a fundamental component of visual intelligence. Decades of research in biological vision have suggested a coarse dichotomy between perceptual grouping tasks that can be solved by feedforward (or bottom-up) processes vs. those that require feedback (or recurrent) processes [1][2][3]. Feedforward grouping processes information sequentially by building up increasingly more complex feature conjunctions via a cascade of filtering, rectification and normalization operations. As shown in Fig. 1a, this visual strategy is sometimes sufficient to detect and localize objects, such as when a scene has little or no background clutter, or when an object pops out (e.g.,) because of color, contrast, etc. (see [4]). However, as illustrated in Fig. 1b, visual scenes are typically complex, with objects interposed in background clutter. In this case, feedforward processes alone are typically insufficient for perceptual grouping [5][6][7][8], and it has been suggested that our visual system leverages feedback mechanisms to refine an initially coarse scene segmentation [2,[9][10][11].  Figure 1: "Are both dots on the same object?" (a) Our visual system can segregate an object from its background with rapid, bottom-up mechanisms, when the object is dissimilar to its background. In this case, the object is said to "pop-out" from its background, making it trivial to judge whether the two dots are on the same object surface or not. (b) Perceptual grouping mechanisms help segment a visual scene. (b, Top) The elements of a behaviorally relevant perceptual object, such as a path formed by tracks in the snow, are transitively grouped together according to low-level "Gestalt" principles. This allows the observer to trace a path from one end to the other. (b, Bottom) Alternatively, observers may rely on prior knowledge or semantic cues to segment an object from a cluttered background. Such object-based visual strategies are disrupted when objects are presented in an atypical pose. (c, Top) The Pathfinder challenge (reproduced with permission from [12]) involves answering a simple question: "Are the two dots connected by a path?" This task can be easily solved with Gestalt grouping strategies. Here, we introduce a novel "cluttered ABC" (cABC) challenge (c, Bottom). The challenge asks the same question on visual stimuli for which Gestalt strategies are ineffective. Instead, cABC taps into object-based strategies for perceptual grouping.
about the shape and structure of a perceptual object to organize a visual scene by refining an initial, coarse feedforward analysis with high-level hypotheses about the objects they depict [20][21][22][23]. Both strategies are iterative and thus rely on recurrent feedback computations.
What are the neural circuits that implement Gestalt vs. object-based strategies for perceptual grouping? Visual neuroscience studies have suggested that these strategies emerge from specific types of neural interactions: (i) horizontal connections between neurons within an area, spanning spatial locations and potentially feature selectivies [24][25][26][27][28][29], and (ii) descending top-down connections extending from neurons in higher-to-lower areas [30][31][32][33][34][35]. The anatomical and functional properties of these feedback connections have been well-documented (see [31] for a review), but the relative contributions of horizontal vs. top-down connections for perceptual grouping remains an open question.
Contributions Here, we investigate the distinctive function of horizontal and top-down feedback connections for perceptual grouping. We have developed a framework for systematically investigating the effectiveness of bottom-up, horizontal, and top-down connections for perceptual grouping tasks that are designed to be solved by either Gestalt or object-based visual strategies. We disentangle the relative contributions of these connections by training and testing deep recurrent neural network (RNN) architectures on synthetic visual tests of perceptual grouping. By lesioning different connections in the RNN model and parametrically increasing task difficulty, we interrogate horizontal vs. top-down connection contributions to perceptual grouping. Our main contributions can be summarized as follow: • We introduce the cluttered ABC challenge (cABC), a synthetic visual reasoning challenge to study object-based grouping strategies (Fig. 1c, bottom panel). We pair this dataset with the Pathfinder challenge ( [12], Fig. 1c, top panel) which forces network models to leverage low-level Gestalt cues. We use these tasks to systematically investigate object-based vs. Gestalt strategies for perceptual grouping.
• We demonstrate that horizontal connections are key for grouping objects without semantic cues, whereas top-down connections support learning on tasks where low-level Gestalt cues are uninformative. Of the network models tested (which included ResNets), only our proposed RNN model with the full array of connections and a very wide and deep U-Net [36] were able to solve both tasks across levels of difficulty.
• We compare model predictions against human psychophysics data on the same visual tasks and show that human judgements are significantly more consistent with image predictions derived from our proposed RNN model which includes bottom-up, horizontal, and top-down connections compared to those derived from the wide and deep U-Net. This indicates that neural network models can learn grouping strategies through multiple mechanisms, but the strategies used by the human visual system are best matched by our highly-recurrent network model.

Related Work
Recurrent neural networks Recurrent neural networks (RNNs) are a class of models that can be trained with gradient descent to approximate discrete-time dynamical systems. RNNs are classically featured in sequence learning, but they have also begun to show promise in computer vision tasks. One example of this is autoregressive RNNs that learn the statistical relationship between neighboring pixels for image generation [37]. Another approach, which has been successfully applied to object recognition and super-resolution tasks, is to incorporate convolutional kernels into RNNs [38][39][40][41][42]. These convolutional RNNs share kernels across processing timesteps, allowing them to achieve greater processing depth with a fraction of the parameters needed for a CNN of equivalent depth. Because it is thought that the visual system also uses recurrent processes to perform more complex computations beyond rapid categorization [43], such convolutional RNNs have been used as a foundation for multiple groups to incorporate biologically-inspired feedback mechanisms [12,[44][45][46][47][48].
Here, we combine convolutional-RNN models with the feedback gated recurrent unit (fGRU) proposed by Linsley et al. [49]. The fGRU extends the horizontal gated recurrent unit (hGRU, [12]), to model (i) horizontal connections between units in a processing layer separated by spatial location and/or feature channel, and (ii) top-down connections extending from units in higher-to-lower network layers. The fGRU [49] was applied to contour detection in natural and electron microscopy images, where it performed on par or better than state-of-the-art feedforward architectures, while also showing far better sample efficiency. We leverage a modified version of this architecture to systematically study the relative contributions of horizontal and top-down interactions for perceptual grouping.
Synthetic visual tasks There is a long history of using synthetic visual recognition challenges for systematically evaluating computer vision algorithms [50,51]. For instance, there are several studies that have increased visual classification difficulty by adding clutter to images [52][53][54] and/or parameterically controlling intra-class image variability [12,48,55,56]. Similar to the current study, synthetic object categorization challenges have also been used to study how feedback mechanisms can help recognition under various degraded conditions [38,45].
In the current study, we begin with the "Pathfinder" challenge [12], which consists of images depicting two white markers that may -or may not -be connected by a long path (Fig. 2). The challenge was used to demonstrate how a shallow network with horizontal connections learned to efficiently recognize a connected path by adopting an incremental grouping strategy, but feedforward architectures (CNNs) with orders of magnitude more free parameters struggled to learn this task. Our novel "cluttered ABC" (cABC) challenge complements the Pathfinder challenge by posing a grouping task based on high-level object cues, which cannot be solved using the same local grouping strategy (Fig. 2).

Synthetic perceptual grouping challenges
Both the Pathfinder and the cABC challenge are characterized by the following common features: white shapes are placed on a black background along with two circular and white "markers" (Fig. 2). These markers are either placed on two different shapes or the same shape, and the task posed in both datasets is to discriminate between these two alternatives. The two challenges differ in the  Figure 2: An overview of the two synthetic visual reasoning challenges used in this study: the "Pathfinder" (top two rows) and the "cluttered ABC" (cABC, bottom two rows). On both tasks, a model must judge whether the two white markers fall on the same or different paths/letters. For both, we parametrically control task difficulty by adjusting intra-class image variability in a dataset. In the Pathfinder challenge, we increase the length of the target curve; in cABC, we decrease correlation between geometric transformations applied to the two letters while also increasing the relative positional variability between letters.
shapes they depict. Images in the Pathfinder challenge contain two flexible curves made of co-linear dashes, whereas the cABC challenge uses overlapping English-alphabet letters that are transformed in appearance and position. Local vs. object-level cues for perceptual grouping Differences in the types of shapes used in the two challenges make them ideally suited for different feedback grouping strategies. The Pathfinder challenge features smooth, and flexible curves, making it well suited for an incremental gestalt-based grouping strategy [12]. Conversely, the English-alphabet letters used in the cABC challenge make it a good match for a grouping strategy that relies on object-level, semantic cues.
The two synthetic challenges are designed to yield a high computational burden in models that rely on sub-optimal grouping strategies. For instance, cluttered, flexible curves in the Pathfinder challenge are well-characterized by local continuity but have no set "global shape". This makes the task difficult for feedforward neural network models like ResNets because the number of fixed templates that the networks needs to learn to solve it increases with the number of possible shapes the flexible paths may take. In contrast, letters in cABC images are globally stable but locally degenerate (due to pixelation), and letters may overlap with each other. Because the strokes of overlapping letters form arbitrary conjunctions, a feedforward network needs to disambiguate spurious conjunctions from intersections and corners in real letters by learning as many templates as required to detect all possible real vs. false conjunctions of letter strokes. The local arbitrariness of strokes in the cABC images also makes gestalt-based feedback strategy unreliable, emphasizing its suitability for a semantically-grounded grouping strategy.
Overall, we consider these challenges two extremes along a continuum of perceptual grouping tasks found in nature; on one hand, the Pathfinder curves provide reliable gestalt cues without semantic cues. On the other hand, cABC letters are semantic but do not permit local grouping cues. As we describe in the following sections, human observers easily solve both challenges with minimal training, demonstrating the ability of biological vision to flexibly switch between different perceptual grouping strategies depending on the availability of grouping cues.
Parameterization of image variability It is difficult to tease apart the relative contributions of different visual strategies using standard computer vision datasets because an architecture's performance may be confounded by different factors including dataset biases, model hyperparameters, and/or the number of samples available for training in comparison to the network's capacity. Here, we overcome this limitation by systematically varying task difficulty in each challenge with three separate datasets (easy, intermediate and hard) characterized by increasing levels of intra-class variability. We train and test individual network architectures on each of the three datasets separately and evaluate how accuracy changes when the need for stronger generalization capabilities arise. This method, called "straining" [12,55], is an effective way to dissociate an architecture's expressiveness from other incidental factors tied to a particular choice of image domain. Architectures with appropriate inductive biases are less likely to exhibit a decrease in accuracy as intra-class variability increases compared to architectures that rely on "rote memorization" [57].
Difficulty in the Pathfinder challenge is parameterized by the length of target curves in the images (6-, 9-, or 14-length; [12]). Image variability in cABC is controlled in two ways: first, by increasing the number of possible relative positional arrangements between the two letters. Second, by decreasing the correlation between random transformation parameters (rotation, scale and shear) applied to each letter in an image. For instance, on the "Easy" difficulty dataset, letter transformations are correlated: the centers of two letters are displaced by a distance uniformly sampled in an interval between 25 and 30 pixels, and the same affine transformation is applied to both letters. In contrast, letter transformations are independent: letters are randomly separated by 15 to 40 pixels, and random affine transformations are independently sampled and applied to each. We balance the variation of these random transformation parameters across the three difficulty levels such that the total variability of individual letter shape remains constant (i.e. the number of transformations applied to each letter in a dataset). Harder cABC datasets lead to increasingly varied conjunctions between the strokes of overlapping letters, making them well suited for models that can iteratively process semantic groups instead of learning fixed feature templates. See SI for more details about cABC image generation procedure. Here we test the role of horizontal vs. top-down connections for perceptual grouping using a single recurrent CNN architecture (Fig. 3). This model is equipped with a recurrent modules called feedback gated-recurrent units (fGRU, [12,49]), which can implement both top-down and horizontal feedback (Fig. 3). By selectively disabling top-down connections, horizontal connections, or both types of feedback, we directly compared the relative contributions of these connections for solving Pathfinder and cABC tasks over a basic deep convolutional "backbone" capable of only feedforward computations. As an additional reference, we test representative feedforward architectures designed for computer vision: residual networks [58] and U-Net variants [36,59].

Network architectures
The fGRU module The fGRU is a recurrent computational module that can model either horizontal or top-down feedback. The fGRU takes input from two unit populations: an external drive, X, and an internal state, H. The fGRU integrates these two inputs to get an updated state for a given processing timestep H[t]. An fGRU module can implement horizontal feedback by learning connections between units in X, and updating its internal state H[t] with these interactions. Alternatively, the fGRU can serve as a module for top-down feedback if its state H originates from activities in higher layers. Consider the hidden states in two fGRU modules in different layers, denoted by l: In this case, the fGRU can learn top-down connections extending from H (l+1) [t] to H (l) [t] using a local (1 × 1) kernel. The resulting H (l) [t] captures horizontal connections at layer l that have been inhibitted and excited by horizontal connections in the hidden state at layer l + 1. Additional details regarding the fGRU module are in the SI.
Recurrent networks with feedback Our full model architecture, which we call TD+H-CNN (Fig.  S6a), introduces three fGRU modules into a CNN to implement both top-down (TD) and horizontal (H) feedback. The architecture consists of a downsampling pathway and an upsampling pathway as in the U-Net [36,59], which allows it to implement top-down interactions. It also contains three fGRU modules: the first two learn horizontal connections at a low-and high-level feature processing layer. The third fGRU module learns top-down interactions between recurrent populations in the first two fGRUs.
The model processes an image over multiple timesteps, after which its low-level persistent activity is sent to a readout module to compute a binary category prediction. We define three variants of the TD+H-CNN by lesioning its horizontal and/or top-down fGRU modules: a top-down-only architecture Reference networks We additionally measure the performance of popular neural network architectures on Pathfinder and cABC. This consists of three "Residual Network"(ResNets) [58] with 18, 50, and 152 layers and two variants of a "U-Net" [36,59]: a deep/wide version we refer to as "Big U-Net" that uses a VGG16 encoder, and a shallow/narrow version we refer to as "Small U-Net" (taken from [59]). Both types of CNNs utilize skip connections to mix information between processing layers. Notably, the U-Nets use skip connections to pass activities between layers of a downsampling encoder and an upsampling decoder. This pattern of residual connections effectively makes the U-Net equivalent to a network with top-down feedback that is simulated for one timestep.

Experiments
Training We tested models on the Pathfinder challenge as in Linsley et al. [12], training them on 900K images and testing them on a held-out set of 25K images. For the cABC Challenge, we trained models on 45K images and tested them on a held-out set of 5K images. We trained models for the same number of total iterations (45K) on both challenges. We find that ResNet-50 achieves a similar pattern of performance on both challenges, suggesting that both ramp up difficulty at a similar rate. Each model was trained with a batch size of 32 using the Adam optimizer and learning rates of 1e −3 for Pathfinder ( [12]) and 1e −4 for cABC. We train each model five times with random weight initializations. Validation accuracy was sampled every 2000 iterations of training, and we report each model's best performance across training.
Screening feedback computations Our study revealed three main findings. First, we found that human participants performed well on both the Pathfinder and the cABC challenges with no significant drop in accuracy as difficulty increased. Second, the pattern of accuracies from our recurrent architectures reveal complementary contributions of horizontal vs. top-down feedback. The H-CNN solved the Pathfinder challenge with minimal straining as difficulty increased, but it could not solve the easy cABC challenge. On the other hand, the TD-CNN performed well on the cABC challenge but struggled on the Pathfinder challenge as difficulty increased. The BU-CNN (along with ResNets) struggled on both challenges, demonstrating the importance of feedback mechanisms for perceptual grouping. Together, these results suggest that top-down interactions (from higher-to-lower layers) help process object-level grouping cues, whereas horizontal feedback (between units in different feature columns/spatial locations in a layer) help process local gestalt grouping cues.
We repeated these experiments on two controlled variants of the cABC challenge: one where two letters are prevented from touching or overlapping (position control) and another where two letters are rendered in different pixel intensities (luminance control). No significant straining was found on these tasks for any of our networks, confirming our initial hypothesis that local ambiguities and mutual occlusion between objects makes top-down feedback important for grouping on cABC images. We also repeated our experiment on 1-timestep variants of our recurrent architectures and found that these architectures struggled on both challenges, suggesting the importance of recurrence for both grouping strategies (SI, Fig.S5).

Max Validation Accuracy
Modeling biological feedback Of the models we tested, only the H+TD-CNN and Big U-Net efficiently solved both Pathfinder and cABC at all levels of difficulty. To what extent do the visual strategies learned by these models match those used by human observers for solving these tasks? We investigated this with a large-scale psychophysics study, in which human participants categorized exemplars from Pathfinder or cABC that were also viewed by the models (see SI for experiment details).
We recruited 648 participants on Amazon Mechanical Turk to complete a web-based psychophysics experiments. Participants were given between 800ms-1300ms to categorize each image in Pathfinder, and 800ms-1600ms to categorize each image in cABC (324 per challenge). Each participant viewed a subset of the images in a challenge, which spanned all three levels of difficulty. The experiment took approximately 2 minutes, and no participant viewed images from more than one challenge (see SI for details).
This experiment provided us with 20 responses for every image in the Pathfinder and cABC challenge. We represented human performance on each image as average human accuracy converted into a logit score. We correlated the vector of human logits for a hard difficulty dataset with model logits on the same images. We found that both the TD+H-CNN and the TD-CNN were significantly more correlated with human decisions on cABC images than the U-Net ( Fig. 5a; TD >U-Net: p <0.05, TD+H >U-Net: p <0.01; p-values derived from a bootstrap test [60]). Although all of these models use feedback and solved all visual challenges, this result indicates that our highly recurrent models better capture the visual routines of human participants. This finding was not an effect of model accuracy, as partial correlations which controlled for this factor showed the same pattern of results. We observed similar results on Pathfinder: activity from the TD+H-CNN and H-CNN were significantly more correlated with human decisions than the U-Net ( Fig. 5b; H >U-Net: p <0.01, TD+H >U-Net: p <0.05).

Discussion
Perceptual grouping is essential for reasoning about the visual world. Although it is known that bottom-up, horizontal and top-down interaction contribute to perceptual grouping, their relative contributions are not well understood. We directly tested a long-held theory related to the role of horizontal vs. top-down connections for perceptual grouping by screening neural network architectures on controlled synthetic visual tasks. Without specifying any role for feedback connections a priori, we found a dissociation between horizontal vs. top-down feedback connections which emerged from training network architectures for classification. Our study provides direct computational evidence for the distinct roles played by these cortical mechanisms.
Our study also demonstrates a clear limitation of network models that rely solely on feedforward processing, including ResNets of arbitrary depths, which are strained by perceptual grouping tasks that involve cluttered visual stimuli. Notably, we found the U-Net is better suited for learning perceptual grouping than the other "standard" computer vision architectures tested. We attribute this relative success to U-Net's "encoding" and "decoding" pathways which effectively approximates a single loop of feedforward and feedback processing.
We also found that the highly-recurrent feedback mechanisms used by our RNN models make decisions on Pathfinder and cABC images that are significantly more similar to those of human observers than a U-Net. Our study thus adds to a growing body of literature [12,48,49,61,62] which suggests that recurrent circuits are necessary to explain complex visual recognition processes. We will release our code and datasets upon publication to encourage progress in modeling perceptual grouping in biological vision.

Supplementary Material
Cluttered ABC The goal of the cluttered ABC challenge is to test the ability of models and humans to make perceptual grouping judgements based solely on object-level semantic cues. Each image in the challenge dataset contains a pair of letters sufficiently close to each other to ensure frequent overlap. To further minimize local image cues which might permit trivial solutions to the problem, we use fonts as letter images that contain uniform and regular strokes with no distinct decorations. Two publicly available fonts are chosen from the website https://www.dafont.com: "Futurist fixed-width" and "Instruction".
Random transformations including affine transformations, geometric distortions, and pixelation are applied before placing each letter on the image to further eliminate local category cues. Positioning and transformations of letters are applied randomly, and their variances are used to control image variability in each difficulty level of the challenge.
Letter transformations Although each letter in the cABC challenge is sampled from just two different fonts, we ensure that each letter appear in sufficiently varied forms by applying random linear transformations to each letter. We use three different linear transformations in our challenge: rotation, scaling, and shearing. In each type of transformation applied to each letter is repeated twice, using three different sets of parameters per image: two sets that are sampled and applied separately to each letter, and the remaining one that is sampled once commonly applied to both letters. We call the first type of transformation parameters "letter-wise" and the other as "common" transformation parameters. For example, rotations applied to the first and the second letter are each described by the sum of letter-wise and common rotational parameters, φ 1 + φ c and φ 2 + φ c , respectively. Scaling is described by the product of letter-wise and common scale parameters, S 1 S c and S 2 S c . Shearing is described by the sum of letter-wise and common shear parameters, E 1 + E c and E 2 + E c . Each shear parameter specifies the value of the off-diagonal shear matrix applied to each letter image. Either vertical or horizontal shear is applied to both letters in each image with shear axis randomly chosen with equal probability. The random transformation procedure is visually detailed in Fig. S1.
We decompose each transformation to letter-wise and common transformation to independently increase variability at the level of an image while keeping letter-wise variability constant. This is achieved by gradually increasing variance of the letter-wise transformation parameters while decreasing variance of the common parameters. See S1 for a summary of the distributions of random transformation parameters used in the three levels of difficulty.
We apply additional, nonlinear transformations to letter images. First, geometric warping is applied independently to each affine-transformed letter image by computing a "warp template" for each image. A warp template is an image in same dimensions as each letter image, consisting of 10 randomly placed Gaussians with sigma (width) of 20 pixels. Each pixel in a letter image is translated by a displacement proportional to the gradient of the warp template in the corresponding location. This process is depicted in Fig. S2. Second, the resulting letter image is "pixelated" by first partitioning the letter image into a grid of 5×5 pixels. Each grid is filled with 255s (white) if and only if more than 30% of the pixels are white, resulting in a binary letter image in 1 5 of the original resolution. The squares then undergo random translation with displacement independently sampled for each axis. We use normal distribution with standard deviation of 2 pixels truncated at two standard deviations.
Letter positioning We sample the positions of letters by first defining an invisible circle of radius r placed at the center of the image. Two additional random variables, θ and ∆θ, are sampled to specify two angular positions of the letters on the circle, θ and θ + ∆θ. r, θ and ∆θ are sampled from uniform distributions whose ranges increases with difficulty level (Table S1). By sampling these parameters from increasingly larger intervals, we increases relative positional variability of letters in harder datasets. In easy difficulty, we choose r ∼ U (50,60

Additional experiments
Control cABC challenges To further validate our implementation of the cABC, we added two control cABC challenges in our study (Fig. S3). In the "luminance control" challenge, two letters are rendered with a pair of uniformly sampled random pixel intensity values that are greater than 128 but always differ by at least 40 between the two letters. In the "positional control" challenge, the letters are disallowed from touching or overlapping. Relative to the original challenge, the two controls provide additional local grouping cues with which to determine the extent of each letter. The luminance control challenge permits additional pixel-level cue with which a model can infer Table S1: Distributions of transformation parameters used in three difficulty levels of of the cluttered ABC challenge. U (a, b) denotes uniform distribution with range [a, b]; N (µ, σ) denotes normal distribution with mean µ and standard deviation σ; logN (z, µ, σ) denotes log-normal distribution with base z, mean µ and standard deviation σ. Figure S3: Example images of two control cABC challenges. In luminance control, two letters are rendered in different, randomly sampled pixel intensity values. In positional control, two letters are always rendered without touching or overlapping each other.

Input
Output Input Output Figure S4: Example images of the segmentation variant of the two challenges. Here, a model is tasked with producing as output an image which contains only the object which is marked in the input image.
image category by comparing the values of letter pixels surrounding two markers. Positional control challenge provides unambiguous local gestalt cue to solve the task. Ensuring that the two letters are spatially separated has an additional effect of minimizing the amount of clutter to interfere the feature extraction process in each letter.
In both control challenges, we found no straining effect in any of the models we tested. This strengthens our initial assumption behind the design of cABC that the only way to efficiently solve the default cABC challenge is to employ object-based grouping strategy. Because all networks successfully solved the control challenges, the relative difficulty suffered by the H-CNN or reference feedforward networks on the original cABC couldn't have been caused by idiosyncracies of image features we used such as the average size of letters. These networks lack the mechanisms to separate the mutually occluding letter strokes according to high-level hypothesis about the categories of letters present in each image.
Segmentation challenges To further understand and visualize the nature of computations employed by our recurrent architectures when solving the Pathfinder and the cABC challenge, we construct a variant of each dataset where we pose a segmentation problem. Here, we modify each dataset by placing only one marker per image instead of two. The task is to output a per-pixel prediction of  Figure S5: Top-down feedback is critical when local grouping cues are unavailable. We verify this by constructing two controlled versions of the cABC challenge, where local Gestalt grouping cues are provided in the form of spatial segregation between two letters (a), or in the form of luminance segregation (b). In both cases, the architectures that lack top-down feedback mechanism can solve the challenge at all levels of difficulties.
the object tagged by the marker (Figure S4). We train each model using 10 thousand images of the cABC challenge and 40 thousand images of the Pathfinder challenge and validate using 400 images in both challenges. We only use hard difficulty in each challenge. We use only the TD-and H-CNN in this experiment. We replace the readout module in each of our recurrent architecture by two 1× 1 convolution layers with bias and rectification after each layer. We also include the "reference" recurrent architectures in which the fGRU modules have been replaced by either LSTM [63] or GRU [64]. Because the U-Net is a dense prediction architecture by design, we simply remove the readout block that we used in the main categorization challenge.
We found that the target curves in the Pathfinder challenge is best segmented by H-CNN (0.79 f1 score), followed by H-CNN with GRU (0.72) and H-CNN with LSTM (0.69). No top-down-only architecture was able to successfully solve this challenge (we set a cutoff of greater than 0.6 f1). The cABC challenge is best segmented by the TD-CNN (0.83), followed by TD-CNN with LSTM (0.64) and TD-CNN with GRU (0.62). Consistent with the main challenge, no purely horizontal architecture was able to successfully solve this challenge. In short, we were able to reproduce the dissociation in performance we found between horizontal and top-down connections in other recurrent architectures (LSTM and GRU) as well, although these recurrent modules were less successful than the fGRU.
We visualize the internal update performed by our recurrent networks by plotting the difference of norm of each per-pixel feature vector at every timestep in the low-level recurrent state, H (1) , vs. its immediately preceding timestep. Of course, this method does not reveal the full extent of the computations taking place in the fGRU, but it at least serves as a proxy for inferring the spatial layout of how activities evolve over timesteps (Fig. 5b). More examples of both sucessful and mistaken segmentations and the temporal evolution of internal activities of H-CNN (on the Pathfinder challenge) TD-CNN (on the cABC challenge) are shown in Fig.S7 and Fig.S8 Network architectures The fGRU module At timestep t, the fGRU module takes two inputs, an instantaneous external drive X ∈ R H×W×K and a persistent recurrent state H[t − 1] ∈ R H×W×K and produces as output an updated recurrent state H[t]. Detailed description of its internal operation over a single iteration from [t − 1] to [t] is defined by the following formulae. For clarity, tensors are bolded but kernels and parameters are not: The evolution of the recurrent state H can be broadly described by two discrete stages. During the first stage, H is combined with an extrinsic drive X. Interactions between units in the hidden state are computed in C I by convolving H[t − 1] with a kernel W I . These interactions are next linearly and multiplicatively combined with H[t − 1] via the k-dimensional learnable coefficients µ and α to compute the intermediate "inhibited" activity, Z.
The second stage involves updating H with a transformation of the intermediate activity Z. Here, Z is convolved with the kernel W E to compute C E . A candidate outputH[t] is calculated via the k-dimensional learnable coefficients κ and ω, which control the linear and multiplicative terms of self-interactions between C E and Z.
Two gates modulate the dynamics of both of these stages: the "gain" G I [t], which modulates channels in H[t − 1] during the first stage, and the "mix" G E [t], which mixes a candidateH[t] with the persistent H[t − 1] during the second stage. Both the gain and mix are transformed into the range [0, 1] by a sigmoid nonlinearity. This two-stage computational structure in the fGRU captures complex nonlinear interactions between units in X and H. Brackets [.] denote linear rectification. Separate applications of batch-norm are used on every timestep, where r ∈ R d is the vector of activations that will be normalized. The parameters δ, ν ∈ R d control the scale and bias of normalized activities, η is a regularization hyperparameter, and is elementwise multiplication.
Learnable gates, like those in the fGRU, support RNN training. But there are multiple other heuristics that also help optimize performance. We use several heuristics to train our models, including Chronos initialization of fGRU gate biases [65]. We also initialized the learnable scale parameter δ of fGRU normalizations to 0.1, since values near 0 optimize the dynamic range of gradients passing through its sigmoidal gates [66]. Similarly, fGRU parameters for learning additive inhibition/excitation (µ, κ) were initialized to 0, and parameters for learning multiplicative inhibition/excitation (α, ω) were initialized to 0.1. Finally, when implementing top-down connections, we incorporated an extra learnable gate, which we found improved the stability of training. Consider the horizontal activities in two layers, H (l) , H (2) , and the function fGRU, which implements top down connections. The introduction of this extra gate means that top-down connections are learned as: H (l) = (1 − sigmoid(β))fGRU(H (l) , H (l+1) ) + sigmoid(β)H (l) , where β ∈ R K is initialized to 0 and learns how to incorporate/ignore top-down connections.
The broad structure of the fGRU is related to the Gated Recurrent Unit (GRU, [67]). Nevertheless, one of the main aspects in which the fGRU diverges from the GRU is that its state update is carried out via two discrete steps of processing instead of one. This is inspired by a computational neuroscience model of contextual effects in visual cortex (see [68] for details). Unlike that model, the fGRU allows for both additive and multiplicative combinations following each step of convolution, which potentially encourages a more diverse range of nonlinear computations to be learned [12,49].

Architecture details
Using the fGRU module, we constrcut three recurrent convolutional architectures as well as one feedforward variant. First, the H+TD-CNN (Fig. S6a) is equipped with both top-down and horizontal connections. The TD-CNN (Fig. S6b) is constructed by selectively lesioning the horizontal connections in the H+TD-CNN. The H-CNN (Fig. S6c) is constructed by selectively lesioning the top-down connections in the H+TD-CNN, essentially making recurrence only take place within the low-level fGRU. Lastly the BU-CNN (Fig. S6d) is constructed by disabling both types of recurrence, turning our recurrent archecture into a feedforward encoder-decoder network. All experiments were run with NVIDIA Titan X GPUs.
TD+H-CNN Our main recurrent network architecture, TD+H-CNN, is built on a convolutional and transpose-convolutional "backbone". An input image X is first processed by the "Preproc." block which consists of two convolutional layers of 7×7 kernels with 20 output channels and a ReLU activation applied between. The resulting feature activity is sent to the first fGRU module (fGRU (1) ), which is equipped with 15×15 kernels with 20 output channels to carry out horizontal interactions. fGRU (1) iteratively updates its recurrent state, H (1) . The output undergoes a batch normalization and a pooling layer with a 2×2 pool kernel with 2×2 strides which then passes through the downsampling block consisting of two convolutional "stacks" (Fig. S6a, The gray box named "DS"). Each stack consists of three convolutional layers with 3×3 kernels, each followed by a ReLU activations and a batch normalization layer. The convolutional stacks increase the number of output channels from 20 to 32 to 128, respectively, while they progressively downsample the activity via a 2×2 pool-stride layer after each stack. The resulting activity is fed to another fGRU module in the top layer, fGRU (2) , where it is integrated with the higher-level recurrent state, H (2) . The kernels in fGRU (2) , unlike fGRU (1) , is strictly local with 1×1 filter size. This essentially makes fGRU (2) a persistent memory module without performing any spatial integration over timesteps. The output of this fGRU module then passes through the "upsampling" block which consists of two transpose convolutional stacks (Fig. S6a, The gray box named "US"),. Each stack consists of one transpose convolutional layer of 4×4 kernels. Each stack also reduces the number of output channels from 128 to 32 to 12, respectively, while it progressively upsamples the activity via a 2×2 stride. Each transpose convolutional layer is followed by a ReLU activations and a batch normalization layer. The resulting activity, which now has the same shape as the output of the initial fGRU, is sent to the third fGRU module (fGRU (3) ) where it is integrated with the recurrent state of fGRU (1) , H (1) . This module plays the role of integrating top-down activity (originating form "US") and bottom-up activity (originating from fGRU (1) ) via 1×1 kernels. To summarize, fGRU (1) in our architecture implements horizontal feedback in the first layer while the fGRU (3) implements top-down feedback.
The above steps repeat over 8 timesteps, and in the 8 th timestep, the final recurrent activity from fGRU (1) , H (1) , is extracted from the network and passed through batch normalization. The resulting activity is processed by the "Readout" block which consists of two convolutional layers of 1×1 kernels with a global max-pooling layer between. Each convolutional layer has one output channel and uses ReLU activations.
Lesioned architectures Our top-down-only architecture, the TD-CNN, is identical to the TD+H-CNN, but its horizontal connections have been disabled by replacing the spatial kernel of fGRU (1) with a 1 × 1 kernel, thereby preventing spatial propagation of activities within the fGRU altogether. Thus, its recurrent updates in the low-level layer are only driven by top-down feedback (in fGRU (3) ).
The H-CNN is implemented by disabling fGRU (3) . This effectively lesions both the downsampling and upsampling pathways entirely from contributing to the final output. As a result, its final category decision is determined entirely by low-level horizontal propagation of activities in the netwrok.
Lastly, we construct the purely feedforward architecture, called the BU-CNN, by lesioning both horizontal and top-down feedback. It replaces the spatial kernel in fGRU (1) by a 1 × 1 kernel and disables fGRU (3) . By running for only one timestep, the BU-CNN is equivalent to a deep convolutional encoder-decoder network.
Feedforward reference networks Because U-Nets [36] are designed for dense image predictions, we attach the same readout block that we use to decode fGRU activities (H (1) ) to the output layer pass information from lower-order to higher-order layers; horizontal connections (green) pass information between units in the same layer separated by spatial location and/or channels; top-down connections (red) pass information from higher-order to lower-order layers. We use "feedback" as an umbrella term for both top-down and horizontal connections as these can only be implemented in a model with recurrent activity that supports incremental updates to model units. (Right) A computational diagram of our default (TD+H-CNN) architecture, unrolled through 8-timesteps of processing. The architecture takes an input image, X, and emits an output label Y at the end of the recurrent sequence. Solid arrows denote the passage of activities, and are color-coded according to the type of connections each represents. The fGRU is a recurrent module, described in the main text, that implements the model's recurrent connections. Dotted arrows denote recurrence of internal states of fGRU modules which are iteratively updated over timesteps. By lesioning different types of feedback connections, we can measure the effectiveness of combinations of these connections for solving perceptual grouping tasks.
of both U-Nets described in the main text: (i) the "small" U-Net, which is the U-Net described in [59] modified for 2D image processing, and (ii) the "big" U-Net, which has a VGG16 downsampling architecture, combined with upsampling + convolutional layers to transform its encodings into the same resolution as the input image.

Human experiments
We devised psychophysics categorization experiments to understand the visual processes underlying the ability to solve tasks that encourage perceptual grouping.
These visual categorization experiments adopted a paradigm similar to Eberhardt et al. [43], where stimuli were flashed quickly and participants had to respond within a fixed period of time. We conducted separate experiments for the "Pathfinder" and the "cluttered ABC" challenge datasets. Participants had several possible response time (RT) widows for responding to stimuli in each dataset. For "Pathfinder", we considered {800, 1050, 1300} ms RT windows, while for the "cluttered ABC", we consider {800, 1200, 1600} ms. We recruited 324 participants for each experiment (648 overall) from Amazon Mechanical Turk (www.mturk.com).
The "Pathfinder" and the "cluttered ABC" are binary categorization challenges with "Same" and "Different" as the possible responses. For both experiments, we present 90 stimuli to each participant and assign a single RT to that participant. These 90 stimuli consist of three sets of 30, corresponding to the three level of difficulties present in the challenge datasets. We randomize the order of the three difficulties (sets of 30 stimuli) across participants as well as the assignment of keys to class "Same" and "Different". The performance results (in terms of accuracy) of the human participants that we obtained are shown in Fig. 4.

Stimulus generation
We generated a short video for each stimuli image from the challenge dataset based on the response time allowed from the onset of stimuli. We generated three stimulus videos for each challenge image, correponding to the three different response times chosen. In each stimulus video, we included an image of a cross (black plus sign overlaid on a white background) for 1000 ms before the onset of stimulus that is shown for RT ms. The stimuli were stored as .webm videos and presented for 1000+RT ms during the experiments. We created a video so that the cross image and the stimulus image are seamlessly combined and shown without delays, which would otherwise be caused by loading their separate images in the browser, which introduces noise in stimulus timing.
We selected a set of 200 images per difficulty in both challenges (600/challenge). Each participant responded to 30 images from each difficulty. We collected responses to the 600 images in each challenge from 18 different participants.
Psychophysics experiment We conducted separate experiments for: (1) Pathfinder and (2) cABC images. The image dataset used for each experiment consisted of three difficulties -easy, intermediate and hard. Participants were assigned three blocks of trials, each having stimuli from a single difficulty level. The order of these three blocks is randomized and balanced across participants. The keys for responses to the stimuli are set to be + and -. Their assignment to "Same" and "Different" class label is randomized across participants as well.
Each experiment began with a training phase, where participants responded to 6 trials at their own leisure. This is to make sure that participants understood the task and the type of image stimuli. After this, participants went through 15 trials, 5 from each difficulty level, where they responded by pressing + orwithin a limited response period. For this set of images, participants were given feedback on whether their response was correct or incorrect. We call this the training block of trials. After training, each participant completed 90 trials and responded with either "Same" or "Different" in the stipulated amount of response time, without feedback on their performance. After every 30 trials, participants were given time to rest and the difficulty level of the viewed images changed. Before starting the block of trials for any difficulty, a representative image from that difficuty level was shown and instructions were given to respond as quickly as possible.
These experiments were implemented using jsPsych and custom javascript functions. We used the .webm format as it is a HTML5-compliant video format to ensure fast loading times for the web browsers. The challenge images used to generate the stimuli videos are converted to 256 × 256 to ensure consistent sized images across the trials. The configuration variables for each of the assignments were stored in a MySQL database and fetched by the server for loading the assignment with the required trails. The configuration variables included finger assignment ID, difficulty ordering ID, trials list, assignment ID and response time. The trials list is stored as a comma-separated values (csv) file, where each line has the path to a generated stimulus video.
Processing Responses To generate the final results from human responses on the two categorization tasks, filtered responses rendered in less than 450 ms, which we deem unreliable responses following the criteria of Eberhardt et al. [43]. We next calculated the accuracy across participants for the three difficulty levels, and used a bootstrapping procedure to determine standard error.