Implicit Scene Segmentation in Deeper Convolutional Neural Networks

Feedforward deep convolutional neural networks (DCNNs) are matching and even surpassing human performance on object recognition. This performance suggests that activation of a loose collection of image features could support the recognition of natural object categories, without dedicated systems to solve specific visual subtasks. Recent findings in humans however, suggest that while feedforward activity may suffice for sparse scenes with isolated objects, additional visual operations ('routines') that aid the recognition process (e.g. segmentation or grouping) are needed for more complex scenes. Linking human visual processing to performance of DCNNs with increasing depth, we here explored if, how, and when object information is differentiated from the backgrounds they appear on. To this end, we controlled the information in both objects and backgrounds, as well as the relationship between them by adding noise, manipulating background congruence and systematically occluding parts of the image. Results indicated less distinction between objectand background features for more shallow networks. For those networks, we observed a benefit of training on segmented objects (as compared to unsegmented objects). Overall, deeper networks trained on natural (unsegmented) scenes seem to perform implicit 'segmentation' of the objects from their background, possibly by improved selection of relevant features.


Introduction
When performing an object recognition task, the visual input elicits a feedforward drive that rapidly extracts basic image features through feedforward connections (Lamme & Roelfsema, 2000). For sparse scenes with isolated objects, this set of features might be enough for successful recognition. For more complex scenes, however, the jumble of visual information ('clutter') may be so great that object recognition cannot rely on having access to a reliable set of features, effectively working as pre-segmented objects. For those images, extra visual operations ('visual routines') that aid the recognition process, such as scene segmentation and perceptual grouping, might require the feedforward activity to be modulated by recurrent loops of activity (Roelfsema, 2006;Groen et al., 2018).
While this view seems to suggest that object recognition only depends on the features that belong to the object, many studies have shown that features from the background can also influence the recognition process. For example, objects appearing in a congruent background are detected more accurately and quickly than objects in an incongruent environment (Davenport & Potter 2004), and many computational models of object recognition use features both from the object and from the background (Riesenhuber & Poggio 1999).
In the current study, we explore how the number of layers (depth) in a DCNN influences object segmentation and how this compares to human vision. We use deep residual networks (ResNets; He, Zhang, Ren & Sun, 2016) to systematically manipulate network depth, because they can be up-scaled by adding their basic building blocks without altering the architecture in another way. We presented seven DCNNs (with increasing depth) and 38 human participants with images of segmented and unsegmented objects. To investigate the influence of features from the background on object recognition, we generated stimuli in which objects were placed on top of congruent or incongruent scenes. Thereby we ask to what extent DCNNs exhibit the same sensitivity to scene properties (i.e. context) as human observers. To complement our findings, we further explore the role of segmentation on learning by training ResNets on a dataset with segmented objects, and a dataset in which objects were embedded in a scene.

Experiment 1: background congruence Methods
Stimuli Images of 27 different object categories were generated by placing cut-out objects from the ImageNet validation set onto white (segmented), congruent and incongruent backgrounds. There were ten exemplars for every category, and backgrounds were sampled from the SUN2012 database (512x512 pixels, fullcolor). For each category, three congruent backgrounds were selected using the five most common places where this object was found within the database. Three incongruent backgrounds were manually chosen.
Participants and networks 38 participants (9 males) aged between 18 and 30 years (M = 22.03, SD = 3.02) took part in the experiment. To investigate the effect of depth on scene segmentation in DCNNs, tests were conducted on ResNets with increasing number of layers (10,18,34,50,101,152), using the fb.resnet.torch implementation by Gross & Wilber (2016). Input images from the ImageNet dataset (Russakovsky et al., 2015) were 224x224 randomly cropped from a resized image using the scale and aspect ratio augmentation of Szegedy et al. (2015). Downsampling was done by stride-2 convolutions in the 3x3 layer of the first block in each stage (instead of the first 1x1 convolution) and weight decay was applied to all weights and biases (instead of just the weights of the convolution layers). ResNet-10 was trained on ImageNet with 1 GPU. We used pre-trained versions for the other ResNets.

Network performance
For human participants, results indicated that features from the background influenced object perception. Do DCNNS show a similar pattern and how is this influenced by network depth? Experiment 1 showed both a substantial overlap and difference in performance between human participants and DCNNs. Both were better in recognizing an object on a congruent versus an incongruent background. However, whereas human participants performed best in the segmented condition, DCNNs performed equally well (or better) for the congruent condition. Performance for the incongruent condition was lowest. This effect was particularly strong for more shallow networks.
To further investigate the degree to which the networks are using features from the object and/or background, we systematically occluded different parts of the image and evaluated the changes in activation of the correct class, before the softmax activation function (Zeiler & Fergus, 2014). We quantified the importance of features in the object vs. background by averaging the change across pixels belonging either to the object or the background. For this analysis, positive values indicate that pixels are helping classification (higher values indicating a higher importance). For example, figure 3A shows that the network is localizing the object in the scene, as the activity drops significantly when the object (china cabinet in this example) is occluded.
To evaluate whether deeper networks are better at localizing the objects in the scene, while ignoring irrelevant background information, we computed the relative drop in performance when pixels of the background vs. pixels of the object were occluded. Results indicated a larger influence of background pixels on classification for more shallow networks, for all conditions. For those models, pixels from the object had a larger impact as well, for the segmented and congruent condition.

Experiment 2: Training
Next, we investigated how training is influenced by network depth. If deeper networks indeed implicitly learn to segment object from background, we expect them to show a smaller difference in learning speed, when trained with segmented vs. unsegmented stimuli (as compared to shallow networks).

Methods
Stimuli To train the models, images from 10 categories were selected from ImageNet. We used 10 categories to obtain a reasonable mixture of ease of computing and performance gradients that show a substantial difference from untrained to trained. With the selected images, we generated two training sets: one in which the objects were segmented, and one with the original images (objects embedded in scenes). Objects were segmented using a DCNN trained on the MS COCO dataset (Lin et al. 2014), using the Mask R-CNN method (He, Gkioxari, Dollár & Girshick, 2017). Images with object probability scores lower than 0.98 were discarded, to minimize the risk of selecting wrongly classified or low quality images. Images were resized to 128x128 pixels. In total, the set contained ~9000 images, 80% was used for training, 20% was used for validation.
Networks As in experiment 1, we used ResNets with increasing number of layers (6, 10, 18, and 34). Deeper networks generated overfitting problems and were not included.

Network convergence
Accuracy of the ResNets was evaluated after each epoch (100 total) on the validation set. Results indicated a higher classification accuracy in the early stages of training for the networks trained on segmented objects, compared to those trained on unsegmented objects ( Figure 4A). Additionally, these networks converged (accuracy constant >10 epochs) in less epochs. In later epochs, accuracy between the two types of networks was similar. Shallow networks trained on segmented stimuli converged earlier than those trained on unsegmented stimuli. The difference in epochs until convergence decreased as the network depth increased. These results confirm that networks need to learn to segment objects from their background for optimal performance.

Discussion
Classic models of grouping and segmentation presume an explicit process in which certain elements of an image are grouped, whilst other are segregated from each other, by a labelling process. Our results from behavioral experiments with segmented and unsegmented objects indicate that recognition can take place without an explicit segmentation step. Furthermore, we show that segmentation can, and for DCNNs does, arise implicitly as a function of network depth.
Different accounts of object recognition in scenes propose different loci for contextual effects (Oliva & Torralba, 2007). It has been argued that a bottom-up visual analysis is sufficient to discriminate between basic level object categories, after which context may influence this process in a top-down manner, by priming relevant semantic representations, or by constraining the search space of most likely objects (e.g. Bar, 2003). The current results show that context features may impact object recognition in a bottom-up fashion, even for objects in a spatially incongruent location.
Instead of being an ultra-deep feedforward network, the brain might employ recurrent connections for object recognition in complex natural environments. The interpretation that deeper networks are better at object recognition, because they are capable of limiting their analysis to (mostly) the object -when necessary-is consistent with the idea that deeper networks are solving the challenges that are resolved by recurrent computations in the brain (Liao & Poggio, 2016).

Conclusion
We investigated the extent to which object and context information, and the interplay between them, impacts object recognition for both DCNNs and human observers. Combined, the current findings show that with an increase in network depth there is better selection of the features that belong to the object category. This process is similar, at least in terms of its outcome, to figure-ground segmentation in humans and might be one of the ways in which scene segmentation is implemented in the brain.