Gestalt-based Contour Weights Improve Scene Categorization by CNNs

Humans can accurately recognize natural scenes from line drawings, consisting solely of contour-based shape cues. Deep learning strategies for this complex task, however, have thus far been applied directly to photographs, exploiting all the cues available in colour images at the pixel level. Here we report the results of fine tuning off-the-shelf pre-trained Convolutional Neural Networks (CNNs) to perform scene classification given only contour information as input. To do so we exploit the Iverson-Zucker logical/linear framework to obtain line drawings from popular scene categorization databases, including an artist’s scene database and MIT67. We demonstrate a high level of performance despite the absence of colour, texture and shading information. We also show that the inclusion of medial-axis based contour salience weights leads to a further boost, adding useful information that does not appear to be exploited when CNNs are trained to use contours alone.


Introduction
In vision science perceptual organization is thought to be effected by a set of heuristic grouping rules originating from Gestalt psychology (Koffka, 1922). Such rules posit that visual elements ought to be grouped together if they are, for instance, similar in appearance, in close proximity, or if they are symmetric or parallel to each other. Developed on an adhoc, heuristic basis originally, these rules have been validated empirically, even though their precise neural mechanisms remain elusive. Grouping cues, such as those based on symmetry, are thought to aid in high-level visual tasks such as object detection, because symmetric contours are more likely to be caused by the projection of a symmetric object than to occur accidentally. In the categorization of complex real-world scenes by human observers, local contour symmetry does indeed provide a perceptual advantage (Wilder et al., 2019), but the connection to the recognition of individual objects is not as straightforward as it may appear.
However, perceptually motivated salience measures to facilitate scene categorization have received little attention thus far. This may be a result of the ability of CNN-based systems to accomplish scene categorization on challenging databases, in the presence of sufficient training data, directly from pixel intensity and colour in photographs (Sharif Razavian, Azizpour, Sullivan, & Carlsson, 2014;Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016;Zhou, Lapedriza, Khosla, Oliva, & Torralba, 2018). CNNs begin by extracting simple features, including oriented edges, which are then successively combined into more and more complex features in a succession of convolution, nonlinear activation and pooling operations. The final levels of CNNs are typically fully connected, which enables learning of object or scene categories (Song, Lichtenberg, & Xiao, 2015;Bai, 2017;Girshick, Donahue, Darrell, & Malik, 2014;Ren, He, Girshick, & Sun, 2015). Unfortunately, present CNN architectures do not allow for properties of object shape to be represented explicitly. Human observers, in contrast, recognize an object's shape as an inextricable aspect of its properties, along with its category or identity (Kellman & Shipley, 1991).
Comparisons between CNNs and human and monkey neurophysiology appear to indicate that CNNs replicate the entire visual hierarchy (Kriegeskorte, 2015;Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019;Güçlü & van Gerven, 2015;Cadieu et al., 2014). Does this mean that the problem of perceptual organization is now irrelevant for machine vision? We argue that this is not the case, and show that CNN-based scene categorization systems, just like human observers, can benefit from explicitly computed contour measures derived from Gestalt grouping cues. To do so we use an average outward flux formulation to compute the medial axis (Dimitrov, Damon, & Siddiqi, 2003) and then use it to directly capture salience measures related to local contour separation and local contour symmetry. Figure 1 presents an illustrative example of a photograph from an artist scenes database, along with two of our medial axis based contour salience maps.

Medial Axis Based Contour Saliency
Motivated by the considerations above, we recently introduced novel measures to capture local separation, ribbon symmetry and taper from line drawings of natural scenes (Rezanejad et al., in press), which we now review. Owing to the continuous mapping between the medial axis and scene contours, the scores obtained using these measures can then be mapped to scene contours. We let p be a parameter that runs along a 1087 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 medial axis segment, (x(p), y(p)) be the coordinates of points along that segment, and R(p) be the medial axis radius at each point. We consider the interval p ∈ [α, β] for a particular medial segment.

Separation Salience
With R(p) > 1 in pixel units (we assume that two distinct scene contours do not touch) we use the following contour separation based salience measure: This quantity falls in the interval [0, 1], increasing with greater spatial separation between the two contours. Scene contours that exhibit further (local) separation are more salient by this measure.

Ribbon Symmetry Salience
When two scene contours are close to being parallel locally, R(p) will vary slowly along the medial segment. This motivates the following ribbon symmetry salience measure: This measure also falls in the interval [0, 1] and it increases as the scene contours on either side become more parallel, such as the two sides of a ribbon.

Taper Symmetry Salience
A notion that is closely related to that of ribbon symmetry is taper symmetry. Two scene contours are taper symmetric when the medial axis between them has a radius function that is changing at a constant rate, such as the edges of two parallel contours in 3D when viewed in perspective projection. To capture this notion of symmetry we use the following taper symmetry salience measure: This quantity also falls in the interval [0, 1] and it increases as the scene contours on either side become more taper symmetric, as in the sides of a railway track.

Artist Scenes Database
Color photographs of six categories of natural scenes were downloaded from the internet, and those rated as the best exemplars of their respective categories by workers on Amazon Mechanical Turk were selected ( (Torralbo et al., 2013)

Machine Generated Logical/Linear Line Drawings
Given the limited number of scene categories in the Artist Scenes database, we worked to extend our analysis on a much larger scene database of photographs -MIT67 (Quattoni & Torralba, 2009) (6700 images, 67 categories).
To produce line drawings from this much larger database we modified the output of the logical/linear edge detector (Iverson & Zucker, 1995), using their publicly available open source implementation. This approach is devised to recover image curves while preserving singularities and junctions. We briefly review the three kinds of image curves modeled in (Iverson & Zucker, 1995).
Consider an image I : R 2 → R + , with P = [α, β] and let C :p ∈ P → R 2 represent a smooth curve parameterized by arc length. The normal cross section N p (t) at the curve point C(p) is given by: N p (t) = I(C(p) + tN(p))), p ∈ P, t ∈ R. (4) Using local structural conditions in the directions tangential and normal to the curve, the following three image curve categories are suggested in (Iverson & Zucker, 1995): 1. C is an Edge iff C is an image curve such that the following condition holds for all p ∈ P: 2. C is a Positive Constrast Line iff C is an image curve such that the following condition holds for all p ∈ P: 3. C is a Negative Constrast Line iff C is an image curve such that the following condition holds for all p ∈ P: In (Iverson & Zucker, 1995) operators are designed to respond when any of the above conditions are met locally in an image, and if so, either an edge, or a line is reported. In our experiments we focused on the case of edge points; from the output edge map and its associated edge strength and edge directions, we produced a binarized version. Each binarized edge map was processed and traced to obtain contour fragments having a width of 1 pixel. Figure 2 presents a comparison of an artist-generated line drawing for an office scene from the Artist Scenes database, along with the logical/linear (machine generated) version.

Artist Line Drawing
Machine Generated Line Drawing Figure 2: (Best viewed by zooming in on the PDF.) An artist's line drawing of an office scene, and the machine generated version, obtained using logical/linear operators (Iverson & Zucker, 1995).
We have confirmed that on the artist's line drawing database 82% of the machine generated contour pixels are in common with the artist's line drawings.

Scene Categorization with Salience Weighted Contours
We report the results of scene categorization using both contours and contours weighted with our perceptual salience measures. We accomplish this by feeding different features in the 3 channels (normally used for red, green and blue) of a pre-trained network, as illustrated in Figure 3. We have used VGG16 (pre-trained on Imagenet) and VGG16-H (pre-trained on both Imagenet and Places365 (Zhou et al., 2018)). In all the experiment,s the last two fully-connected layers of the pretrained networks were fine-tuned using our feature-coded inputs, i.e., training was done on the feature maps provided by them.
The results for the Artist Scenes dataset and for MIT67, are shown in Table 1. It is apparent that with these salience weighted contour channels added, there is a consistent boost to the results obtained by using contours alone. In all cases the biggest performance boost comes from a combination of contours, ribbon or taper symmetry salience, and separation salience. This is likely because taper salience is conceptually very close to ribbon salience, while local separation salience  Table  1 for the specific sets of input channels. provides a more distinct and complementary perceptual cue for grouping. For MIT67 the performance of 79.49% on photographs is consistent with that reported in (Zhou et al., 2018). Remarkably, 75% of this level of performance (a level of 60.73%) is obtained using only logical/linear line drawings. The overall performance goes up to 65.79% (or 82.8% of the performance on photographs) when using contours weighted by ribbon and separation salience. For MIT67, we have also compared the performance (fine-tuned) Hybrid1365 VGG on photographs (79.49% top-1) with photographs with contours, ribbon, and separation salience weighted contours overlayed (82.05% top-1). Thus, perceptually weighted contour features can boost overall performance as well.

Conclusion
Our experiments show that scene contours weighted by perceptually motivated contour salience measures can boost CNN-based scene categorization accuracy, despite the absence of colour, texture and shading cues. Our work indicates that measures of contour grouping, which are simply functions of the contours themselves, are beneficial for scene catego-rization by computers, leading to recognition performance that is over 80% of the best reported results on the underlying photographs. Whereas this shape information is reflected in the images themselves, it does not appear to be directly learned by present state-of-the-art CNN-based scene recognition systems. Adding shape information computed on the medial axis outside of the CNNs improves scene categorization above the current state of the art.