Visual features drive the category-specific impairments on categorization tasks in a patient with object agnosia

Object and scene recognition both require mapping of incoming sensory information to existing conceptual knowledge about the world. A notable finding in brain-damaged patients is that they may show differentially impaired performance for specific categories, such as for "living exemplars". While numerous patients with category-specific impairments have been reported, the explanations for these deficits remain controversial. In the current study, we investigate the ability of a brain injured patient with a well-established category-specific impairment of semantic memory to perform two categorization experiments: 'natural' vs. 'manmade' scenes (experiment 1) and objects (experiment 2). Our findings show that the pattern of categorical impairment does not respect the natural versus manmade distinction. This suggests that the impairments may be better explained by differences in visual features, rather than by category membership. Using Deep Convolutional Neural Networks (DCNNs) as 'artificial animal models' we further explored this idea. Results indicated that DCNNs with 'lesions' in higher order layers showed similar response patterns, with decreased relative performance for manmade scenes (experiment 1) and natural objects (experiment 2), even though they have no semantic category knowledge, apart from a mapping between pictures and labels. Collectively, these results suggest that the direction of category-effects to a large extent depends, at least in MS' case, on the degree of perceptual differentiation called for, and not semantic knowledge.


Introduction
Object or scene recognition requires mapping of incoming sensory information to existing conceptual knowledge about the world. A notable finding in brain-damaged patients is that they may show differentially impaired knowledge of, most prevalently, living things compared to non-living things (Gainotti, 2000). For many years, researchers have been investigating these category-specific semantic deficits. To date, the debate remains unsettled on how this distinction in breakdown of semantic knowledge along the natural/living versus manmade/non-living axis arises (Capitani et al., 2003;Gainotti, 2000;Young et al., 1989).
Some studies have suggested that evolutionary pressures have led to a specialized, distinct neural mechanism for different categories of knowledge (e.g. animals, plants and artefacts) (Caramazza and Shelton, 1998;Nielsen, 1946), and that category-specific deficits arise from damage to one of these distinct neural substrates. However, the most widespread views hold that they emerge because living and non-living things have different processing demands (i.e. they rely on different types of information). The first (most dominant) of those theories assumes that the storage of semantic information is divided into parts dominated by different knowledge aspects (e.g. perceptual, functional) and proposes that the dissociation arises from a selective breakdown of perceptual compared to functional associative knowledge. While man-made objects have 'clearly defined functions' and are mostly differentiated by their functional qualities, animals have less defining functions and are mostly distinguishable in terms of their visual appearance (Warrington and Shallice (1984). This 'differential weighting' of perceptual and associative attributes might underlie the dissociation between living and non-living things. Later on this theory was revised to also include other modality-specific knowledge channels, such as a 'motor-related' channel, to support findings indicating greater impairments for certain more 'motor-related' or 'manipulable' items (such as tools or kitchen utensils) compared to larger manmade objects (such as vehicles) (Warrington and McCarthy, 1987).
A number of studies have emphasized the importance of intercorrelations amongst individual semantic features. This intercorrelation theory states that concepts are represented as patterns of activation over multiple semantic properties within a unitary distributed system. This intercorrelation theory is appealing in that it does not rely on damage to specific subtypes of attribute (visual, associative, motor) to produce category-specific deficits (Caramazza et al., 1990;Caramazza and Shelton, 1998;Tyler and Moss, 2001). Still another account holds that living items contain a larger number of structurally similar exemplars (e.g. many different types of trees), requiring a more fine-grained visual analysis for successful recognition (Sartori et al., 1993). In order words, it could be inherently more difficult to visually recognize living things compared to non-living things. This view of the structural description system, and their account for category-specific impairments is consistent with work on normal subjects and animal studies (Gaffan and Heywood, 1993).
In line with these findings, a more recent study by Panis et al., (2017) suggested that category-specific impairments may be explained by a deficit in recurrent processing between different levels of visual processing in the inferotemporal cortex. According to them, category-specificity has a perceptual nature, and the direction can shift, depending on perceptual demand. High structural similarity between stored exemplars might be beneficial for integrating local elements and parts into whole representations because the global and local features of these exemplars are more stable and more highly correlated in the real-world than the features from categories with low structural similarity. At the same time however, high structural similarity may be harmful for matching or precise recognition operations, because there may be more competition between the activated representations (Gerlach, 2009).
Here it's important to note that different tasks have been used to evaluate patients' ability to recognize objects from different categories. Category-specific impairments have been established both using semantic memory experiments or visual recognition tasks at different levels (picture naming, picture-word matching, categorization). The differences in perceptual demand for these tasks (i.e. on which perceptual information they depend) might underlie the differences in category-specificity that have previously been found.
In the current study, we investigate the ability of a brain-injured patient, who is believed to have a category-specific impairment of semantic memory to perform scene-and object-categorization tasks (Fig. 1). This patient, MS, has played a crucial role in the development of theories on category-specificity, showing a very clear category-specific deficit on semantic category fluency tests in previous studies. He has shown to perform better than control participants on non-living categories and significantly worse on living items (Young et al., 1989). A recent study showed that his impairments have remained unchanged for more than 40 years (de Haan et al., 2020). MS' problems with living items relative to non-living ones is apparent across a variety of tasks, including mental imagery, retrieval of information and visual recognition (Mehta et al., 1992). However, there is a striking dissociation between MS' preserved ability to access information about category membership in an implicit test (by priming identification of living and non-living items with related category labels), where there is no difference between the categories, and his severe problems in accessing such information in an explicit test. These findings suggest that it's an "access" rather than a "storage" problem. The question remains as to whether MS can access stored representations of visual stimuli and, if so, what are the relationships between perceptual demand, recognition and semantic memory.
In the current study, two types of questions were addressed. The first -in order to investigate whether or not the category-specific impairment is dependent on semantic category or perceptual factors -concerned MS ′ ability to categorize visual images as depicting naturalistic vs. manmade scenes (experiment 1) and objects (experiment 2). Our findings show a difference between the two tasks, with better performance for manmade objects, compared to naturalistic objects (as is usually the case), and better performance for natural scenes compared to manmade scenes.
The second question concerned the type of information that might underlie the observed behavior. There is a large body of research looking into semantic category-specific impairments using computational modeling and different types of implementing artificial damage (Guest et al., 2020). For example, studies using connectionist simulations have shown how the distinctiveness of functional features was related to perceptual features varying across semantic categories and how damaging may lead to patterns of impairments for different types of information across varying semantic categories (Tyler et al., 2000). Recently, a class of computational models, termed deep convolutional neural networks (DCNNs), inspired by the hierarchical architectures of

Fig. 1. Stimuli and experimental paradigm. A)
Exemplars of the two categories in experiment 1 (manmade vs. natural scenes) and experiment 2 (manmade vs. natural objects). B) Experimental design. After a 2000 ms blank screen, the stimulus was shown for 100 ms, followed by a 400 ms blank screen. Then, the image reappeared, and MS was asked to categorize the stimulus by pressing the corresponding button. ventral visual streams demonstrated striking similarities with the cascade of processing stages in the human visual system (Cichy et al., 2016;Güçlü and van Gerven, 2015;Khaligh-Razavi and Kriegeskorte, 2014)). In particular, it has been shown that internal representations of these models are hierarchically similar to neural representations in early visual cortex (V1-V3), mid-level (area v4), and high-level (area IT) cortical regions along ventral stream. Therefore, we evaluated performance of different DCNN architectures (all ResNets, with varying depth) and compared it to MS ′ behavior. Moreover, 'adding lesions' to higher-order layers of a DCNN, by removing certain blocks, resulted in similar response patterns with decreased performance for manmade scenes (experiment 1) and natural objects (experiment 2).
Altogether, results from the current study indicate that, at least in specific cases such as MS, category-specific impairments can be explained by perceptual aspects of exemplars within different categories, rather than semantic category-membership.

Case history
MS is a former police cadet who contracted herpes encephalitis in 1970 (for a full case description see also (Newcombe and Ratcliff, 1975;Ratcliff and Newcombe, 1982). Most of the ventral temporal cortex of both hemispheres was destroyed, extending to occipital cortex on the right, leaving him with a complete left homonymous hemianopia. He suffers from achromatopsia (Chadwick et al., 2019;Mollon et al., 1980), has severe object agnosia and prosopagnosia, but is able to read accurately. His comprehension of what he reads is affected by an impairment of semantic memory. His semantic memory impairment is more marked for living than for non-living things (Young et al., 1989;de Haan et al., 2020).
Anatomical scans (Smits et al., 2019) revealed an, at least partially, intact primary visual cortex (V1) in both hemispheres. Further inspection of the anatomical scan suggests that this part of cortex in the right hemisphere, that could consist of parts of V1 to V4, is disconnected from subsequent cortical areas.

Stimuli -scenes
240 images (640*480 pixels, full-color) of real-world scenes were obtained from a previous unpublished study by Chow-Wing-Bom et al. (2019). Of these 240 images, 80 images were labeled natural (>90% naturalness rating in an independent experiment), 80 images were man-made (<10% naturalness rating) and 80 images were ambiguous (between 10 and 90% naturalness rating). Ambiguous trials were collected for a different purpose and aren't analyzed in the current study. In total, 160 trials were thus used for further analysis. The stimulus set contained a wide variety of different outdoor scenes including beaches, mountains, forests, streets, buildings and parking lots. Most of the scenes (presented at ~12.3 • ) contained objects. For example, more than 50% of the manmade scenes contained a vehicle, mostly bicycles and cars. For the naturalistic scenes, more than 25% of the scenes contained an animal.

Experimental design
During the experiments, stimuli were presented for 100 ms, followed by a 500 ms blank screen. This rapid presentation was included for a different purpose (to evaluate EEG responses after a first brief presentation). Then, to allow for reliable behavioral responses, the stimulus reappeared for 2000 ms and MS was asked to categorize the image as accurately as possible using one of two corresponding response buttons. For the scenes, MS was asked to categorize images as being "manmade" or "natural". For the objects, the two categories were "animate" vs. "inanimate". Each image was shown only once in both experiments, resulting in 160 trials for experiment 1, and 80 trials in experiment 2. Stimuli were presented in a randomized sequence, at eye-level, in the center of a 23-inch ASUS TFT-LCD display (1920*1080 pixels, at a refresh rate of 60 Hz), while MS was seated approximately 70 cm from the screen. The task was programmed in-and performed using Presentation (Version 18.0, Neurobehavioral Systems Inc., Berkeley, CA, www. neurobs.com). After every 40 trials there was a short break. During the task, EEG was recorded.

Statistical analysis: behavioral data
Choice accuracy and reaction times were computed for each condition (Fig. 2). Differences between the conditions were tested using twotailed permutation testing with 5000 permutations. In each iteration, we divided the permuted dataset into two datasets x and y (all items assigned randomly) and computed the difference (here: mean) of sample x and sample y. We repeated this until all permutations were evaluated and stored the differences. The p-value was computed by taking the number of times the stored differences were at least as extreme as the original difference, divided by the total number of permutations. Here, the p-value is defined as the probability, given that the null-hypothesis (no difference between the conditions) is true, that we obtain results that are at least as extreme as the results we observed. In each iteration, all samples were taken into account (resampling was dependent only on the assignment of values to condition groups). Behavioral data were analyzed in Python using the following packages: Statsmodels, SciPy, NumPy, Pandas and Seaborn (Jones et al., 2001;McKinney and Others, 2010;Oliphant, 2006;Seabold and Perktold, 2010).

Deep convolutional neural networks (DCNNS)
First, to evaluate how many layers were sufficient to accurately perform the categorization tasks, tests were conducted on four deep residual networks (ResNets (He et al., 2016);) with increasing number of layers; ResNet-6, ResNet-10, ResNet-18 and Resnet-34. We selected ResNets because they contain skip connections. Skip connections in deep architectures skip some layer in the neural network and feed the output of one layer as the input to the next layers (instead of only the next one). This allows us to 'damage' or remove one layer while still evaluating the output of the network. Pre-trained networks were fine-tuned (40 epochs) to perform either the manmade vs. natural scene categorization task, or the object categorization task, using PyTorch (Paszke et al., 2019). Each model was initialized five times with different seeds to perform statistical analyses. For ResNet10, the most shallow network that was able to successfully perform the task (>95% accuracy on all conditions), we evaluated categorization performance after 'lesioning' higher-order layers. To this end, we removed one of the 'building blocks', while keeping the skip connection intact.

Results
First, categorization performance (proportion correct) of MS was computed for both categorization tasks. Results from two-sample permutation tests with 5000 permutations indicated higher performance for natural scenes (experiment 1) and manmade objects (experiment 2) images (p = 0.007, p = 0.016, respectively). Thus, in the scene categorization task, MS was significantly better at classifying visually the natural compared to man-made environments. In contrast, on the object categorization task, he was significantly better at assigning the manmade objects to the correct category compared to the natural objects.
ResNet-10, -18 and − 34 all showed virtually perfect performance for both tasks, for all categories (Fig. 3). For the most shallow network, ResNet-6, there was a slight decrease in performance, specifically for manmade scenes (experiment 1) and natural objects (p = 0.02, p = 0.03, respectively). Overall these results indicate that performance of a shallow ResNet-6 may decrease in a similar fashion as MS. This supports the idea that performance is decreased for specific categories because those stimuli (in our dataset) are more difficult. Still, even for a shallow ResNet-6, the two-option categorization tasks seem too easy.
Finally, we evaluated the performance of ResNet-10 after 'lesioning' higher-order layers (Fig. 4A). We focused on ResNet-10, because this network was the most shallow network to perform virtually perfect on both tasks (without lesion), indicating the network's ability to categorize all conditions. In order to mimic lesions to higher-order areas in the visual processing stream, we removed connections to the final building block of the network (Block 4). Permutation tests with 5000 permutations between ResNet-10 without and with lesion, indicated a decrease in performance after elimination of higher-order layers, specifically for manmade (experiment 1) and natural (experiment 2) images (both p < 0.001). For natural scenes, there was a slight increase in performance after the removal of higher order layers (p = 0.023).
To assess whether the networks also perform worse for trials that MS answered incorrectly, we split the data into two groups (trials that MS answered correctly vs. trials that MS answered incorrectly) and computed the networks' performance for both groups. Results indicated that for experiment 1 (manmade vs. natural scenes) ResNet-10 made more errors for images that were answered incorrectly by MS, both with and without a lesion. For experiment 2 (objects), although there were more errors for trials answered incorrectly by MS, there was no significant difference. Overall, these results indicate that there is an overlap in the trials that are 'difficult' for MS and the networks.
Lesions in earlier layers of the network (blocks 1-3) resulted in a strongly biased response, in which the network generally classified all images as belonging to the same category (Supplementary Figure 1). The direction of this bias was variable across different initializations, suggesting that the earlier layers are crucial to obtain a useful representation, and the bias was not caused by the current stimulus set.

Discussion
We evaluated the extent to which MS ′ ability to recognize visual information shows selective impairments for semantic categories. Our findings show a dissociation between two associated tasks (categorization of manmade vs. natural scenes and objects), with impaired performance for natural objects (as is usually the case), and better performance for naturalistic scenes compared to manmade scenes. For the objects, performance was primarily impaired for insects. Overall, these results indicate that the category-specific effects, at least for patient MS, are better explained as a visual impairment, invalidating the idea that this is a purely semantic disorder (i.e. by category membership only). This is in line with earlier findings from Young et al. (1989), and suggests that, similar to findings by Gerlach (2001) and Låg (2005), the direction of category-effects to a large extent depends on the degree of perceptual differentiation called for. Using Deep Convolutional Neural Networks as 'artificial animal models' (Scholte, 2018), we further explored the type of information that might underlie such behavior. Overall, shallow DCNNs and DCNNs with 'lesions' in higher order areas showed similar response patterns, with decreased performance for manmade (experiment 1) and natural (experiment 2) categories. While there was some overlap in the type of errors, there were differences as well. On the subordinate level, the deficit in the object recognition task was mostly restricted to the subcategory of insects for MS, whereas lesioned ResNet-10 performance was impaired for both mammals and insects.
While DCNNs trained to classify images contain mappings from labels to certain objects, they do not contain semantic knowledge about the images and the objects that are depicted. The similarity in response patterns strengthens the notion that the category-specific effects are driven by visual properties. However, it's important to note that in the current study only one architecture type (ResNet) and one approach to modelling damage (removing a block) was evaluated. In previous studies simulating 'lesions' in computational models there has been considerable variability in the way in which damage has been implemented (Guest et al., 2020), in some cases resulting in different response patterns. Whether the current effects hold for different model architectures or different implementations of damage requires future experimental study. The extension of the damage in MS ′ brain makes it difficult to model or simulate damage to a specific brain region or stage in the visual processing stream. In the current study, we therefore do not focus on the exact location of the damage. Future research, comparing patients with lesions in different parts of the visual cortex, will be necessary to evaluate a more precise mapping between different types of damage in visual cortex and in ResNet layers.

Category selectivity in the visual ventral stream
There is an ongoing debate on the emergence of category selectivity in the visual ventral stream of healthy subjects. A popular view is that observed category effects indicate a high-level representation in which neurons are organised around either object category or correlated semantic or conceptual features (Konkle and Oliva, 2012;Kriegeskorte et al., 2008;Mahon et al., 2009). An alternative view is that categorical responses in the ventral stream are driven by combinations of more basic visual properties that covary with different categories (Andrews et al., 2015;Long et al., 2018). The conflation of visual and categorical properties in object images means that category-selective responses could be expected under both accounts. Results from the current study do not speak to these findings, nor include/exclude the possibility for object category-selective responses driven by categorical or semantic properties. However, these findings do indicate that in object recognition impairments (following brain damage to certain regions), category-selectivity can emerge based on basic visual properties.

Object representations in IT
A question that remains unresolved in this study is which visual features might be involved in classification of the different categories, i. e. which dimensions in stimulus or object space are utilized by MS. Recent work by Bao et al. (2020) shows that specialization of different categories in certain regions in IT can be explained by two dimensions, progressing from animate to inanimate (dimension 1), and from more stubby to spiky (dimension 3). Following these dimensions, lesions to different parts of IT should lead to agnosias in specific sectors of object space. For example, the observation that MS ′ specifically does not recognize insects (which are generally more 'spiky' than mammals, Supplementary Figure 2) as being animate might be explained by a disturbed 'spiky animate corner' in object space.

Effect of typicality on category-membership decisions
The typicality of a target object is known to influence categorymembership decisions (Shoben, 1982). For a given semantic category, the more typical members can be accepted as belonging to that category more quickly than less typical members. In earlier studies, MS also showed faster reactions to more typical exemplars (Young et al., 1989). However, on top of this 'typicality effect', MS showed faster responses to non-living things than living things. In the current study, performance on experiment 2 was merely decreased for insects (Supplementary Figure 2). One explanation could be that insects are less typical for the 'natural' condition than mammals, and therefore performance was decreased for these images.

Objects vs. scene categorization
While object and scene recognition both require mapping of lowlevel incoming sensory information to high-level representations and semantic knowledge, perceiving a scene may involve different types of information than recognition of objects. Previous work indeed suggests that patients with visual form agnosia may employ different visual properties for the classification of scenes. The unexpected observation in the current study, that performance was worse for manmade scenes compared to natural scenes, has been shown earlier in an individual with visual form agnosia (Steeves et al., 2004). DF, who has a profound deficit in object recognition but spared color and visual texture perception, could classify scenes and was fastest for natural scenes. The fact that natural scenes contained more predictable color, texture information, and spatial structure than non-natural scenes (Oliva and Torralba, 2001;(Burton and Moorhead, 1987), could potentially explain why performance was increased for those images.

Conclusion
Overall, these results suggest that semantic impairments for certain categories can, at least in MS ′ case, be explained by differences in perceptual demand and visual features, rather than by a deficit in semantic memory. The pattern of results seems to indicate that the task and stimuli that one uses has a strong influence on the diagnostic visual features that can be used for a certain task, and that those factors may lead to different, opposing results in terms of semantic category-specific deficits. Because of the differences between the two tasks we cannot pinpoint exactly what visual properties drive our effects. There are several possible explanations. First of all, it could be that MS ′ performance is impaired for images containing certain visual properties, such as 'spiky' features and straight lines. Secondly, it could be that MS performance is impaired for certain images, because they are perceptually more difficult based on other image properties, such as the image complexity, viewpoint or prototypicality. The stimuli used in the current experiment were not controlled for such factors, and future research is needed to determine the exact underpinnings of the impairment in the specific conditions. The finding that deep neural networks (which have no semantic knowledge, apart from adding a label to a visual image) show the same effects for shallow models and models with lesions in higher layers supports the notion that the effects are driven by visual features and not semantic category.

Data and code availability
Data and code to reproduce the analyses in this article are available at the Open Science Framework (#9h7mf) and at https://github.com/ noorseijdel/2020_Object_agnosia.

Declaration of competing interest
The authors declare no competing financial interests.