Elsevier

NeuroImage

Volume 23, Issue 1, September 2004, Pages 156-166
NeuroImage

Combinatorial codes in ventral temporal lobe for object recognition: Haxby (2001) revisited: is there a “face” area?

https://doi.org/10.1016/j.neuroimage.2004.05.020Get rights and content

Abstract

Haxby et al. [Science 293 (2001) 2425] recently argued that category-related responses in the ventral temporal (VT) lobe during visual object identification were overlapping and distributed in topography. This observation contrasts with prevailing views that object codes are focal and localized to specific areas such as the fusiform and parahippocampal gyri. We provide a critical test of Haxby's hypothesis using a neural network (NN) classifier that can detect more general topographic representations and achieves 83% correct generalization performance on patterns of voxel responses in out-of-sample tests. Using voxel-wise sensitivity analysis we show that substantially the same VT lobe voxels contribute to the classification of all object categories, suggesting the code is combinatorial. Moreover, we found no evidence for local single category representations. The neural network representations of the voxel codes were sensitive to both category and superordinate level features that were only available implicitly in the object categories.

Introduction

How does the brain encode and represent objects? Functional brain imaging has revealed that the human ventral object vision pathway has a complex functional architecture. Different categories of objects evoke different patterns of response in these cortices. Based on standard methods for analyzing and interpreting functional brain imaging results, these patterns are usually described in terms of the locations of regions that respond more strongly to one category, for example, faces, than to all others Aguirre et al., 1998, Downing et al., 2001, Epstein and Kanwisher, 1998, Hasson et al., 2003, Ishai et al., 1999, Kanwisher et al., 1997, McCarthy et al., 1997. In previous work, however, Haxby et al. (2001) showed that category-related information is also carried by weaker responses in these patterns of response and proposed that strong and weak responses may all play an integral role in the representation of objects. Thus, the representations for multiple categories overlap because a strong response to one category and intermediate or weak responses to other categories in the same piece of cortex are all parts of the representations for these categories. Such representations have an essentially unlimited carrying capacity by virtue of the number of combinatorial possibilities. By contrast, representations based on localized processors or modules, identified by maximal response to the objects for which they are specialized, are limited by the number of category-dedicated regions that can fit into a cortical space.

The similarity method of Haxby et al. (2001) was intended as a demonstration of a concept, designed to attempt to measure category-related, distributed patterns of response, but it was inefficient and insensitive to the range of possible distributed coding possibilities. It and other analyses (Spiridon and Kanwisher, 2002) have also confused category identification with feature (cortical response) sensitivity making it unlikely that functional areas could be uniquely identified (cf. Bartels and Zeki, 2004). Others have since applied various multivariate methods for analyzing distributed patterns of response in functional magnetic resonance imaging (fMRI) data sets, such as linear discriminant analysis (Carlson et al., 2003) and support vector machines (Cox and Savoy, 2003). All of these methods examine a form of information in fMRI data that is overlooked in standard methods of analysis (Friston et al., 1994). The usual statistical methods analyze the temporal course of response in each voxel independently of all other voxels then search for clusters of voxels with similar responses. By contrast, these multivariate methods explicitly analyze how the response varies across clusters of voxels and how these patterns of response, or landscapes, change with cognitive or perceptual state (see Haxby, in press). These types of methods could be used to detect representations that involve specific local codes that index a compact region (cf. Fodor, 1983), perhaps varying in shape or size, or for probabilistic maps that vary in intensity over the region in a distributed and possibly overlapping way. There are actually four logical possibilities for such coding schemes: (1) spatially local or compact codes that indicate the presence or absence of a type of object, (2) spatially local or compact codes that also indicate “likelihood” of the object type, (3) distributed codes that are non-overlapping and hence act as a potential local code but are distributed through the region in a unique pattern (these types of codes could also vary in intensity), finally, (4) distributed codes that are either partially or completely overlapping and vary in intensity. The case of completely overlapping distributed code is often called a combinatorial code. They only depend on the pattern of activity in which each subregion of the landscape responds in a continuous way to create an object code. Activity in a subregion is therefore more similar to the kind of coding such as specific values that a variable can take on, rather than a likelihood or intensity measure that could indicate strength of a response in a specific patch or even set of patches.

The method of Haxby et al. (2001) measured the similarity of a pattern of response to a template, defined individually for each subject, using a correlation coefficient as the index of similarity. Briefly, the data are divided into statistically independent halves, the patterns of response in each half of the data to each category are calculated, and correlations between these patterns are used as indices of the replicability of the pattern of response to each category (within-category correlations) and the confusability of patterns of response to different categories (between-category correlations). This correlation method is a test of whether a replicable pattern of response in one experimental condition exists that is significantly different from the pattern of response in another experimental condition. To test whether the information carried by a pattern of response resided only in the cortex that responded maximally to one category, the patterns of response to two categories were compared with the cortex that responded maximally to either category excluded from the analysis.

Previous methods of topographic pattern analysis, however, have not provided an unbiased test of whether the patterns of response are most consistent with a distributed or a localist code for the representation of faces and objects. We decided, therefore, to reanalyze data from the experiment of Haxby et al. (2001) with a neural net (NN) classifier that could detect either a localized or a distributed code with no initial bias toward either. Neural networks are nonlinear response functions that consist of “nodes”, which possess both an activation function and an input function. An input function defines the integration of inputs to the node, typically this function is a weighted average (dot product) over the input values (in this case voxel values). An activation function or output function defines the transformation of net input through the integration function to “rate of firing” function. Often, such a function is sigmoidal in nature, such as a logistic function, such low net input is transformed to low response rates and high net input is transformed to high response rates. These outputs, which typically vary between zero and one, can also be used to indicate the “likelihood” of a given input vector. Feed-forward neural networks often have layers of nodes with intermediate nodes that are known to make them universal approximators Hanson and Burr, 1990, Hornik et al., 1989. Because of their broad approximation powers, NNs have the ability to detect locally contiguous inputs, “patches”, that are consistent across training examples or widely dispersed inputs that may have no obvious spatially contiguity.

In addition to providing an unbiased comparison of distributed versus localist models for category-related patterns of response, NN classifiers also offer a more general method in detecting topographic patterns than the correlation method. Because the method of Haxby et al. used correlation as the measure of pattern similarity, the weight given to a single voxel is based on the deviation of the response in that voxel from the mean response across voxels rather than on the discriminating power of that voxel. By contrast, NN classifiers adjust the weight assigned to each voxel to maximize discriminatory power. Therefore, NN classifiers have the potential to detect the more exact form of the topographic pattern.

NN classifiers also address another shortcoming of the correlation method, namely the uncertainty about the precise extent of response pattern overlap. Haxby et al. showed that the pattern of response to an object category was highly specific to that category even when the analysis was restricted to cortex that responded maximally to other object categories. Also apparent from the correlation analysis were extensive negative correlations between categories, suggesting a potential network of associations between object categories that were primarily inverse relationships in activation, ones that could form an associative basis. These results suggested that information about multiple categories is distributed in overlapping representations, but it is not an exhaustive test of whether each voxel contributes information to the representation of all categories. It is possible that no maximal responses in a piece of cortex only carry information about one or two categories in addition to the category that elicits the maximal response. Such a representational scheme, therefore, would be localized to scattered, small cortical patches that have some degree of category-specificity. With NN classifiers, we can apply a sensitivity analysis to determine whether each individual voxel contributed to the classifier for each category and, thus, make an exact quantitative estimate of the extent of response pattern overlap. This kind of analysis adds noise to the input voxel after training the NN to optimal generalization performance. As noise increases for each specific voxel input, the classification error of the trained NN is monitored for significant increases in error given small perturbations of noise indicating that that voxel is contributing to the overall classification performance. In this way, each voxel can be “queried” as to its contribution to the specific object identity.

In the present research, we therefore ask two basic questions: whether we can show improvement in out-of-sample generalization and further can we identify the object code in temporal lobe more precisely? Specifically, the kinds of codes that we investigate in this paper are a special case of more general topographic codes; ones in which differential intensities in some fixed spatial patterns code for objects; similar to a piano where the same set of keys are played but with different amplitude modulation; thus producing unique output with the same keys. From a computational point of view, this might be the simplest type of code to implement that is efficient, high capacity, and rapidly extensible. In the next sections, we examine this specific coding hypothesis and provide results for the Neural Network Classifiers.

Section snippets

Data acquisition

The data consisted of 64 slices 64 × 40 BOLD collected from a GE 3T (repetition time = 2500 ms, forty 3.5-mm-thick sagittal images, field of view = 24 cm, echo time = 30 ms, flip angle = 90°). We used 7–10 slices from this set and used Haxby's feature masks that he had used for his correlations. Haxby had done feature selection using thresholded high variance voxels that created slice masks for 7–10 slices with 5–150 voxels per slice (500–600 voxels per volume) depending on the subject.

Experimental procedures (Haxby's original procedure from Science 2001)

Patterns

Voxel or feature properties

As described above in the Methods section, Haxby defined voxel masks in the ventral temporal area that resulted in approximately 300–600 voxels (features) depending on each subject. We converted the voxel intensities in these sets to z scores (demeaned and normalized to standard deviation in each time series) and examined their distributions. As shown in Fig. 2, we have a typical subject's frequency distribution over the voxel set. For all six subjects, we found no evidence of significant modes

Discussion

We have reanalyzed the Haxby et al. (2001) object recognition data using feed-forward neural networks and showed significant out-of-sample generalization performance (82.5%) on scans between blocks of stimulus trials. Networks performing a potential compression of 50:1 of voxels to hidden units were able to correctly classify and recognize all (672) tokens based only on individual scans, indicating that voxel variation alone can be use to code for objects that human subjects are visually

Acknowledgements

This research was supported by a McDonnell Foundation Grant to S. Hanson and NSF ITR Grant EIA-0205178. We wish to thank Maggie Shiffrar and Catherine Hanson for providing feedback on earlier versions of this paper.

References (26)

  • T.A. Carlson et al.

    Pattern of activity in the categorical representations of objects

    J. Cogn. Neurosci

    (2003)
  • P. Downing et al.

    A cortical area selective for visual processing of the human body

    Science

    (2001)
  • R. Epstein et al.

    A cortical representation of the local visual environment

    Nature

    (1998)
  • Cited by (233)

    • Path-Weights and Layer-Wise Relevance Propagation for Explainability of ANNs with fMRI Data

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text