Introduction

Perceiving the facial expressions of other individuals plays a critical role in the social life of primates, including humans. Two neural pathways, the ventral cortical pathway and the subcortical pathway, contribute to this perceptual ability1,2 (Fig. 1A). The ventral cortical pathway consists of a network of areas in the occipito-temporal region of the cerebral cortex and processes a variety of visual features of objects, people, and environments, including shape, color, texture, material properties, and binocular disparity3,4,5,6,7,8,9,10. Neurons that preferentially respond to images of faces or facial features are found in several clusters along this pathway11,12,13,14,15,16,17. They constitute the neural system that analyzes facial details such as expression, identity, and direction of attention. The subcortical pathway consists of a few processing stages in phylogenetically ancient regions: the superior colliculus in the midbrain, the pulvinar nucleus in the posterior thalamus, and the amygdala in the medial limbic system. The subcortical pathway is suggested to mediate rapid behavioral and physiological (autonomic) responses to sensory signals related to possible dangers such as fearful faces1,18,19 (for a critical review, see20). The ventral cortical pathway and the subcortical pathway intersect at the amygdala.

Figure 1
figure 1

Shallow neural networks (SNNs) and modifications. (A) Cortical and subcortical visual pathways for processing facial expressions in the primate brain. AMYG amygdala, AIT, PIT anterior and posterior parts of inferior temporal cortex, LGN lateral geniculate nucleus, PUL pulvinar nucleus, SC superior colliculus, V1, V2, V4 visual areas 1, 2, 4. (B) A schematic illustration of the SNNs and full-replacement models. In the full-replacement models, processing layers were added, the filters in the initial layer were changed with Gabor filters, and the range of pooling was narrowed. (C) DoG filters for the SNNs and DoG models (upper) and Gabor filters for the Gabor models (lower). (D) The pooling range for the SNNs (5 × 5) and the narrow-pooling models (3 × 3). (E) Examples of presented face images with seven expressions (angry, disgusted, fearful, happy, sad, surprised, neutral) of two individuals (upper: female, AF06; lower: male, AM27). The original images were obtained from the Karolinska Directed Emotional Faces database21.

Psychological and brain imaging studies suggest that the subcortical pathway subserves the ability of some patients with lesions in the primary visual cortex (V1) to discriminate facial expressions despite lacking visual awareness22,23,24 (“affective blindsight”). These patients also reflexively exhibit specific facial expressions and pupillary reactions when exposed to fearful or happy faces25. Studies have also shown that the subcortical pathway supports unconscious face perception in neurologically healthy subjects26,27. Furthermore, the head orientation toward faces or face-like patterns by newborn babies is suggested to be mediated by the subcortical pathway28,29 (but see30). Importantly, these perceptual abilities are not perfectly accurate, instead resulting in modest performance at above-chance levels. These findings suggest that information on faces conveyed by the subcortical pathway is less accurate than that carried by the ventral cortical pathway.

Electrophysiological studies have demonstrated that processing of facial expression in the subcortical pathway is indeed fast and not very accurate. Méndez-Bértolo and colleagues31 showed that intracranial local field potentials in the human amygdala respond differentially to fearful faces versus other faces within 74 ms after stimulus onset. A recent single-neuron recording study in the monkey revealed that a population of amygdala neurons responded to threatening faces within 50 ms32. This early response, when combined across an ensemble of neurons, carries information that allows linear classifiers to discriminate threatening faces from neutral and affiliative faces. The rate of correctly discriminating the three expressions is approximately 50%; this is well above chance (33%) but significantly worse than perfect.

What architectural and physiological properties of the subcortical pathway are responsible for its fast, crude processing? The fast processing most likely arises from the small number, or “shallowness”, of processing stages in the subcortical pathway, given that the ventral cortical pathway consists of a larger number of regions (at least five before reaching the amygdala) than the subcortical pathway (only two) and that every transition from one cortical region to the next takes at least 10 ms33. It is unclear whether shallow processing similarly explains the low accuracy of the information transmitted by the subcortical pathway to the amygdala. This uncertainty arises from the fact that in addition to the difference in the number of processing stages, visual response properties differ markedly between the two pathways.

Neurons in the superior colliculus at the first stage of the subcortical pathway show circular receptive fields with center-surround antagonistic organization, which can be modeled using the difference-of-Gaussian (DoG) function34,35,36,37. By contrast, simple cells of V1 at the first stage of the cortical pathway have elongated receptive fields with side-by-side ON and OFF sub-regions, which can be modeled by two-dimensional Gabor functions38. Furthermore, the receptive field is typically larger in the superior colliculus (for the foveal field, 1.5°–10° in superficial layers, 10°–20° in deep layers)39,40 than in V1 (1.18° in simple cells, 1.3° in complex cells)41 and the extrastriate areas V2 and V442. Thus, spatial pooling across ascending stages occurs over a wider visual field area in the subcortical pathway than in the ventral cortical pathway.

Examining the impact of these properties of the subcortical pathway on facial expression processing is a vital step towards understanding facial expression recognition in various individuals, including those with immature or damaged visual cortices (such as neonates and people with affective blindness) and animals with less developed visual cortex. This examination helps establish a link between the processing carried out by the subcortical pathway and facial expression recognition. To this aim, we constructed convolutional neural networks (CNNs) and analyzed their performance in facial expression discrimination. CNNs are one type of multilayer perceptron and can be optimized (“learn”) to classify inputs by varying connection weights between processing units through supervised learning algorithms43. Typical CNNs have several to tens of layers (deep neural networks, DNNs). DNNs developed for classifying visual objects share architectural and representational features similar to those of the ventral cortical pathway44,45,46,47. We designed our CNNs to imitate the subcortical pathway by reducing the number of processing stages and by implementing DoG-type receptive fields and a wider extent of pooling. These CNNs, hereafter referred to as shallow neural networks (SNNs), learned to discriminate facial expressions with modest correct rates. Replacing the three properties, one-by-one, two at a time, or all three simultaneously, with the corresponding properties in the ventral cortical pathway gradually improved discrimination performance, suggesting that all three properties are responsible for limiting the performance of the SNNs. We further showed that similar to some neurons in the amygdala, a major group of units in the final processing layer of the SNNs were sensitive to spatial frequency (SF) in the retina-based reference frame as initially detected in the first processing layer, and that the three subcortical properties contribute to preserving retina-based SF sensitivity. The results characterize the computational processes that underlie the contribution of the subcortical pathway to recognition of facial expression.

Materials and methods

All methods were carried out in accordance with relevant guidelines and regulations.

Architecture of SNNs

The design and performance of CNN models are primarily influenced by the number of stages and the processing characteristics implemented within these stages. Convolutional filters play an essential role in convolutional processing, whereas window sizes are critical for pooling. With this in mind, we developed SNNs that incorporate the three essential subcortical properties (Fig. 1B–D; Table 1). These properties include shallow processing stages, DoG-type organization of receptive fields at the initial processing stage, and larger pooling size.

Table 1 Architecture of the Shallow Neural Network (SNN). Each row describes a layer i with calculation operator Fi, output resolution Hi × Wi, and the number of output channels Ci. Conv denotes convolution layer, and Pool denotes max pooling layer.

Unlike typical DNNs, the SNNs consisted of only two sets of convolution and pooling layers followed by two fully connected layers (FC1, FC2), approximating the small number of processing stages of the subcortical pathway. The first convolution layer incorporated 32 DoG-type filters (Fig. 1C, top) with a spatial resolution of 11 × 11 pixels, whereas weights in the second convolution layer were initially random, i.e., the filters had no structure and gradually changed through training. A rectified linear unit (ReLU) was used as the activation function of a unit in the convolution layers and FC1; the ReLU forwards the processing results directly to the next stage if they are positive; otherwise, it outputs zero. A max pooling operation was performed over 5 × 5 sliding regions with a stride of 4 for the outputs of convolution layers (Fig. 1D, top). Max pooling selects the largest value among the responses of units within a sliding window over the preceding convolution layer and forwards the value to the next layer. A local response normalization process was added after the pooling layers to aid generalization48 (we used slightly different parameters from theirs; k = 1, n = 5, α = 2 × 10−5, β = 0.75). Every unit in FC1 and FC2 received inputs from all units in the immediately preceding layer, i.e., each was fully connected. FC1 is the final processing layer, and FC2 outputs the results of entire processing by the SNNs. These features were implemented to capture the architectural and computational properties of the subcortical pathway, i.e., fewer processing stages compared to the ventral cortical pathway (Fig. 1A), DoG-type receptive fields in the superficial layer of the superior colliculus37, and large receptive fields of deeper superior colliculus neurons40. The first three processing layers were intended to represent the superior colliculus, pulvinar, and amygdala, respectively. The processing types of these layers, i.e., convolution and pooling in the first two layers and full connection in FC1, were chosen to align with the retinotopic organization of the three brain regions. The convolution and pooling processes in the first two layers exploit retinotopy, as the superior colliculus and pulvinar contain retinotopic maps49,50. The FC1 layer lacks retinotopic information because of the fully convergent connection from the earlier stage, as the amygdala does not have a retinotopic map51.

The SNNs were trained to discriminate images of facial expressions representing seven basic emotions: angry, disgusted, fearful, happy, sad, surprised, and neutral (Fig. 1E; see below for details). For each input image, the seven units in FC2 yielded scores ranging from 0 to 1 for the seven expression categories, representing the probabilities of classified expressions. The expression with the highest score was taken as the output of the model.

The DoG-type filters of the first convolution layer were built using the following formula:

$$DoG(r) = \pm A_{1} \;exp\left( { - \frac{{r^{2} }}{{2\sigma_{1}^{2} }}} \right) \mp A_{2} \;exp\left( { - \frac{{r^{2} }}{{2\sigma_{2}^{2} }}} \right),$$
(1)

where r is the polar radius from the filter center, A1 and A2 are the amplitudes of exponentials of two Gaussian functions, and σ1 and σ2 are the standard deviations. Values of A1, A2, σ1, and σ2 were chosen empirically so that DoG curves took the shapes of Mexican hats. A1 values were 0.4, 0.67, 0.8, and 1.0. A2 values were determined based on A1 − A2 = 0.4. When A1 was 0.4 (i.e., A2 is 0, and σ2 cannot be defined), we set the σ1 value at \(1/2\sqrt 2 ,\; 1/4\sqrt 2 , \;1/8\sqrt 2 , \;{\text{or}}\; 1/16\sqrt 2\). Otherwise, the σ1 value was \(1/2\sqrt 2 \;{\text{ or}}\; 1/4\sqrt 2\). The σ2 value was based on σ12 = 0.5 or 0.25. The same number of filters was generated for each A1 value.

The primary question in this study was to investigate the classification accuracy achieved by the SNNs and determine if it exhibited a modest level, similar to individuals with affective blindsight or newborn babies. After confirming this prediction (refer to “Results”), we next explored whether this modest accuracy could be attributed to the three subcortical processing properties. If these properties were responsible for the modest performance, substituting them with cortical counterparts would lead to an improvement in performance. We therefore constructed modified models in which the three properties of the SNNs were replaced one-by-one, two at a time, or all three simultaneously with the corresponding properties in the ventral cortical pathway. First, we added convolution layers with filters of 3 × 3 pixels after each of the first two convolution layers to increase the number of processing stages (add-layer model). In adding the convolution layers, the stride of sliding filters was reduced to 1 to keep the output resolutions unchanged before and after adding the new layers. Additionally, to keep the number of output channels unchanged, the new layers contained the same number of filters as the preceding layers.

Second, we replaced the DoG-type filters with Gabor-type filters (Gabor model). Gabor-type filtering occurs in simple cells of V1 and emerges in the first layer of DNNs after they are trained to classify object images48,52. We constructed Gabor-type filters with the following formula:

$$g\left( {x,\; y;\; f,\; \theta } \right) = A\;exp\left( { - \frac{{x^{2} + y^{2} }}{{2\sigma^{2} }}} \right)exp\left( {2\pi if\;\left( {x\cos \theta + y\sin \theta } \right)} \right),$$
(2)

where A is the amplitude of the Gaussian envelope, i is √ − 1, f is the carrier frequency of a Gabor filter, and θ is the orientation53. A was fixed at 0.4 to match the amplitude of the DoG filters. σ was fixed at 0.125 so that the half-amplitude width was half of the filter width. f values were 2 or 4 cycles/image. Orientation θ was 0, 22.5, 45, 67.5, 90, 112.5, 135, or 157.5. We built even- and odd-symmetric filters for every combination of variables. In total, we obtained 32 Gabor-type filters (Fig. 1C, bottom).

Finally, we made the pooling window (convergence field) in the max pooling layers smaller (3 × 3; Fig. 1D, bottom) than that of the SNNs (5 × 5; Fig. 1D, top), enabling better spatial resolution of processing (narrow-pooling model) to mimic the smaller receptive fields in the visual cortices compared to the superior colliculus and pulvinar39,40,41,42. The pooling range of 3 × 3 is often used in DNNs (e.g., AlexNet48; ResNet54).

Face images

Face images in the main analysis were obtained from the Karolinska Directed Emotional Faces Database developed in Sweden (KDEF)21 and the Radboud Faces Database developed in the Netherlands (RaFD)55. We chose these databases because they meet the requirements of having a substantial number of face images with the seven basic expressions taken under controlled conditions56 and have been previously used to investigate the classification performance of DNNs57. Images of the seven expressions of 40 individuals (half females, half males) were chosen from each database (the total number of images was 560 = 7 × 40 × 2). We converted the images from color into grayscale and extracted the face region by removing hair, neck, and ears with the face-detection function of a computer vision library, OpenCV (Open Source Computer Vision Library)58. The isolated faces were pasted on a gray background (198 × 198 pixels; RGB values = 128; Fig. 1E). We augmented the number of face images by changing size and position and by flipping horizontally: seven sizes (28 × 28, 56 × 56, 85 × 85, 113 × 113, 141 × 141, 170 × 170, and 198 × 198 pixels), five positions (center, left-top, right-top, left-bottom, and right-bottom; directional displacements = 10 pixels), and two horizontally flipped images. The augmentation increased the number of images by 70 times to 39,200. At each training session, we randomly split this augmented set of face images into a training set (29,400 images), a validation set (2450 images), and a test set (7350 images). The number of images per facial expression was identical within each of these stimulus sets. To avoid the inadvertently biased assignment of face images of a particular size, position, or horizontal flip state into a given set, all images from the same individual were assigned into the same set.

To evaluate the applicability of our main analysis results beyond the KDEF/RaFD databases, we conducted additional analysis using the KOKORO Research Center Facial Expression Database developed in Japan (KRC)59 for both training and testing. This dataset comprises facial images displaying the seven expressions from a total of 74 individuals, with 50 females and 24 males. To expand the dataset, we employed a similar augmentation technique as used in the KDEF/RaFD datasets, resulting in a total augmentation factor of 70 times the original images. The training and analysis procedures were the same as those used for the KDEF/RaFD-trained SNNs.

Training of the SNNs

The training was performed through supervised learning and was conducted individually 20 times with randomized initial weights except for the built-in weights of the first convolution layer, i.e., 20 SNNs with different initial states were built. In training, the weights other than the first convolution layer were optimized for classification of face images into the seven facial expression categories. Stochastic gradient descent was used for weight optimization. For each iteration, 32 samples were randomly selected from the training set as a mini-batch. The averaged cross-entropy60 across the 32 images in a mini-batch was calculated as an estimate of the loss value, which is a measure of the difference between a model output and a desired output and is used for quantifying the training effect. The number of iterations (i.e., weight-updating processes with single mini-batches) was set at 240,000. Initial weight parameters followed a normal distribution with a mean of 0 and a standard deviation of √(2/N) (N is the total number of weights)61. Weights were updated at each iteration with a constant learning rate of 0.001. This learning rate was determined empirically; a preliminary analysis based on 10 constructed SNNs (different from the 20 SNNs in the main analysis) revealed no decrease in loss values (i.e., no learning) with a learning rate of 0.01, which has frequently been used for DNNs in the literature (e.g., see62). A dropout process was added before FC2 to facilitate learning across all units and to avoid overfitting63. The proportion of units dropped out of each weight update was set to 0.5. The training was conducted in a Python environment (Chainer 3.0.0)64 on a graphics processing unit (GPU) machine (Intel® Core™ i7-5820K Processor, Intel, Santa Clara, CA, USA; The GeForce® GTX 1080 Ti, NVIDIA, Santa Clara, CA, USA) with the single-precision floating-point format (float 32, 4 bytes). During our training process, we utilized 417 MiB of GPU memory, with the SNN accounting for 69 MiB of that total.

While the SNNs were being trained with the training set, the correct rate and loss value for the validation set were periodically checked to monitor signs of overfitting. After training was completed, the performance of the models was evaluated using the test dataset that had not been used for training. This was done to ensure that the models acquired a genuine ability to classify the facial expressions, as opposed to simply sorting the training images into the seven facial expression categories according to the instruction signals.

Test for reference frames of SF tuning of model units

A marked difference in visual responses between the two pathways is the reference frame of neuronal tuning to SFs65. Neurons in the inferior temporal cortex, the final stage of the ventral cortical pathway, are tuned to object-based SFs (cycles/object) and represent face patterns in a size-invariant, hence distance-invariant, manner (Fig. 2A, right). Thus, the ventral cortical pathway converts the representation of SFs in the retina-based coordinate (cycles/degree) to that of object-based SFs. By contrast, many amygdala neurons preserve sensitivity to retina-based SFs. When the stimulus size is changed, these neurons change their preferred object-based SFs; for large stimuli, they respond to higher object-based SFs, which correspond to the same retina-based SFs (Fig. 2A, left). Thus, the dependence of SF tuning on stimulus size serves as a key differentiating indicator to distinguish between the object-based and retina-based reference frames of SF encoding. We analyzed the reference frame of FC1 SF tuning by examining responses to face images of two different sizes to evaluate how well our models captured this characteristic of subcortical processing.

Figure 2
figure 2

Test determining whether units are tuned for object- or retina-based spatial frequencies (SFs). (A) Hypothetical tuning curves for object-based SFs of units ideally tuned to retina-based SFs (left: peak shift = 1) or object-based SFs (right: peak shift = 0). (B) The models were fed face images with two different sizes (198 × 198 and 99 × 99 pixels) and 64 different bandpass filters (AF06 of the Karolinska Directed Emotional Faces database). These images were created by applying two-dimensional bandpass filters that shared the same center object-based SF across different sizes. For each unit in FC1, we obtained responses to images of different center SFs to create tuning curves for object-based SFs.

Bandpass-filtered face images were used to examine the SF tunings (Fig. 2B). These images were created by multiplying Gaussian functions with the original face images on the polar Fourier domain. Gaussian functions had 61 different center frequencies between 1 cycle/object and 64 cycles/object. The center frequencies had discrete values at steps of 0.1 cycles/object on a log scale. Gaussian functions shared the same variance at 2.4 octaves, regardless of their center frequencies. The filtered images had amplitude spectra that were determined solely by the Gaussian function because their spectra were set to be flat before multiplication. To balance the total luminance contrast among the filtered face images, the peak amplitude of the Gaussian function was set inversely proportional to the center image-based SF65. These bandpass-filtered images were created for the seven facial expressions at two different sizes (99 × 99 and 198 × 198 pixels).

To characterize the reference frame of SF tuning of each unit, the peak SFs for the two stimulus sizes were estimated, defined by the SFs at which filtered face stimuli activated a unit most strongly. For a given unit, 14 peak SFs were determined (two sizes × seven expressions). The degree to which unit responses to SFs depended on the stimulus size was quantified by calculating differences in peak SFs on a log scale between the two face sizes. A peak shift of 0 indicates that a unit responds to the same cycles/object regardless of the image size and is perfectly tuned to object-based SFs (Fig. 2A, right). A peak shift of 1 means that an SF tuning curve shifts by the amount corresponding to the change in the stimulus size, indicating that a unit is perfectly tuned to retina-based SFs (Fig. 2A, left). This analysis excluded cases in which units did not respond to face images or were not sensitive to SFs and cases in which peak SFs were at either end of the tested range of SFs and the peak positions could not be determined.

Analysis of the effects of the max pooling operation on SF selectivity

The max pooling operation collapses the positional information of edges, which is detected by convolution filters and is critical for encoding the SFs of facial images. We therefore examined the effects of max pooling on the SF selectivity of units. We were particularly interested in the role of max pooling in converting sensitivity to retina-based SFs to sensitivity to object-based SFs. We fed bandpass-filtered images of the two stimulus sizes (198 × 198 and 99 × 99 pixels) to the models and determined how different stimulus sizes affected the responses of units to SFs in the first convolution layer (before pooling) and the first max pooling layer (after pooling). Changes in response patterns across the units associated with different stimulus sizes were quantified by calculating the dissimilarity index. The dissimilarity index D (x, y) for responses x to the large stimuli and responses y to the small stimuli was defined by the Euclidean distance between x and y as follows:

$$D(x,\;y) \, = \, \| {x \, - \, y} \|{/}(NM)$$
(3)

where \(\| \cdot \|\) is L2 norm, and N is the number of elements of x and y. M is the maximum value among the 6405 Euclidean distances calculated for 61 center SFs and the seven facial expressions of 15 individuals. To probe the roles of the max pooling, the ratio of the dissimilarity index before pooling in the first convolution layer and after pooling in the first max pooling layer was then calculated. This analysis was applied to the case of wide pooling (5 × 5) and narrow pooling (3 × 3), as well as to the case of the SNNs and the Gabor models. By dividing \(\| {x - y}\|\) by M, the dissimilarity index was normalized across the layers (convolution vs. max pooling) and the models (SNNs vs. Gabor models), taking values from 0 to 1.

Ethics statement

Written informed consent was obtained for the publication of any identifiable images included in this article 21.

Results

Performance of SNNs in facial expression classification

The SNNs were trained to classify each face image in the training set into one of the seven facial expressions. The training improved the classification performance rapidly over the initial iterations and then slowly thereafter. The correct rate across the seven facial expressions rose from the chance level (0.14), surpassed 0.6 at approximately 50,000 iterations, further improved to approximately 0.8 over 150,000 iterations, and reached an asymptote (Fig. 3A for an example SNN; 3B for the average of the 20 constructed SNNs; orange lines). The correct rate for the validation set saturated at approximately 0.5, which was substantially lower than for the training set, indicating insufficient generalization to “unseen” images. However, the validation correct rate reached a plateau in a similar way to the training correct rate. This indicates that the low correct rate was not the result of inadequate training but represents the limited learning ability of the SNNs. It also indicates that no overfitting occurred. The loss value also quickly decreased over the initial 50,000 iterations and became gradually stable (Fig. 3A,B; cyan lines). The results indicate that within the range of adopted iterations (240,000), the SNNs were trained to classify facial expressions without overfitting.

Figure 3
figure 3

Learning curves and confusion matrices of an example SNN (A) and the average of 20 SNNs with different initial weights (B). Left panels show changes in correct rates that occurred during training in the training set (orange) and the validation set (red), and loss values (cyan). Confusion matrices on the right indicate the rate of classification of each facial expression (true label) as one of the seven expressions (prediction label). (C) The correct rates for the seven expressions. The black line indicates the mean of the 20 SNNs, and lines with other colors indicate the individual performance of the 20 SNNs. The order of facial expression is based on the mean correct rate.

The correct rates for the test set were higher than the chance level (1/7 = 0.14) for all facial expressions but were modest; the average correct rate across the seven expressions was 0.51 for the example SNN shown in Fig. 3A. The average correct rate across the 20 constructed SNNs was 0.51 (± 0.03, s.d.). This was not different from the average correct rate across 20 additional SNNs that were trained with 3,000,000 iterations (0.50 ± 0.03; p = 0.289, t-test). The training performance thus did not improve even when the SNNs underwent overly excessive training, assuring that the modest correct rate was not due to insufficient training but instead reflected the limited ability of the SNNs.

Confusion matrices showed that the correct rates varied among the facial expressions (Fig. 3A,B, right panels). Based on the averaged performance, the classification performance of the SNNs was best for happy (0.74) and surprised faces (0.72), followed by angry (0.50), disgusted (0.47), and fearful (0.44) faces, and was worst for sad (0.37) and neutral (0.34) faces. Sad faces were often confused with neutral, angry, and fearful faces. Neutral faces were often confused with sad and angry faces. This expression-dependent performance was consistent across the 20 constructed SNNs (Fig. 3C; p < 0.001 for expressions, p = 0.562 for models, two-way ANOVA).

Effects of SNN modification on classification performance

We then addressed whether the modest classification accuracy achieved by the SNNs could be attributed to the three subcortical processing properties, namely, the shallowness of processing stages, the DoG-type receptive fields at the initial stage, and spatial pooling over a wider visual field. If these properties were responsible, replacing them with cortical counterparts would lead to an improvement in performance. Our results confirmed this hypothesis, as we found that substitution of one or more of the three subcortical properties with the corresponding cortical properties improved the classification accuracy (Fig. 4A; p < 0.001, ANOVA). The correct rates averaged across the seven expressions and the 20 constructed models of each modification were increased from 0.51 in the SNNs to 0.55 in the narrow-pooling models, 0.64 in the Gabor models, and 0.69 in the add-layer models (p < 0.01/28 = 8C2; t-test with Bonferroni correction). Among these one-property replacement models, the add-layer models exhibited the best performance. The results indicate that all three properties had effects on classification performance, and the layer structure was the most influential.

Figure 4
figure 4

Effects on model performance of replacing subcortical properties with corresponding cortical properties. (A) Discrimination performances of the SNNs and the modified models. The discrimination performance differed across the models (ANOVA, p < 0.001). The pairs of models with statistically significant differences in performance are linked with horizontal lines in the upper part (t-test, Bonferroni correction, p < 0.01/ 28 = 8C2). (B) Confusion matrix for the full-replacement models (average of the 20 constructed models). (C) The correct rate for the seven expressions across the 20 full-replacement models. The black line indicates the mean, and lines with other colors indicate the data for individual full-replacement models.

When two properties were replaced together, the narrow-pooling + Gabor models and the narrow-pooling + add-layer models performed better than the narrow-pooling models and the Gabor models (p < 0.01/28 = 8C2; t-test with Bonferroni correction) but comparably to the add-layer models (correct rate, 0.69, p = 0.778 for the narrow-pooling + Gabor models; 0.69, p = 0.564 for the narrow-pooling + add-layer models). The Gabor + add-layer models performed better than all one-property-replacement models (correct rate, 0.75; p < 0.01/28 = 8C2). When all three properties were replaced together (full-replacement models), the correct rate was 0.77, which was better than all other models (p < 0.01/28 = 8C2) except for the Gabor + add-layer models (p = 0.00846 > 0.01/28 = 8C2). The performance was improved for all facial expressions (Fig. 4B; mean correct rates: happy = 0.92; surprised = 0.85; disgusted = 0.81; neutral = 0.78; angry = 0.75; fearful = 0.66; sad = 0.65). As in the SNNs, it was highest for happy and surprised faces and lowest for fearful and sad faces. The performance was improved most for neutral faces (SNNs, 0.37; full-replacement models, 0.78). The variance of the correct rates was affected both by facial expressions and models (Fig. 4C; p < 0.001 for facial expressions, p = 0.00285 for models, two-way ANOVA). The improved performance of the two-property replacement and full replacement models indicates that the effects of the three features on classification performance were partially additive, suggesting that the three features exerted their effects partially independently.

Spatial frequency representation in the FC1 layer

As shown above, the SNNs exhibited modest performance in facial expression classification, and this performance was improved by changing SNN subcortical properties to corresponding cortical properties. These findings suggest that the SNNs captured aspects of processing in the subcortical pathway to the extent that they explained the suboptimal perceptual performance of V1-lesioned patients. We next looked into individual computational units to gain insights into the processing in the models. We examined the SF sensitivities of FC1 units using two different sizes of input images (198 × 198 and 99 × 99 pixels). This procedure allowed us to determine whether units were sensitive to retina- or object-based SFs (Fig. 2; see “Materials and methods”).

FC1 units of the SNNs exhibited a variety of dependencies of SF tunings on stimulus size (Fig. 5A). Some units responded to the same range of object-based SFs for both large and small stimuli, and the peak positions of the SF tuning curves remained unchanged (Fig. 5Aa–Ac). Other units exhibited different preferred SFs for large and small stimuli, and in these cases the peak position shifted horizontally along the abscissa (Fig. 5Ad–Af). We quantified these shifts by measuring the difference between preferred SFs on a log scale for the two stimulus sizes. A peak shift of 0 means that the unit encoded SFs in the object-based coordinate, whereas a peak shift of 1 means that the unit encoded SFs in the retina-based coordinate. The peak shifts of the example units shown in Fig. 5A were 0.0 (a), 0.1 (b), 0.1 (c), 0.9 (d), 1.0 (e), and 1.3 (f).

Figure 5
figure 5

SF tuning reference frames of FC1 units of the SNNs and full-replacement models. Responses of FC1 units to SF-filtered face images were examined at two different sizes (198 × 198, 99 × 99 pixels). (A) Six example FC1 units of SNNs with a different peak shift (PS). (B) Two-dimensional histograms of peak SFs at large images versus small images for the SNNs (left) and the full-replacement models (right). Solid lines indicate peak shifts of 0, and dashed lines indicate peak shifts of 1. (C) Distribution of peak shifts of units in the 20 SNNs (left) and the 20 full-replacement models (right). Arrowheads indicate the estimated locations of multiple peaks in the distribution (solid: major peaks, open: statistically detected but less obvious peaks). Gray columns indicate units with a response peak at SFs below 2 cycles/object for large and/or small stimuli.

We plotted the peak positions of 2401 FC1 units of the 20 SNNs in a two-dimensional space defined by the peak SF for the large stimuli on the abscissa and the peak SF for the small stimuli on the ordinate (Fig. 5B, left). Note that 46% of FC units were excluded from this analysis, either because they were not sensitive to SFs (23%) or because the largest responses were found at the end of the examined range of SFs and the peak SFs could not be determined (23%). The diagonal solid line in Fig. 5B represents the responses of a peak shift of 0, and the dashed line next to it represents the responses of a peak shift of 1. FC1 units of the SNNs were clustered in multiple groups in this scatter plot. One conspicuous group was selective to low SFs and was centered on the diagonal, i.e., peak shift values of approximately 0. Another group was selective to higher SFs, and was clustered on the dashed line indicative of peak shift values of approximately 1. The multimodality of the distribution can also be seen in the histogram (Fig. 5C, left). We applied an excess mass test for multimodality66,67 to this distribution. This test statistically determines the number of peaks in the distribution, with the null hypothesis that the true number of peaks is N (N = 1, 2, 3,…). The true number of peaks is estimated as the smallest N under which the null hypothesis is not rejected. The excess mass test also estimates the locations and heights of peaks from Gaussian kernel density estimation. The test revealed that there were three peaks in the distribution of the SNNs (first p-value < 0.001, second p-value < 0.001, third p-value = 0.096). Based on the probability density function derived from the histogram67, the peaks were estimated to be located at − 4.85, 0.144, and 0.909 (open and solid arrowheads in Fig. 5C, left). Units sensitive to low SFs below 2 cycles/object were most frequent around a peak shift of 0 (gray columns). Comparing this result and the density map, the peak at approximately 0 was mostly from the low spatial frequency group and the peak at approximately 1 was from the high spatial frequency group. Although the third peak at the far periphery (at − 4.85, open arrowhead) was statistically detected, it was much smaller in height than the other two peaks (1.1% of the peaks near 0 and 1). The results indicate that FC1 contained two major groups of units, those sensitive to low SFs, encoding SFs in the object-based coordinate, and those sensitive to high SFs, encoding SFs in the retina-based coordinate.

The distribution of peak shift values was drastically altered in the full-replacement models. In the two-dimensional plot shown in Fig. 5B (right), most data points were diffusely distributed in an elongated area between the diagonal and dashed lines, indicating that the SF reference frame of most units was intermediate between retina-based and object-based. An excess mass test again detected three peaks located at − 4.84, 0.436, and 4.98 (Fig. 5C right, solid and open arrowheads; first p-value < 0.001, second p-value < 0.001, third p-value = 0.096). The second and third peaks at − 4.84 and 4.98 (open arrowheads) were smaller than the primary peak at 0.436 (3.1% and 2.6% of the primary peak, respectively), making the distribution nearly unimodal. Thus, the three processing properties influenced the SF reference frame of FC1 units.

Given the change of the peak shift distribution in the full-replacement models, we next analyzed one- and two-property replacement models to determine which subcortical properties were essential for the multimodal distribution observed in the SNNs. All these modified models exhibited unimodal distributions of the major peak (solid arrowheads) at different peak positions (Fig. 6). An excess mass test for multimodality detected two other less obvious peaks (open arrowheads) in each model as in the cases of the SNNs and the full-replacement models (Fig. 5C). The heights of these smaller peaks were 2.6–26% of those of the major peaks, and were located at the periphery of the distribution.

Figure 6
figure 6

Distributions of peak shifts of FC1 units of the SNNs and one-property or two-property replacement models. (A) Data from the narrow-pooling model, add-layer model, and Gabor model. (B) Data from models with two modifications: narrow-pooling + Gabor, Gabor + add-layer, and narrow-pooling + add-layer. Gray columns indicate units with a response peak at SFs below two cycles/object for large or small stimuli. Arrowheads indicate the estimated locations of multiple peaks in the distribution (solid: major peaks, open: statistically detected but small peaks).

Each of the three one-property replacement models showed a characteristic distribution of the peak shift values. The narrow pooling models contained units with peak shift values between 0 and 1 in addition to units with peak shift values around either 0 or 1. The distribution became unimodal and broad, and was estimated to be centered at 0.85. In the Gabor models, units with peak shift values intermediate between 0 and 1 were the most abundant with a smaller number of units of peak shift values around 0 and 1. The distribution peak was estimated at 0.51. In the add-layer models, units with peak shift values at approximately 0 were predominant, and exhibited a sharp distribution peak at 0.32. Regarding the two-property replacement models, the narrow-pooling + Gabor models and the Gabor + add-layer models showed a broad distribution straddling the peak values from 0 to 1 (peak for the former, 0.56; peak for the latter, 0.52), whereas the narrow-pooling + add-layer models showed a sharp distribution peak at 0.20. As in the SNNs and the full-replacement models, units sensitive to low SFs (below 2 cycles/object) were most frequently found at the peak shift of 0 in all of the one- and two-property replacement models (gray columns). The results indicate that all three computational properties were responsible for the multimodal distribution of peak shift values observed in the SNNs. In particular, the smaller number of units with peak shift values of approximately 1 in the Gabor models and the add-layer models suggests that the shallowness and the DoG-type filters were critical for preserving the unit sensitivities to retina-based SFs. The broad distribution observed for the narrow-pooling models and the narrow-pooling + Gabor models suggests that the wide pooling employed in the SNNs contributed to the two peaks at 0 and 1 by reducing units with peak shift values intermediate between 0 and 1.

Effects of max pooling on SF tuning

We showed above that the FC1 units of the SNNs were roughly grouped into two populations in terms of the reference frame of SF encoding. Because max pooling yields the same output from a population of convolution layer units in response to slightly different spatial arrangements of local features, the max pooling operation is likely to affect the encoding of the global configuration of face components. This information of global configuration will be reflected in a low range of SFs. Therefore, we next compared the effect of max pooling on the representation of SFs across different SF ranges.

We first analyzed the responses of the 96,800 units (32 filters × 55 × 55 resolution) in the first convolution layer. We obtained the response patterns across these units by feeding bandpass-filtered faces of two sizes (Fig. 2B) to the models and quantified the difference between the SF tunings obtained for the two stimulus sizes by calculating the dissimilarity index (see “Materials and methods”). In the SNNs with DoG filters, the dissimilarity index was high (approximately 0.6) for a low SF range up to approximately four cycles/object, but gradually decreased over a higher range of SFs (Fig. 7A, black curve). In the pooling layer, the dissimilarity index of the 6272 units (32 filters × 14 × 14 resolution) became lower for a low SF range of less than four cycles/object than that of the convolution layer. For a high SF range of greater than four cycles/object, by contrast, it became higher than that of the convolution layer (Fig. 7A, compare the orange and cyan curves with the black curve). Thus, max pooling resulted in SF tuning becoming similar between the two stimulus sizes for a low SF range, consistent with the results of peak shift analysis (Fig. 5C). Although these changes were observed both for wide pooling (5 × 5) and narrow pooling (3 × 3), the effects were larger for the former than for the latter (Fig. 7A, compare the orange curve with the cyan curve). This was more evident when we plotted the ratio of dissimilarity indices before and after pooling (Fig. 7B). Furthermore, the ratio curve for wide pooling had smaller standard deviations than that for narrow pooling (shown as shades in Fig. 7B), indicating that wide pooling exerted its effects more consistently across the 15 individual faces and the seven facial expressions than narrow pooling.

Figure 7
figure 7

Effects of max pooling on the size-invariant responses to SFs. (A) Dissimilarity index curves of responses of units in the first convolution layer of the SNNs before max pooling operation (black), after wide pooling (blue), and after narrow pooling (orange). Dissimilarity indices, defined by the Euclidean distances of unit responses between different stimulus sizes (see “Materials and methods”), are plotted against the center SFs of input images. Solid lines indicate the means of dissimilarity indices across the seven facial expressions. Shades indicate standard deviations. Each dissimilarity index was normalized by the number of units and the maximum values. (B) Dissimilarity ratios of inputs and outputs of the max pooling operation (upper, after wide pooling; lower, after narrow pooling). (C,D) Data from units in the first convolution layer of the Gabor models. The conventions are the same as in (A) and (B).

By contrast, the convolution-layer units of the Gabor models exhibited a constantly high dissimilarity index over most of the SF range (Fig. 7C, black curve). However, when we applied max pooling with windows of either 5 × 5 or 3 × 3 in size, the dissimilarity index became small over almost the entire SF range, with the largest decrease for 1–16 cycles/object (Fig. 7C; compare the orange and blue curves with the black curve). As in the case of the SNNs, the effect was stronger for wide pooling than for narrow pooling (Fig. 7D). These results demonstrated that max pooling rendered the SF tuning more invariant to stimulus size for units sensitive to low SFs, enabling them to represent SFs in the object-based coordinate. Regardless of the filter type in the first convolution layer (i.e., DoG vs. Gabor), wide pooling was more effective than narrow pooling in creating this response property.

Effects of alternation of sliding strides on SF tuning

We next examined the effect of another free parameter of our models, the stride size, on the SF sensitivity of FC1 units. We changed the stride of the two max pooling layers of the SNNs from 4 to 2. A stride size of 2 was also employed in the narrow-pooling model. This modified model with a smaller stride of 2 achieved a mean correct rate of 0.54, which was better than the SNNs (0.51) but similar to the narrow pooling models (0.55) (vs. SNN, p = 0.0020; vs. the narrow pooling, p = 0.23; t-test with Bonferroni correction).

An analysis of FC1 responses to the two stimulus sizes revealed that the distribution of the peak shift had three peaks at − 4.17, − 0.0107, and 0.976 (first p-value < 0.001, second p-value < 0.001, third p-value = 0.098; excess mass test for multimodality; solid and open arrowheads in Fig. 8). Two of them were conspicuous and located near 0 or 1 (solid arrowheads), and the third one at the periphery of − 4.17 (open arrowhead) was small (3.2% and 3.3% of the two major peaks). Comparisons with Fig. 5B,C show that the small-stride model exhibited a similar SF representation as in the SNNs, in that there were two main groups of units, one sensitive to low SFs, representing SFs in the object-based coordinate (peak shift approximately 0), and the other sensitive to high SFs, representing SFs in the retina-based coordinate (peak shift approximately 1). The change of the stride from 4 to 2 had little effects on the reference frame of SF sensitivity of FC1 units.

Figure 8
figure 8

Effects of alternation of sliding strides of pooling windows on SF tuning reference frames of FC1 units of the SNNs. The responses of the FC1 units were obtained in the same way as in Fig. 5. (A) A two-dimensional histogram of peak SFs obtained with large (198 × 198) versus small (99 × 99) stimulus images. (B) Distribution of peak shifts of units. Arrowheads indicate the estimated locations of multiple peaks in the distribution.

Training and testing of SNNs with a different face database

Finally, we explored the generalizability of our findings based on the KDEF and RaFD databases by assessing the SNNs trained with another face database, KRC. We trained and tested an additional set of 20 SNNs using the KRC database, following a similar approach to the main analysis with the KDEF and RaFD databases. Similar to the KDEF/RaFD-trained SNNs, the SNNs trained on the KRC database achieved modest correct rates above chance level for all facial expressions. The mean correct rate across the seven facial expressions was 0.49 (s.d. = 0.02), slightly lower but still comparable to 0.51 in the KDEF/RaFD-trained SNNs. Furthermore, the performance order for the seven expressions was largely consistent with the models trained on the KDEF/RaFD databases: the performance was best for happy (0.76) and surprised (0.71) faces, moderate for disguised (0.44), fearful (0.42), neutral (0.42), and angry (0.41) faces, and poor for sad (0.28) faces (Fig. 9A).

Figure 9
figure 9

Effects of alternation of training and test datasets to the KRC, a Japanese facial expressions dataset, on the classification performance of 20 SNNs (A) and the SF tuning reference frame of FC1 units in these networks (B,C). The average performance across the 20 SNNs was obtained in the same way as in Fig. 3 and responses of FC1 units were obtained in the same way as in Fig. 5. (A) A confusion matrix illustrates the rate of classification of each facial expression (true label) as one of the seven expressions (prediction label). (B) A two-dimensional histogram of peak SFs obtained with large (198 × 198) versus small (99 × 99) stimulus images. (C) Distribution of peak shifts of units. Arrowheads indicate the estimated locations of multiple peaks in the distribution. Gray columns indicate units with a response peak at SFs below 2 cycles/object for large and/or small stimuli.

Analysis of the SF representation in the FC1 layer revealed that, similar to the KDEF/RaFD-trained SNNs (Fig. 5B, left), the KRC-trained SNNs exhibited two major groups of units. One group encoded object-based SFs in a low SF range, while the other encoded the retina-based reference frame in a higher SF range (Fig. 9B). The distribution of units in two-dimensional histograms was similar between the KRC-trained SNNs and the KDEF/RaFD-trained SNNs (Pearson’s correlation coefficient, r = 0.42, p < 0.001). The multimodality test revealed that the distribution of the peak shift had three peaks, located at − 4.89, 0.455, and 0.998 (the first and second p-values < 0.001, the third p-value = 0.12; excess mass test for multimodality; solid and open arrowheads in Fig. 9C). The units sensitive to low SFs below 2 cycles/object were most frequently observed at a peak shift of approximately 0 (gray columns), which is also similar to the observations from KDEF/RaFD databases.

The results suggest that our main findings, the modest classification performance and the bimodal distribution of units based on the SF reference frame, are applicable across the employed face databases, even those comprising faces of diverse ethnic backgrounds.

Discussion

This study presents an initial endeavor to construct and investigate a computational model for facial expression processing along the subcortical pathway. While DNNs have achieved notable accomplishments in modeling visual processing in the ventral cortical pathway, it has remained unclear whether and how the CNN architecture can be adapted to processing in the subcortical pathway. Here we demonstrated that the SNNs, which implemented the three subcortical properties (shallow processing, DoG-type filters, and wide-area spatial pooling), can be trained to classify facial expressions. Compared to DNNs previously tested for facial expression classification, the SNNs exhibited modest classification accuracy, akin to individuals with affective blindness. Substituting any of these properties with their cortical counterparts improved performance, indicating the significance of all three subcortical properties as factors that limit performance of the SNNs. In the FC1 layer, units were divided into two groups: one sensitive to retina-based SFs and the other sensitive to object-based SFs. Most modified models with cortical properties reduced the number of retina-based units, indicating the role of these properties in preserving retina-based SF information. These findings suggest that the three processing properties of the subcortical pathway have a limiting effect on facial expression recognition. These insights shed light on the computational processes employed by individuals with affective blindness and newborns.

Modest performance of the SNNs and neural computations of the subcortical pathway

It has been proposed that affective blindsight is mediated by components of the subcortical pathway spared by the lesions, including the superior colliculus, pulvinar, and amygdala22,23,24. One view assumes that the shortest route directly connecting the three subcortical structures conveys facial expression information from the superior colliculus via the pulvinar to the amygdala1. A different view proposes that information from the pulvinar reaches the amygdala through the facial processing system in the temporal cortex under the assumption that “the direct connections of the pulvinar with the amygdala are likely insufficient in themselves for recognizing emotional expressions”68. By demonstrating that SNNs can successfully acquire the ability to discriminate facial expressions, the present study provides support for the shortest route hypothesis.

The average correct rate (0.49–0.51) of classifying the seven facial expressions in the present study was well above chance (0.14) but was far from perfect. The modest correct rate is in line with the performance of patients with affective blindsight. Pegna et al.23 reported that a patient with bilateral lesions in V1 discriminated happy faces from either angry, sad, or horrified faces at correct rates of 0.58–0.62, marginally above the chance level of 0.5. Another patient with bilateral lesions in V1 exhibited correct rates of 0.64–0.67 for happy vs. fearful or angry faces (chance level = 0.5)24. The residual ability of facial expression classification in these patients was only moderate compared to the nearly perfect performance in healthy people. This raises the question of why subcortical processing supports vision more poorly than visual functions mediated by the cortical pathway.

A traditional explanation is that neurons in the subcortical pathway respond to low SFs and are less sensitive to high SFs than the cortical pathway (e.g., see31,69,70,71). This predisposition towards low SF sensitivity will impede the ability of the subcortical pathway to analyze fine details of visual images and can itself result in the inaccurate processing of face images. However, some researchers dispute the dependence of the subcortical response on low SFs72,73. Our results show that low SF sensitivity, if important, was not the only cause because the DoG filter models combined with narrow-pooling or add-layer modifications exhibited improved performances, despite the fact that our DoG filters were tuned to low SFs, with a preferred SF range between 0.17 and 3.4 cycles/degree. Note that we estimated this value on an assumption of an image size of 30.5° based on our DoG parameters, the filter resolution, and the RF size of superior colliculus neurons representing the foveal region. The range of DoG-filter width is slightly wider than that applied in a recent model of the superior colliculus (0.36–1.88 cycles/degree based on our calculation)74.

Another explanation is that the small number of processing stages in the subcortical pathway hampers detailed analysis of visual inputs. However, a previous study75 showed that CNNs that had only two processing layers, with Gabor filters at the first stage, performed highly accurately in the discrimination of facial expressions. The performance of our SNNs incorporating the three subcortical properties was not this high. This was not due to inadequate training because the performance plateaued and stayed stable over numerous iterations in the training sessions (Fig. 3). This was further verified by showing that the correct performance remained unchanged even after excessive training with 3,000,000 iterations. Furthermore, replacing not only the small number of processing layers but also the filter type at the first processing layer and the width of the pooling window with the corresponding cortical properties improved the performance of the SNNs (Fig. 4). The three properties at least partially underlie the less accurate processing of facial images in the subcortical pathway and may be responsible for the low performance in affective blindsight.

Recognition of facial expressions by the SNNs, DNNs and patients

The SNNs had varying classification accuracies across facial expressions (Fig. 3A,B). They performed best for happy and surprised faces and worst for sad faces. The performance order was consistent across independently trained SNNs (Fig. 3C) and across SNNs trained with different databases (the KDEF/RaFD databases and the KRC database). It also matched that of previously developed AlexNet-based DNNs57. These DNNs were trained to discriminate between the seven expressions derived either from the KDEF database or the KRC database. Similar to the SNNs, the DNNs exhibited the best performance for happy and surprised faces. This coincidence may simply suggest that within each database, facial features are consistent across faces with happy or surprised expressions but are more diverse across faces with sad or neutral expressions. However, the variations across examples of facial expressions within a database are not the sole reason for the difference in the performance across facial expressions because neutral faces were classified poorly by the SNNs (Fig. 3B), but the DNNs of Inagaki et al.57 classified them with high correct rates. An alternative explanation is that the ease/difficulty of classification may vary across expressions owing to differences in the conspicuousness of component facial actions underlying various expressions. Similarities between neural networks regarding expression-specific performance may vary according to these differences.

Another CNN, with the first layer of DoG filtering and average pooling, also had difficulty distinguishing between sad and neutral faces74. This CNN was constructed to simulate facial processing in the superior colliculus and was trained on happy, sad, and neutral expressions. The CNN performed best for happy faces and moderately for sad faces but classified neutral faces into neutral faces with a classification rate of 0.49 and into sad faces with a rate of 0.39. The fact that this CNN and the SNNs in the present study demonstrated this confusion, whereas AlexNet-based DNNs and our add-layer models (0.52 for sad, 0.67 for neutral) did not, suggests that the convolution processes after the initial DoG filtering (in the case of add-layer models) or the convolution by the Gabor filters (in the case of AlexNet-based DNNs) may be critical for classification of sad and neutral faces.

The expression-dependent performance of the SNNs also had both similarities and dissimilarities to that observed in a V1-lesioned patient22. The patient classified happy and sad faces with a higher correct rate than angry and fearful faces; our SNNs and this patient classified happy faces well, whereas the performance for sad faces was poor in the SNNs but good in the patient.

Our SNNs were solely trained using facial images under controlled laboratory conditions. A crucial area for future research lies in assessing the performance of SNNs in expression classification of facial images taken under more natural and diverse conditions56. Additionally, we note that although we used human face photographs as training and test images, our neural network models were constructed based on physiological findings from both monkeys and humans. This raises the question of how effectively the SNNs trained with human faces perform in classifying monkey facial expressions in cross-species scenarios. It should be acknowledged, however, that the facial expressions in these two species differ, and there is ongoing controversy regarding the interpretation of facial expressions in nonhuman primates (e.g., see76,77,78,79). Furthermore, it is worth noting that the performance of deep neural networks does not necessarily generalize completely even for human faces from different countries57.

Reference frame of coding SF information and invariance of visual responses

The FC1 units of the SNNs consisted of two major groups with different SF processing properties (Fig. 5). One group responded best to the same object-based SFs (cycles/object) regardless of the stimulus size. This size-invariant response indicates that these units represent SFs in the object-based coordinates. Most of these units were tuned to low SFs (approximately one to two cycles/object). The other group showed a shift in the optimal object-based SFs when testing was performed with different stimulus sizes. The direction of the shift was consistent with the interpretation that the units were tuned to retina-based SFs (cycles/degree). That is, for larger stimuli, the units responded to higher object-based SFs that corresponded to the same retina-based SFs. The DoG filters at the initial stage and the shallow architecture appear to be critical for preserving the SF representation based on the retina-based coordinate because FC1 units with retina-based SF sensitivity were reduced in number when the first convolution layer was changed to Gabor filters or when the number of processing layers was increased (Fig. 6A, middle, bottom).

A major group of the object-based SF units in the SNNs were tuned to low SFs (Fig. 5B). This curious bias of the object-based units towards low SF sensitivity likely resulted from the wide max pooling process. Lowpass-filtered facial images contain only coarse structures such as solid blobs at eye or mouth positions. The positional information of these blobs is initially captured by DoG filters and encoded as response patterns across units in the convolution layer. These blobs appear in different positions and scales for images of different sizes, and thus the response patterns vary between different sizes. However, the max pooling operation would make response patterns more similar between different sizes because it reduces the sensitivity of units in the pooling layer to slight changes in the spatial arrangement of local features. Indeed, the dissimilarity index for lowpass-filtered images decreased after max pooling (Fig. 7). This effect leads to object-based SF tuning (i.e., preferential responses invariant of image size to a particular range of object-based SFs) for lowpass-filtered images. Wider pooling windows enhance this effect at the expense of losing fine details of inputs. When the pooling windows are narrower, this effect would be incomplete, and units with intermediate peak shift values would increase, as we found in the narrow-pooling models (Fig. 6A, top).

One may wonder why FC1 units of the SNNs maintained sensitivity to retina-based SFs, i.e., size-dependent representation of SFs, despite the demand that we imposed on the SNNs to classify facial expressions regardless of the seven different face image sizes. One plausible explanation is that the architecture of our SNNs cannot achieve sufficient object-based representation and remains suboptimal for the required task even after excessive training sessions. This may be a reason for the modest classification performance of the SNNs. Indeed, replacement of the subcortical processing properties with the cortical properties resulted in the representation becoming more object-based (Figs. 5B,C, 6) and improved the classification performance (Fig. 4). However, if object-based SF encoding was the only requirement for optimal performance under our training conditions, the models that showed object-based SF encoding should have had the highest correct rate, but this was not the case. The add-layer models and the narrow-pooling + add-layer models exhibited the best object-based encoding of SFs (Fig. 6A bottom, B bottom), while they performed worse than the full-replacement model (Fig. 4A). The representation acquired for the classification depended not only on the task demand of size-invariant classification of facial expressions, but also on other, yet unspecified, constraints deriving probably from the architecture of the models.

In the primate amygdala, the responses of many neurons are affected by retina-based SFs, and only a minority of neurons have perfect object-based SF sensitivity65. By contrast, many FC1 units tuned to low SFs of the SNNs exhibited object-based SF sensitivity. The paucity of evidence for units with object-based SF sensitivity in the amygdala may be related to the fact that the previous electrophysiological study65 did not present face images with very low SFs and may have overlooked the neurons with object-based SF sensitivity in this range of SFs.

Some inferior temporal cortex neurons exhibit invariant responses to changes in shape sizes80,81. The max pooling operation may help achieve these invariant responses by disregarding positional changes in inputs in each region of interest. Because size changes involve changes in edge positions without modifications in topologies, if the changes are small enough to be covered by each region of interest, stimuli before and after the changes elicit similar responses. The effects of wide pooling (Fig. 7) suggest that some aspects of the invariant responses of inferior temporal cortex neurons can simply be achieved by bypassing early cortical areas with high spatial resolutions such as V1. Such shortcut routes exist, including the projection from the pulvinar to V2 and then to the posterior inferior temporal cortex and the projection from the pulvinar to V4 and then to the anterior inferior temporal cortex20.

The size invariance in low SFs is important in newborns. They have blurred visions that rely on low SFs82,83, but respond to faces or face-like patterns irrespective of the stimulus size or the viewing distance28,84. These findings indicate that the ability of size-invariant face recognition based on low SFs is innately implemented in our visual system. Convergence of inputs from the superficial layer of the superior colliculus to the deep layer, which is already present in newborns40, may be part of the neural substrate supporting this aspect of size-invariant face recognition.

Concluding remarks

We have presented an original computational approach to investigate the facial expression processing in the subcortical pathway of primates. This approach analyzes the performance of SNNs with subcortical properties and the effect of replacing these properties with corresponding cortical counterparts. We have demonstrated that the performance of the SNNs is constrained by the three key subcortical properties: the shallowness of processing stages, the DoG-type receptive fields at the initial stage, and spatial pooling over a wider visual field. These properties also affect the reference frame of spatial frequency representation in the final layer of the SNNs. Our findings offer insights into the neural mechanisms and functions of the subcortical pathway, highlighting their distinction from those of the cortical pathway. They provide an explanation for the limited proficiency observed in facial expression recognition among individuals with impaired or undeveloped visual cortices.

An important direction for future research is to experimentally assess the validity of our SNNs as proxies for biological neural network models for the subcortical pathway. One possible approach is to generate facial expression images that elicit the strongest responses of units in either the SNNs or the full-replacement models using activation maximization procedures85,86. These two types of images can then be used to evaluate the performance of human observers in classifying the facial expressions. Further, neural responses can be measured by single-neuron recordings in non-human primates or functional magnetic resonance imaging in human observers. Employing these types of approaches would enable us to evaluate the extent to which the SNNs are linked to biological neural systems.

Research interest in the role of subcortical structures in cognitive functions has recently surged, but physiological data are still much sparser for subcortical structures than for the cerebral cortex87. Computational approaches, such as the one we have presented here, are expected to partially compensate for this data scarcity and to guide future research.