Evidence for a deep, distributed and dynamic semantic code in human ventral anterior temporal cortex

How does the human brain encode the meanings of words and objects? Most theories propose that local neural populations independently encode various semantic features. Alternatively, meanings may arise as distributed neural patterns that change radically in real time as a stimulus is processed. We introduce a technique for revealing a dynamically-changing distributed code in simulated neural data, then apply it to neural signals collected from human cortex while participants named line drawings of common items. The data reveal a dynamic semantic code along ventral temporal cortex possessing stable elements posteriorly and rapidly-changing elements anteriorly. The results challenge the established view of semantic representation, resolve conflicting findings from past research, and provide a new framework for understanding the time-course of distributed representation in the brain. One Sentence Summary The brain’s code for semantic information changes rapidly in real time as a stimulus is recognized.

Abstract: How does the human brain encode the meanings of words and objects? Most theories propose that local neural populations independently encode various semantic features.
Alternatively, meanings may arise as distributed neural patterns that change radically in real time 25 as a stimulus is processed. We introduce a technique for revealing a dynamically-changing distributed code in simulated neural data, then apply it to neural signals collected from human cortex while participants named line drawings of common items. The data reveal a dynamic semantic code along ventral temporal cortex possessing stable elements posteriorly and rapidly-A dynamic semantic code in human cortex 2 changing elements anteriorly. The results challenge the established view of semantic representation, resolve conflicting findings from past research, and provide a new framework for understanding the time-course of distributed representation in the brain.
One Sentence Summary: The brain's code for semantic information changes rapidly in real time as a stimulus is recognized. 5 Main Text: Semantic memory supports human understanding of language and experience: our remarkable ability to recognize new items and events, infer their unobserved properties, and comprehend and produce statements about them(1). These abilities arise from neural activity propagating in a broadly distributed cortical network, with different components encoding different varieties of information (perceptual, motor, linguistic, etc) (2). The ventral anterior 10 temporal lobes (vATL) form a hub in this network that coordinates activation amongst the various surface representations (3). In so doing the vATL acquires distributed representations that allow the whole network to express conceptual similarity structure, supporting inductive generalization of acquired knowledge across conceptually-related items (4).
We consider an unexamined consequence of this arrangement with important 15 implications for theories of semantic representation and other forms of higher cognition. To serve their transmodal function, hub representations must interact with all of the connected surface representations. Many models therefore propose that neural activation flows bidirectionally in the network, producing dynamics in which early perceptual activation patterns change with feedback from downstream areas as the whole system settles toward an 20 interpretation of the input (5)(6)(7)(8). Consequently, the way that neural responses encode semantic information can change rapidly as a stimulus is processed.
Such dynamics have not been the focus of contemporary semantic theories, which often view semantic processing as involving the activation of locally-encoded semantic features (9)(10)(11).
A dynamic semantic code in human cortex 3 Feature-based views license a straightforward interpretation of neurophysiological data: when a neural population is active, the corresponding feature has been detected, inferred, or called to mind; inactive populations imply no feature detection. The temporal behavior of the population directly indicates the time-course with which the represented feature is available to influence processing, and the mean neural activity over time indicates the strength with which the feature 5 was activated in a particular trial or task condition. These ideas motivate efforts to determine which cortical regions encode which features and at which points in time by correlating local neural activity with the presence/absence of a semantic feature (such as a category label or property). If, however, semantic representations are distributed and dynamic, the contribution of a given population to the distributed code may change in real time, potentially in highly 10 nonlinear ways that confound standard univariate analysis. (consistent with high spatial and low temporal resolution; e.g. fMRI) and for both populations together (low spatial and temporal resolution). The former suggests that population 2 plays no important role distinguishing the categories, while the latter suggests the two populations A dynamic semantic code in human cortex 5 together selectively detect animals. These conclusions are incorrect from the distributed and dynamic perspective, under which the two populations always jointly differentiate the categories but the contribution of each changes over time ( Figure 1A).
The interpretation of neurophysiological signals in the cortical semantic system thus depends critically upon whether the neuro-semantic code is feature-based or distributed and 5 dynamic, but efforts to adjudicate the question face two significant hurdles. First, spatial and/or temporal averaging can obscure important signal if the code truly is distributed and dynamicdiscovery requires neural data with high temporal and spatial resolution, ruling out non-invasive brain imaging. Second, independent univariate analysis can mischaracterize information distributed across multiple sites. Discovery requires multivariate methods, but it is unclear which 10 approaches can uncover a dynamically changing neural code.
We therefore combined computational modeling, multivariate pattern classification, and electrocorticography (ECoG) to assess whether semantic representations in human vATL are distributed and dynamic. We first showed that the nonlinear dynamics highlighted in our thought experiment arise in a well-studied neurocomputational model of semantic memory(12, 13). We 15 next used pattern classifiers to establish the "signature" of a dynamically changing semantic code. We then applied this technique to ECoG data collected from the surface of human vATL while participants named line drawings of common items, and found the critical decoding signature-providing strong evidence that semantic structure is expressed in human vATL by a distributed code that changes dynamically with stimulus processing. The results challenge the 20 view that semantic representations are encoded as activity patterns over independent feature detectors, and more generally offer a new framework for thinking about the time-course of representation in distributed neural systems.

Results
Simulation study. Simulations provide a formal basis for understanding the implications of the distributed and dynamic view, because a computer model's architecture, behavior, learning and testing patterns are fully known. Adapting prior models (7,12,13), we therefore assessed whether hub representations change dynamically with stimulus processing and how multivariate 5 classification can uncover such a code.
The model was a fully continuous and recurrent neural network that learns cross-modal associations between distributed visual and verbal representations via three reciprocallyconnected hidden layers ( Figure 2A). Given positive input to a subset of visual or verbal units, it learns to activate the corresponding item's unspecified visual and/or verbal attributes. Activity 10 propagates in both directions, from surface representations to hub and back, so that the trained model settles to an attractor state representing any item specified by the input. We trained the model with patterns representing ninety items from three model conceptual domains (e.g., animals, objects, and plants), each organized into three categories containing ten items. Items from different categories in the same domain shared a few properties while those in the same 15 category shared many. We trained the model for 30k epochs, attaining a mean accuracy of 99% (see Methods). We then presented the model with visual input for each item and recorded the resulting activation patterns over hub units for each tick of simulated time as the network settled.
A dynamic semantic code in human cortex 7 To visualize how model internal representations changed during stimulus processing, we computed a 3D multi-dimensional scaling (MDS) of activation patterns at all timepoints during "perception" of each stimulus, then plotted the changing representation for each item as a line in this space. The result in Figure 2B shows a systematic but nonlinear elaboration of the 5 conceptual structure latent in the stimulus patterns: the domains separate from one another early on, but each item follows a curved trajectory over time. Figure 2C shows these trajectories in the native activations of randomly-sampled unit pairs in one network run: they appear even more radically nonlinear. Consequently, independent analysis of each unit's behavior produces mixed results ( Figure 2D), with some units behaving like tonic category detectors (green squares), some 10 A dynamic semantic code in human cortex 8 like transient detectors (blue), some appearing to flip their category preference (red) and others appearing not to code category information at all (gray).
Thus the full distributed pattern elaborates conceptual structure from early in processing, but the progression is nonlinear and only clearly discernable in a low-dimensional embedding of the space. Such an embedding can be computed for the model because we know which units are 5 important and can apply the MDS to all and only those units. The same approach cannot be applied to ECoG data for two reasons. First, one cannot know a-priori which channels record signals relevant for semantic representation and thus cannot simply compute a low-dimensional embedding of all data collected. Instead one must fit a statistical model that will selectively weight signals useful for discerning semantic structure. Second, whereas the model allows access 10 to the entire network, a cortical surface sensor array only sparsely samples the neural responses contributing to a semantic representation. The problem requires a multivariate statistical model capable of revealing a dynamically changing neural code when fitted to sparsely-sampled neural data.
We therefore used multivariate pattern classifiers to decode semantic category 15 information from hub activation patterns and assessed their behavior on simulated ECoG data.
For each simulated participant we selected a sparse random subsample (15%) of all hub units and recorded their responses to each stimulus at every tick of time. We fitted a separate classifier at each timepoint to distinguish two semantic domains from the activation patterns elicited over the subsampled units. Figure 3A shows the cross-validated accuracy at each timepoint averaged over 20 many network runs and subsamples. The classifiers performed well above chance as soon as input activation reached the hub units and throughout the time window.
A dynamic semantic code in human cortex 9 To assess representational change over time we next adopted a temporal generalization approach(14), using the classifier fitted at one timepoint to decode the patterns observed at each other timepoint. Accuracy should remain high if the information a classifier exploits at the training time persists at other timepoints. The temporal generalization profile of each classifier 5 thus indicates how the underlying neural code persists or changes over time. Classifiers fitted to earlier activation patterns generalized only to immediate temporal neighbors, while those fitted to later patterns generalized over a wider window but failed at decoding earlier states ( Figure   3B). To better visualize these results, we clustered the rows of the matrix in Figure 3B and plotted the mean accuracy of the classifiers in each cluster across time ( Figure 3C). The results exhibit an "overlapping waves" pattern: classifiers that work on early patterns quickly fail at later timepoints where a different classifier succeeds. As time progresses, the clusters include more classifiers and the breadth of time over which the classifiers perform well widens ( Figure 3D). This pattern reflects the nonlinear trajectories apparent in the sparsely-sampled representational space. When trajectories curve, earlier classification planes fail later in time 5 while later planes fail at earlier time-points ( Figure 1A). If representations simply moved linearly from an initial to a final state, early classifiers would continue to perform well throughout processing-a pattern observed in further simulations with feature-based models, models with distributed representations that evolve linearly, and recurrent but shallow neural networks (see Supplementary Materials). In the deep network, the non-linear dynamic pattern was observed 10 only in the hub layer-in more superficial layers, the code remained stable (Supplementary Materials). The simulations thus suggest that distributed and dynamic semantic representations can arise in deep layers of interactive networks and will elicit a particular "signature" when multivariate pattern classifiers are used to decode semantic structure from ECoG data.
Specifically, such classifiers will show: 15 (1) Constant decodability. Neural activity predicts stimulus category at every time point once activation reaches the vATL.
(2) Local temporal generalization. Classifiers generalize best to immediate temporal neighbors and worst to distal timepoints.  These are the characteristics we looked for in the ECoG study.
A dynamic semantic code in human cortex 11 ECoG study. The dataset included local field potentials (LFPs) collected at 1000Hz from 16-24 electrodes situated in the left ventral anterior temporal cortex of 8 patients awaiting surgery while they named line-drawings of common animate and inanimate items matched for a range of confounds (see Methods). We analyzed LFPs over the 1640ms following stimulus onset using a 50ms sliding-window approach in which separate classifiers were fitted for each window 5 and the window advanced in 10ms increments. The approach yielded 160 classifiers per subject, each decoding the LFPs across all vATL electrodes in a 50ms window. Each classifier was then tested on all 160 time-windows. The classifiers were logistic regression models fitted with L1 regularization to encourage coefficients of 0 for many features (see Methods).
Hold-out accuracy exceeded chance at about 200ms post stimulus onset and remained 10 statistically reliable throughout the time window ( Figure 4A). By 200ms classifiers generalized well to timepoints near the training window but poorly to more distal timepoints, with the generalization envelope widening as time progressed (4B). We again clustered the classifiers based on their temporal accuracy profile, then plotted mean profiles for each cluster (4C). The result was an "overlapping waves" pattern strikingly similar to the simulation: classifiers that 15 performed well early in processing quickly declined in accuracy, replaced by a different wellperforming set. Over time neighboring classifiers began to show similar temporal profiles, forming larger clusters that performed above chance for a broader temporal window (4D).
Finally, we considered whether and how the neuro-semantic code changed over time. For each time window we projected the classifier weights for all electrodes in all subjects to a cortical surface model, then animated the results (see movie S1). Figure 4E shows snapshots A dynamic semantic code in human cortex 13 every 200ms post stimulus onset. In mid-posterior regions the code was spatially and temporally stable-weights on the lateral aspect were positive while those on the medial aspect were negative. The anterior pattern differed, flipping from mainly positive at 200ms to mainly negative by 800ms. In other words, the "meaning" of a positive deflection in the LFP-whether it signaled animal or non-animal-stayed constant posteriorly but changed direction over time 5 anteriorly, consistent with the deep, distributed, dynamic view (see Supplementary Materials).

Discussion.
We have presented evidence from computational modeling, multivariate pattern classification, and ECoG suggesting that semantic representations in human vATL are distributed and dynamic: neural populations throughout ventral temporal lobe jointly express 10 semantic information from early in processing, but the distributed code changes over time, especially anteriorly. In simulation we showed that dynamic representational change arises in the deep layers of an interactive network, producing a characteristic decoding signature: classifiers perform well in the time-window when they were trained, but generalize over a narrow time envelope that widens as the system settles. These dynamics produced puzzling results when unit 15 activations were analyzed independently, with some units behaving like tonic feature detectors, some like transient detectors, and some "flipping" the direction of their category preferences over time. Remarkably similar phenomena were observed in ECoG data collected from the surface of ventral temporal lobe while participants named line-drawings, supporting the proposal that semantic representations in vATL are deep, distributed and dynamic. 20 Are the results also consistent with feature-based theories? They certainly rule out the simple view that stimulus perception drives tonic activation of feature detectors down the ventral visual stream. Were this the case, classifiers that perform well early on should show continued good performance later (see Supplementary Materials). One could, perhaps, posit that feature A dynamic semantic code in human cortex 14 detectors in different cortical regions become transiently active along different time-courses, with some engaging and disengaging early on and others only activating later. Under this view, each "wave" in Figure 3C could reflect the temporary activation of detectors in different cortical regions. If so the classifier weight maps should highlight different brain regions at different timepoints, but we found non-zero weights distributed across the entire field of view from the 5 first moment that classification succeeds-a spatial distribution that changed hardly at all over time. What did change was the direction of the vATL semantic code: the "meaning" of a positive LFP flipped direction over time, contrasting with the stable positive-to-negative gradient observed more posteriorly. Several fMRI studies have reported a similar lateral-to-medial category-specific pattern in posterior fusiform (15,16), lending external validity to our analysis 10 and suggesting in turn that the shifting vATL pattern is not artifactual.
The dynamic semantic code in vATL also resolves a long-standing puzzle. Convergent methods have established the centrality of this region for semantic memory, including studies of semantic impairment (17)(18)(19), lesion-symptom mapping(20), functional (21,22) and structural (23, 24) brain imaging, and transcranial magnetic stimulation(25). Yet multivariate approaches to 15 discovering neuro-semantic representations almost never identify the vATL, instead revealing semantic structure in areas closer to the sensory periphery(26, 27). One prominent study suggested that semantic representations may tile the entire cortex except for the vATL(10).
Setting aside significant technical challenges of successful neuroimaging of this region (22), almost all such studies have employed non-invasive imaging techniques that sacrifice either 20 temporal or spatial resolution-a compromise that will destroy signal in vATL if semantic representations there truly are distributed and dynamic, but will preserve signal in posterior regions where the code is more stable. Thus the widespread null result may arise precisely because semantic representations in vATL are distributed and dynamic.
A dynamic semantic code in human cortex 15 Why should a dynamic code arise in the vATL? The area is situated at the top of the ventral visual stream, but also connects directly to core language areas(28) and, via middle temporal gyrus, to parietal areas involved in object-directed action(12). It receives direct input from smell and taste cortices(29), and is intimately connected with limbic structures involved in emotion, memory, and social cognition(30). Thus vATL anatomically forms the hub of a cross-

Model implementation details.
Model structure. The model implements the "distributed-plus-hub" theory of semantic representation 20 developed in prior work to understand patterns of semantic impairment in patients with acquired neuropathology and patterns of functional activation observed in healthy participants during semantic task performance (3,4,7,31). Model environment. The model environment contained visual and verbal patterns for each of 90 simulated objects, conceived as belonging to 3 distinct domains (e.g. animals, objects, and plants). Each domain contained 10 items from each of 3 sub-categories-thus there were 30 "animals," 30 "objects" and 30 "plants." Visual patterns were constructed to represent each item by randomly flipping the bits of a binary category prototype vector in which 25 items from the same domain shared a few properties and items from the same category shared many. The verbal patterns were constructed by giving each item a superordinate label true of all items within a given domain (animal, object, plant), a basic-level label true of all items within a category (e.g. "bird", "fish", "flower", etc), and a subordinate label unique to the item (e.g. "robin", "salmon", "daisy", etc). These procedures, adopted from prior A dynamic semantic code in human cortex 22 work (7,31,33), generated model input/target vectors that approximate the hierarchical relations among natural concepts in a simplified manner that permits clear understanding and control of the relevant structure.
Training. For each input, target patterns that fully specified the item's visual and verbal characteristics were applied throughout the duration of stimulus processing. The model was trained with backpropagation to minimize squared error loss. Half of the training patterns involved generating verbal outputs from visual inputs, while the 5 other half involved generating visual outputs from verbal inputs. The model was initialized with small random weights sampled from a uniform distribution ranging from -1 to 1, then trained for 30,000 epochs in full batch mode with a learning rate of 0.002 and without weight decay. For each pattern, the settling process was halted after 20 activation updates, or when all Visual and Verbals units were within 0.2 of their target values, whichever came first.
For all reported simulations the model was trained 5 times with different random weight initializations. After 10 training, all models generated correct output activations (ie, on the correct side of the unit midpoint) for more than 99% of output units across all training runs. Each model was analyzed independently, and the final results were then averaged across the five runs.
Testing. The model was tested by presenting visual input for each of the 90 items in its environment and recording the resulting activations in the 25 hub units at each update as the model settled over the course of 32 15 updates (producing 33 time-points including the initial state). These activations were then distorted with uniform noise sampled from -0.005 to 0.005 to simulate measurement error in ECoG.
Analysis. All analyses were conducted using R version 3.6. To visualize the trajectory of hub representations through unit activation space, we computed a simultaneous 3-component multidimensional scaling of the unit activation patterns for all 90 items at all 33 timepoints. Pairwise Euclidean distances amongst all 2970 20 vectors were computed and subjected to a classical multi-dimensional scaling algorithm using the native R function cmdscale to extract three latent dimensions. The resulting coordinates for a given item at each point in time over the course of settling were plotted as lines in a 3D space using the scatterplot3d package in R. Figure 2B shows the result for one network training run. Figure 2C shows the same trajectories in the raw data (ie actual unit activation states rather than latent dimensions in a MDS) for randomly-sampled pairs of hub units. 25 To simulate decoding of ECoG data, we evaluated the ability of pattern decoders (ie binary classifiers) to determine the correct superordinate category from patterns of activity arising in the hub for each item at each timepoint. As explained in the main text, we assume that ECoG measures only a small proportion of all the neural populations that encode semantic information. We therefore sub-sampled the hub-unit activation patterns by A dynamic semantic code in human cortex 23 selecting 3 units at random from the 25 hub units and using their activations to provide input to the decoder. We fitted three decoders to discriminate, respectively, animals from objects, animals from plants, and plants from objects. The decoders were fitted with logistic regression using the glm function and the binomial family in R. A separate decoder was fitted at each time-point, and unit activations were mean-centered independently at each time point prior to fitting the classifier. We assessed decoder accuracy at the time-point where it was fitted using 90-fold 5 leave-one-out cross-validation, and also assessed each decoder at every other time point by using it to predict the most likely stimulus category given the activation pattern at that time point and comparing the prediction to the true label. This process was repeated 10 times for each model with a different random sample of 3 hub units on each iteration. Thus for each trained neural network, decoding accuracy at a single timepoint reflected the mean accuracy across 30 decoders: animal vs object, animal vs plant, and object vs plant, each fitted to 10 independent sub-samples 10 of 3 hub units. The reported results then show mean decoding accuracy averaged over the 5 independent network training runs, for decoders trained and tested at all 33 time points. The above procedure yielded the decoding accuracy matrix shown as a heat plot in Figure 3B.
Each row of this matrix shows the mean accuracy of decoders trained at a given timepoint, when those decoders are used to predict item domain at each possible timepoint. The diagonal shows hold-out accuracy for 15 decoders at the same time point when they are trained, but off-diagonal elements show how the decoders fare for earlier (below diagonal) or later (above) timepoints. Decoders that perform similarly over time likely exploit similar information in the underlying representation, and so can be grouped together and their accuracy profiles averaged to provide a clearer sense of when the decoders are performing well. To this end, we clustered the rows of the decoding accuracy matrix by computing the pairwise cosine distance between these and subjecting the resulting similarities to 20 a hierarchical clustering algorithm using the native hclust function in R with complete agglomeration. We cut the resulting tree to create 10 clusters, then averaged the corresponding rows of the decoding accuracy matrix to create a temporal decoding profile for each cluster (lines in Figure 3C). We selected 10 clusters because this was the highest number in which each cluster beyond the first yielded a mean classification accuracy higher than the others at some point in time. Similar results were obtained for all cluster-sizes examined, however. 25 Finally, to understand the time-window over which each cluster of decoders performs reliably better than chance, we computed a significance threshold using a one-tailed binomial probability distribution with Bonferroni correction. Each decoder discriminates two categories from 60 items, with probability 0.5 of membership in either category. We therefore adopted a significance threshold of 44 correct items out of 60, corresponding to a binomial probability of p < 0.03 with Bonferroni correction for 330 tests (10 clusters at each of 33 time points). The barplot in Figure 3D shows the proportion of the full time window during which each decoding cluster showed accuracy above this threshold.

ECoG methods and materials
Participants. Eight patients with intractable partial epilepsy (seven) or brain tumor (one) originating in the 5 left hemisphere participated in this study. These include all left-hemisphere cases described in a previous study(34), and we will use the same case numbers reported in that work (specifically cases 1-5, 7, and 9-10). Background clinical information about each patient is summarized in Table S1.

Participants all gave written information consent to participate in the study. 15
Stimuli and Procedure. One hundred line drawings (50 living and 50 nonliving items) were obtained from previous norming studies (35, 36). A complete list of all items can be found in(34). Living and nonliving stimuli were matched on age of acquisition, visual complexity, familiarity and word frequency. Independent-sample t-tests did not reveal any significant differences between living and nonliving items for any of these variables.
Participants were presented with stimuli on a PC screen and asked to name each item as quickly and 20 accurately as possible. All stimuli were presented once in a random order in each session and repeated over four sessions in the entire experiment. The responses of participants were monitored by video recording. Each trial was time-locked to the picture onset using in-house MATLAB scripts (version 2010a, Mathworks, Natick, MA). Stimuli were presented for 5 seconds each and each session lasted 8 minutes 20 seconds. Participants' mean naming time was 1190ms. Responses and eye fixation were monitored by video recording. 25 Data preprocessing. Data preprocessing was performed in MATLAB. Raw data were recorded at sampling rate of 1000 Hz for six patients and at 2000Hz for two patients. The higher sampling rates for the two patients were down-sampled to 1000Hz by averaging measurements from each successive pair of time-points. The raw data from the target subdural electrodes for the subsequent analysis were measured in reference to the electrode beneath the galea aponeurotica in 4 patients (Patients 4,5,7 and 10) and to the scalp electrode on the mastoid process contralateral to the side of electrode implantation in 4 patients (Patients 1-3 and 9). Data included, for each stimulus at each electrode, all measurements beginning at stimulus onset and continuing for 1640ms. Baseline correction was performed by subtracting the mean pre-stimulus baseline amplitude (200 ms before picture onset) from all data points in the epochs. Trials with greater than +/-500 µV maximum amplitude were rejected as artifacts. Visual 5 inspection of all raw trials was conducted to reject any further trials contaminated by artifacts, including canonical interictal epileptiform discharges. The mean waveform for each stimulus was computed across repetitions.
Multivariate classification analysis.
The pre-processed data yielded, for each electrode in each patient, a times-series of local field potentials Classifier accuracy for a given time-window and subject was assessed using nested 10-fold cross-25 validation. In each outer fold, 10% of the data were held out, and the remaining 90% of the data were used with standard 9-fold cross-validation to search a range of values for the regularization parameter. When the best weight was selected, a model was fitted to all observations in the 90% of the training data and evaluated against the remaining 10% in the outer-loop hold-out set. This process was repeated 10 times with different final hold-outs, and A dynamic semantic code in human cortex 26 classifier accuracy for each patient was taken as the mean hold-out accuracy across these folds. The means across patients are the data shown in Figure 4A and the diagonal of 4B in the main paper. A final classifier for the window was then fitted using all of the data and the best regularization parameter. This classifier was used to decode all other time-windows, yielding the off-diagonal accuracy values shown in Figure 4B.
The above procedures produced a pattern classifier for each of 160 50ms time-windows in every subject, 5 with every classifier then tested at every time-window within each subject. Thus the classifier accuracy data were encoded in a 160x160-element decoding matrix in each subject. The matrices were averaged to create a single 160x160-element matrix indicating the mean decoding accuracy for each classifier at each point in time across subjects. This is the matrix shown in Figure 4B.
To better visualize how the code exploited by each classifier changes over time, we clustered the rows 10 using the same agglomerative hierarchical approach described for the simulations. We considered solutions ranging from 4 to 15 clusters and plotted the mean decoding accuracy over time across the classifiers within each cluster. All cluster sizes produced the overlapping-waves pattern. In the main paper we show the 10-cluster solution as it is the largest number in which each cluster after the first has a mean accuracy profile that is both statistically reliable and higher than every other cluster at some point in time. 15 To assess the breadth of time over which a cluster showed reliable above-chance classification accuracy, we again set Bonferroni-corrected significance thresholds using the binomial distribution. Stimuli included 100 items, with a .5 probability of each item depicting an animal. In the 1640ms measurement period there are 32 independent (ie non-overlapping) 50ms time windows, and we assessed the mean classifier performance for each of 10 clusters at every window. We therefore corrected for 320 multiple comparisons using a significance threshold of 20 68 correct (p < 0.0001 per comparison, p < 0.03 with correction).
Visualizing solutions on surface plots. Structural brain imaging and electrode localization. Magnetization-prepared rapid gradient-echo (MPRAGE) volumetric scan was performed before and after implantation of subdural electrodes as a part of presurgical evaluations. In the volumetric scan taken after implantation, the location of each electrode was identified 25 on the 2D slices using its signal void due to the property of platinum alloy(38). Electrodes were non-linearly coregistered to the patient MRI (MPRAGE) taken before implantation, and then to MNI standard space (ICBM-152) using FNIRT (www.fmrib.ox.ac.uk/fsl/fnirt/). The native coordinates of all the electrodes for all patients were morphed into MNI space and resampled into 2 mm isotropic voxels(39).
Projecting classifier coefficients to the surface. As described above, a separate logistic classifier was fitted to each 50ms window in each subject. The classifier was specified as a set of regression coefficients, with one coefficient for each timepoint at each electrode in the patient, and many coefficients set to 0 due to L1regularization. The sign of the classifier coefficient indicates the "meaning" of a LFP deflection in a particular direction: a positive coefficient indicates that animals are "signaled" by a positive deflection in the LFP, while 5 negative coefficients indicate that animals are signaled by a negative deflection. The magnitude of the coefficient indicates the "importance" of the measurement, in the context of all other LFPs appearing in the classifier. The distribution of coefficient directions and magnitudes across the cortex and over time thus provides an indication of how the underlying neuro-semantic code changes over time. We therefore analyzed the temporal and spatial distribution of mean coefficients across participants as follows. 10 For a single time window we computed, separately for each electrode in each participant, the magnitudes (sum of absolute values) of the classifier weights across the 50 time points in the window. The resulting data were exported from Matlab to NIFTI volumes using the NIFTI toolbox(https://www.mathworks.com/matlabcentral/fileexchange/8797-tools-for-nifti-and-analyze-image) and projected from all electrodes and subjects onto the common cortical surface map using AFNI's 3dVol2Surf relative 15 to the smooth white matter and pial surfaces of the ICBM 152 surface reconstructions shared by the AFNI team and the NIH (https://afni.nimh.nih.gov/pub/dist/tgz/suma_MNI152_2009.tgz). The space between corresponding nodes on the two surfaces were spanned by a line segment sub-divided at 10 equally spaced points. The value displayed on the surface is the average of the values where these 10 points intersect with the functional volume along that line segment. Once mapped to the surface, the results were spatially smoothed along the surface with a 6mm full-width 20 half-max Gaussian kernel using the SurfSmooth function in SUMA. We inclusively masked any surface point with a non-zero value in this surface projection. A separate mask was generated for each time window.
To visualize how the representational code changes over time within the surface mask we next carried out a similar procedure on the classifier coefficients themselves, without taking the absolute values. At each electrode in every subject we summed the classifier coefficients over the 50ms time window, yielding a single positive or 25 negative real-valued number at each electrode for each time window. These values were again projected onto a common brain surface and spatially smoothed with an 8mm FWHM Gaussian blur along the surface. In the resulting maps, any colored point indicates a cortical region that received a non-zero value in the weight magnitude mask, while the hue indicates the direction of the classifier coefficient in the area-that is, whether a positive deflection of A dynamic semantic code in human cortex 28 the LFP for nearby electrodes indicated that the stimulus was an animal (warm colors), a non-animal (cool colors), or showed no systematic direction (green). A separate map of this kind was generated for each of 160 time windows.
We animated the results to visualize how they change over time using the open-source ffmpeg software (https://ffmpeg.org/ ) with linear interpolation between successive frames. The animation is shown in Movie S1; snapshots of this visualization are shown in Figure 4E. 5 Supplementary Text: Comparison of deep model results to control models.
The main paper highlights four properties of the neural decoding results observed in both the deep neural network model and in the human ECoG data: constant decodability, local temporal generalization, a widening window of generalization, and change in neural code direction in the ATL hub. We suggested that these properties arise because semantic structure is encoded as distributed activation patterns that change in highly nonlinear ways 10 due to their situation in the deep cross-modal hub of a dynamic cortical network. This argument implies that the signature pattern would not arise in models that adopt different kinds of representation and processing mechanisms, nor in the shallower layers of the deep model. In this section we assess this implication by comparing the main results with those observed in three alternative models of semantic representation.
Distributed versus feature-based representations. 15 By distributed representation, we mean that many neural populations or units can jointly contribute to representation of structure even if they do not each independently encode the structure. Deep neural network models are capable of acquiring distributed representations of this kind (40) and may be contrasted with models proposing that semantic representations are comprised of elements that each independently detect a particular semantic feature, such as membership in a particular conceptual domain or category. We therefore considered what the decoding 20 signature would look like in such a feature-based model. The 90 items were represented with a vector in which two elements were dedicated to each conceptual domain (animal, plant, object) and to each basic-level category (bird, fish, flower, etc; total of 9 categories). For instance, a particular instance of flower would activate the two "plant" features and the two "flower" features; an instance of tree would activate the same two "plant" features and two "tree" features, etc. This yielded 24 elements total; to equate the number of features with the number of units in the 25 deep network simulation, we added a 25 th vector element that always adopted a low activation value.
We simulated the gradual activation of features over the course of processing by generating a 33-step timeseries for each feature and each item presentation. All units began with an activation of 0, and features true of the stimulus would ramp up their activation according to a sigmoid function with a constant slope and a randomlysampled offset term determining when in the stimulus presentation window the feature would begin to activate. This procedure yielded a dataset analogous to the evolution of internal representations in the deep network, but with feature-based semantic representations in which features activated with randomly-sampled time-courses. Dynamic versus linear. 5 By dynamic processing, we mean that units can influence themselves via feedback from the other units to Deep versus shallow. 20 By deep network, we mean a neural network that has multiple hidden layers interposing between Visual and Verbal representation units. Depth generally allows neural networks to discover and represent more complex statistical relations amongst connected units(41). It also allows for more complex temporal dynamics as mutual influences across distal network components take more processing time. We assessed the importance of network depth in two ways. 25 First we compared the behavior of the hub layer in the deep network to that of a shallow network employing just a single hidden layer containing 25 units reciprocally connected to Visual and Verbal units and parameterized identically to the hub units in the deep model. We trained the network for 30k epochs exactly as described for the deep network, using the same training and testing patterns and procedures. This procedure yielded a dataset in which internal representations were distributed and dynamic as in the deep model, but arose within a shallower network.
Second, we compared the behavior of the hub layer in the deep network to the patterns emerging across intermediate (visual hidden and verbal hidden) and shallow (visual representation and verbal representation) layers in the same model. For this comparison, we recorded the activation time-series produced in response to each visual 5 stimulus, for every unit in the model. We then assessed the propensity for units in each layer to behave like individual feature detectors, unresponsive units, or units that appear to "switch" their category preference over time, taking the "switch" behavior as a marker of distributed and dynamic representation.

Results.
In the first analysis, we subjected each alternative model to the same analyses reported for the primary 10 model and assessed whether they also show the four signature properties identified in the main paper.
Constant decodeability. All four models showed cross-validation accuracy reliably above chance and consistently high across the time-window once input signals reached the representation units-thus all models showed constant decodeability.
Local temporal generalization. Figure S1A shows a 3D MDS of the trajectories for all items through the 15 corresponding representation space in each model. For feature-based, linear, and shallow models, the trajectories are strictly or nearly linear-only the deep, distributed and dynamic model shows the nonlinearities discussed in the main paper. Consequently the models show qualitatively different patterns of generalization over time: featurebased, linear, and shallow models show a pattern in which all classifiers generalize poorly to earlier timepoints and well to later timepoints. Thus local temporal generalization-in which classifiers do well only for neighboring time 20 points in both past and future-is only observed in the deep, distributed and dynamic model (S1B).
Widening generalization window. In contrast to the ECoG data and the deep, distributed and dynamic model, the alternative models all show a narrowing window of temporal generalization: models fitted early in processing show good performance over a wider window than those trained later.
Change in code direction. The deep, distributed and dynamic model acquired representations in which 25 some single units, when analyzed independently, behaved like feature-detectors that change in direction over processing-with high activations initially predicting an animal stimulus, for instance, then later predicting a nonanimal stimulus. A similar flipping of signal direction was also observed in the more anterior parts of the ventral temporal lobe, via the changing sign of the classifier coefficients identified in the ECoG data. We therefore A dynamic semantic code in human cortex 31 considered whether a change in code direction was observed for single units considered independently in the alternative models.
Specifically, we classified each unit in each simulation as (1) a feature-detector if its activity correlated significantly with conceptual domain in only one direction over the time-course of processing, (2)  These control simulations establish that all three properties-distributed representation, dynamic processing, and network depth-conspire to yield the decoding signature observed in the ECoG data: local temporal 25 generalization, a widening window of generalization, and neural populations whose code direction appears to change over time when considered independently. For each model type, the proportion of units in the semantic representation that behave like feature-detectors (red), detectors that switch their category preference over time (green), and units that seem unresponsive to the semantic category (blue). Only the deep, distributed, dynamic 5 model has units whose responses switch their category preference over time. Black arrows indicate connectivity between layers. Only the hub layer of the network-the model analog to the ventral anterior temporal cortex-contained units whose responses switch 10 their category preference over time.  Table S1. Patient demographics and clinical information.
Animation showing the spatial distribution and sign of classifier coefficients over time, averaged across subjects, spatially smoothed, and slowed by a factor of ~20. The dotted horizontal line divides the field of view into posterior regions (bottom) where the code is relatively stable and 5 anterior regions (top) where it varies considerably over time.