An overview of functional alignment in artificial and biological neural networks: Current recommendations and open questions

Functional alignment is a method for finding similarity in functional representations of both biological and artificial neural networks. Although it is actively developed in cognitive neuroscience and deep learning, each field prefers its own terminology for and variants of this method. There is, therefore, relatively little cross talk between the two spaces. In this brief review, we highlight three functional alignment methods successfully used in both fields: canonical correlation analysis, Procrustes analysis, and shared response modelling. We consider the relative strengths of each method and highlight situations in which each may be most appropriate. We conclude with open questions in functional alignment that may serve as collaborative opportunities for cognitive neuroscience and deep learning.


Introduction
One of the fundamental challenges for cognitive neuroscience is to find similarity across neural diversity (Churchland, 1998); that is, to find shared or similar neural processes supporting the diversity of individual cognitive experience. This goal is not unique to cognitive neuroscience, however, and is in fact shared across biological and artificial neural networks. Indeed, it can be considered more generally as a problem of aligning functional representations. For the purposes of this work, we can define functional representations broadly as the parameterization of internal states of a neural system that carry informational content and thereby play a functional role (Bechtel, 1998). Practically, we can treat them as activation vectors within a high-dimensional space defined by e.g., the neurons or voxels of the network (Churchland, 1998). In deep learning, multiple random instantiations of the same neural network architecture on the same data set will yield different layerwise functional representations (Li, Yosinski, Clune, Lipson, & Hopcroft, 2015). In neuroscience, anatomical variability and poor structure-function correspondence across association cortex (Rodriguez-Vazquez et al., 2019; yields misaligned functional representations across subjects for an identical stimulus, even following state-of-theart anatomical normalization. Despite the immediate potential of functional alignment methods, these tools are underutilized and often misunderstood within each field. Here, we review three methods used in functionally aligning both artificial and biological neural networks: Canonical Correlation Analysis (CCA), Procrustes analysis (also known in the neuroscience literature as hyperalignment), and Shared Response Modelling (SRM). Expanding on Barrett, Morcos, and Macke (2019), we argue that functional alignment is a promising direction for collaboration between deep learning and cognitive neuroscience. We note open questions in current formulations of functional alignment and suggest future research directions that may benefit both fields.

Canonical correlation analysis
As proposed by Hotelling (1936), Canonical Correlation Analysis (CCA) was originally designed to deal with multi-view samples where we have two views on the same data; for example, audio and visual recordings of the same speaker.
For input matrices X ∈ R n×p 1 and Y ∈ R n×p 2 , where n is the number of samples (e.g., time points in fMRI), and p 1 , p 2 are the number of units (e.g., neurons or voxels) for each network. Interestingly, the dimensionality of these matrices varies dramatically across fields, with neuroscience applications often considering n and p to be in the range of 100-1000, while deep learning applications consider n and p values in the range of 10,000-100,000.
When p 1 ≤ p 2 , CCA derives a vector of canonical correlation coefficients ρ = ρ 1 , ρ 2 , ...ρ p 1 . We can assume that the matrices have been pre-processed to center their columns.
For a given index i, then, ρ i can be defined as We can also consider this maximizing correlation as minimizing distance (Xu, Lorbert, Ramadge, Guntupalli, & Haxby, 2012), in which case we can re-write CCA as 853 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 In functional alignment of biological or artificial neural networks, where X and Y are sub-sampled from two subjects (in the case of biological networks) or two initializations (in the case of artificial networks), some concerns emerge in using generic CCA. In particular, since CCA maximizes these canonical correlation coefficients, if the two data sources share correlated noise we will learn a joint representation driven by noise rather than signal. This is especially a concern for functional magnetic resonance imaging (fMRI) data with its low signal-to-noise ratio. Variants such as projection-weighted CCA (Morcos, Raghu, & Bengio, 2018) and L2-regularized CCA (Bilenko & Gallant, 2016) are thus designed to reduce the influence of noise, though they adopt different strategies in doing so.
In particular, Kornblith et al. (2019) show that for p 2 ≥ n data sets (e.g., wide convolutional network layers with more neurons than examples in the training data set) similarity indices that are invariant to invertible linear transforms gives the same result. This is a common situation in neuroimaging, where the number of voxels is often much greater than the number of available examples.

Procrustes analysis
Named for Procrustes, the ancient Greek innkeeper who stretched or cut off traveller's limbs so they would fit his bed, Procrustes analysis seeks to conform datasets through a series of rigid-body transformations (Schönemann, 1966). In the case where p 1 = p 2 , we can define an orthogonal rotation matrix R X ∈ R p×p such that we can Although this is only defined in the case of exactly two data sets, Procrustes analysis has been extended to a generalized framework (Gower, 1975) wherein two or more data sets of the same dimensionality can be compared by first aligning to a reference subject and then iterating on this alignment. It was this Generalized Procrustes Analysis which was introduced to the neuroscience literature as hyperalignment in Haxby et al. (2011).
Procrustes analysis has been used successfully used both for aligning biological neural networks constructed from fMRI data (Haxby et al., 2011;Guntupalli et al., 2016) as well as artificial neural networks (Smith, Turban, Hamblin, & Hammerla, 2017). Two constraints emerge in applying Procrustes analysis to these data types, however. The first is that data sets must be of equivalent dimensionality. Thus, for example, convolutional neural network (CNN) hidden layers must have the same width to be aligned using Procrustes transformations. The second constraint is that each minimal unit (i.e., voxels in fMRI data or neurons in CNN hidden layers) is considered in the analysis, meaning that very large data sets often suffer from estimation problems. In particular, we need ≥ p samples for the estimation to be well-posed; this is rarely the case in fMRI studies, where our sampled time points n p. To date, investigators have circumvented this issue by performing functional alignment only in anatomically-or functionally-defined regions of interest.

Shared response modelling
A more recently proposed method is Shared Response Modelling (SRM; P.-H. Chen et al., 2015). The intuition is that rather than aligning networks individually, we now want to develop a common basis set or coordinate system into which we can project additional networks.
Thus for m subjects, we want to learn an individual transformation basis W ∈ R p×k and a common or shared time series S ∈ R k×n , where k is an experimenter-selected parameter to control the dimensionality of the model. As before, p is the number of units (e.g., neurons or voxels) in the network and n is the number of samples (e.g., time points in an fMRI analysis). Because all subjects are considered simultaneously in learning the shared response, the data matrix X now contains sub-matrices for each subject i such that X i ∈ R p×n . Note that since all subjects are included in X, there is thus no longer a need for the Y matrix. For subject i, then, we want to learn For a fixed S, this formulation resembles (3) but with the transformation matrix-R X in (3), W i in (4)-now applied to the second term rather than the first. These two formulations are in fact equivalent when the experimenter-selected dimensionality k is equal to the number of minimal units (i.e., voxels or neurons) p 1 . However, in the case where k < p 1 , applying the transformation matrix directly to the subject data X i leads to an uninformative shared response S (P.-H. Chen et al., 2015).
SRM has been successfully used in aligning both fMRI data (J. Chen et al., 2017) as well as deep neural networks (Lu et al., 2018). Like CCA, SRM has the advantage that the layer width or number of voxels considered does not need to be equivalent across networks. Similarly to SVCCA, a CCA variant developed by Raghu et al. (2017), learning the hyperparameter k also provides researchers an understanding of how many directions meaningfully contribute to the alignment. Nonetheless, the problem of hyperparameter selection requires cross-validation to assess its impact on the learned shared response, potentially requiring more data than available in standard analyses.

Current recommendations
Although each of the considered methods have been used in functionally aligning both artificial and biological neural net- CCA shows higher popularity in deep learning than in neuroscience, while Procrustes analysis (under the name hyperalignment) and SRM are more consistently used in neuroscience research. Although this disparity is due in part to non-overlapping terminology between the two fields, there are also field-specific constraints which in part guide these decisions. For example, one of the advantages of CCA is that data set sizes do not need to match exactly. This is more likely to appeal to deep learning researchers as it enables comparison of layers with different widths. Neuroscience researchers, however, are more likely to work with regions-of-interest from functional or anatomical parcellations that are standardized to the same number of voxels. We hope, however, that this brief review will introduce researchers to the range of functional alignment methods available, enabling them to use those methods that best match their data set and research question. To this end, we have summarized some of the key features for each method in Table 1. Although these follow our understanding of functional alignment methods as they exist today, there are still several open questions which we draw attention to here.

Open questions and discussion
In considering current methods for functional alignment, at least two immediate questions arise. The first is what kind of similarity we should be assessing and what are the transformations to which these scores should be invariant; for example, whether we should allow for isotropic scaling of representations during alignment as in CCA, and therefore how to choose a similarity measure for a given use case. The second question is how to interpret calculated similarity. Deriving a "similarity score" could be useful for diagnosing network architecture and performance or for comparing experimental conditions; however, its interpretation after hyperparameter optimization is unclear. We review each of these questions in turn.

What kind of similarity metric should we use?
The question of what kind of similarity we should be examining is a fundamental one, with connections to many other mathematical fields such as clustering (Estivill-Castro, 2002). In their recent work, Kornblith et al. (2019) argue that similarity should not be invariant to invertible linear transformation. Besides the practical problem of data set size outlined above, choosing similarity metrics that are invariant to invert-ible linear transformation implies that the scale of activation space is irrelevant. That is, that representations that are only similar on small eigenvalues should have the same similarity index as representations that are only similar on large eigenvalues. The success of deep learning methods such as style transfer suggest that these distances are meaningful, however (Dumoulin, Shlens, & Kudlur, 2016). Neuroscience has only begun to quantify the dimensionality supporting similar representations (Ahlheim & Love, 2018), but we argue that a similar case is likely to hold for this field as well.

Should we define or improve similarity?
Hyperparameter selection in SRM or regularized CCA (Bilenko & Gallant, 2016) significantly improves our ability to transfer functional representations between networks. Unfortunately, it also obscures the definition of similarity. For example, many deep learning researchers use functional alignment in order to gain insight into the development of functional representations across training. In this case, a summary statistic of similarity can be meaningfully used to learn how different training regimes such as freeze training impact learned representations. If similarity is not only calculated between two networks, however, but optimized as in SRM then the interpretation of such a metric and its use across data sets is unclear.
Although a future alignment method may develop which preserves interpretability while maximizing similarity, we argue that such an extension is unlikely. Instead, we suggest that researchers carefully consider what they hope to learn from functionally aligning their networks and to choose a method which best meets their research goals with a clear understanding of the methods specificity and differences.

Conclusions
Functional alignment methods are being actively developed in both cognitive neuroscience and deep learning, though to date these research programs have been pursued largely in parallel. We argue, however, that there is substantial overlap and opportunities for collaboration in exploring the alignment of biological and artificial neural networks. Indeed, future investigations directly aligning these two kinds of networks seem close at hand. We hope that developing a common language for and implementations of these methods will inspire scientists to bridge this gap.