Learning robust visual representations using data augmentation invariance

Deep convolutional neural networks trained for image object categorization have shown remarkable similarities with representations found across the primate ventral visual stream. Yet, artificial and biological networks still exhibit important differences. Here we investigate one such property: increasing invariance to identity-preserving image transformations found along the ventral stream. Despite theoretical evidence that invariance should emerge naturally from the optimization process, we present empirical evidence that the activations of convolutional neural networks trained for object categorization are not robust to identity-preserving image transformations commonly used in data augmentation. As a solution, we propose data augmentation invariance, an unsupervised learning objective which improves the robustness of the learned representations by promoting the similarity between the activations of augmented image samples. Our results show that this approach is a simple, yet effective and efficient (10 % increase in training time) way of increasing the invariance of the models while obtaining similar categorization performance.


Introduction
Deep artificial neural networks (DNNs) have borrowed much inspiration from neuroscience and are, at the same time, the current best model class for predicting neural responses across the visual system in the brain (Kietzmann et al., 2017;Kubilius et al., 2018). Yet, despite consensus about the benefits of a closer integration of deep learning and neuroscience (Bengio et al., 2015;Marblestone et al., 2016), important differences remain.
Here, we investigate a representational property that is well established in the neuroscience literature on the primate visual system: the increasing robustness of neural responses to identity-preserving image transformations. While early areas of the ventral stream are strongly affected by variation in e.g. object size, position or illumination, later levels of processing are increasingly robust to such changes (Isik et al., 2013). The cascaded achievement of invariance to such identity-preserving transformations has been proposed as a key mechanisms for obtaining robust object recognition (DiCarlo & Cox, 2007;Tacchetti et al., 2018).
Learning such invariant representations has been a desired objective since the early days of artificial neural networks (Simard et al., 1992). Accordingly, a myriad of techniques have been proposed to attempt to achieve tolerance to different types of transformations (see Cohen & Welling (2016) for a review). Interestingly, recent theoretical work has shown that invariance to "nuisance factors" should naturally emerge from the optimization process (Achille & Soatto, 2018).
Nevertheless, DNNs are still not robust to identity-preserving transformations, including simple image translations (Zhang, 2019), or more elaborate adversarial attacks (Szegedy et al., 2013), in which small changes, imperceptible to the human brain, can alter the classification output of the network. In this regard, there is growing evidence that DNNs may exploit highly discriminative features that do not match human perception (Ilyas et al., 2019). Extending this line of research, we use image perturbations using the data augmentation framework (Hernández-García & König, 2018) to show that DNNs, despite being trained on augmented data, are not sufficiently robust to such transformations.
Inspired by the increasing invariance observed along the primate ventral visual stream, we subsequently propose a simple, yet effective and efficient mechanism to improve the robustness of the representations: we include an additional term in the objective function that encourages the similarity between augmented examples within each batch.

Methods
This section presents the procedure to empirically measure the invariance of the representations of a convolutional neural network and our proposal to improve the invariance.

Model, data and training parameters
As a test bed for our hypotheses and proposal we use the all convolutional network, All-CNN (Springenberg et al., 2014), a well-known architecture which achieves good performance in spite of being much shallower than other architectures, and thus faster to train and more convenient for the analysis. It consists of 9 convolutional layers, with a total of 1.3 million parameters. Our model is identical to All-CNN-C in the original paper, except that we remove the explicit regularizers-weight decay and dropout-following the conclusions from Hernández-García & König (2018). We also keep the original training hyperparameters: 350 epochs, initial learning rate of 0.01 and batch size of 128.
We train on the highly benchmarked data set for object recognition CIFAR-10 (Krizhevsky & Hinton, 2009) and apply heavier data augmentation than in the original paper. Specifically, we use the heavier training and evaluation scheme described by Hernández-García & König (2018), which includes random affine transformations and contrast and brightness adjustment.

Evaluation of invariance
To measure the invariance of the learned features under the influence of identity-preserving image transformations we compare the activations of a given image with the activations of a data augmented version of the same image.
Consider the activations of an input image x at layer l of a neural network, which can be described by a function f (l) (x) ∈ R D (l) . We can define the distance between the activations of two input images x i and x j by their mean square difference: Following this, we compute the mean squared difference between every f (l) (x i ) and a random transformation of x i , that is d (l) (x i , G(x i )). In this case, we define G(x) as the data augmentation scheme that can take any of the extreme values of each transformation in the heavier scheme, after halving the parameter ranges. This is to ensure the same level of augmentation in all comparisons, while preventing too extreme transformations.
The assessment of the similarity between the activations of an image x i and of its augmented versions G(x i ) was normalised by the similarity with the other, different images, reminiscent of an image identification problem. We define the invariance score σ (l) i of the transformation G(x i ) at layer l of a model, with respect to a data set of size N, as follows:: The invariance σ

Data augmentation invariance
Most CNNs trained for object categorization are optimized through mini-batch gradient descent (SGD), that is the weights are updated iteratively by computing the loss of a batch B of examples, instead of the whole data set at once. The models are typically trained for a number of epochs, E, which is a whole pass through the entire training data set of size N. That is, the weights are updated K = N |B| times each epoch. Data augmentation introduces variability into the process by performing a different, stochastic transformation of the data every time an example is fed into the network. However, with standard data augmentation, the model has no information about the identity of the images, that is, that different augmented examples, seen at different epochs, separated by N |B| iterations on average, correspond to the same seed data point. We believe this information may be valuable and useful to learn better representations in a self-supervised manner. For example, the high temporal correlation of the stimuli that reach the visual cortex may play a crucial role in the creation of robust connections (Wyss et al., 2006).
In order to make use of this information in an unsupervised way, we propose to perform data augmentation within the batches by constructing the batches to include M transformations of each ex-ample (see Hoffer et al. (2019) for a similar idea). Additionally, we propose to modify the loss function to include an additional term that accounts for the invariance of the feature maps across multiple image samples. Considering the difference between the activations at layer l of two images, d (l) (x i , x j ), defined in Equation 1, we define the data augmentation invariance loss at layer l for a given batch B as follows: where S k is the set of samples in the batch B that are augmented versions of the same seed sample x k . This loss term intuitively represents the average difference of the activations between the sample pairs that correspond to the same source image, relative to the average difference of all pairs. A convenient property of this definition is that L inv does not depend on the batch size nor the number of in-batch augmentations M = |S k |. Furthermore, it can be efficiently implemented using matrix operations.
Since we want to achieve image invariance at L layers of the network and jointly train for object recognition, we define the total loss as follows: where ∑ L l=1 α (l) = α and L ob j is the loss associated with the object recognition objective, typically the cross-entropy between the object labels and the output of a softmax layer. All the results we report in this paper have been obtained by setting α = 0.1 and distributing the coefficients across the layers according to an exponential law, such that α (l=L) = 10α (l=1) . This aims at simulating a probable response along the ventral visual stream, where higher regions are more invariant than the early visual cortex 1 .

Results
One of the contributions of this paper is to empirically test in how far convolutional neural networks produce invariant representations under the influence of identity-preserving transformations of the input images. Figure 1 shows the invariance scores, as defined in Equation 2, across network layers.
Despite the presence of data augmentation during training, which implies that the network may learn augmentation-invariant transformations, the representations of the baseline model (red boxes) do not increase in invariance beyond the pixel space. As a solution, we have proposed a simple, unsupervised modification of the loss function to encourage the learning of data augmentation-invariant features. As can be seen in Figure 1 (blue boxes), our data augmentation mechanism pushed network representations to become increasingly more robust with network depth. One exception is the top, 'readout' layer, likely because the features are dominated by the categorization objective.
In order to better understand the effect of the data augmentation invariance, we plotted the learning dynamics of the invariance loss at each layer. In Figure 2, we can see that in the baseline model, the invariance loss keeps increasing over the course of training. In contrast, when the loss is added to the optimization objective, the loss drops for all but the last layer. Unexpectedly, the invariance loss increased during the first epochs and only then started to decrease. While further investigations are required, these two phases may correspond to the compression and diffusion phases proposed by Shwartz-Ziv & Tishby (2017).
In terms of efficiency, adding terms to the objective function implies an overhead of the computations. However, since the pairwise distances can be efficiently computed at each batch through matrix operations, the training time is only increased by about 10 %. Finally, the improved invariance comes at no cost in the categorization performance, as the network trained with data augmen- inv during training. The axis of abscissas (epochs) is scaled quadratically to better appreciate the dynamics at the first epochs. The same random initialization was used for both models.
tation invariance achieves similar classification performance to the baseline model-test accuracy baseline: 91.5 %; test accuracy data augmentation invariance: 92.2 %).

Conclusions
In this work we have empirically shown that the features learned by a prototypical convolutional neural networks are not invariant to identity-preserving image transformations despite being part of the training procedure. This property is fundamentally different to the primate ventral visual stream, where neural populations have been found to be increasingly robust to changes in view or lighting conditions of the same object (DiCarlo & Cox, 2007).
Taking inspiration from this property of the visual cortex, we have proposed an unsupervised objective to encourage learning more robust features, using data augmentation as the framework to perform identity-preserving transformations on the input data. We created mini-batches with M augmented versions of each image and modified the loss function to maximize the similarity between the activations of the same seed images.
Data augmentation invariance effectively produces more robust representations, unlike standard models optimized only for object categorization, at no cost in classification performance. Future work will investigate whether this increased robustness also allows for better modelling of neural data.