UvA-DARE (Digital Academic Repository) Learning rotation equivalent scene representation from instance-level semantics: A novel top-down perspective

This paper focuses on rotation variant scene recognition. Different from existing rotation invariant recognition approaches which learn from either rotated images or rotated convolutional filters in a bottom-up manner, a new top-down perspective by learning is explored from instance-level semantic representation. The goal is to eliminate the convolutional feature differences in bottom-up feature propagation caused by the rotation sensitive nature of convolution operation. Our rotation equivalent convolutional neural network (RE-CNN) scheme consists of three components. Firstly, our key instance selection module highlights the instances strongly related to the scene scheme regardless of their orientation. Secondly, our key instance aggregation module builds a scene representation invariant to the position change of each instance caused by rotation. Finally, our semantic fusion module allows the framework to be organized as a whole and implements rotation regularization. Notably, our RE-CNN scheme can be adapted to existing CNNs in a plug-in-and-play manner. Extensive experiments on rotation variant scene recognition benchmarks from four domains demonstrate the state-of-the-art performance and generalization capability of the proposed RE-CNN.


Problem statement
Due to changes in imaging conditions, the appearance and shape of an object in a scene may vary drastically in terms of orientation, which is usually termed as rotation variant scenes.Typical examples are aerial images (Ding et al., 2019;Zheng et al., 2020;Bi et al., 2020bBi et al., , 2021a;;Xia et al., 2018;Cheng et al., 2018) and industrial scenes (Fernandes and Cardoso, 2017;Zhang et al., 2020;Iacovacci and Lacasa, 2020), in which the texture may have a different orientation (Kylberg, 2011;Li et al., 2015).Also, the pathological regions in medical scenes can appear in a variety of orientations (Li et al., 2019;Ilse et al., 2018;Wu et al., 2020).The orientation information in such rotation variant scenarios is far more abundant than in natural scenes (Quattoni and Torralba, 2009;Almakady et al., 2020;Zhang et al., 2013;Hanbay et al., 2015) (see Fig. 1 for an intuitive example).It can cause confusion for computer vision algorithms to understand such scenes (Worrall et al., 2017;Cohen and Welling, 2016;Xia et al., 2017;Li et al., 2019;Zhang et al., 2020).
One may argue that the long-existing challenge of rotation variant scene recognition becomes trivial in the deep learning era, as the rotation based data augmentation (Simonyan and Zisserman, 2015;He et al., 2016;Szegedy et al., 2015;Ding et al., 2019;Xia et al., 2018; Fig. 1.Different orientation information distribution between natural image scenes and rotation variant scenes, reported in sample proportion.The main direction angle has a range of [0, ).The statistics from generic image scenes are from MIT dataset (Quattoni and Torralba, 2009) (blue).The statistics from rotation variant image scenes are from AID (Xia et al., 2017) (green), LAG (Li et al., 2019) (orange) and KTD (Kylberg, 2011) (pink) datasets respectively.It can be clearly seen that the orientation information from rotation variant scenes are more abundant and more randomly distributed, while for natural scenes the orientation information is more gathered horizontally or vertically.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)convolution filters (Cohen and Welling, 2016;Worrall et al., 2017;Zhou et al., 2017).
As is shown in Fig. 2, since these approaches are in a bottomup manner, i.e., extracting rotation information from shallow to deep, the rotation sensitive nature of the convolution poses a bottleneck to learn a discriminative representation from different rotation angles.This hinders the understanding capability of rotation variant scenes and limits the generalization capability (Wu et al., 2020;Xia et al., 2017;Iacovacci and Lacasa, 2020).
To this end, in contrast to existing methods, we present a novel rotation equivalent scene representation learning scheme from a topdown perspective.In this scheme, it is not required to extract convolutional features from multiple rotated samples or from rotated convolution filters, and eliminates the drawback in existing bottomup pipelines.Notably, no modification is required in the convolutional feature extraction process.We only start to build a rotation equivalent representation from high-level, which backward guides the learning process of the entire framework.
Specifically, our RE-CNN scheme introduces the classic multiple instance learning (MIL) formulation (Maron and Ratan, 1998).By describing each scene as a bag and each image patch in the scene as an instance, the relation between high-level feature maps and the scene scheme is built.The key instances in determining the scene scheme are highlighted regardless of their orientation.Also, the permutationinvariant nature of the MIL aggregation function (Zaheer et al., 2017) allows the scene scheme prediction to be invariant to the position change of image patches caused by rotation.

Contribution
Our contributions can be summarized as follows: • We propose a rotation equivalent CNN (RE-CNN) scheme.To the best of our knowledge, it is the first work to learn rotation invariant deep features from a top-down perspective reducing the negative influence from rotation-sensitive convolution operations in existing bottom-up pipelines.More importantly, it can be easily adapted to existing CNN backbones in a plug-and-play manner.
• We propose a rotation equivalent scene scheme learning strategy by adapting the classic MIL formulation.It allows the scene scheme to be invariant to the change of instance positions.It is realized by our key instance selection (KIS), key instance aggregation (KIA) and semantic fusion (SF) modules.• Our proposed RE-CNN substantively improves the recognition performance of rotation variant scenes, i.e., up to 8.95% with only a 0.47% parameter number increase and a 1.78% prediction time increase.Extensive experiments demonstrate that our approach outperforms 24 state-of-the-art approaches on four recognition domains.
The remainder of this paper is organized as follows.Section 2 provides a detailed summary of the related work.Section 3 offers more background on multiple instance learning for a better understanding of our technical insight.Then, in Section 4, our proposed RE-CNN is introduced in detail.Section 5 reports and discusses the extensive experiments and ablation studies.Finally, the conclusion is drawn in Section 6.

Rotation variant scene recognition
Rotation variance scenes are common due to either the restriction of view point (e.g., aerial imaging, arbitrary-oriented hand-writing digit recognition and etc.) or the unique orientation distribution (e.g., medical imaging, texture recognition and etc.).Among these tasks, arbitrary-oriented hand-writing digit recognition has been investigated for a relatively long time (Dmitry et al., 2016;Zhang et al., 2017;Worrall et al., 2017;Cohen and Welling, 2016;Zhou et al., 2017).Unfortunately, for other large-scale or real-world scenarios such as aerial, industrial and medical imaging, this challenge has not been well tackled and the recognition capability remains to be boosted.
To be specific, for aerial scenario, as the imaging sensor carried by airplane or satellite is bird-view, the objects are posed in arbitrary orientations in an scene.Recent works of aerial image understanding tend to highlight these key local regions regardless of the orientation (Xia et al., 2017(Xia et al., , 2018;;Bi et al., 2020bBi et al., , 2021a;;Cheng et al., 2018;Bi et al., 2020a,c;Wang et al., 2021).Although such solutions usually lead to an obvious performance gain compared with the baselines and former works, the rotation invariant scene representation has not been widely discussed in aerial imaging community (Han et al., 2021).
Different from traditional medical imaging dealing with X-ray and ultrasound data where the body and organ is presented in a fixed order, in fundus image the eyeball is circle-shaped, and the pathological regions can be posed in an arbitrary orientation (Ilse et al., 2018;Li et al., 2019;Ghamdi et al., 2019;Diaz-Pinto et al., 2019;Wu et al., 2020).However, as fundus disease recognition is only drawing attention in the past few years, the varied orientation of these fundus pathological regions is still not considered so far in the medical imaging community.
In industrial imaging, texture recognition is a typical task that demands rotation invariant scene representation, as the texture can be posed in arbitrary orientation.Before the deep learning era, texture recognition with rotation-invariant hand-crafted features has been thoroughly investigated (Hanbay et al., 2016;Zhao et al., 2012;Sifre and Mallat, 2013;Schmidt and Roth, 2012;Takacs et al., 2010).However, the generalization capability of these hand-crafted features is still significantly inferior to features learnt by deep learning models (Zhang et al., 2020;Iacovacci and Lacasa, 2020).
To summarize, although the challenge of rotation variance has existed and been investigated for a long time, till now few works in the computer vision community have attempted to tackle the rotation variant challenge in such more complicated real-world large-scale scenarios.(Zhang et al., 2017;Cheng et al., 2019;Dmitry et al., 2016) or from rotated convolution filters (Cohen and Welling, 2016;Worrall et al., 2017;Zhou et al., 2017;Marcos et al., 2017); (c) Our RE-CNN pipeline in a novel top-down pipeline.KIS: key instance selection module; KIA: Key instance aggregation module; SF: semantic fusion module; A and B denotes the correct and wrong category for intuitive illustration.

Rotation invariant features & Down-stream tasks
Existing CNN based methods exploited rotation invariant feature representation for a relatively long time.Generally speaking, these approaches can be generally divided into two categories, that is, selecting representative feature responses from rotated samples (Dmitry et al., 2016;Zhang et al., 2017) and using rotated convolution for feature extraction (Zhou et al., 2017;Worrall et al., 2017;Cohen and Welling, 2016;Marcos et al., 2017).
To be specific, Dmitry et al. extracted convolutional features from eight different rotation angles and selected the max point-wise response from these eight representations as the rotation invariant representation (Dmitry et al., 2016).Zhang et al. designed binary filters to generate the convolutional features from different angles, and then conducted a linear combination to generate the final rotation invariant representation (Zhang et al., 2017).On the other hand, to design rotated filters, circular harmonics transformation (Worrall et al., 2017), group theories (Cohen and Welling, 2016), active rotating learning (Zhou et al., 2017) and rotation equivariant vector field (Marcos et al., 2017) have been investigated.
In summary, both strategies are in the bottom-up manner.Their flaw lies in that the rotation insensitive nature of convolution effects negatively on the entire feature extraction process.Thus, it is hard for such methods to activate the RoIs properly when posed in arbitrary orientations, especially in more complicated scenarios.
Therefore attention has been shifted towards down-stream applications such as multi-orientation object detection and segmentation.Instead of learning rotation equivalent representations, the performance of such down-stream detection and segmentation tasks mainly relies on the orientation sensitive region proposal strategies (Ding et al., 2019;Liao et al., 2018;Xu et al., 2020;Jiang et al., 2017;Yang et al., 2021;Han et al., 2021) and rotation-angle-aware loss functions (Cheng et al., 2019;Mou et al., 2019;Yang and Yan, 2020;Qian et al., 2021;Zheng et al., 2020).

Multiple instance learning
Multiple instance learning is initially designed to deal with weaklyannotated data, as it formulates an object as a bag, and the bag consists of a set of instances which do not have specific labels (Maron and Ratan, 1998;Saad and Mubarak, 2010;Wang et al., 2015Wang et al., , 2013a)).Each instance is only labeled as either positive or negative.These weak annotations are utilized to compute the final bag category.
Classic MIL describes the relation between a bag and its instances.It allows us to compute more robust representations and is applied in visual tasks such as image classification (Tang et al., 2017b), object detection (Wang et al., 2012;Tang et al., 2017a), tracking (Babenko et al., 2009) and saliency detection (Zhang et al., 2016;Wang et al., 2013b).
In the past few years, deep multiple instance learning (deep MIL) is drawing increasingly attention.Wang et al. uses the mean and max pooling operation as an instance aggregation function (Wang et al., 2016).Then, gated attention based (Ilse et al., 2018) and channelspatial attention based (Bi et al., 2020b) deep MIL is studied.The insight of the attention based deep MIL lies in that the aggregation of instance representation is averaged by the attention weights.In this way, the generated bag probability distribution becomes more robust (Ilse et al., 2018;Bi et al., 2020b;Yu et al., 2021) than the mean or max pooling based deep MIL (Wang et al., 2016).Recently, a multi-scale form of deep MIL is proposed (Zhou et al., 2021;Bi et al., 2022).
Using deep MIL into the rotation invariant representation learning is not trivial or straightforward.The rotation sensitive nature of convolutions leads to feature variances caused by different rotation angles.Thus, how to learn a robust instance representation from these varied convolutional features is an important question to be addressed.

Classic MIL formulation
In classic multiple instance learning, an object is formulated as a bag consisting of a set of instances.Assume a bag has label  , and each instance {  } of the bag has weakly-annotated labels   (  = 0, 1).The bag label  is given by (1)

Probability distribution assumption
In classic MIL, the bag probability distribution is binary, i.e., either 0 (false) or 1 (true).In contrast, in deep MIL, the bag probability distribution   is assumed to be continuous in [0, 1] to circumvent the gradient vanishing problem (Ilse et al., 2018).

Deep MIL for multi-class recognition
We assume that bag label  belongs to the th bag category if and only if the bag probability of the th bag category    is the maximum among   1 ,   2 , … ,    , … ,    , where  denotes the number of bag categories.This is defined as follows: (2)

MIL aggregation function
In MIL, an aggregation function is needed to bridge the gap between the instance representation and the bag representation.We adopt the instance space paradigm of MIL so that the instance representation can be directly aggregated to the bag probability distribution.The construction of bag-level probability distribution   is a two-step process with transformations  and  given by: where  refers to the transformation to instance representation, and  denotes the MIL aggregation function which directly obtains the bag probability   .

T-equivalent transformation
For a group of transformations  , a function  is Tequivalent (Maron et al., 2020;Han et al., 2021) if for all  ∈  .

Rotation equivalent bag scheme prediction
For our task, the bag (scene) scheme needs to be invariant to changes caused by the rotation transformation  .From the formulation in Eq. ( 4),  is a rotation operation with a certain rotation angle in the transformation set  .Our objective is to design such a transformation  to predict the scene scheme invariant to the rotation angle.

Objective
Eq. ( 2) shows that the MIL aggregation function needs to meet the aforementioned T-equivalent requirement for rotation transformation  .Hence, function  needs to be permutation-invariant (Worrall et al., 2017;Maron et al., 2020).

Permutation-invariant MIL aggregator
The MIL aggregation function itself tolerates the possible order changes of the instances so that the bag scheme remains unchanged.It has shown that the MIL aggregation function is permutation -invariant (Wang et al., 2016;Ilse et al., 2018;Zaheer et al., 2017), which is beneficial to generate a rotation equivalent bag scheme prediction.

Framework overview
Fig. 3 demonstrates the framework of our proposed rotation equivalent convolutional neural network (RE-CNN).Firstly, the convolutional feature maps from the backbone are flattened by a transitional layer and the multi-angle class confident maps (MACCMs) are computed (in Section 4.2).Then, for each CCM from a rotation angle, the key local regions relevant to the scene scheme are selected by our key instance selection (KIS) module (in Section 4.3).Later on, the key instance aggregation (KIA) module fuses these instance representations in a rotation insensitive manner.This ensures that the scene scheme is invariant to a change of instance positions (in Section 4.4).Lastly, our semantic fusion module (in Section 4.5) and the corresponding loss function (in Section 4.6) minimizes the semantic variance from different rotation angles and allows the entire framework to be optimized as a whole.

Multi-angle class confidence representation
CCMs from multiple rotation angles contain abundant orientation information.CCMs rotated by a certain angle correspond to the sample rotated by the same angle due to the same receptive field of a CNN.Learning from rotated CCMs eliminates the weaknesses of existing bottom-up rotation invariant scene recognition pipelines, which are negatively influenced by the rotation sensitive nature of the convolution operation.
As shown in Fig. 3, a 1 × 1 convolutional layer with weight matrix  1 and bias matrix  1 , also termed as transitional layer in our framework, is used to generate the CCM  1 from the extracted convolutional feature .Assume there are  scene categories and ⊗ denotes the convolution operator, then  1 also has  channels, each corresponding to the feature response of a category, and is given by: (5) Then, the CCM  1 is rotated by multiple rotation angles   with an interval of ∕4.The set of multi-angle class confidence maps (MACCMs) {   1 } is defined by: where  = 0, 1, 2, … , 8.

Key instance selection module
To properly activate the key regions regardless of their orientation, as shown in Fig. 3, the key instance selection (KIS) module consists of two branches, i.e., one learns the key instance representation insensitive to rotation, and the other learns representation sensitive to rotation.
The rotation insensitive representation is expected to be robust to a shift of orientation.Following (Maron et al., 2020;Dmitry et al., 2016) Overall, it helps to fully exploit the rotation insensitive representation for the scene scheme.This instance weight distribution {   ,ℎ } provides a description on how each instance contributes to the scene scheme.Higher weights are assigned to instances which are relevant to the scene scheme and vise versa and is given by: where  2 and  2 denote the weight and bias matrix of the 1 × 1 convolutional layer in this weight-sharing deep MIL module, softmax denotes the softmax function and (, ℎ) marks the position of a certain instance in the  channel  × -sized instance representation.
The aim of the second branch is to learn the instance representation sensitive to the rotation.Thus, the rotation features from all orientations need to be included.This is obtained by the sum of the instance representations from each rotation.Specifically, this objective is obtained by a single spatial attention based deep MIL module, which extracts another instance spatial weight matrix { ,ℎ }.The input of this branch   1 is the sum of    1 , which is calculated as where  = 0, 1, … , 8. The distribution of key regions on   1 is more scattered, as the position of many key regions changes due to rotation.Thus, this branch is capable to perceive the rotation sensitive representation while maintaining the scene scheme.
Then, the instance spatial weight matrix { ,ℎ }, derived from the deep MIL module, is computed by where  3 and  3 are the weight and bias matrix of the 1 × 1 convolutional layer in this deep MIL module.

Key instance aggregation module
Before generating the scene probability distribution, it is required to aggregate the above rotation insensitive and sensitive instance representation.The key instance aggregation (KIA) module allows the scene scheme from these aggregated representations invariant to the change of instance positions caused by rotation.
First, the instance weight distribution {   ,ℎ } from the above weightsharing deep MIL module has a point-wise product on the instance representations {   1 } emphasizing the contribution of key instances in determining the scene scheme.
Specifically, assume 1 ≤  ≤  , 1 ≤ ℎ ≤  and  denotes the th channel corresponding to the th scene category for  categories.Also assume ⋅ denotes the element-wise production.Then, the feature response of the th dimension of instance  ′   1,(,ℎ,) is accentuated by Similarly, for the rotation sensitive instance representation   1,(,ℎ,) , the feature response of the th dimension of instance Then, as demonstrated in Fig. 3, the instance representation  , given by In this way, both the summed rotation sensitive representation

Semantic fusion module
The aim of this module is two-fold: (1) convert the instance representation to the scene probability distribution in a rotation equivalent manner, and (2) guide the convolution parameter learning process to tolerate rotation variance.In this way, two scene probability distributions, i.e., { 1 2 is considered as follows Note that that this process is the MIL aggregator  in Eq. ( 3).The MIL aggregator is permutation-invariant and hence the change of instance position caused by rotation does not effect the scene scheme prediction.

Semantic fusion loss
The two-branch semantic fusion module matches with a specific loss function  consisting of a classification term ℒ  and a rotation regularization term ℒ  .The scene probability distribution { 1   } is directly used by the classification loss ℒ  , which is calculated as where   is the true label of a scene. 2

𝑝 𝑙
describes the potential difference among    2 ( = 1, … , 8).It regularizes the convolutional feature learning process despite the impact of different orientations.This regularization term ℒ  is given by: Finally, our semantic fusion loss function ℒ is the combination of the ℒ  and ℒ  terms and calculated as follows: where  is a hyper-parameter to balance the impact of two terms.Empirically, we set  = 5 × 10 −4 .

Dataset
Five rotation variant recognition datasets from four different image domains are used to validate the effectiveness of our RE-CNN framework and summarized in Table 1.

Aerial Image Dataset (AID)
The bird view of aerial sensors results in aerial scenes posed in arbitrary orientations.AID dataset is a large-scale aerial scene classification benchmark with 30 categories and 10,000 samples in total (Xia et al., 2017).

Large Age Gap (LAG)
Images of glaucoma pathological parts are at any position and arbitrary orientation along with the circle-shaped optic disc.LAG, is a newly-released glaucoma recognition benchmark containing 1710 glaucoma and 3140 non-glaucoma samples (Li et al., 2019).

Kylberg Texture Dataset (KTD)
Texture is an important recognition cue in industrial applications and a major challenge for texture recognition is its arbitrary orientation.KTD is a 28-class texture recognition dataset with 160 samples per class (Kylberg, 2011).
The last two benchmarks (MNIST-rot & MNIST-rot-12k) are traditional small-sized standard benchmarks to validate the rotation invariant representation, while the first three benchmarks (AID, LAG, KTD) are from more challenging real-world large-scale rotation variant scenes.
On AID, the evaluation protocol (Xia et al., 2017) randomly selects 50% samples as the training set and the remaining samples as the test set.The mean and standard deviation of the overall accuracy (denoted as OA) from ten independent runs are reported (Xia et al., 2017).Following the common evaluation protocol, both mean and variance are presented in two-decimal format (Xia et al., 2017).
On LAG and KTD, the evaluation protocols report on test accuracy (denoted as Acc) from five-fold cross-validation experiments (Li et al., 2019;Kylberg, 2011).
For the comparison with state-of-the-art methods on each community, following the existing protocols the test samples are not rotated (in Section 5.3).In contrast, to fully evaluate the performance of the former rotation invariant methods, the test samples are rotated under a variety of settings (in Section 5.4).

Hyper-parameter settings
For fair evaluation, the baseline ResNet-50 on AID, LAG and KTD datasets is implemented by ourselves under the same hyper-parameter settings as the proposed RE-CNN.The batch size of all our experiments is set to 64.The Adam optimizer is used.The initial learning rate is 5 × 10 −5 and is divided by 10 every 20 epochs.The training process terminates after 60 epochs.To overcome potential over-fitting,  2 normalization with a relative importance weight of 5 × 10 −4 is used.Moreover, the dropout rate is set to 0.2 for all the experiments.
The backbone on MNIST-rot and MNIST-rot-12k is a naive fourlayer CNN.Its performance is directly cited from the corresponding reference.All the hyper-parameter settings of our RE-CNN are the same as Dmitry et al. (2016), Zhang et al. (2017), Zhou et al. (2017), Cohen and Welling (2016), Worrall et al. (2017).

Parameter initialization
For all experiments, except for MNIST-rot and MNIST-rot-12k, the pre-trained model on ImageNet is used as the initial parameters for the backbone.For the rest parts of our RE-CNN, random initialization is utilized for the weight parameters with a standard deviation of 0.001.All bias parameters are set to zero for initialization.For the experiments on MNIST-rot and MNIST-rot-12k, the parameter initialization is the same as used in Dmitry et al. (2016)

Development environment
All the experiments are implemented on a workstation with an Intel ® Core™ i7-10700K CPU and 64 GB memory.Two GeForce RTX 2080 SUPER GPUs are utilized for acceleration.

Comparison with rotation variant recognition methods
This subsection reports the performance of our RE-CNN on the three large-scale rotation variant recognition benchmarks (AID, LAG and KTD) and compares it with current rotation variant scene recognition methods.

On AID
The performance of our RE-CNN and other state-of-the-art methods on the AID benchmark is listed in Table 2.It is shown that the proposed RE-CNN outperforms all these methods by a large margin.The close performance of DCNN (Cheng et al., 2018) may be caused by the additional pair-wise supervision used by DCNN, which is stronger than conventional deep learning pipelines and RE-CNN.
As recent work in aerial imaging tends to highlight key regions in an aerial scene regardless of their orientation, these methods are still incapable of providing rotation invariant scene representation.In contrast, our RE-CNN learns the rotation invariant representation from a top-down manner, and thus enhances the model's generalization capability.

On LAG
Table 3 shows the performance of our RE-CNN and current fundus disease recognition methods on the LAG dataset.It can be derived that our RE-CNN significantly outperforms existing state-of-the-art methods for fundus image disease recognition.
As research in high-resolution fundus image disease recognition only intensified over the past few years, only a few methods consider the rotation variance problem in this domain.Our RE-CNN not only highlights small and tiny pathological regions, but also learns the rotation invariant scene representation.

Table 3
Classification accuracy of our proposed RE-CNN and other approaches on the LAG dataset.Results presented are five-cross test accuracy (Li et al., 2019); Metrics presented in %.The ResNet-50 result is implemented under the same hyper-parameter settings as the RE-CNN.The performance of the state-of-the-art methods is directly cited from the corresponding references.

Method
Publication

Per-category classification accuracy
Fig. 4 lists the per-category classification accuracy of the baseline and our RE-CNN on the AID, LAG and KTD benchmarks respectively.It can be derived that by learning a rotation invariant feature representation from these scenes, the per-category recognition performance is significantly increased when compared to the CNN baseline.

Visualization
Fig. 5 shows a number of samples from the three large-scale benchmarks.The key instances and the key regions related to the scene scheme have higher feature responses after they are processed by our RE-CNN, regardless of their orientation.This may be one of the reasons for its superior performance.Also, the interpretable feature maps indicate that the representation learnt by our pipeline has the potential to be transferred into the down-stream detection and segmentation tasks for more rotation robust feature representation.
To understand the impact of RE-CNN on the low-level convolution features, Fig. 6 provides visualized low-level features from the first block of the ResNet-50 backbone.The cases with and without the proposed top-down rotation invariant learning scheme on the AID benchmark are provided.The low-level convolutional feature maps are resized and overlaid to the samples for clarity.Without the proposed scheme, the generic convolutional features tend to be randomly scattered over the entire image.In contrast, with the proposed scheme, the low-level convolutional features tend to highlight the corner or edge of the key objects in the scene, which contain more abundant rotation information.This observation may explain the performance gain from 91.72% (baseline) to 96.95% (RE-CNN), as the rotation information is important to understand the rotation variant scenes.

Comparison with rotation invariant methods
This subsections compares and discusses the performance of our top-down RE-CNN and existing bottom-up rotation invariant scene representation learning methods, namely, TI-pooling (Dmitry et al., 2016), RILBCNN (Zhang et al., 2017), ORN (Zhou et al., 2017), H-Net (Worrall et al., 2017), RotEqNet (Marcos et al., 2017) and P4CNN (Cohen and Welling, 2016).Moreover, the performance of baselines (for details please refer to Table 1) and two commonly-used data augmentation approaches (rotating samples to 45, 90 and 135 degrees, denoted as four-angle augmt.; rotating each sample to a random angle, denoted as random augmt.) is also reported for reference.
Note that: (1) Existing recognition methods that theoretically generate a rotation invariant representation are only validated on smallsized standard benchmarks  (2) The performance on these two benchmarks is saturated.Hence, for fair comparison, (1) On three large-scale recognition benchmarks (AID, LAG and KTD), the results of the above methods are re-implemented with default settings and are under the same ResNet-50 backbone; (2) On MNIST-rot and MNIST-rot-12k, the same baseline as former works (Dmitry et al., 2016;Zhang et al., 2017;Zhou et al., 2017;Worrall et al., 2017;Marcos et al., 2017;Cohen and Welling, 2016) is adopted.

On overall accuracy
Table 5 lists all the experimental results of these rotation invariant scene representation methods.Some interesting observations can be found.
• Our RE-CNN leads to a performance gain on all these five benchmarks, indicating its effectiveness when applied to multiple image domains especially large-scale scenes and compared to existing rotation invariant recognition methods in a bottom-up learning manner.et al., 2016;Zhou et al., 2017;Zhang et al., 2017;Cohen and Welling, 2016;Worrall et al., 2017) and the results are directly cited from the corresponding references; For AID, LAG and KTD, the baseline is ResNet50 implemented under the same hyper-parameter settings as the RE-CNN; '-' denotes not reported.denotes the rotation based data augmentation when samples are rotated by 0, 45, 90 or 135 degrees and rotated by random angles respectively.In (d), (e) and (f), BS: baseline, 4A: four-angle rotation based data augmentation, RA: random rotation based data augmentation, TI: TI-pooling (Dmitry et al., 2016), P4: P4CNN (Cohen and Welling, 2016), RI: RILBCNN (Zhang et al., 2017), OR: ORN (Zhou et al., 2017), RN: RotEqNet (Marcos et al., 2017); HN: HNet (Worrall et al., 2017); RE: RE-CNN.

MNIST-rot
• Both our top-down pipeline and existing bottom-up pipelines outperform the commonly-used rotation based data augmentation strategies in learning rotation invariant representation.Moreover, four-angle augmentation can slightly improve the overall recognition capability, but random rotation augmentation decreases the overall recognition capability.
The reason is that the convolution operation is sensitive to rotation.Hence, the feature representation from different angles can vary.Sometimes it is difficult for existing bottom-up methods to learn a more robust rotation invariant representation.The case of random rotation based data augmentation is also similar, as the features from a variety of rotation angles vary too much and the overall recognition capability declines.
To show how the proposed top-down rotation invariant learning scheme outperforms the existing bottom-up schemes, a visualization is given in Fig. 8.The proposed RE-CNN learns the rotation invariant representation from the instance-level representation.Thus, the last-layer feature maps from the other six bottom-up schemes are averaged and normalized for comparison.It is shown that although all six bottom-up methods provide different responses to different rotation angles, five out of six methods have a scattered activation over the image.They do not properly activate the key local regions in the image.In contrast, the proposed RE-CNN not only has different response to different rotation angles, but also activates the key local regions properly despite the rotation.
However, directly investigating the overall classification performance is still not sufficient to fully evaluate a model's capability of learning rotation invariant representations.Hence, the following two subsections consider the model's performance when all the test samples are rotated by a specific angle and by random angles.

On specific rotation angle
Fig. 7(a), (b) and (c) demonstrate the recognition performance variation when all test samples from AID, LAG and KTD are rotated by a specific angle.For full testing, the rotation range is [0, ] with an interval of ∕12.
It is shown that our RE-CNN not only outperforms existing rotation invariant recognition approaches on every specific rotation angle, but also demonstrates a more stable performance on all rotation angles.As the backbone is kept the same, the effectiveness of our RE-CNN may is provided by the novel top-down rotation invariant representation learning scheme.It bypasses the feature variance caused by the convolution operation in the feature extraction stage.

On random rotation angle
Fig . 7(d), (e) and (f) demonstrate the recognition performance fluctuation when each sample from AID, LAG and KTD is rotated by a random angle.In this way, every sample contains a different orientation.To provide results that are statistically significant and representative, such observations are based on 20 independent runs.
It is shown in Fig. 7 that our RE-CNN has the least fluctuation among these rotation invariant recognition approaches.This observation is also not difficult to explain, as the flaw of existing bottom-up rotation invariant recognition approaches is obvious.The difference of feature representation from multiple rotation angles exists and accumulates in the feature extraction process, and it lows down the generalization capability.In contrast, our top-down scheme bypasses this problem, and thus has stronger generalization capability.

Ablation studies
Our RE-CNN consists of a backbone, a transitional layer (TL), a key instance selection (KIS) module, a key instance aggregation (KIA) module and a semantic fusion (SF) module.The ablation studies investigate the performance gain of each component and all results are listed in Table 6.Note that, for fair comparison, in all the cases without SF, the scene probability distribution is directly generated from a global average pooling followed by a softmax function, and the cross-entropy loss function is utilized.

Effect of TL
The experiments on both AID and LAG benchmarks indicate that simply using KL to generate MACCMs only slightly improves the classification performance.This observation also demonstrates that more advanced solutions are needed to solve the rotation invariant problem instead of simply rotating samples as augmentation.

Effect of KIS
Two comparison pairs on AID indicate that the utilization of KIS leads to a performance gain of 1.48% and 1.83% respectively.Similarly, the improvement on LAG is 1.89% and 1.98% respectively.The effectiveness of KIS may be explained that our deep MIL module stresses the region of interest (RoI) in an scene regardless of the rotation angle.Hence, the representation can be more insensitive to the changes caused by rotation.

Effect of KIA
Two comparison pairs on AID and LAG demonstrate that the performance gain of KIA are 1.86%, 1.69% and 1.89%, 1.79% respectively.The function of KIA is important as it aggregates both the rotation sensitive and insensitive representation from KIS in a rotation equivalent manner.Bear in mind that the permutation invariant nature of MIL aggregation function allows the scene representation invariant to the position change of instances caused by rotation.
For an intuitive understanding, some instance representations from different rotation angles are visualized in Fig. 9 (I), where the key instances are activated properly regardless of the orientation.Also, some heatmaps processed by either only KIS or by both KIS and KIA are displayed in Fig. 9 (II), reflecting our KIA helps activate the RoIs more accurately.

Effect of SF
Among three comparison pairs on either using or not using our SF module, the performance gain on AID dataset varies from 1.38% to 1.73%.Similarly, the performance gain on LAG dataset varies from 1.64% to 1.74%.Generally speaking, our SF can not only stress the contribution of key instances but also regularize the entire learning process to be rotation tolerable.

Effect of regularization loss
The loss function of the proposed RE-CNN is a combination of the conventional classification loss term and the regularization loss (Eq.( 15)).The impact of the regularization loss, which is based on the difference between feature representations from different rotation angles, is also investigated.When the RE-CNN framework only has the classification loss, the performance on AID and LAG declines 0.65% and 0.75% respectively.The regularization loss helps the representation from different rotation angles to align to the same semantic label of the scene, and thus can benefit the model's robustness to some extent.

Generalization capability test
To validate the generalization capability of our RE-CNN scheme, we report its performance when embedded into three conventional  CNN backbones, namely, VGGNet-16 (Simonyan and Zisserman, 2015), ResNet-50 (He et al., 2016), Inception-V2 (Szegedy et al., 2015), and two latest backbones, namely, Swin-T (Liu et al., 2021) and ViTAEv2 (Wang et al., 2022) (denoted as VGG, ResNet, Inception, Swin-T and ViTAEv2 in Table 7) on the AID benchmark.Apart from the required overall accuracy metric (Xia et al., 2017), the parameter number and the frame per second are also reported to evaluate its impact on parameter number and inference time.From all the outcomes on Table 7, it is clearly seen that our RE-CNN framework significantly boosts the recognition performance on not only the three classic CNN backbones but also the two latest stateof-the-art backbones, only slightly increasing the parameter number and prediction time.Hence, our RE-CNN scheme can be easily adapted to existing CNN backbones with a generic performance boost.

Influence of the sampling interval for 𝜃 𝑖
The sampling interval for our   is 45 degrees in our framework by default.It is also interesting to observe the influence of sampling interval on the overall recognition performance.A larger interval leads to a less number of MACCMs while a smaller interval leads to a larger number of MACCMs.To investigate this impact, we test the cases when the sampling interval is 15, 30, 45, 60 and 75 degrees on AID benchmark, while all the other default settings keep the same.Table 8 lists all the results.It can be seen that when the interval is 15, 30, 45 and 60 leads to an outcome of 96.89%, 96.95% and 96.90%, indicating that there is no significant difference regarding the interval.When the sampling interval is too large, feature responses may be not sufficient for the entire model in the learning phase, and thus the performance shows a slight decline.
Our RE-CNN only utilizes the instance representation rotated by 45, 90, 135, 180, 225, 270, 315 and 360 degrees respectively.But it still demonstrates a stable performance on a variety of rotation angles.The effectiveness lies in two-folds: (1) The permutation-invariant nature of MIL aggregation function allows the scene scheme invariant to the position change of instances caused by rotation.Thus, the specific rotation angle does not influence much on the recognition performance.
(2) As the high-level feature maps are often small in sizes (e.g., 8 × 8 in ResNet), the rotated high-level feature maps are not that sensitive to specific rotation angles.

Influence of the hyper-parameter 𝛼
Our loss function has two terms   and   , which are balanced by a hyper-parameter .Table 9 shows the impact of  on classification results.
It can be seen that when  is set 5×10 −3 , 10 −4 , 10 −5 , the performance of our RE-CNN is relatively stable.However, when  is either too large or too small, 5×10 −2 or 5×10 −6 , the performance of our model degrades.A too-small  indicates that the model does not fully learn the rotation robust features from MACCM, while a too-large  may overwhelm the impact of the original scene representation.

Influence of network initialization
Two network initialization settings, namely, optimization and weight initialization, may have an impact on the classification performance of RE-CNN.Multiple variations of these settings are studied on the KTD dataset (Kylberg, 2011).(Kylberg, 2011).Both curves become stable after 4000 iterations and do not shown over-fitting.
Table 10 reports the five-fold test accuracy (denoted by Acc) when using stochastic gradient descent (SGD), stochastic gradient descent of momentum (SGDM) and Adam optimizer.Generally, the performance of RE-CNN is not influenced by a different setting of the optimizer.
Table 11 reports the five-fold test accuracy (denoted by Acc) when the standard deviation of weight initialization varies from 1 × 10 −5 to 1 × 10 −2 .It is shown that the weight initialization has very little influence on the performance of the proposed RE-CNN.
The curves of training and test losses are shown in Fig. 10.After about 4000 iterations, both the training and test losses on KTD dataset become stable.It shows that there is no indication of over-fitting in the learning process.

Conclusion
In this paper, we proposed a RE-CNN framework for rotation variant scene recognition.Compared with existing rotation invariant scene recognition methods learning in a bottom-up manner, the RE-CNN scheme uses the classic MIL formulation and learns in a novel topdown manner.It not only eliminates the problem caused by the rotation sensitive nature of convolution operations in the existing bottom-up pipelines, but also accentuates the key regions in a scene regardless of their orientation.Furthermore, by exploiting the permutation-invariant characteristic of the MIL aggregation function, it allows the scene scheme prediction invariant to a position change of instances caused by rotation.Extensive experiments demonstrate that our RE-CNN outperforms 24 representative SOTA approaches on five rotation variant scene benchmarks.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 2 .
Fig. 2. (a) & (b): Existing rotation invariant scene recognition pipelines in a bottom-up manner by learning from either rotated samples(Zhang et al., 2017;Cheng et al., 2019;Dmitry et al., 2016) or from rotated convolution filters(Cohen and Welling, 2016;Worrall et al., 2017;Zhou et al., 2017;Marcos et al., 2017); (c) Our RE-CNN pipeline in a novel top-down pipeline.KIS: key instance selection module; KIA: Key instance aggregation module; SF: semantic fusion module; A and B denotes the correct and wrong category for intuitive illustration.
, the tolerance of rotation is implemented by a weight-sharing feature extraction module with the input from a different rotation.The first branch only has a weight-sharing spatial attention based deep MIL module, aiming to compute an instance spatial weight matrix {   ,ℎ } for each element in our MACCMs {   1 }.

Fig. 5 .
Fig. 5. Visualized samples from our RE-CNN.(a) samples from aerial, medical and industrial scenes; (b) instance-level semantic response; (c) heatmap based on instance response.

Fig. 6 .
Fig. 6.Comparison of low-level features from the backbone without (denoted as baseline) and with (denoted as RE-CNN) the proposed top-down rotation invariant learning scheme.The RE-CNN enforces the low-level convolution features to focus more on the edges and corners of the key objects, containing more abundant rotation information.

Fig. 8 .
Fig. 8. Feature maps from different rotation angles learnt by the proposed top-down RE-CNN and the other six bottom-up rotation invariant learning methods.The instance-level top-down scheme is more effective to highlight the key local regions regardless of the rotation angle.

Fig. 9 .
Fig. 9. (I) Instance-level semantics from multiple rotation angles are activated properly.(II) Samples (a) and the corresponding heat maps when processed only by the KIS module (b) or by both the KIS and KIA modules (c).

Table 1
Summary of four rotation variant scenarios involved in our experiments, including brief descriptions, corresponding benchmarks, evaluation protocols, baselines, sample numbers, scene category numbers and inputted image sizes.

Table 2
(Xia et al., 2017)uracy of our proposed RE-CNN and other approaches on the AID dataset.Results presented in the form of 'average±deviation' from ten independent runs(Xia et al., 2017); Metrics presented in %.The ResNet-50 result is implemented under the same hyper-parameter settings as the RE-CNN.The performance of the state-of-the-art methods is directly cited from the corresponding references.

Table 6
Ablation study of our RE-CNN on the AID and LAG dataset (OA: Overall Accuracy required by Xia et al., 2017; Acc: Five-fold test accuracy required by Li et al., 2019; Metrics presented in %; ResNet: Backbone ResNet-50; TL: Transitional layer for MACCMs; KIS: Key instance selection module; KIA: Key instance aggregation module; SF: semantic fusion module).

Table 7
Performance of our RE-CNN on different backbones on the AID dataset (OA: Overall Accuracy required by Xia et al., 2017; Metric presented in %; Para.num.: Parameter numbers; presented in million; FPS: Frame Per Second.).

Table 8
Xia et al., 2017pling interval (presented in degree) for   on the performance of our RE-CNN on the AID benchmark (OA: Overall Accuracy required byXia et al., 2017; Metrics presented in %).

Table 9
Xia et al., 2017;r-parameter  on the performance of our RE-CNN on the AID benchmark (OA: Overall Accuracy required byXia et al., 2017;Metrics presented in %).

Table 10
Kylberg, 2011a different optimization on the performance of our RE-CNN on the KTD benchmark (Acc: Five-fold test accuracy required byKylberg, 2011; Metrics presented in %).

Table 11
Kylberg, 2011different standard deviations of weight initialization on the performance of our RE-CNN on the KTD benchmark (Acc: Five-fold test accuracy required byKylberg, 2011; Metrics presented in %).The optimizer is fixed as the Adam in all experiments.The curves of training and test losses on KTD