Deep Tensor Capsule Network

Capsule network is a promising model in computer vision. It has achieved excellent results on simple datasets such as MNIST, but the performance deteriorates as data becomes complicated. In order to address this issue, we propose a deep capsule network in this paper. To deepen the capsule network, we present a new tensor capsule based routing algorithm and the corresponding convolution operation. Compared to vector capsules, tensor capsules can capture more instance-level information. Together, the relevant convolution operation is beneficial for reducing the amount of parameters in the routing process. Furthermore, we propose a dropout mechanism for vectors and tensors in order to alleviate the potential overfitting problem. Finally, we also inject the multi-scale capsules of the middle layers into a multi-scale decoder to pursue more details of an image and more clear reconstructed image. Experimental results on CIFAR10, Fashion-MNIST, and SVHN demonstrate that the proposed deep tensor network can achieve very competitive performance compared to other state-of-the-art capsule networks.


I. INTRODUCTION
In recent years, the convolutional neural network (CNN) has achieved higher performance than conventional manually designed featue driven model in many computer vision tasks. However, there are several limitations of CNN, such as the lack of equivariance and the inability to maintain spatial hierarchies between features. To address these limitations, the capsule network (CapsNet) [1] was proposed. In contrast to CNN, CapsNet replaces the scalar neurons in a network with capsules that are vectors of neurons and the max-pooling with dynamic routing. A capsule contains several instantiation parameters of different types and is sent from one layer to the next by a powerful dynamic routing mechanism. It is in this way that the capsule network can maintain spatial information from top to bottom of the network.
Although CapsNet works well with MNIST [2], it does not perform as well as CNN in complex datasets, such as CIFAR10 [3]. CapsNet may require higher dimensions of capsules when classifying complex images. Moreover, some traditional machine learning methods and neural networks can still surpass CapsNet in some classification tasks with more than 10 classes [4]. Therefore, the performance of The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . capsule network in difficult classification tasks needs to be further improved.
One of the main reasons for the excellent performance of CNN on image classification task is due to its deep architecture. Intuitively, deepening CapsNet is a promising direction to address the poor performance of CapsNet on complex data. According to [5], it is not practical to construct a deep capsule network using only fully-connected capsule layers proposed by Sabour et al. [1]. Firstly, the dynamic routing algorithm of capsule network requires a lot of computation and memory. Secondly, multiple fully-connected capsule layers are stacked, resulting in poor performance [6]. According to [5], too many capsules lead to too small coupling coefficients, which dampen the gradient flow. Thirdly, a symmetric input cannot be distinguished correctly by CapsNet and the classification accuracy decreases with the increase of network layers [7]. Fourthly, as the number of network layers increases, the overfitting phenomenon will become more and more serious. In addition to reconstruction, capsule networks also need other regularization methods.
In order to construct the deep capsule network and deal with the problems mentioned above, this paper proposes the following solutions.
The convolution operation in CNN is one of the keys to implement the deep architecture. This operation can be used to transform tensors by kernels of shared parameters, and output feature maps, which are also tensors. Because the dynamic routing of CapsNet is based on matrix multiplication, it is a great parameter cost. If the convolution operation can be introduced into dynamic routing, a large number of parameters can be reduced. Inspired by this, Rajasegaran et al. [5] introduced 3D convolution into the capsule routing process. However, this operation shares the convolution weight among all the capsules in a layer, which makes it difficult for the convolution operation to distinguish the different capsules. This method reduces the performance of the capsule network, and a similar defect is found in the transformation matrix of shared weights in [8]. To overcome this shortcoming and introduce the convolution operation, convolution-based tensor capsules routing is proposed. Tensor capsules are composed of numerous related vector capsules, which can be used to represent more complex instantiated entities. Each child tensor capsule predicts possible parents by independent 2D convolution. This method can make each tensor capsule correspond to a convolution operation that is analogous to a transformation matrix between vector capsules. Moreover, traditional convolution layers can be regarded as a special case of the one iteration of tensor capsule routing.
Due to the inherent increase in the model complexity with model depth, the model will likely be seriously overfitting. However, the current deep capsule architecture lacks more effective regularization methods, except for reconstruction. To deal with the overfitting phenomenon of the deep capsule network, the dropout method is proposed, which is called capsule dropout, suitable for tensor capsules.
The decoder is a special part of CapsNet to reconstruct the input images. This can not only provide regularization for CapsNet, but also realize semantic segmentation. However, Nair et al. pointed out that this decoder failed on complex images (such as CIFAR10), and the result of reconstruction is too blurred to recognition. This is because the decoder only reconstructs an image from a capsule, while the capsule (typically 16 or 32 dimensions) can provide limited information to restore the original image. The middle layers of the deep capsule architecture contain multi-scale information of images. Inspired by this, the multi-scale decoder is proposed, which uses multi-scale capsules to help reconstruct clear images.
To this end, this paper proposes three novel algorithms and networks: a tensor capsule routing, a tensor capsule based dropout and a multi-scale decoder. They are integrated into the residual capsule architecture, effectively improving the performance of the deep capsule network. More specifically, the following contributions are made in the paper: • A novel tensor capsule routing algorithm is proposed.
It takes the tensor as the data structure of capsules and implements the transformation between capsules by convolution operation. The algorithm is introduced into the deep capsule architecture, so the network is called DeepTensorCaps.
• A novel tensor capsule based dropout is proposed. This method can effectively improve the generalization performance of DeepTensorCaps.
• A novel multi-scale decoder is proposed, which is a new network architecture for image reconstruction. it not only provides regularization effects for a DeepTen-sorCaps, but also achieves clear image reconstruction on CIFAR10.
• This paper evaluates the performance of DeepTen-sorCaps on several benchmark datasets: CIFAR-10, Fashion-MNIST, and SVHN [9]. The classification accuracy on these datasets proves that tensor capsule routing and dropout effectively improve the performance of deep capsule network. Image reconstruction result on CIFAR10 demonstrates the effect of multi-scale decoder. Besides, we experimentally demonstrate that weight decay is not suitable for the capsule model and warm Restarts learning rate is beneficial to train deep architecture. To the best of our knowledge, these experiments show that the DeepTensorCaps achieves state-of-the-art performance in the capsule network domain.

II. RELATED WORK
Since Sabour et al. [1] proposed the CapsNet, many improved algorithms continue to emerge. These works can be roughly divided into the following aspects. Firstly, in the aspect of routing algorithm improvement, many novel algorithms are proposed. Some of the methods are outlined below: HitNet [10] uses one specific capsule layer, called Hit-or-Miss layer, for data augmentation. Wang and Liu [11] formulate the routing algorithm as an optimization problem and propose a novel routing algorithm similar to the agglomerative fuzzy K-Means algorithm, which achieved better performance. Ren et al. [8] introduce the Parameter-Sharing mechanism to improve the capsule network generalization ability. Peer et al. [7] poof a drawback in dynamic routing algorithms that is incapable of distinguishing symmetric inputs, and propose a reasonable solution: bias and homogeneous representation of instantiation vectors. Do Rosario et al. [12] propose a novel model that is called multi-lane capsule network, which is similar to a multi-network integration solution. Zhao et al. [13] use the scale-invariant Max-Min function improves the performance of CapsNet during the routing process.
Secondly, the performance of capsule networks is improved from the combination of capsule networks and convolution networks. The common architecture is to use a convolutional neural network as a feature extractor and a capsule network as a classifier. Some of the methods are outlined below: Yin et al. [14] propose to use the combination of a pre-trained convolution network and a capsule network to handle hyperspectral image classification. Samarth R Phaye et al. [15] replace the standard convolutional layers in capsule network with the densely connected convolutions and achieve better performance than the classic VOLUME 8, 2020  capsule network. Hoogi et al. [16] propose a novel capsule network called SACN which inserts a self-attention layer between the convolutional layer and the primary capsule layer to improve the performance on image classification. Ren et al. [8] stack multiple residual blocks as a feature extractor and use a capsule layer with shared transformation matrices as a classifier to implement image classification.
Thirdly, to improve the performance of the capsule network on complex data, a deeper architecture can also be used. However, there are a few studies in this direction, and capsule networks mentioned in most papers have only 2 or 3 capsule layers. Therefore, a deeper architecture may be a worthwhile and promising research direction. To the best of our knowledge, DeepCaps [5] is the only paper to discuss deep capsule network. This paper is mainly inspired by their work. As shown in Fig. 1, the blue cubes on the left represents capsule tensors in layer l, and the shape is h l , w l , a l . Each of them predicts a group i of predicted capsules by 3D convolution operation, which performs in the direction of depth (namely the vertical direction of the blue cubes), and the number of elements of this group is c l+1 . The correspondingly predicted capsules (e.g. a set of yellow cubes in the horizontal direction) are weighted by coupling coefficients R i and added together to obtain c l+1 final predicted capsules. These predicted capsules get the final outputs after several routing iterations. 3D convolution greatly reduces the number of parameters of capsule transformation, but the 3D convolution shares convolutional kernel among multiple capsules, which leads to the same transformation between different capsules and damage the spatial information of features. In order to maintain the spatial information between related capsules, DeepTensorCaps regards the feature maps obtained by convolution operation as independent capsules, and perform independent capsule transformation through 2D convolution. Furthermore, DeepCaps only uses image reconstruction for regularization, but using only this method does not reduce overfitting very well. Therefore, a novel dropout method is added to DeepTensorCaps to further solve this problem. Moreover, the reconstruction of the decoder of DeepCaps on CIFAR10 is failed, while the multi-scale decoder well overcome this issue. Our work continues along the deep architecture direction to explore appropriate routing algorithms, effective regularizations, and efficient training methods.

III. DeepTensorCaps ARCHITECTURE
DeepCaps [5] is adopted as the basic architecture and add the layers proposed in this paper, as shown in Fig. 2. However, the basic architecture of the capsule network will not change much, mainly to compare with DeepCaps and highlight the effects of tensor capsule routing and regularization methods As shown in Fig. 2, the input images of size (64,64,3) are passed through four residual capsule blacks after a traditional 2D convolutional layer, where k is the size of kernels, c is the number of kernels, and s is the strides. After the outputs of residual black 3 and 4 are flatted into 1D capsule arrays, concatenation fusion is carried out. Digital capsules are output through fully connected capsule layer, where c is the number of capsules, a is the number of atoms, and r is the number of iterations of routing.
the DeepTensorCaps is mainly composed of a 2D convolutional layer, residual capsule blocks, capsule dropout layers, and a fully connected capsule layer. Each part of the network is described in detail below. Firstly, an image of size (64,64,3) is inputted into the 2D convolutional layer, which has 128 convolutional kernels with size (3,3), strides (1,1), and relu activation function. This convolutional layer extracts the low-level feature maps of the original image.
Secondly, the 128 feature maps can be seen as a tensor capsule in the shape of (64,64,128). Then, the tensor is inputted the first residual capsule block, as shown in Fig. 2. DeepTensorCaps contains a total of four residual capsule blocks, and their internal architecture is the same. As shown in Fig. 3, the first residual capsule block has three tensor capsule layers, which are implemented by the tensor capsule routing algorithm that is described in section IV. In a residual capsule block, the size of the convolutional kernel of the first tensor capsules layer is (3,3), the strides is (2,2), the number of output tensor capsules is 32, the number of atoms of capsules is 4, and the number of routing iterations is 1. The parameters of the second, third and skip connected layers are similar to the first tensor capsules layer, except that the strides is (1,1). Subsequently, the outputs of the third layer and the outputs of the skip connected layer are fused by an element-wise addition. Finally, a tensor capsule based dropout layer is used to handle the block's outputs. The remaining three residual capsule blocks are the same as the first one, with only a few differences. In the second block, the number of atoms of the tensor capsule layers is 8, other configurations remain unchanged. The dropout layers are removed from block 3 and 4, because their outputs are then concatenated and together inputted the vector capsule based dropout layer outside blocks. The skip connected layer of block 4 has routing iterations of 3, which corresponds to DeepCaps. The outputs of the block 3 and 4 are not fused by addition. Instead, The output tensors of both are flattened into one-dimensional arrays of vector capsules, and then the two arrays are concatenated together. This concatenated capsule array, through a vector capsule based dropout layer, serves as the input data for the fully connected capsule layer.
Thirdly, the last layer of DeepTensorCaps is the fully connected capsule layer, its outputs are 10 digital capsules with 32 atoms.
Finally, the length of each digital capsule is calculated to obtain the probability of the class to which the input image may belong.

A. TENSOR CAPSULE ROUTING ALGORITHM
In this paper, one can think of a tensor capsule as a cube of neurons. So, the tensor capsule i of layer l is defined as where h l is the height of the tensor capsule, w l is the width, and a l is the number of atoms.The number of tensor capsules of the layer l is c l . In this section, the tensor capsules of layer l + 1 are produced by routing the capsules of layer l. The detailed steps are described as follows and are shown in Fig. 4. In this figure, each tensor capsule u l i in layer l is convoluted to generate the feature maps calledˆ i , which is an initial predicted tensor. k l i is the corresponding convolutional kernel. Then, eachˆ i is reshaped into a group of c l+1 tensor capsules, called i , so there are c l × c l+1 predicted tensor capsules serves as the input to the routing process. These capsules finally product c l+1 tensor capsules in layer l + 1 through multiple iterations of routing by agreement (adding trainable biases items). The detailed algorithm steps are described below.
Firstly, the groups of tensor capsules predicted from the layer l is defined as where j ∈ c l+1 and i ∈ c l . In classical routing algorithms [1], this process is implemented by matrix multiplication. In this paper, due to the huge cost of matrix multiplication, the transformation between tensor capsules in adjacent layers is replaced by convolution, which is defined according to (1) where the K l i is a convolutional kernel that perform a tensor transformation. The shape of this kernel is k l , k l , a l , c l+1 × a l+1 , where k l is the size of kernel in layer l. Then, the feature mapsˆ i of shape h l+1 , w l+1 , c l+1 × a l+1 are reshaped into a i . Next step, the tensor i is split along the last dimension with lenght a l+1 , resulting in c l+1 predicted tensor capsules u l+1 j|i of shape h l+1 , w l+1 , a l+1 .
Secondly, these predicted tensor capsules are used for the routing process. However, these capsules are tensors, which can also be regarded as matrices of vector capsules. Therefore, two methods are proposed to rout by agreement: the matrix method and the tensor method.

1) MATRIX METHOD
A tensor capsule is viewed as a matrix of multiple vector capsules, and the log prior probability tensor is defined as  (2).
where r ijxy is a element of the coupling coefficient matrix defined as r ij , and x ∈ h l+1 , y ∈ w l+1 is coordinates of this matrix. The final predicted tensors s j , it is a weighted sum over all u l+1 j|i . The formula is as (3).
where is defined as the product of corresponding elements is performed on the first and second dimensions of the tensor, after broadcast on the third dimension. This step can be visualized in Fig. 5. First, the coupling coefficient matrix is copied a l+1 times and arranged in depth to form a new coefficient tensor, which is the broadcasting process. Then, the predicted tensor capsules and the coefficient tensor are multiplied element by element to produce the weighted predicted tensor capsules. Finally, all the corresponding tensors (in the direction of i) are added to get the final predicted tensor s j . Next, the output tensor capsules of one routing iteration are obtained by squeezing s j . This operation is squash function defined as (4). where v jxy ∈ R a l+1 is a element of a tensor capsule v j ∈ R h l+1 ,w l+1 ,a l+1 which is the output of one iteration, and x ∈ h l+1 , y ∈ w l+1 are coordinates of the width and height of v j respectively. According to [7], in order to solve the limitation of routing by agreement, a bias d jxy term is added to (4). The d jxy ∈ R a l+1 is a element of tensor d j ∈ R h l+1 ,w l+1 ,a l+1 . Finally, at the end of the routing algorithm in one iteration, B needs to be updated by the agreement of v j and u l+1 j|i . This formula is shown in (5).
where is defined as the product of corresponding elements, and the b ij is updated after summing over the atoms dimension.

2) TENSOR METHOD
Unlike coefficients in matrix form mentioned above, each scalar coupling coefficient corresponds to a tensor capsule. So, the log prior probability matrix is defined as B ∈ c l , c l+1 initialized to 0 and r ij can be calculated by the softmax function as shown in (6).
Before calculating the finally predicted tensor S j , a scalar coefficient is extended to a matrix of size h l+1 × w l+1 which is replicated by the scalar. Then S j are obtained by using (3). The process is shown in Fig. 6. The scalar is first extended to a matrix, then broadcast along the depth direction, and multiplied element by element with the predicted tensor.
Subsequently, by using squash function (as shown in (4)), obtain the output tensor capsules v j of one routing iteration from S j . Finally, log prior probabilities need to be updated. Since b ij is a scalar, the process uses (7), which is different from (5).
where N = h l+1 × w l+1 . The inner part of the parenthesis gives a matrix such as (5), whose shape is h l+1 , w l+1 . Equation (7) is similar to calculating the mean value of each element of the matrix based on the results of (5).
After several iterations of this routing algorithm, the tensor capsules of layer l + 1 is finally obtained. The selection of routing times is arbitrary, but to reduce the computational cost, most of the routing iterations are 1 and a few are 3.

B. REGULARIZATION
When the number of neural network layers and the quantity of neurons increases, it is easy to overfit. In order to reduce overfitting in the capsule networks, reconstruction loss [1] is proposed. This method effectively alleviates overfitting on the shallow capsule network, but for the deep capsule network, the reconstruction loss alone may not be able to achieve the best regularization. This paper attempts to add other regularization methods to the network.

1) MULTI-SCALE DECODER
In [1], the reconstruction of the input images is achieved through full connection layers, but this can only build a relatively shallow reconstruction network. In order to improve the regularization effect and reconstruction ability of the capsule network, the class independent decoder network is proposed in [5]. In this decoder, the transposed convolution layers are used to replace the full connection layers, and its input is only one digital capsule. Class independent decoder can provide effective regularization, but the performance of image reconstruction is limited. To solve this problem, the decoder is improved by introducing the multi-scale capsules in the middle layers of the deep capsule network into the decoder. The multi-scale decoder is shown in Fig. 7. The input to this network is a class capsule (digital capsule) that comes from the output of the DeepTensorCaps. The dense layer is a fully connected network layer and the Conv2DT is a transposed convolution layer. In a layer with convolution, k is the kernel size, c is the number of kernels, and s is the strides. The refers to concatenation operation. In Fig. 7, a capsule that represents an image category is decoded by a fully connected layer and then reshaped into feature maps. These feature maps can be used to produce a reconstructed image through the transposed convolution layers. The output of each residual capsule blocks are tensor capsules with shape c l , h l , w l , a l , which are converted to shape h l , w l , c l × a l by the reshape operation, in order to match the tensor capsules with the feature maps. After the reshaped capsule tensors are concatenated with the feature maps of the corresponding scale, they are inputted to the next transpose convolutional layer. Since the pixel intensity of an image can generally be expressed in an interval of 0 to 1, the sigmoid activation function is adopted in the last layer of the multi-scale decoder, while the relu activation function is adopted in other layers. The final output of the multi-scale decoder is the reconstructed image with shape (64,64,3).

2) DROPOUT
Since Hinton et al. [17] proposed the dropout method, it has been very successful in the deep network, especially in computer vision.
Capsule networks also encounter overfitting problems. In the classic capsule network model [1] and deep capsule networks [5], image reconstruction is used to alleviate overfitting. We believe that reconstruction alone is not enough, so the dropout should also be applied to the capsule network. The dropout in [17] is designed for scalar neurons network and is not applicable to vector or tensor capsules. This paper modified the original dropout and proposed three methods. VOLUME 8, 2020

a: DROPOUT OF VECTOR CAPSULES
The fully connected capsule layer is similar to the dense layer of scalar neural networks, but the capsule neurons are in vector form. Therefore, the shape of the tensor of the fully connected capsule layer l is c l , a l , where c l represents the number of capsules and a l is the amount of atoms in each capsule. Dropout calculates the probability in the first dimension of the tensor capsules and keeps it the same in the second dimension. So the shape of the dropout mask can be expressed as c l , 1 . This dropout method is similar to [18], but it is only applied in the shallow capsule network without convolutional capsule layers.
As shown in Fig.8, the bars on the left represent the vector capsules, and the black on the right represents the dropout capsules. This dropout is to randomly erase several capsules output from the layer l. There is no special relationship between the capsules.

b: DROPOUT OF TENSOR CAPSULES
Unlike the fully connected capsules layer, the output of the tensor capsule layer is a set of capsules in the form of 3D tensors. Therefore, two dropout methods are proposed, which consider from different perspectives of tensor capsules: independent tensor capsules based dropout and internal elements of tensor capsules based dropout.

Independent Tensor Capsules Based Dropout:
In general, the shape of overall output of tensor capsule layer l is c l , h l , w l , a l , which mean c l tensor capsules in shape h l , w l , a l . The basic unit of this dropout is a tensor, therefore, the shape of dropout is c l , 1, 1, 1 .
As shown in Fig.9, the cubes on the left of the figure represent the output tensors after routing. The right cubes are the dropout tensors, where black tensors are masked by 0. The Fig.9 shows that the dropout process bases on tensors. Some tensor capsules are zeroed, just like scalar neurons in neural networks. The dropout capsules will no longer be routed to a higher layer, as if the part information of the high layer is lack, forcing the network to learn more robust features.
Internal Elements of Tensor Capsules Based Dropout: Let's take another view at the output of the tensor capsule layer l mentioned above. A tensor capsule can be viewed as a matrix of vector capsules of length a l , which has width w l and height h l . Unlike vector capsules in the fully connected capsule layer, there is a correlation between each capsule in a tensor. So, the shape of this dropout is c l , h l , w l , 1 .
If a tensor capsule is compared to a multi-channel image, this method is similar to randomly remove multiple pixels from a image, as shown in Fig. 10. On the left, there are tensor capsules of the layer l. The cubes on the right represent tensors after dropout, where the black squares in cubes represent the vector capsules multiplied by 0. The number and  position of these black squares on each tensor are random. This disturbs the inner values of a tensor capsule. Since the entity information represented by the tensor capsules is interfered, the entities that are routed to the upper layer will not be complete. In this way, the network is forced to learn more crucial information, which increases the ability to reduce overfitting.
In the section of experiments, it is experimentally proved that all three dropout methods are effective, but the effect of internal elements of tensor capsules based dropout is superior to the independent tensor capsules based dropout. However, the independent tensor capsules based dropout makes the intermediate data of the computation more sparse, thus reducing the memory overhead during training.

3) WEIGHT DECAY
There are several common weight decays: L1, L2, and a combination of these. In the deep convolutional network, weight decay is a successful regularization method, but whether this method is suitable for a deep capsule network still needs experimental verification. In DeepTensorCaps, L1 and L2 regularization are applied to the kernel of the tensor capsule layers and the transformation matrix of the fully connected capsule layer. In the following experimental section, it is shown that weight decay cannot provide regularization effect and damage the performance of DeepTensorCaps.

C. LOSS FUNTION
The loss function of DeepTensorCaps is divided into two parts: the margin loss [1] of the tensor capsule network and the mean square error of the decoder network.
In this paper, DeepTensorCaps adopts the margin loss as the objective function to be optimized. The outputs of the network are category capsules in the vector form. The length of each category capsules represents the existence probability of a certain class. This loss function increases the probability of the true class and suppresses the probability of the other classes, which guide the network to optimize in the right direction. The loss is shown in (8).
where T k is 1 when the real class is k, and 0 otherwise. v k is a digital capsule of class k. m + and m − represent the lower bound for the correct class and the upper bound for the incorrect class, respectively. λ is the equilibrium parameter of the positive and negative examples. In this paper, we take 0.5. The second loss function is the mean square error of pixel intensity between the reconstructed image and the real image for the decoder neural network. This is different from the binary cross entropy in [5], but the same as in [1]. The mathematical formula is (9) where N i is the number of pixels in an image i. I xy refer to a pixel of the original image in coordinates xy andÎ xy represent a pixel of the reconstructed image.
To sum up, the total loss function is (10): where µ is the parameter that balances these two error terms.

D. WARM RESTART LEARNING RATE
Capsule networks are more complex than CNN due to the introduction of capsules and routing. As a result, the training time of the capsule network is relatively long, while the training is more difficult for the deeper capsule network. An appropriate training strategy is one of the effective methods to improve the training process of capsule networks.
To address this challenge, Loshchilov and Hutter [19] proposed warm restart, a strategy of learning rate. After that, Marchisio et al. [20] applied this method to the capsule network. However, their work only focuses on shallow capsule network. We are applying this approach to deep tensor capsule networks. Experiments show that the warm restart is more suitable for DeepTensorCaps than the strategy proposed in [5]. This method is formulated in (11) lr = lr min + 1 2 (lr max − lr min ) 1 + cos π ts T Here lr max and lr min are the upper bound and lower bound of the learning rate, ts is the training step, which is a batch index, T is the total number of training step in an epoch. The learning rate can be cycled in every epoch in the method of cosine annealing.

A. IMPLEMENTATION
For the training procedure, Adam optimizer [21] is used for training the network, and the learning rate is adopted from 0.001 to 0.0001 in each iteration which define an epoch as the cycle and a batch index as the step. The µ parameter of the loss function is 0.6 and λ is 0.5. Dropout probability is 0.35. The experiment use data augmentation similar to DeepCaps, such as translation, rotation, etc. Tensorflow [22] is the framework to implement DeepTensorCaps. According  [1] has the result of 7 network integration on CIFAR-10, and its result on the FMNIST is from [5]. DeepTensorCaps is only a single network, and do not use ensembles.
to each experiment, the configuration of the experiment will change a little, and the detailed parameters will be introduced in the following section.

B. CLASSIFICATION RESULTS
Due to our focus on complex image classification tasks, so experiments test on three benchmark datasets: CIFAR-10 [3], Fashion-MNIST (FMNIST) [23], and SVHN [9]. The datasets are much more complex than MNIST [2], which is too simple to distinguish performance between networks. Among these datasets, CIFAR-10 is the most complex and the most important target dataset of the experiments. Therefore, the accuracy of the network on CIFAR-10 is the main performance indicator, while FMNIST and SVHN are used as supplements to further investigate the DeepTensorCaps. In order to deepen the layers of the capsule network and be consistent with DeepCaps [5], we resize the images from (32,32) to (64,64). Besides, the DeepTensorCaps adopts the internal elements of tensor capsules based dropout and the reconstruction as regularization. As shown in table 1, DeepTensorCaps are compared with other capsule networks and deep convolutional networks.
As shown in table 1, the DeepTensorCaps achieves the competitive results compared with other capsule networks on CIFAR-10, FMNIST, and SVHN, while the performance is slightly below a few of advanced convolution networks. Compared with the CapsNet [1], the performance of DeepTensor-Caps on complex data is better: the improvement is 3.47% on CIFAR-10, 1.9% on FMNIST, and 1.71% on SVHN. Note that the CapsNet results from the ensembles of 7 networks on CIFAR-10, whereas the DeepTensorCaps is a single network. Compared with the DeepCaps [5], the performance improved by 1.86%, 1.04%, and 0.25%, under similar network architectures. Therefore, the DeepTensorCaps achieve the highest performance in the complex image classification task of capsule network within our known scope.
The first four networks in the table 1 (above the middle line) are scalar neural networks based on convolution, namely CNN. The DeepTensorCaps does not outperform all convolution networks, such as ResNet [25] and DenseNet [24] (our network has slightly more accurate in FMNIST by 0.1%). However, it should be noted that the depth of the ResNet [25] and the DenseNet [24] are at least dozens of layers, but the DeepTensorCaps has only 18 layers (except those without trainable parameter layers, such as dropout). The DeepTensorCaps achieves competitive performance with fewer layers. So, it also suggests that deepening the capsule network is promising to further improve its performance.
The HitNet [10], the network proposed by Zhao et al. [13] and the Mlcn2 [12], these models improve network performance by modifying routing algorithms mentioned in section II. However, as the table 1 shows, the methods that only modifies the routing is not as effective as the method that deepens the depth, such as DeepCaps [5] and ours.
The architecture of DCNet++ [15] is the combination of a dense CNN and a capsule network. Moreover, in the FC-SA and PS-SA [8], a CNN (at least 6 residual blacks) is adopted to extract image features, and then a capsule network (shallow network) is used as a classifier. These belong to the method, the combination of CNNs and capsule networks. Experiments show that the performance of the deep capsule network is better than this combination method.
To our best knowledge, the DeepTensorCaps surpass the results of all the existing capsule network models and is slightly below a few of Deeper convolution networks on CIFAR-10, FMNIST, and SVHN. To sum up, deepening the capsule network is one of the most effective ways to improve the capsule network.

C. ABLATION EXPERIMENTS
To investigate various parts of the DeepTensorCaps, several ablation experiments are used to validate them. The results of the capsule networks on CIFAR-10, FMNIST, and SVHN can well distinguish and evaluate the models, so thay are used as datasets for the following experiments.

1) TENSOR AND MATRIX METHOD BASED ROUTING BY AGREEMENT
In this section, two different routing methods of tensor capsules are compared on CIFAR-10, FMNIST, and SVHN. As shown in table 2, both methods have achieved good classification accuracy, but the matrix method is better. However, in terms of calculation and storage cost, the tensor method consumes less, because its coupling coefficients are only scalars, while the matrix method requires coefficient matrices. So tensor method can get competitive performance with fewer parameters.

2) REGULARIZATION a: DROPOUT
The capsule dropout proposed in this paper can be divided into three ways, one vector capsule dropout and two tensor capsule dropout methods. The fully connected vector   capsule dropout and the tensor capsule dropouts are compatible. However, these tensor capsule methods should be used separately, and experiments are used to prove which one is better.
As shown in table 3, those dropout methods are effective, and the combination of the vector capsule dropout and the tensor capsule element dropout works best. Independent tensor capsules dropout is slightly less effective. This method VOLUME 8, 2020   may requires fine-tuning the dropout probability, because a larger amount of data to drop out at a time. Independent tensor capsule dropout, however, can yield more sparse intermediate data, requiring less memory. Compared to table 1, the DeepTensorCaps without dropout (shown in first row of table 3) also achieves competitive results! b: WEIGHT DECAY In this paper, in addition to dropout methods, L1 and L2 regularization aiming at network weight are also studied. In this experiment, CIFAR-10 can clearly show these regularization effects. The training loss of them are compared on this dataset, as shown in the Fig.11.
According to the Fig. 11 above, L1 and L2 regularization make the DeepTensorCaps unable to converge, as shown in (a) and (b). In other words, L1 and L2 are detrimental to the training of the capsule network. This is because L1 and L2 regularization perturb the transformation matrices (or convolution kernel) between low capsules and high capsules. It indicates that the DeepTensorCaps is different from the CNN, because the transformation between capsules contains some important spatial information. However, The CNN has not, which is not harmed by the weight decay.

3) LEARNING RATE STRATEGY
The proper selection of learning rate is very important to training a capsule network, which can not only accelerate the convergence of the network but also improve the performance slightly. In the training of a capsule network, the learning rate is usually reduced with epochs. Marchisio et al. [20] use the warm restart learning rate in classic capsule network [1] for the first time. In this section, experiments prove that the warm restart is also suitable for the DeepTensorCaps.
The first method in table 4, learning rate initialized to 0.001 is reduced by 0.9 after each 10 epochs. In the second method, the learning rate was adopted from 0.001 to 0.0001 in each iteration, which takes epoch as the cycle and batch index as the step. As shown in table 4, warm restart improves the network effect by 0.7% on CIFAR-10. Besides, when training the network, the warm restart is more efficient and converges faster than the learning rate with decay, which can be shown in Fig. 12, 13, and 14.
In these figures, the capsule loss refer to the margin loss. In order to avoid the influence of overfitting on the loss curve, the test loss curves are drawn to verify the effect of the warm restart. The blue curve decreases faster, the curvature is larger, and the surface is smoother, which can fully explain that warm restart makes the training process faster, better, and more stable.
In conclusion, according to table 4 and these figures, the warm restart learning rate is a more appropriate method for the deep capsule network.

4) RECONSTRUCTION
In this section, the reconstruction performance of the multi-scale decoder is investigated by the experiment on CIFAR10 that is difficult to reconstruct clearly by many capsule networks. Firstly, the advantages of the multi-scale decoder is investigated intuitively from the comparison of appearance of reconstructed images. Secondly, by comparing the mean square error between the multi-scale decoder and the single-scale decode (class independent decoder [5]), the performance of the two is investigated quantitatively.
As shown in Fig. 15, the images on the left are generated from the single-scale decoder, and the images on the right are generated from a multi-scale decoder. The images output by the single-scale decoder retain the outline information and color distribution of objects, but lack details and are very fuzzy. Since the multi-scale capsules are injected into the decoder, the appearance of the reconstructed images are greatly improved. It is proved intuitively that the multi-scale decoder is effective and the single capsule is difficult to recover complex images. However, by experiments, although the multi-scale decoder can reconstruct clear images, the regularization effect it provides is similar to that of single-scale decoders. Therefore, multi-scale capsules still need further study.
As shown in Fig. 16, the image loss of multi-scale decoder is far lower than that of the single-scale decoder in the early stage of training. The mean square error of single scale reconstruction is also in decline, but it is much slower than that of the multi-scale decoder. Therefore, it can be proved that using multi-scale decoder to reconstruct the original image is very effective. FIGURE 16. The mean square error of decoders on the CIFAR10 test data. VOLUME 8, 2020

VI. CONCLUSION
Deepening the capsule network is one of the effective ways to improve the performance of the capsule network, and convolution is a reasonable choice to reduce the cost of computation and storage. We take the outputs of convolution as capsules and propose the algorithm of convolution based tensor capsule routing. Therefore, this capsule network that is based on residual capsule architecture [5] is called DeepTensorCaps. Due to the deepening of the network, overfitting is very easy to happen. In order to meet this challenge, the dropout methods suitable for the deep capsule network is proposed. To obtain clear reconstructed images, the multi-scale capsules of the middle layers are injected into a multi-scale decoder to enrich the details of the images. The DeepTensorCaps with techniques proposed by us is investigated on many datasets, such as CIFAR-10, Fashion-MNIST, and SVHN. Experiments show that these techniques can effectively improve the performance of the deep capsule network.
In the research, it is found that overfitting is still a challenge for DeepTensorCaps. Reconstruction and dropout may not be enough, and experiments show that weight decay is not effective, so it is needed to find a better regularization method. Modifying batch normalization should be one of the solutions worth studying. Moreover, there are no major changes to the network architecture, but intuitively, increasing routing times, the number of residual blocks, and the depth and width of the network should also increase network capacity. These are all worth studying in the future. KUN