Compressed Wavelet Tensor Attention Capsule Network

Texture classification plays an important role for various computer vision tasks. Depending upon the powerful feature extraction capability, convolutional neural network (CNN)-based texture classification methods have attracted extensive attention. However, there still exist many challenges, such as the extraction of multilevel texture features and the exploration of multidirectional relationships. To address the problem, this paper proposes the compressed wavelet tensor attention capsule network (CWTACapsNet), which integrates multiscale wavelet decomposition, tensor attention blocks, and quantization techniques into the framework of capsule neural network. Specifically, the multilevel wavelet decomposition is in charge of extracting multiscale spectral features in frequency domain; in addition, the tensor attention blocks explore the multidimensional dependencies of convolutional feature channels, and the quantization techniques make the computational storage complexities be suitable for edge computing requirements. 'e proposed CWTACapsNet provides an efficient way to explore spatial domain features, frequency domain features, and their dependencies which are useful for most texture classification tasks. Furthermore, CWTACapsNet benefits from quantization techniques and is suitable for edge computing applications. Experimental results on several texture datasets show that the proposed CWTACapsNet outperforms the state-of-the-art texture classification methods not only in accuracy but also in robustness.


Introduction
Texture classification is crucial in pattern recognition and computer vision [1][2][3][4][5]. Since many very sophisticated classifiers exist, the key challenge here is the development of effective features to extract from a given textured image [6]. As an important research issue, many methods have been proposed to represent texture features [7,8]. About 51 different sets of texture features are summarized in [9]. ese texture features are generally hand-crafted under some hypothesis of texture characteristics. Because different texture datasets contain different types of textures, the performance of hand-crafted features is usually changed for different datasets [4].
Recently, texture representation methods based on CNN have been achieved powerful representation capability [6,[10][11][12][13]. ese CNN-based methods implement texture feature extraction in an end-to-end way which does not require predefined representation formula. Moreover, erefore, the integration of attention mechanism and CapsNets has great potential to represent texture features and explore their relationships sufficiently. e key problem preventing attention mechanism and CapsNets from being applied in edge computing domain is that they both suffer from heavily computation and memory burdens. It is essential to consider the quantization techniques for deploying models on edge devices [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33]. is paper proposes the compressed wavelet tensor attention capsule network (CWTACapsNet) that integrates multilevel wavelet decomposition, tensor attention mechanism, and quantization techniques into the capsule network. e proposed CWTACapsNet involves several compressed multiscale tensor self-attention blocks that can capture multidirectional dependencies across different channels. Furthermore, CWTACapsNet utilizes Nyström technique and proposes quantized dynamic routing process to release resource requirements. e main contributions of CWTACapsNet are three folds. First, it uses multilevel wavelet transform to extract multiscale spectral features in frequency domain which further extends texture representation capability. Second, it employs tensor attention mechanism via matrization to explore the multidirectional dependencies of texture features in different scales. ird, it employs quantization techniques to reduce the computation and memory costs without sacrificing the accuracy. e rest of the paper is organized as follows. Section 2 presents the whole architecture and key parts of the proposed CWTACapsNet. Section 3 presents validation experiments and discusses the experimental results. e conclusion is drawn in Section 4.

Compressed Wavelet Tensor Attention
Capsule Network e proposed CWTACapsNet integrates multiscale wavelet decomposition and tensor self-attention blocks into capsule network. e architecture of CWTACapsNet is shown in Figure 1. CWTACapsNet involves the wavelet feature extraction block, the compressed multiscale tensor self-attention block, and the quantized capsule network. e wavelet feature extraction block extracts multiscale spectral features with multilevel wavelet decomposition. e compressed tensor self-attention block captures the multidirectional relationships within each scale, and the primary capsules are generated based on the wavelet and tensor attentive information.

Multiscale Feature Extraction via Wavelet Decomposition.
Given an image x, we utilize the 2D discrete wavelet transform (DWT) [34] with four convolutional filters, i.e., low-pass filter, f LL , and high-pass filters, f LH , f HL , and f HH , to decompose x into four subband images, i.e., x LL , x LH , x HL , and x HH . e convolutional stride is 2. e four filters are defined by e four filters (see equation (1)) indicate that they are orthogonal to each other and form a 4 × 4 invertible matrix.
e DWT operation is given by x HL � f HL * x ↓ 2 , where * denotes convolution operator and ↓ 2 denotes the downsampling with stride 2. e (i, j)-th value of x LL , x LH , x HL ,and x HH after 2D Haar transform [19] is given by Based on multilevel wavelet package transform [35], the subband image x LL is recursively decomposed by DWT. Because the downsampling stride is 2, the sizes of extracted subband images in different wavelet decomposition levels are dimidiate gradually. In addition, the upsampling operations (with stride (2) are employed to guarantee the size consistency of convolution feature maps for tensor concatenation.

Compressed Tensor Self-Attention
Block. Inspired by [36], we design the compressed tensor self-attention block based on matricization and Nyström technique. e matricization can capture interdependencies along all dimensions of tensorized convolution feature maps. To reduce the computational and storage requirement of attention computation, we use Nyström technique to achieve an approximation solution which releases the resource burden of inference and speed up significantly. ese tensorized convolution feature maps are generated based on wavelet-extracted features. e input 3rd-order tensor can be viewed as a combination of its three mode-matricizations. Combining their outputs allows the compressed tensor self-attention block to make use of interchannel and intrachannel interdependencies. Moreover, the Nyström-based self-attention module involved in the compressed tensor self-attention block implements the self-attention computation to explore dependencies along corresponding mode in a more efficient way. e architectures of the compressed tensor self-attention block and the Nyström-based self-attention module are shown in Figure 2.
A mode-n-matricization of 3rd-order input tensor F i , i ∈ 0, 1, 2, 3 { }, is the vector obtained by fixing all indices of F i except for the nth dimension and can be seen as a generalization of matrix's rows and columns, n ∈ 1, 2, 3 { }. e mode-n-matricization of 3rd-order tensorR I 1 ×I 2 ×I 3 is a case of matricization denoted as X (n) i and arranges its mode-nfibers to be the columns of the resulting matrix.
To simplify notations, we ignore the subscript. Let X ∈ R h×ℓ be the input matrix of the self-attention module, and it is projected using three matrices W V ∈ R ℓ×υ , W K ∈ R ℓ×m , and W Q ∈ R ℓ×m to extract feature representations Q ∈ R h×m , K ∈ R h×m , and V ∈ R h×υ as follows: e output of the self-attention module is computed by where α > 0 denotes the learnable coefficient and softmax(·) denotes a row-wise softmax normalization function. en, generate tensor Y by reshaping O as tensor form. As shown in equation (5), the self-attention mechanism requires calculating h 2 similarity scores between each pair of vectors, resulting in a complexity of O(h 2 ) for both memory and time. Due to this quadratic dependence on the input length, the application of self-attention is limited to small size matrices (e.g., h < 1000) for edge devices. It is necessary to reduce the resource burden. Inspired by [37], we utilize Nyström technique to build a resource-efficient self-attention module. We rewrite the softmax operation in equation (5) as follows:   where . A S denotes the selected matrix generated by sampling m columns and m rows from matrix S via some adaptive sampling strategy [38].
According to the Nyström method [37,38], S can be approximated by unitary matrices, and Σ A ∈ R m×m denotes the diagonal matrix whose diagonal elements are corresponding singular values of A S . en, pseudoinverse A † S can be computed by Submitting equations (8) into (7), we obtain From equations (6)-(9), we can find that S requires all entries in QK T due to the softmax function, even though the approximation only needs to access a subset of the columns and rows of S, e.g., A S F S ∈ R h×m corresponds to the first m columns of S (see equation (6)) and A S B S ∈ R m×h corresponds to the first m rows of S. An efficient way is to approximate S using subsampled matrices instead of whole one (i.e., the h × h matrix QK T ). Let K T ∈ R m×m denote the matrix that consists of m columns of K T ∈ R m×h , and Q ∈ R m×m denote the matrix that consists of m rows of Q ∈ R h×m . en, we compute the approximations as follows:

Nyström-based selfattention module
Compressed tensor self-attention block

Security and Communication Networks
Based on equations (7)- (9), we can obtain the efficiently approximated S as follows: where Q and K T are selected before the softmax operation, which means S can be computed only using small submatrices instead of the whole one (the h × h matrix QK T ). e output of each single compressed tensor self-attention module is computed by en, the output of the compressed tensor self-attention block (Figure 2(a)) can be generated by where Ψ n denotes a reshape function which rearranges the matrix O (n) as the tensor of dimension C × H × W, n ∈ 1, 2, 3 { }, and ⊕ denotes the matrix concatenation operator.

Quantized Capsule Network.
Aiming to overcome the deficiency and shortcoming of convolutional neural networks, a novel architecture of neural network called capsule networks was first introduced by Geoffrey Hinton [14]. A capsule is a set of neurons represented as a vector. e individual values are to capture features of an object, while the length of the vector shows the capsule activation probability. e first layer of capsules comes from the output of a convolution. is output is rearranged into vectors with a previously specified dimension (and is shrunk using the squashing function), which are used to compute the output of a next layer set of capsules. e algorithm with which the next layer capsules are computed using the current layer of capsules outputs is called dynamic routing. It takes predictions from the current layer capsules about the output of the next layer capsules and computes the actual output according to an agreement metric between predictions.
It should note that the superiority of capsule network leads to the heavy burden of computation and storage. To address this problem and make it easy to deploy on edge computing devices, we integrate share structure and quantization technique into capsule network and propose the quantized capsule network.
As shown in Figure 1, the input of quantized capsule network is generated by concatenating the output tensors of multiple compressed tensor self-attention blocks through some upsampling operation. From these concatenated tensors, the quantized convolutional layer extracts basic features. e primary capsule layer explores more detailed patterns from the extracted basic features: where u i ∈ R p denotes the output of capsule i in the primary capsule layer, p denotes the dimension of primary layer capsule vector (or capsule vector length), M P denotes the number of capsules in the primary capsule layer, Reshape(·) denotes the function that reshapes the output tensors into capsule vectors (the detailed description is provided in [14]), and QConv(·) denotes the quantized convolution operator (the detailed derivation of quantized convolution is provided in [39]). Generally, the prediction vector generated by the primary layer capsule i, u j|i ∈ R c , indicates how much the primary layer capsule i contributes to the class layer capsule j. u j|i is given by where W ji ∈ R c×p denotes the weight matrix between the primary layer capsule i and the class layer capsule j, c denotes the dimension of the class layer capsule vector, p denotes the dimension of primary layer capsule vector (or capsule vector length), and M C and M P denote the numbers of capsules in the class capsule layer and the primary capsule layer, respectively. From equation (15), we can find that there are M C × M P weight matrices W ji , which leads to heavy computation and memory burden. To reduce the burden, we adopt two strategies. First, we utilize the shared structure of weight matrices (shown in Figure 3) as where W j ∈ R c×p denotes the transformation weight matrix corresponding to class layer capsule j (i.e., each class layer capsule shares its weight matrix to all primary layer capsules). Equation (16) indicates that the number of weight matrices is reduced from M C × M P to M C . Second, we propose the quantized dynamic routing process that implements the dynamic routing in a more efficient way (shown in Figure 4). For simplicity, we assume that p can be divided by P( < p) with no reminder and p � (p/P). Let W j � [ω (1) j , ω (2) j , . . . , ω (P) j ] ∈ R c×p where ω (τ) j ∈ R c×p denotes the τ-th submatrix of W j , τ � 1, . . . , P, j � 1, . . . , M C . We train subcodebook for subspaces of weight matrices as follows:

Security and Communication Networks
where D (τ) j denotes the subcodebook consists of K subcodewords for ω (τ) j , j � 1, . . . , M C , and B (τ) j denotes the indexing matrix, and each row of B (τ) j only has one nonzero entry which specifies the quantization relationship between subvector and subcodeword. e alternative optimization algorithm, such as k-means clustering, is employed for where u (1) i ∈ R p denotes the τ-th subvector of u i , τ � 1, . . . , P, i � 1, . . . , M P . We train subcodebook for subspaces of primary layer capsule vectors as follows: where F (τ) denotes the subcodebook consists of K subcodewords for u (τ) i , i � 1, . . . , M P , and ] (τ) i denotes the index vector that only has one nonzero entry which specifies the quantization relationship between subvector and subcodeword. e alternative optimization algorithm, such as k-means clustering, can be employed for learning ] (τ) i andF (τ) .
Combining equations (17) and (18), we can rewrite equation (16) as It is obvious that there are many replicate elements in the product of D (τ) j F (τ) , after the parameter quantization. erefore, it is unwise to compute the products in a one-byone style. Instead, we first compute the results of the product D (τ) j F (τ) , i.e., constructing the lookup table, as follows: where L (τ) j ∈ R K×K , j � 1, . . . , M C , τ � 1, . . . , P. en, in the application, we can look up the precomputed table instead of repeatedly computing which raises computational speed significantly. Hence, we can rewrite equation (19) as follows: where the product can be considered as the process of looking up the precomputed table L (τ) j instead of the matrix multiplication operation.
According to the mechanism of capsule network, the input vector z j of class layer capsule j can be computed by where ϑ ij denotes the coupling coefficient determined by the iterative dynamic routing process (see Table 1). e routing part is actually a weighted sum of u j|i with the coupling coefficient. e output vector of class layer capsule j is calculated by applying a nonlinear squashing function that can ensure short vectors to be shrunk to almost zero length, and long vectors get shrunk to a length slightly below one as where ρ j denotes the output vector of class layer capsule j.
Obviously, the capsule's activation function actually suppresses and redistributes vector lengths. Its output can be used as the probability of the entity represented by the current class capsule. e quantized dynamic routing algorithm is shown in Table 1.
We construct the whole loss function of the proposed CWTACaps by integrating the margin loss [14], reconstruction loss [14], and the quantization loss as follows: where λ re and λ qu denote positive coefficients and L max , L re , and L qu denote the margin loss function, the reconstruction loss function, and the quantization loss function, respectively. ey are defined by equations (25)- (27) as follows:

Security and Communication Networks
where T c � 1iif correct classification, ε + � 0.9 and ε − � 0.1, η mar and η qu denote positive coefficients, usually selected as 0.5, and X denotes the reconstructed image.
Models in experiments are trained under Ubuntu 16.04 with i7-8700 CPU, 64G RAM, and GeForce GTX Titan-XP GPU, and our proposed CWTACapsNet is deployed on Jetson TX2. To provide a direct comparison with published results, parameters of five state-of-the-art methods are set according to previous studies [1,8,11,13,44]. We use an exponential decay learning policy, with an initial learning rate of 0.001, 2000 decay steps, and 0.96 decay rate. We employ Adam optimizer to adjust the weights of CWTA-CapsNet in the training process. e batch size is set as 32. We implement data augmentation through rotating images with a random angle between 0°and 90°. We use 3 routing iterations to update capsule parameters in CWTACapsNet.
e number of wavelet level in CWTACapsNet is selected according to the tradeoff between validation accuracy and network parameter amount. We thus choose 3-level wavelet decomposition. e learnable coefficient α is selected as 0.1, and λ re and λ qu are selected as 0.001 and 0.0013, respectively. Table 2 illustrates classification accuracies and standard derivations of six methods. Table 2 indicates that CWTA-CapsNet achieves the best performance and is more stable than other methods. e tensor attention block makes CWTACapsNet be able to capture multidirectional dependencies while other methods cannot. FV-CNN performs better than CapsNets. FV-CNN and CapsNets both deal with pooling operation, and FV-CNN has some specific design to capture texture information. CNN-based texture classification methods tend to be limited by the lack of diversity of convolution filters. e multilevel wavelet decomposition extends both spatial and frequency features, which raises diversity of convolution filters and improves performance.
We add 10% white noise into texture datasets to evaluate robustness. Table 3 shows the performance of noisy datasets. Figure 5 shows accuracy for pure and noisy data. Figure 6 shows accuracy standard derivations (std) for pure and noisy data.
From Table 3 and Figures 5 and 6, we can find that CWTACapsNet achieves the best accuracy and robustness. Although CapsNets and CWTACapsNet are both based on capsule layer, CWTACapsNet significantly outperforms Cap-sNets. e memory requirement of CapsNets in the experiments is about 272M, while our proposed CWTACapsNet only requires 23.2 M with about 10 × speed-up. CWTACapsNet can be deployed and run on Jetson TX2, while CapsNets requires too much resource that Jetson TX2 hardly supported. e superiority of CWTACapsNet relies on three factors: the  (21) (2) for all capsule i in the primary layer and capsule j in class layer: b ij ←0 (3) for t iterations do (4) for all capsule i in the primary layer and capsule j in the class layer: for all capsule j in the class layer: compute the input vector z j using equation (22)  (6) for all capsule j in the class layer: compute the output vector ρ j using equation (23)  (7) for all capsule i in the primary layer and capsule j in the class layer: b ij ←b ij + u j|i , ρ j (8) Return ρ j

Conclusion
In order to make capsule network efficiently explore spatial and spectral features and capture multidirectional channel dependencies, this paper proposes a novel capsule network named compressed wavelet tensor attention capsule network (CWTACapsNet). In CWTACapsNet, the compressed multiscale wavelet transform is designed to extract multiscale spectral features in frequency domain; the tensor attention blocks utilize matrization to capture multiple directional dependencies across convolutional channels in terms of each scale information; furthermore, we propose quantized dynamic routing process for speeding up and storage reduction. Experimental studies have shown that the proposed CWTACapsNet provides the best performance on both classification result and antinoise robustness; moreover, CWTACapsNet significantly reduces the computational and storage complexities. In the future, we will incorporate parallel computation methods into CWTA-CapsNet to further improve efficiency.
Data Availability e authors approve that data used to support the finding of this study are publicly available. e datasets can be achieved from the links provided by [40][41][42]. CUReT Dataset is available at https://www.cs.columbia.edu/CAVE/software/ curet/html/download.h