Analyzing the performances of squash functions in capsnets on complex images

Abstract Classical Convolutional Neural Networks (CNNs) have been the benchmark for most object classification and face recognition tasks despite their major shortcomings, including the inability to capture spatial co-location and the preference for invariance over equivariance. In order to overcome CNN’s shortcomings, CapsNets’ hierarchical routing layered architecture was developed. The capsule replaces average or maximum pooling techniques used in CNNs with dynamic routing between lower level and higher level neural units. It also introduces regularization mechanisms for dealing with equivariance properties in reconstruction, improving hierarchical data representation and improving hierarchical data representation. Since capsules can overcome existing limitations, they can serve as potential benchmarks for detecting, segmenting, and reconstructing objects. As a result of analyzing the fundamental MNIST handwritten digit dataset, CapsNets demonstrated state-of-the-art results. Through the use of two fundamental datasets, MNIST and CIFAR-10, we investigated a number of squash functions in order to further enhance this distinction. When compared to Sabour and Edgar’s models, the optimized squash function performs marginally better and presents fewer test errors. In comparison to both squash functions, the optimized squash function converges faster as it is more efficient, scalable, and similar and can be trained on any neural network.


Introduction
For years, convolutional neural networks (CNNs) have been the tools of choice when it comes to solving computer vision problems.Due to their feature extraction approach, CNNs are the most widely used algorithm for learning meaningful and hierarchical information (Ayidzoe, Yongbin, Kwabena, et al., 2021).When CNN features are applied to images and videos, spatial localization is greatly useful; however, these networks have their own limitations.Convolutional layers require kernels to learn how to identify all relevant features in input data.Their performance, however, is highly dependent upon the availability of a large volume of data in different variations (LaLonde & Bagci, 2018).It is thus important to augment the training dataset before performing transformations such as rotations and occlusions.Nevertheless, the burden of learning visual and modified features on a traditional CNN can be great.But one issue with CNNs is that they require parameter pooling to keep translational invariance and regulate the number of parameters.However, they do not explicitly depict the relationship between the positions of the characteristics (Xiong et al., 2019).Due to the fact that two identical objects in different orientations are not represented identically, as humans do, a vast amount of training data, augmentation operations, and network bandwidth are required.CNNs also have the issue that pooling leads to information loss in forward pass progress, making it more difficult to locate smaller objects during localization and segmentation tasks.
An innovative class of neural networks was proposed in (Sara et al., 2017) using the concept of a "capsule".According to the definition of capsules given by (Sara et al., 2017), capsules are described as a group of neurons that represent the existence of a feature as well as parameters related to its instantiation.These capsule vectors provide a richer representation of information in the network than the scalar activations of kernels in a traditional CNN.Capsules should therefore be able to encode both a visual feature's existence and its transformation within the application it is used to.Despite the potential of capsule networks, there are still many uncertainties surrounding how they function (Nair et al., 2021).All classes of neural networks can be compared to "black boxes", not only those with capsules.As neural networks have always been difficult to interpret, it is difficult to evaluate the benefits of capsules without comparing them with CNNs.Typical architectures of a CapsNet with the decoder and encoder parts are shown in Figures 1 and 2 respectively.
Using a deep visualization technique, this study will generate images that visually represent information contained in capsules in an effort to clarify them.By comparing these images to other CapsNets research works done in a similar fashion, visual evidence can be provided to support the hypothesized benefits of capsule networks.Additionally, the visual impact of modifying a capsule's value is examined for a more accurate assessment.Additionally, a reconstruction network and dynamic routing will be examined as part of the study of the original ca psule network architecture, which was proposed in (Sara et al., 2017).

Summarily;
• In this paper, two benchmark datasets are examined to see how the squash function affects CapsNets.
• Perform extensive analysis on the performance of the squash functions using visualizations.
In the remaining sections, we introduce relevant works in Section 2. The methodology and squash functions are described in Section 3, followed by the experimental setup and results in Section 4, and finally the conclusion is presented in Section 5.

Related work
Hinton's 2017 paper (Sara et al., 2017) presents the capsule vectors as convolutional architectures, based on the concept of a capsule neural network.An alternative to traditional down-sampling methods such as max pooling is proposed that selectively links units within a capsule together.In 2018, Hinton published a follow-up article (Hinton et al., 2018) that extended capsules to matrix form and further developed the routing scheme; however, our study will primarily focus on the architecture discussed in the baseline study (Sara et al., 2017), and we will perform experiments using the dynamic routing algorithm(see Algorithm 1) in parallel with those in (Sara et al., 2017).
Several other modifications to the original architecture have also been proposed, such as in (Edgar et al., 2017;Yaw et al., 2022a), where the number of layers and capsule size is increased as well as changes to the activation function is made.Although the dynamic routing procedure recently proposed by (Sara et al., 2017) is effective, there is no standard formalization of the heuristic.
According to (Wang & Liu, 2018), the routing strategy proposed by (Sara et al., 2017) can be partially expressed as an optimization problem that minimizes a clustering-like loss and a KL regularization term between the current coupling distribution and its last state.In addition, the authors introduce another simple routing method that exhibits a number of interesting features.As described in (Rawlinson et al., 2018), capsules without masking may be more generalizable than those with masking.According to (Martins et al., 2019), multi-lane capsule networks (MLCNs) are a resource-efficient way to organize capsule networks (CapsNets) for parallel processing and high accuracy at low costs.With CapsNet, MLCNs consist of several (distinct) parallel lanes, each contributing to a dimension of the result.In both Fashion-MNIST and CIFAR-10 datasets, their results indicate similar accuracy with reduced parameter costs.In addition, when using a proposed novel configuration for the lanes, the MLCN outperforms the original CapsNet.Furthermore, MLCN has faster training and inference times than CapsNet in the same accelerator, over twofold faster.By combining pairwise inputs with the capsule architecture, the authors in (Neill, 2018) construct a Siamese capsule network.Siamese Capsule Networks outperform strong baselines on two pairwise learning datasets, exhibiting the greatest performance in the few-shot learning setting where pairwise images contain unseen subjects.
A wide range of applications have also been found for capsule networks.For instance, CapsNets are well suited for predicting traffic speed because of the spatiotemporal character of traffic data expressed in images (Kim et al., n.d..).The work of (Steur & Schwenker, 2021) contributes to the development of CapsNets for text classification on six datasets selected.Based on empirical results, the authors demonstrate the robustness of CapsNets with routing-by-agreement for a wide variety of net architectures, datasets, and text classification problems.There have been good results with CapsNets in other areas, such as hyperspectral image classification (Using & Training, n.d..) (Ding et al., 2021), where labelled data is harder to obtain.Agricultural (Kwabena et al., 2020) and health (Afriyie, 2021;Ayidzoe, Yongbin, Kwabena, et al., 2021;Yaw et al., 2022c) applications of CapsNets have been widely explored.
Though these capsule networks demonstrate great potential, their justification for performing so well is less clear.As (Sara et al., 2017) indicates, capsules have several potential advantages, such as encoding feature transformations and enhancing information aggregation through dynamic routing.While impressive, the results of the experiments cannot prove that these characteristics are present in the capsules.A number of experiments carried out in (Mukhometzianov & Carrillo, n. d..) (Lian et al., 2023;Marchisio et al., 2020;Zhang et al., 2017) suggest that certain object features may be controlled via capsule manipulation, but this is not fully explored.This methodology is limited in scope again, with the authors in (Sun et al., 2021) taking a more concerted approach to explainability by varying output capsules in multiple dimensions.To analyze the advantages of a capsule network over a traditional CNN, activation functions must be applied to the capsule network.Due to the lack of thorough exploration of capsule networks at a feature level, understanding capsules is crucial before adopting them in the field.In the next section, we present activation functions employed by various researchers, including a baseline activation function developed by (Sara et al., 2017).
A comprehensive review of CapsNet based methods was presented by (Goceri, 2020), followed by the design of a new CapsNet topology, the application of the proposed topology to three types of tumours, and the comparative evaluation of the results obtained by other methods.In the proposed approach, 92.65% accuracy is achieved on tumor classification with efficiency according to the numerical results presented by the Author.According to comparative evaluations, the proposed network is more accurate at classifying images than other approaches.By using the Capsule network, (Tiwari, 2021) proposes a deep learning-based approach for detecting melanoma.Based on a comparison of a multi-layer perceptron and convolution network with a Capsule network model, the author concluded that the classification accuracy was 98.9%.As a result of the study, a CapsNet model with fewer learning parameters was found to be more generalizable and performed better in detecting skin cancer.According to (Tiwari & Jain, 2021) an X-ray diagnostic system can be used to detect the presence of COVID-19 based on a decision support system based on the image.The visual geometry group capsule network (VGG-CapsNet) is described in their paper as a CapsNet-based diagnostic system for COVID-19.VGG-CapsNet performs better for COVID-19 diagnosis than CNN-CapsNet, according to simulation results.

Methodology
A deep visualization technique of activation maps is applied to trained CapsNets in our study as a first step.In this study, we compare the resulting datasets using some squash functions in order to distinguish different feature representations on the CapsNets and gain insight into their potential.To determine whether capsule vectors represent transformation parameters directly, the second experiment further scrutinizes capsule features.A description of the capsule network architecture and a presentation of various squash functions will be presented in this section.Detailed results and experimental details will follow for the different squash functions.

Capsule network architecture
As first described by (Sara et al., 2017), whole vectors are used for representing internal properties (also referred to as instantiation parameters, including pose) of entities within an image, and each capsule represents one instance of an entity within the image.Pooling is used as a crude way to route outputs in CNNs, which use single scalar outputs.Subsampling is performed by pooling so that neurons are invariant to viewpoint changes; capsules, on the other hand, seek to preserve the information in order to achieve equivariance.To achieve translation equivariance, the lower-level capsules (such as the nose, ears, etc.) are sent as input to parent capsules (such as the face) representing part-whole relationships through linear transformations.Thus, pooling is replaced with dynamic routing.Originally developed in computer graphics, where images are rendered based on their internal hierarchical representations, this theory proposes that the brain solves an inverse graphics problem by deconstructing an image to its latent hierarchical properties when presented with an image.The CapsNets proposed by (Sara et al., 2017) use dynamic routing (DR) and a CNN to solve the MNIST dataset (images of 28 × 28 pixels).In the architecture, the first capsule layer uses two convolutional layers as the input representations, which are then routed to the final class capsule layer.It is possible to reuse and replicate learned knowledge from local feature representations in other parts of the receptive field because of the initial convolutional layers.An Iterative Dynamic Routing algorithm determines capsule inputs.A transformation W ij is used to output the vector u i of the capsule C K i .An object's state (e.g.orientation, position, relationship with upper capsule) is indicated by the direction of the vector u i , which represents the probability that the lower-level capsule detected it.A prediction vector ûj=i ; is created from the output vector u i where ûj=i ¼ W ij u i : In the next step, log prior probabilities b ij from a sigmoid function are multiplied by a coupling coefficient ∑ k e b ik and softmaxes are then applied.When ûL j=i is multiplied by u Lþ1 j , its scalar magnitude increases.The coupling coefficient C ij is likewise increased, while the remaining potential parent capsule coupling coefficients are decreased.Routing by Agreement is then carried out via coincidence filtering to find clusters of predictions that are close to each other.Nonlinear normalization (also known as Squash function) uses entities output vector lengths to represent probability of entity presence.

Squash functions
In capsule networks, a non-linear activation function called the squashing function is used after the iterative routing procedure.This was proposed by (Sara et al., 2017) in their work on the performance of capsule networks in complex images.The squashing function transforms the length of the output vector into the probability of the existence of the entity present within the capsule.It performs shrinking of the long output vectors slightly below length one and short vectors almost close to zero.This study therefore analyses the performance of different squash functions on complex datasets images.As a result, the following squash functions (see Table 1) are tested in terms of performance on complex images.

Loss function
For image classification task, for each image capsule, we used a separate margin loss (Sara et al., 2017) function to identify where a given image category is present within a capsule.For image capsule s, the margin loss, L s ; is given by; 7: for all capusule i in layer land capsule j in layer 8: Here, T s ¼ 1 if the image category exists within the image capsule, otherwise it is set to 0. m þ and m À are set as 0.9 and 0.1 accordingly.The down-weighting λ is set to 0.5 with the optimal performance.

Datasets
Experiments were conducted on three benchmark datasets purposely for image classification.The details of each dataset are shown in Table 2.

Implementation
For the experimental analysis, we utilized Keras for the front-end and TensorFlow for the back-end.
Our Python code was implemented in conjunction with an NVIDIA GPU GeForce GTX 1050 with 16GB RAM, a Windows OS, and an Intel Core i5 @ 3.70 GHz CPU from the 8th generation.Based on the default parameters and with 100 batches running for 200 epochs on the FMNIST and 100 epochs on the CIFAR-10 dataset, the proposed optimized CapsNets were trained.In the dynamic routing algorithm proposed by (Sara et al., 2017), three routing iterations are performed.The learning rate was further adjusted to 0.001 during the training and the learning rate decay to 0.9 during the testing.The margin loss function was then employed (see Equation 1) to train the models.In our experimentation, we applied standardization over each image, and we trained all the networks from scratch.Only the best model is saved during training, which is controlled by patience, an early stopping hyperparameter set to 10.In the primary capsule layer, 8-dimensional vectors were instantiated for each capsule, and 16-dimensional vectors for each convolutional and image capsule.Within image capsules, the length of each capsule indicates the existence of a specific image category within a dataset, which is then utilized to identify the image categories within the dataset.

Experimental results and discussion
Our experiments were evaluated according to accuracy based on related research in the same domain (Harilal & Patil, 2022).A summary of the experiment results, compared to two benchmark datasets, is shown in Table 3.
A graphical representation of the performance of the various squash functions used in this study is presented in Figure 3. Based on the comparison of performance, (Sara et al., 2017)   2017) squash functions produce large activation functions even for smaller values s j compared to the optimized squash function (Yaw et al., 2022b) resulting in faster initial growth of the function.Therefore, the optimized squash function outperforms (Sara et al., 2017) and (Edgar et al., 2017) squash function.Moreover, the optimized squash function (Yaw et al., 2022b) can compress short vectors to almost zero and long vectors to just below one.Hence, this shows that the optimized squash function produces better sparsity, preventing capsules from holding on to high activation values.In order to obtain the capability of capturing information from images with varied backgrounds, sparsity is used to discriminate and optimize high discriminative capsules.Figure 3 illustrates the performance improvement achieved by the optimized squash function.
In order to determine the effectiveness of any proposed classification model, there are several methods available.In order to evaluate the performance of the various squash functions on standard datasets, the following evaluation parameters were used:  Accuracy: A measure of how many categories are correctly classified compared to how many total categories exist.As a result of all experiments, we quote the overall accuracy.
Loss: This metric measures how far the model's predictions differ from the true labels.These experiments use margin loss as a measure of loss.
Clustering: We derive and analyze the clustering achieved by the class capsule layer.The routing algorithm on the datasets prove to be effective in this instance.

Area Under Curve(AUC):
In order to analyze the performance of the model on imbalanced datasets, the receiver operating characteristics (ROC) and the precision-recall curves (PR curves) are calculated.
The performance of the different models was compared at the 200 epoch mark rather than training each model until convergence because of computational constraints which are depicted in Figure 4. FMNIST showed that the optimized squash function (Yaw et al., 2022b) achieved 92.78% accuracy, and Edgar's model (Edgar et al., 2017) and Sabour's model (Sara et al., 2017) achieved 92.78% and 92.49% accuracy, respectively.With the FMNIST dataset as a training set (see Figure 5), the optimized squash function, Edgar's model, and Sabour's model all showed similar classification error rates of 7.20%, 7.22, and 7.51.
Since the CIFAR-10 is complex and computationally constrained, we trained the images for 100 epochs.A detailed analysis of the performance assessment for the CIFAR10 dataset can be found in Table 4.A model's performance on imbalance datasets does not depend on the class in which the data is distributed.On the CIFAR10, the optimized squash function (Yaw et al., 2022b) achieved the highest accuracy of 86.7%, compared to Edgar et al (Edgar et al., 2017) squash function accuracy of 85.63% and Sabour et al (Sara et al., 2017) original squash function accuracy of 84.57% (see Figure 6).
According to our analysis, the optimized squash function (Yaw et al., 2022b), Edgar et al model (Edgar et al., 2017), and Sabour's model (Sara et al., 2017) all had classification error rates of 13.21%, 14.37%, and 15.43% when training with the CIFAR10 dataset.The optimized squash function performed marginally better than Edgar's model and Sabour's model when used on the same dataset, achieving 87.79%, 85.63%, and 84.57% accuracy, respectively.On the basis of the per class accuracy, all the models are assessed on the individual classes.Using a different dataset or setting a different hyperparameter can lead to better results and greater accuracy.Due to the imbalance nature of the dataset, we generated and analyzed receiver operating characteristics(ROC) and precision-recall(P-R) curves for all the models for CIFAR 10 and FMNIST.A receiver operating characteristic (ROC) curve and a precision-recall (PR) curve were used to determine how effectively the models distinguish between the different classes.There is a paradoxical relationship between accuracy and performance when datasets have highly imbalanced classes since classes with large samples tend to overshadow smaller classes (Zhao & Cen, 2014).Since the area under the curve (AUC) measures the sensitivity and specificity of the model's predictions across thresholds (Hajian-Tilaki, 2013), we use it to summarize the model's performance across thresholds.As shown in Figure 7, the ROC curves for the two imbalanced datasets all lie above the diagonal, which indicates that the optimized model is effective at discriminating between categories.Edgar's and Sabour's models show weaker discriminative power than the optimized squash function.
The PR curves shown in Figure 8 are also appropriate for evaluating highly-imbalanced datasets.Even with the class imbalance, the optimized model was able to discriminate between the different categories effectively regardless of the class imbalance.
Similar experimentation for the ROC and PR curves on the CIFAR10 dataset for all the three models are shown in Figure 9 with the optimized model discriminating better among the different classes.
For analyzing the separability of the clusters formed at the class capsule layer, we used t-distributed stochastic neighbor embedding (TSNE) (García-Alonso et al., 2014).The formation of distinct clusters confirms that the model is able to classify each test image correctly.On FMNIST and CIFAR10 datasets, Figure 10 shows clusters for all three models at the secondary capsule layer.In contrast to the original and Edgar's CapsNet models, the optimized model forms distinct clusters for the datasets (although they overlap).It is possible to observe a few outliers from each model's cluster; however, they are not too far from their respective clusters.Based on these results, the optimized model has a good discriminative ability in comparison with the other models.

Conclusion
Our paper is unique in two respects: 1) We evaluated the performance of a variety of squash functions on complex images using CapsNets.2) A comparison of the optimized squash function with other squash functions showed that the optimized squash function performed better, significantly reduced the number of parameters, and introduced interesting changes to the CapsNet model.Using two standard datasets with complex backgrounds, we tested three different squash functions: the optimized squash function, Edgar's squash function, and Sabour's squash function.Based on the comparison of these squash functions in CapsNets, the optimized squash is clearly superior to the other models, since the entities in the images are well preserved.The optimized squash function also improves CapsNet performance by preventing information sensitivity in addition to shrinking vectors.The Sigmoid function was chosen instead of the softmax function for all dynamic routing models in order to achieve better normalization of the coupling coefficient.The optimized squash function also employs feature extraction so that images can be classified better based on their feature information.Using the feature extraction technique in the encoder, more discriminable feature representations could be created when dealing with complex background data.The optimized squash function achieves state-of-the-art results when compared to the standard datasets, demonstrating its effectiveness.The optimized squash is a new method and an important implementation idea to alleviate the problem of CapsNets information sensitiveness.The optimized squash is a new method and an important implementation idea to alleviate the problem of CapsNets' information sensitiveness.We hope to study more squash functions and modify them in the future so that they can perform better in classifying complex images in the future.

Figure 1 .
Figure 1.A typical architecture of CapsNet encoder with an image from MNIST dataset.

Figure
Figure 2. A typical architecture of CapsNet decoder with an image from MNIST dataset.
Figure 3.Comparison between different squash functions.

Figure
Figure 9. Multi-class Receiver Operating Characteristic (ROC) curves and Precision-Recall curves for CIFAR10.The (a), (b) and (c) represents the ROC curves for (a) Afriyie et al (Yaw et al., 2022b) model (b) Sabour et al (Sara et al., 2017) model.(c) Edgar et al (Edgar et al., 2017) model and the (d), (e) and (f) consists of the Precision-Recall curves of the respective models.

Figure
Figure 10.Visualization of the clusters formed at the: (a) FMNIST-caps-amp layer of the optimized model (b) FMNISTcaps-amp layer of the Sabour's model (c) FMNIST-caps-amp layer of the Edgar's model (d) CIFAR10-caps-amp layer of the optimized model (e) CIFAR10caps-amp layer of the Sabour's model (f) CIFAR10-caps-amp layer of the Edgar's model.

Table 2 . Properties of datasets
and(Edgar et al.,