A Hybrid Deep Learning Architecture for the Classi ﬁ cation of Superhero Fashion Products: An Application for Medical-Tech Classi ﬁ cation

: Comic character detection is becoming an exciting and growing research area in the domain of machine learning. In this regard, recently, many methods are proposed to provide adequate performance. However, most of these methods utilized the custom datasets, containing a few hundred images and fewer classes, to evaluate the performances of their models without comparing it, with some standard datasets. This article takes advantage of utilizing a standard pub-licly dataset taken from a competition, and proposes a generic data balancing technique for imbalanced dataset to enhance and enable the in-depth training of the CNN. In addition, to classify the superheroes ef ﬁ ciently, a custom 17-layer deep convolutional neural network is also proposed. The computed results achieved overall classi ﬁ cation accuracy of 97.9% which is signi ﬁ cantly superior to the accuracy of competition ’ s winner.


Introduction
The graphical comic arts were introduced in the mid of 19 th century to explain a story, different characters, events or some particular buildings [1]. These comic arts were initially printed on papers but evolved to the digital characters with the passage of time. At the end of 20 th century, these characters were introduced as superheroes in many animated movies and thus the fame of these superheroes increased exponentially [2]. Nowadays, these superheroes are used everywhere, either on comic books, or fashion accessories, school bags or room walls as the sensation of these superheroes is increasing rapidly. With the recent success of machine learning in many fields such as video surveillance [3,4], biometrics [5][6][7], medical [8,9], agriculture [10][11][12], social network [13], and few other [14,15], the researchers have drawn their attention towards the understanding and learning the visual features of comic characters to better identify and classify the superheroes. The classification methods simply train on few images before classifying the image into one of the predefined class, while the identification methods extract different features i.e., statistical [16], color [17], geometrical and shape, to identify and locate a character onto Faces in Japanese comics called mangas were identified using Viola-Jones framework [61] and then detected sufficiently [62,63]. The concept of applying the prior techniques for detection and recognition of human faces was proved wrong as it was clearly identified that the size, organ positions and color shades of human are different from the comic characters. Thus an improved and comic face related face detection method was proposed which utilized skin edges and color regions [64]. Another technique utilized the color attributes to detect the comic characters [65]. Graph theory was used to detect the comic characters by representing the color regions as nodes and panels as attributed adjacency graphs [66]. The same idea was implemented using the SIFT features with redundant values to accurately classify the repeated multiple objects [67]. Approximate nearest neighbors (ANN) search and local feature extraction was used for character retrieval [68]. Query-by-example (QBE) model was implemented using a Frequent Subgraph Mining (FSM) techniques for comic detection [69].

Motivation
The use of superheroes on fashion accessories is increasing worldwide as the demand for items containing a superhero is uprooting. Since, there exist so many superheroes, and one can simply not recognize or memorize them; thus, there is a need for an automated system, which can classify any product having a superhero image and help the consumer to identify and recognize the product that sets primary motivation of the study. Another motivation of this work is to exploit standard datasets for comic classification to set a baseline for the other researchers.

Applications
Although, the proposed system is tested on product images but not limited to this domain and can solve the many problems of other areas. One of the main applications of this work is to identify all the comic medical images, which involve any medical character, i.e., doctor, nurse, paramedic staff, then classify those images according to these comic characters. This can help to categorize the medical images into a specific folder. Another application may involve on detection of medical comic characters in a movie or video, to identify, locate and describe a medical comic character. This can help to recognize and discriminate a medical comic character from a video.

Objectives and Contribution
Classifying the superheroes from multiple types of products is an exciting task due to the placement and size of the images has huge diversity. This task becomes even more difficult when the dataset is extremely imbalanced, and the image sizes are extremely small. The main purpose of this article is to propose an automated system, which not only overcomes these issues but also performs the tasks of training, classification, and prediction with efficiency. The fundamental contributions of this article are: A general data balancing algorithm is proposed, which is not limited to the selected dataset only. It calculates the difference between a majority and minority classes, and populates the dataset with augmented images, obtained by performing steps such as image flipping, adding gamma correction and injecting gaussian noise. It increases the training of the model, which ultimately improves the performance of the proposed model. A 17-layer deep CNN model is proposed, which contains six (6) convolutional layers with attached ReLU and max-pooling layers and two (2) fully connected layers. The settings of the proposed CNN model are adopted after intensive experiments like increasing and decreasing the total number of convolutional layers and applying max and average pooling.

Materials and Proposed Model
This section describes in detail about the selected dataset, preprocessing steps involved, data augmentation, CNN, Network Architecture and Training settings.

Dataset and Pre-processing
The primary objective of the selected dataset is to classify the 12 superheroes i.e., Antman, Aquaman, Avengers, Batman, Black Panther, Captain America, Catwoman, Ghostrider, Hulk, Ironman, Spiderman and Superman from product images. The dataset is already split into training and testing portions having 5433 and 3375 images respectively. Two-step preprocessing method including the data augmentation and image resizing to a size of 100 × 100 × 3 is adopted in this research work. Fig. 1 illustrates one image from each of the 12 classes.

Data Augmentation
For training, less images may cause the over-fitting as the training dataset contains five classes with less than 250 images while remaining classes contains 400 or above images, even a class have highest images 1144. The aim of data augmentation is to increase all the minority classes to the size of majority class for a fair and enough training of proposed network. To achieve this, initially all the images of minority classes are flipped horizontally at 90, which balanced the classes like Batman, Captain America, Iron Man and Superman. In the second step, all the remaining classes are augmented using gamma correction by using a fixed gamma-value g at 0.8. This step further balances the classes like Black Panther and Hulk. In the third step, gaussian noise having a variance value of 0.02 is applied on all the minority class images. This step completes the data augmentation method, as now all the classes contain images more than the majority class. A total of 1144 images are then selected from each class to train the network. The detailed overview of each class is given in Tab. 1 along with the augmentation results. It can be seen that the original dataset contains 6,527 images while the augmented dataset contains 20,000 images. The method of data augmentation is further explained and illustrated in Fig. 2.  Initially, all the classes are extracted from the dataset and class with maximum images is selected as a threshold value H. Difference between remaining classes and threshold value H is calculated, which is later used to store all the non-Balanced Classes (nBC). These nBC are then forwarded for further processing to obtain three different images from original image under certain conditions. The arrangement of these images after augmentation in relevant folders is presented in Fig. 3.
The purpose of this arrangement is to keep the original images always at the start while the augmented images to follow them, so that the discarded images from the end have overall low impact on training, as the original images are always selected. After the process of augmentation, first 1100 images are sequentially selected to train the proposed network.

Convolutional Neural Network (CNN)
Textual and non-textual classifications are majorly performed using CNNs as these networks have gained tremendous results due to their deep structures [70]. Parameter sharing, sparse interaction and equivariance have turned these networks advantageous over the traditional shallow networks. A typical CNN is composed of different layers like convolutional, rectified linear units and pooling layers. The number and arrangements of these layers varies from network to network.

Convolutional Layer
Three dimensional inputs and filters are convolved using the convolutional layer. Suppose an input image of size I w Â I h Â I c is convolved along the width and height using a filter of size F w Â F h Â F c where w; h and c denotes the width, height and channels of the both input image and filters. The channel size for both the input and filter must be same to perform the convolution. If the stride for the filter is x and padding is φ, then width O w and height O h of convolved output image can be calculated as: The convolved output is also a three-dimensional with width, height and number of filters in the form of Every convolutional layer has a ReLU layer as a nonlinear activation function to rectify the output of a convolutional layer. The ReLU function for an output r is defined as:

Pooling Layer
Pooling operation updates the final output of ReLU activation function by calculating the statistical measures using the nearby output parameters. Performing the pooling operation not only reduces computational burden by reducing the size of parameters but also guarantees that the representation of small translations in input becomes invariant. Suppose an activation set Z has a pooling region P r in it, then the specific activation set is defined as: Max-pooling for this activation set is defined as Pooling max ¼ max Z ð Þ while the average-pooling is defined as Pooling avg ¼ P P r P r j j where P r j j represents elements in activation set.

Fully Connected Layer
The main propose of fully-connected layers is to combine the learned features of different convolutional kernels in such a way that they form a global representation of the overall image. The neurons of fullyconnected layers get fired only when the convolutional features are presented in the features of previous layers. Linear and non-linear transformations are performed on the input data. The linear transformation is represented as: Here, w denotes the weights, i denotes the input from the previous layer and b denotes the biasness. For a non-linear transformation, a sigmoid function is used with the values between 0 and 1. The non-linear transformations are performed, when the data is binary. As we are dealing with more than two classes, we will be using linear transformations for the fully-connected layers.

Network Architecture
In the past decade, many pre-trained CNN models are proposed to tackle multiple issues. These networks have performed significantly in many research areas, however, in case of comic classification, none of these CNN models worked effectively because of the complex structure and extensive layers. These extensive layers decrease the efficiency and increase the training time of the algorithm. To cope with this problem, a DCNN with 17 deep layers is proposed in this work. The input layer forwards it to the connected convolutional layers having attached ReLU and max-pooling layers. The size of filters, stride and total number of filters are set by performing multiple experiments. The aim of fully connected layer is to add a bias vector with the multiplication of weight matrix and input. The final, fully connected layer relates to a softmax layer which mainly generalizes the logistic regression. The configuration of proposed CNN is presented in Fig. 4.
If Prob m ð Þ denotes the probability of prior class and Prob njm ð Þ denotes the conditional probability for the n th sample within class m, and I denotes the total number of classes in the dataset then probability for a sample can be calculated as: The output provides the predicted class labels for every input based on the training. The class labels are already defined as the class names while training the model. The proposed CNN contained 17 layers, where the input layer accepts an RGB image of 100 Â 100 Â 3. There are total of 6 convolutional layers, and each layer is followed by a ReLU layer. Different number of filters are applied on these layers. All these 6

Training Settings
The CNN model is trained on NVIDIA GeForce GTX 1080 having an overall 6.1 capability of computation, 1607-1733 MHz clock rate and 7 multiprocessors using MATLAB 2018a. The Stochastic Gradient Descent with momentum (SGDM) algorithm represents the training technique in minibatch size of 64. Learning rate is initially fixed at 0.01 and decreased after every 5 eras by the factor of 5. The momentum is set at 0.7 and maximum epochs are set at 150. A suitable loss function, Cross-Entropy [71] is used as it has performed reasonable for many multiclass issues. To extract the features, FC1 layer is utilized, which extracts 4000 features against a single image.
These parameter settings are selected by performing intensive experiments. Considering the minibatch size of 64, total iterations are set at 150 and 13,200 training images from augmented dataset, which makes The overall training accuracy and loss is illustrated in Figs. 5a and 5b respectively.
The training accuracy of 93.5% is achieved on proposed network while the training loss is reduced to less than 1%. The training loss shows that the CNN model is well-trained on training and validation sets. The training accuracy of proposed model is determined, once all the parameters of model are learned and no further learning is in due. This trained network is then used to extract the features of test data, which are later classified to obtain the classification accuracy.

Illustration of Data Augmentation
The proposed data augmentation technique initially finds the majority class and calculates the difference of each class with respect to the majority class. Based on this difference, different operations i.e., by image flipping, gamma correction and gaussian noise injection, are performed to generate new images. These operations generate up to 3 new images to enlarge the dataset for training. It also benefits the deep learning algorithms to acquire more consistent features than the original dataset. Fig. 6 demonstrates the results of data augmentation process on 2 different images from 5 minority classes having images under 250. In Fig. 6, (a) represents the original image, (b) represents the flipped image, (c) represents the image after gamma correction and (d) represents the image after gaussian noise injection.

Classification Results
The proposed 17-layer network is used to classify the test data using the trained model. The arrangement of CNN layers like convolutional layer and max-pooling layers plays a vital part in training the model to achieve maximum results. For this purpose, multiple experiments are performed to search for the ideal combination by increasing the depth of network. The network is tested by using 5,6,7,8 and 9 convolutional layers along with ReLU and pooling layers. The highest results are obtained by using the network with 6 convolutional layers. Comparison of these different combinations is shown in Fig. 7 in the form of graph.
To evaluate the authenticity of proposed network, evaluation parameters like sensitivity, precision, specificity and accuracy are obtained on respective classes for the test data as displayed in Tab. 3. The purpose of extracting the class-wise classification results is to monitor the performance of model on this dataset. There is a lot of inter and intra class similarity in this dataset and decreases the overall efficiency of model. It can clearly be seen that the second class (Aquaman), seventh class (Catwoman) and tenth class (Iron Man) can be perfectly identified having the sensitivity of 100.0%. The class with worst The performance of proposed network is compared with 7 classifiers, where ESD performs best by achieving an overall accuracy of 97.9%. These results are obtained on both, augmented and original dataset to verify the impact of data augmentation. The minimum training time is recorded for weighted-KNN with 69.0 seconds while the maximum training time is recorded for LDA with 448.7 seconds. The lowest FNR is 2.1 for ESD classifier and highest FNR is 9.7 for weighted KNN. In terms of sensitivity, 95.3% is highest, recorded for ESD and 90.1% is recorded for LDA. The highest precision is recorded for ESD at 94.3%, while lowest is recorded for cubic SVM at 90.5%. The average prediction time is 0.09 seconds while the minimum prediction time is recorded at 0.04 seconds. Detailed classification results are shown in Tab. 4.
During the testing of proposed method on selected dataset, few images are incorrectly classified, which ultimately degraded the accuracy. All these images have incorrectly predicted labels on the image with yellow background and correct labels under the image in black background. Correctly and wrongly predicted images are shown in Figs. 8 and 9 respectively.
The results of max-pooling layer in proposed network are compared with average-pooling. The max pooling provides accuracy of 97.9% while the proposed network provides 96.3% accuracy with average pooling, which clearly degrades the overall classification accuracy by 1.6%. This downfall is because the average-pooling considers all the elements inside the filter to decide while max-pooling only selects the highest feature. In future, other pooling techniques can also be tested to further enhance the results.

Discussion
In the relevant literature, the researchers focused on extracting hand-crafted features on comic panels or pages. Although, previously proposed techniques have achieved remarkable results, but most of the methods are tested on very few images collected from google or other sources. This research work utilized a standard publicly available dataset which can be used for comparison to validate the methods in this domain. The selected dataset was presented as a challenge to identify the superheroes on fashion product images. Five winners were selected as a result of solving this challenge, who achieved the highest classification accuracies. The leaderboard on the challenge page contains accuracy scores of around 108 contestants who participated in the trial, with 94.31%, 93.96%, and 93.86% as the first second and third positions.  The computed results of the proposed model with 97.7% accuracy outperformed the previous results. This improvement in the classification score proves the authenticity of the proposed model.

Conclusion
This article proposed a 17-layer deep CNN to classify the superheroes among different fashion product images. A publicly available dataset consists of 12 classes and 8808 images is used to authenticate the performance of the proposed model. The dataset is normalized through augmentation using a proposed augmentation technique, which performs operations like image flipping, adding gamma correction, and Gaussian noise. A 17-layer deep CNN model is proposed containing six convolutional layers with connected ReLU and max-pooling layers and two fully connected layers. Different combinations of convolutional layers and their overall efficiency are also compared along with the effect of maximum and average pooling. The experiments show that six convolutional layers with integrated max-pooling provide better results of 97.9% classification accuracy with an average prediction time of 0.09 seconds. In the future, this network, along with few hand-crafted features, can be utilized to enhance classification results further. This model can also be implemented on other domains to check the validity of integrated layers and depth of the proposed CNN model.