Two-Way Feature Extraction Using Sequential and Multimodal Approach for Hateful Meme Classification

,


Introduction
In the present era, social media is the most important activity that directly or indirectly affects people [1]. Although social media is a great platform to masses for developing skills, reach to experts, and for expressing talent, this platform has helped many people to gain success by sharing and escalating their work around the globe with the Internet. Sharing of memes on social media is increasing rapidly. Memes spread humour on a positive side. However, technology comes with a boon and a bane. ese memes on the negative side can hurt any group or an individual. Internet memes can most commonly be defined as still images with text that spread rapidly among people and become a craze. ey attempt to make us laugh at the expense of a theme or a person. ey often carry a deeper meaning. Memes can be made by anyone. A section of audience may find them funny while another section may find them offensive. Memes are widely spread in social media sites such as Quora, Instagram, Twitter, Facebook, Snapchat, and WhatsApp. Memes are a great tool to spread humour; however, some people use it to target an individual or a group and to offend them in a polite and sarcastic way. Such memes spread hatred, and their excess may lead to depression. Nowadays, memes are made on countless topics like politics, movies, games, college life, and comic book characters.
In this work, we are addressing a real-world problem by using multiple techniques of deep learning. Most research is targeting a particular domain, viz, text recognition [2], image classification [3], object detection [4], and natural language processing [5]. As the problem has a direct impact on the society, we are trying to manage odds here and are thus trying to provide the best possible solution by combining the aforementioned techniques so that it is beneficial for society. Although a vast amount of research is related to sentiment analysis [6,7], combining it with image [3] makes the problem itself novel. Memes can be a great content to have a laugh or two. However, a content which is hilarious to one can also be a loathe to another. Some people also deliberately create a meme whose salient purpose is to spread hate towards a community or a person. Since the reach of content in social media is limited to no one [1], it has also gained the attention of political parties to promote their agenda with the help of memes. Some political parties use memes to spread fallacious information to people about their oppositions, which indirectly affect the elections. However, a lot of people are sharing distasteful memes and encouraging their ideas on social media sites. Such memes try to make fun of a target individual or group. Ideas and statements of such memes should be banned before it is too late. A lot of people read such memes and may accept that idea is acceptable. Data analysts from different parts of the world are trying to solve the problem of identifying such memes. Millions of memes are created and shared every day on social media platforms. It is not possible to remove hateful memes manually. In this research, we propose an algorithm that identifies such memes so that social media platforms like Facebook or Quora could delete such memes.
Our research contributions can be summarized as follows: (i) A two-way analysis is proposed covering textual as well as image component of the memes. (ii) Data cleaning, preprocessing, transformation, and text extraction from images are performed over the dataset to improve the generalization.
(iii) Instead of focusing on one domain, ensemble of techniques from multiple domains is proposed. (iv) Feature extraction over the dataset is performed considering textual component as well as image component features from the memes.
(v) Two novel sequential and multimodal approaches are proposed, and we are successfully able to carry out comparative analysis on both. In sequential approach, image modality is converted to text modality using image captioning and then classification is done using textual features. In multimodal approach, image features and textual features are extracted and combined to classify memes.
(vi) Results show that the proposed approach outperforms the ground truth remarkably.
is work is organized as follows. Section 2 presents an extensive literature survey of DL techniques for hate meme classification problem. In Section 3, we have proposed our two-way model for classifying memes. Section 4 presents, indetailed comparative analysis, and discussions over the results. Finally, the work is concluded with Section 5.

Related Works
Image captioning is one of the ongoing research field. It is very difficult to extract context of a particular image by just looking at it. You et al. [8] proposed a solution to deal with such problems using encoder and decoder architectures of DL.
Anderson et al. [9] proposed a similar method to handle image captioning problem using concatenation of both topdown and bottom-up attention mechanism. is enabled user to calculate the salient image portions. ey have used faster R-CNN on image portions each attached to its feature vector which helps them to determine the appropriate weights of the features required as per the architecture demand.
Optical character recognition is one of the most studied fields in AI and DL. Many researchers have performed various model architectures but none of them can be generalized as it all depends on the dataset they are utilizing and helps us get a broad understanding on how we can tackle similar problems. In [10], authors tried to explain one of the approaches to deal with such problems in efficient way. ey have explained in detail about how they have taken pretrained weights of Google_Incpetion_V3. e model was trained on some random 54k + noisy char images which helped them in overall 21.5% reduction of error rate compared with the existing OCR's model.
Similarly, authors in [11] used a more rigorous approach rather than relying only on the pretrained model. ey developed their custom CNN model by fine-tuning the pretrained model weights with additional layers of LSTM and DNN to achieve better results. However, in order to train such large network, there is a requirement of very strong GPU and VRam. Authors have tried and managed this and provided the results with error rate of only 0.11% on famous UW3 dataset.
Later in a few studies "Multimodal Approach" was proposed, which comes very handy while solving the problems that are dependent on multiple modules for hybrid architecture. In [12], authors used a similar approach to provide solution for the emotion recognition in video where they explained image feature extraction by capturing visual information of the detected human faces and the extraction of audio stream for that particular movement and converted them to similar feature vectors further solving the problem relatively.
e results obtained were 15% better than the sequential approach which was discussed in [13] where the author converted the audio stream to text also video stream face to text using the encoder-decoder model and further treated that problem as sentiment analysis [14].
Image classification is one of the most prominent research fields, and lots of advanced level research studies have been published in the past few years related to this domain. Generally, the areas that handle these types of research are computer vision, image processing, and ML. Krishna et al. A new and interesting approach for image classification has been observed by the referred paper [16] as the authors show the implementation of transfer learning in the field of DL and computer vision. Lin et al. [16] discussed how the approach of the pretrained model (Google_Inception_V3) can be used in classification for the custom dataset and explained the procedure for change in the last layer of the architecture and match it to the required number of classes in the output layer.
Object detection [17] is one the most popular fields whose devastating advancement of research results can be seen since 2012. Authors in [18] explained the use of an alternative of CNN-YOLO. e proposed algorithm is very fast with respect to CNN as its FPS 155, and its mAP [18] can also reach up to 78.6%, both of which have been way ahead from F-RCNN.
It has been observed in the web that the growth of text data is rising exponentially from past few decades. In today's world, every data user wants very precise results. Nevertheless, retrieval of relevant information from given text has always been a challenge in terms of AI. erefore, in the paper [19], authors discussed the approach of tokenization, and then the inference time will be shorter and accurate. Most of the proposed research works specifically targeted either image-based methods or textual methods. In the present work, we have proposed a two-way approach for addressing the problem of hate memes classification.

Methodology
To specifically target the aim of predicting whether a given meme is hateful or not, the proposed ML model utilizes information from both the image of meme and the text written on it to give a prediction. is is a binary classification problem. e present work explores two different approaches to solve the problem of identifying hate meme. e first approach uses sentiment analysis based on image captioning and text written on meme. e second approach utilizes features extraction and combination from different modalities. To solve this problem, we have used Facebook Challenge Hateful meme Dataset [20] containing approximately 8500 meme images. Both approaches are tested on the same validation set, and results are quite acceptable for both models.

Data Preprocessing.
As we have used Facebook hateful meme dataset containing 8500 images with unique id and each labelled as 0 or 1 (0: not toxic; 1: toxic). In the dataset, we found that all the images were of different size. us, we have applied transformation technique to transform all the given images to size (224 * 224 * 3) where 224 is used for height and width and 3 specifies RGB channel. Further, we normalized given images and converted them into a vectorized form.
After transforming each image, we extracted the text from each image. For this, we have used a 3rd party OCR tool to pull out the required text from the given image. Furthermore, to pass our text data into any given specific model of neural network, we require some separate data preprocessing to make it in a suitable format. In the multimodal approach, we have used FastText [21] which is a built-in library model developed by Facebook developer to make our task easy that covers all the preprocessing steps of text as discussed in approach 2 and creates the required feature vector available to us which can be passed into our final NN classifier. In the image caption-based approach, we have embedded the text using the glove embedding algorithm [22], which is developed by Stanford. It is an unsupervised learning algorithm which is trained on 400 thousand words. Using glove, we obtain feature vector which is passed in the final NN-based sentiment analysis model as discussed in detail in approach 1.

Sequential Approach.
In this model, the basic procedure followed is to first find semantic meaning of meme image in textual format using imaging captioning. is is done by an encoder-decoder model. e encoder model comprises passing the image through a pretrained resnet-152 [23,24] model (trained on ImageNet dataset) in which we take the last layer (dimension 2048) as output vector. is vector is passed through a linear layer (whose input dimensions were same as the resnet-152 output layer dimension, and its output dimension was equal to embedding input dimension of LSTM component in decoder) [25,26]. In the decoder, start token is given along with image vector features to predict the first word. e word with the highest probability of being first word is used along with image features to predict the second word. is process goes on till finish token is not given by LSTM [27,28].
After that, we perform some basic image processing techniques on the image, and then we use the Tesseract API [29,30] developed by Google to extract text written on image. is extracted OCR text is concatenated with sentence generated by the image caption model [21,31]. is text is embedded using glove embedding and then passed through an NN-based sentiment analysis model. is NN model consists of convolution layers, max pooling layers, global max pooling layer, fully connected layer, and a sigmoid layer. If the sigmoid function value is greater than the threshold value (0.5), then we classify meme as hateful, else we classify meme as not hateful. Figure 1 describes in detail the step-by-step methodology for the sequential approach.

Multimodal
Approach. In this model, we approached the problem differently rather converting our image into text and then solving it as a sentiment analysis problem. Here, we first perform some preprocessing and after that in case of image vector we passed it through our pretrained resnet-152 model in which we take the last layer as output vector for our feature representation. However, the output feature vector of Complexity resnet-152 is of dim 2048. Hence, one more linear layer is added whose input dimensions will be same as resnet-152 output dimension, and its output dimension was similar to that of our language feature dimension. As for our text data rather than fine-tuning manually, we have directly used FastText built-in library to extract the desired features by adding an additional embedding layer whose embedding is kept fixed for simplicity. e output generated after from embedding layer is then passed through a trainable linear layer as a way of fine-tuning our feature represented vector.
Finally, these features received from our vision as well as from our language model are concatenated and transformed into another single feature vector. Later these extracted features are passed to one fully connected layer for classification. Figure 2 depicts the complete model architecture for multimodal approach.

Results and Discussion
Both the proposed approaches are tested on the same validation set, and results are found to be quite acceptable for both models. In-detailed results are described further in this section.

Multimodal Approach.
We first address this problem by only using OCR meme text as input. us, we obtained the results illustrated in Table 1. is was done by training an NN-based sentiment analysis model on Wikipedia Toxic comments dataset. Adamax optimizer gave the best validation accuracy of 0.55. is was done by training an NNbased sentiment analysis model on Wikipedia Toxic comments dataset. Adamax optimizer gave the best validation accuracy of 0.55. Table 1 depicts some of the best results obtained in each category after some rigorous amount of training and testing and optimization of hyperparameters. We have used lr_scheduler [32] to automatically determine learning rate value according to the number of epochs. Advantage of using scheduler is while reaching to global minima, step size reduces. Initially scheduler is able to take large steps with higher learning rate value; as it keeps on reaching to minima, step side is reduced. anks to PyTorch [32], we were able to use this feature to find appropriate learning rate with respect to each iteration.
We have also used "early stopping" as one of our hyperparameters so that whenever loss tries to go above during training, it stops the model further thus providing the optimum global minimum point and avoid any type of high variance problem leading to overfitting of the model. Figure 3 illustrates the AUC curve with a score of 56.83%, and Figure 4 depicts the confusion matrix for the same score.

Sequential Approach.
In the sequential approach, the results obtained after training and validation on Facebook Hate Meme dataset are shown in this section. We have optimized the hyperparameters to gain the best results possible from this approach. We have used learning rate scheduler (functionality available in keras library) to dynamically adjust learning rate. In other words, as the optimal weights are at a distance from the minima, the learning rate will be high, and when close to optimal weights, then the learning rate will be low. We have also used early stopping to avoid any type of high variance problem leading to overfitting of the model. Table 2 shows the validation accuracy for a variety of optimizers used for training and validation.
We have used dropout regularization of 20% to avoid overfitting. Dropout regularization ensures that the model weights are not affected by noise data while training. Figure 5 shows the validation set ROC curve obtained for a variety of optimizers, while using the two-way approach combining text and image captioning text.
From Figure 6, it is quite clear that the sequential model (using Adamax optimizer) performed better on validation set than the multimodal (using Adam optimizer). One of the possible reasons for this is that the multimodal approachbased model takes image input features as it is which also contain a lot of noise. In contrast, the model uses the image input features to form a sentence based on highest sequential probability that filters out the noise. However, the sequential model also has a limitation that its accuracy is almost 70%, which means that it may give wrong output sentence that can lead to wrong prediction when this sentence is      Complexity 5 concatenated with OCR text and passed through the sentiment analysis model. Finally, for the hateful meme classifier, the comparative analysis of accuracies is depicted in Table 3. e best model obtained from the sequential model had a validation accuracy of 0.64, and the best model obtained from the multimodal approach had a probability of 0.59 as seen from Tables 1 and 2. Human accuracy on this prediction problem is approximately 80%, provided by Facebook for Hateful Memes Challenge and dataset [20]. We achieved a decent accuracy on this problem when compared with human accuracy.

Conclusion and Future Works
Memes on social media are one of the most popular ways to send false and hateful information to the masses. is work classifies hateful memes targeted at a particular audience so as to modify their opinions on certain issues. Dataset of memes is provided by Facebook in the open challenge. In the present work, we have specifically proposed a sequential approach and a multimodal one to extract information from image captions as well as text in the memes. us, a two-way feature extraction was performed, and deep learning models including a combination of OCR, glove, and encoder-decoder architecture are applied in addition to tools like Tesseract API for training. Furthermore, the two approaches are compared over benchmarks, as well as with the dataset collected from other sources. For this work, results obtained were found to be quite comparable to human accuracy. In future, we plan to extend this work to other multimodal feature extraction methods so as to improve the training over the given dataset. Further, social media trends and patterns are fast changing, so there is a need of real time capturing of memes with respect to a particular domain so as to find the influential entities. is work can be extended to capture such real time data and train deep learning models for identifying hateful memes.

AI:
Artificial intelligence CNN: Convolution neural network CV: Computer vision DCNN: Deep convolutional neural network DL: Deep learning F-RCNN: Faster R-CNN DNN: Deep neural network GPU: Graphics processing unit ML: Machine learning NLP: Natural language processing NN: Neural network LSTM: Long short-term memory OCR: Optical character recognition R-CNN: Region-based convolution neural network RGB: Red green blue YOLO: You only look once.
Data Availability e authors have participated in the Hateful Meme Classification challenge hosted by Facebook. While registering for this competition, they have agreed to not outsource the dataset as per the restricted content it may have for general public and can only be used for research purpose to provide solution for the given problem.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.    Complexity