Image Hashtag Recommendations Using a Voting Deep Neural Network and Associative Rules Mining Approach

Hashtag-based image descriptions are a popular approach for labeling images on social media platforms. In practice, images are often described by more than one hashtag. Due the rapid development of deep neural networks specialized in image embedding and classification, it is now possible to generate those descriptions automatically. In this paper we propose a novel Voting Deep Neural Network with Associative Rules Mining (VDNN-ARM) algorithm that can be used to solve multi-label hashtag recommendation problems. VDNN-ARM is a machine learning approach that utilizes an ensemble of deep neural networks to generate image features, which are then classified to potential hashtag sets. Proposed hashtags are then filtered by a voting schema. The remaining hashtags might be included in a final recommended hashtags dataset by application of associative rules mining, which explores dependencies in certain hashtag groups. Our approach is evaluated on a HARRISON benchmark dataset as a multi-label classification problem. The highest values of our evaluation parameters, including precision, recall, and accuracy, have been obtained for VDNN-ARM with a confidence threshold 0.95. VDNN-ARM outperforms state-of-the-art algorithms, including VGG-Object + VGG-Scene precision by 17.91% as well as ensemble–FFNN (intersection) recall by 32.33% and accuracy by 27.00%. Both the dataset and all source codes we implemented for this research are available for download, and our results can be reproduced.


Introduction
The number of social media users has continuously increased. Platforms like Facebook, Instagram, Twitter or Flickr are very popular tools for sharing news, keeping in touch with friends, and business promotion. With the aid of Natural Language Processing (NLP), researchers improve methods that might teach Artificial Intelligence (AI) to understand the meaning of messages published in the network. It is still a very challenging task, and algorithms are not perfect in capturing language flexibility, such as sentiments or context of a sentence. Many users include additional information in their post that classifies the context of the message using hashtags. Hashtags are words preceded by the '#' symbol and are used not only to label text data but also images, which is crucial in image-oriented social networks [1]. Hashtags might describe the content of a picture (for example "cat", "mum"), localization ("downtown", "beach"), mood (for example "sad", "happy"), or other topics, even abstract (for example "weather", "future", etc.). Users are also able to use different forms of words ("day", "days"), upper and lowercase letters, slang-inspired words such as "luvu" (which means "love you"), or marketing slogans. The proper choice of hashtags is crucial for correctly categorizing image content and makes an image potentially easier to be found by viewers. In this

Motivation of This Paper
As can be seen in the previous section, multi-labeled hashtag recommendations for image data are challenging to model but are a very promising area of research with important applications in the industry, especially on social media platforms. In this research we will propose and evaluate a novel machine learning approach that utilizes an ensemble of deep neural networks to generate images features, which are then classified to a given potential hashtags set. The proposed hashtags are then filtered by a voting schema. The remaining hashtags might be included in a final recommended hashtag dataset by application of associative rules mining, which explores dependencies in certain hashtag groups. This method is called Voting Deep Neural Network with Associative Rules Mining (VDNN-ARM). We evaluated VDNN-ARM on the HARRISON dataset that contains 57383 images in 997 classes (one image can be assigned to more than one class). We implemented and trained other state-of-the-art approaches, namely References [1,9], and our method outperformed those algorithms. Both the dataset and all source codes we implemented for this research are available for download, and our results can be reproduced.
The most important contribution of this paper is the proposition and evaluation of a novel computer method that recommends hashtags from image data. The main novelty of this paper is employing an ensemble of deep neural networks to enhance classification using additional information about dependencies between certain hashtag groups discovered by associative rules mining. To our best knowledge, this combination of deep learning and rules discovering has not been combined into a single voting and recommendation schema for the task of hashtag recommendation. The learning step in our proposed algorithm requires only a training dataset that has a sufficiently large image dataset and information about hashtags associated with them.

Materials and Methods
In this section we will describe the dataset on which we evaluated our method and schema of our multi-labeled classifier.

Dataset
In this research we utilized a real-world photo dataset, HARRISON [1]. This dataset is composed of 57,383 photos from Instagram. The authors of the dataset processed it by filtering out less frequent hashtags. Finally, each image in the dataset is described with one to ten hashtags. The total number of hashtags is 997, and there is an average of 4.5 associated hashtags for each photo. The task of assigning hashtags to a photo is defined here as multi-label problem because each image might have one or more classes (hashtags) assigned to it. The efficiency of the evaluated method is evaluated using precision calculated on the first suggested hashtag (precision (1)), recall of first five suggested hashtags (recall (5)), and accuracy of first five suggested hashtags (accuracy (5)) as follows: where k is the number of first k "best" (top) hashtags we want to consider, result(k) corresponds to the set of top k hashtags the algorithm predicted, and GT is a set of ground truth hashtags. As can be seen in this evaluation setup, obtaining 100% precision and recall is virtually impossible.

Classifier Architecture
The architecture of our solution was inspired by previous research in this field. Authors have quickly discovered that a single CNN might not be enough to extract all valuable features from an image. In Reference [1], the authors used two deep feature extractors, and both of them had VGG16 architectures [20]; however, the first one was trained on an ImageNet dataset [21] and the second one was trained on a Places dataset [22,23]. Researchers have used very popular transfer learning approaches in which network weights are imported from pre-trained models to extract deep features, and the final classification layers are re-trained using actual classes that are present in the dataset [24,25]. In Reference [1], transfer learning is conducted on the HARRISON dataset. The classification network is composed of two fully connected (dense) layers with ReLu activation and an output layer with a sigmoidal activation function. This is a typical network setup for multi-label classification problems.
The solution proposed in Reference [9] also uses an ensemble of DNNs pretrained on ImageNet, namely VGG16, InceptionV3 [26], and ResNet [27]. Contrary to Reference [1], the transfer learning for each DNN is performed separately on the HARRISON dataset. The solution proposed in Reference [9] is, however, simpler than the one in Reference [1] because the authors considered it a single-label problem, as the output layer for the DNN classification part used softmax. Authors have experimented on various ensemble schemas such as voting, union, intersection, and so forth.
In this paper we propose an approach that incorporates ideas from the above papers, called VDNN-ARM (Voting-based Deep Neural Network architecture with Associative Rules Mining). Figure 1 presents an overview of this method. It consists of several CNN-based feature extractors, namely Xception [28], DenseNet201 [29], InceptionResNetV2 [27], VGG16, NASNetLarge [30], InceptionV3, and MobileNetV2 [31]. Each of these seven networks, besides VGG16, is pretrained on ImageNet; VGG16 is pretrained on Places dataset (the same set as in Reference [1]). Each of those networks accepts input images with dimensions of 224 × 224 in RGB color space. The output of each CNN is processed by a Global Average Pooling 2D layer and then propagated to classification layers. Each of the eight networks has the same classification architecture consisting of two dense layers with 2048 neurons with ReLu activation functions and output layer with sigmoid activation. The role of the last layer is to perform multi-label classification. Similar to Reference [9], we performed separate transfer learning for each of the eight networks on the HARRISON dataset. For a given input image each of the eight DNNs generated sigmoidal output. In Reference [1], the authors classified input images into x class labels (they assigned x hashtags to an image), which corresponds to x top values generated by the output sigmoidal layer. The number of classes/recommendations x is arbitrarily decided by the user of an algorithm. VDNN-ARM takes a different approach in recommending hashtags. Because each DNN generates separate recommendations, we can apply to it various ensemble techniques, similar to Reference [9]. However, besides using only image data during training, we can also utilize information about dependencies between hashtags that are available in the dataset. We can do this, for example, by applying an associative rules mining framework.
Let T be a set of all transactions in the given dataset; A and B are itemsets, and A → B is the association rule [32]. We define the itemset support as counts of A among all transactions T, in other words the frequency of A in a dataset.
The confidence of association rule A → B is a conditional probability of A given B: In our case, a transaction is a set of hashtags that describes a given image in our dataset. Because of this, as described in Section 2.1, each transaction in our dataset contains 1 to 10 objects. We want to investigate potential associations between hashtags with reasonable support and confidence. In order to extract frequent itemsets we used the Apriori algorithm [33].
After training the DNN and mining rules from the training dataset, the VDNN-ARM algorithm is applied and described in following section.

VDNN-ARM Algorithm
Let us assume there are l DNN classifiers. For an input image I each CNN f j generates a feature vector that is used as an input to dense (fully connect) the NN with a sigmoid output layer. For the prediction P j,[1...k] (I) we take k classes that correspond to the top values of the NN output layer.
In the next step we compose a single vector P, which contains predictions from all l DNN classifiers.
Using P we generate two vectors: C, which contains unique elements from P, and C f r , which contains counts of hashtag class labels from C in P.
where #h 1 ≥ #h 2 ≥ . . . ≥ #h n , h i is a hashtag class label, and #h i is counts of the hashtag class label h i in P.
Then, we perform thresholding of C and create two vectors: C 1 , which contains classes that appeared in more than one classifier output, and C 2 , which contains those appearing in only one output. C 1 is then ordered by descending number of hashtag counts.
where #h m ≥ 2 C 1 contains classes for which at least two classifiers have voted. Then, we apply associative rules mining for C 1 and generate ARM(C 1 ), which is a set of all conclusions supported by associative rules. Then, we take the common part of ARM(C 1 ) with C2. In hashtag set C 3 we have only classes that appeared as a result of rules supported by C 1 and which are also present in C 2 .
where ARM is reasoning applied by associative rules mining (ARM) on the rules we have previously discovered. Classes that were present in C 3 are assigned at the end of vector C 1 . S is a vector that contains the suggested hashtags for image I ordered from most frequently proposed by classifiers to those that appeared only once, but they were supported by ARM.
Now we can take x first elements from S to generate x top hashtag suggestions for a given image.

Results
We implemented our approach in Python 3.6. Among the most important packages we used were Keras 2.4.3 and Tensorflow 2.3.1 for DNN implementation and GPU-accelerated tensor calculations. We used pre-trained CNN network weights from Keras-Applications 1.0.8 that were trained on the ImageNet dataset [21]. We also used VGG16 network weights [23] that were trained on the Places dataset [22]. For associative rules mining we utilized the mlxtend 0.17.3. package [34]. To evaluate the proposed method we used the HARRISION dataset [1] described in Section 2.1. We used 52383 randomly chosen objects in the training set and 5000 in the validation dataset.
All classifiers have been trained using a first-order gradient-based Adam optimizer [35] with a binary cross-entropy loss function.
In order to generate associative rules we set the minimal support threshold in the a priori algorithm to 0.0001. We filtered out all rules with confidence below 0.001.
We also implemented and trained algorithms proposed in References [1,9]. In the case of Reference [9] we replaced the softmax layer with a dense layer with sigmoidal activation function to make this classifier applicable to multi-label problems. All source codes we implemented in our research can be downloaded from github (https://github.com/browarsoftware/VDNN-ARM). Calculations were performed on a PC computer with Intel i7-9700 3.00 GHz CPU, 64 GB RAM running Windows 10 OS. We used NVIDIA GeForce RTX 2060 GPU.
In Figure 2 we present the accuracy (5) tests on each DNN network in the form of a graph. We generated this visualization using Gephi 0.9.2 software [36]. Graph layout was generated using the ForceAtlas2 algorithm [37].
Each node (vertex) represents an image from the validation dataset. If an image is connected to a node by a colored edge, this represents a particular DNN network, which means that at least one of the top five hashtags generated by that DNN is among hashtags describing this image. An image might have several connections to different vertices if, and only if, it has correctly passed the accuracy (5) test in more than one network. If the node is isolated, that means it has not been correctly classified by any network. As can be seen, there is a group of images that are not correctly classified by any network. They are visible in the top part of the graph as isolated grey points. This clearly shows there are some limitations to algorithms that are based on applying DNN to hashtag discovery that cannot be overcome. In addition, various DNN covers are not identical subsets of all images. This means applying an ensemble of several DNNs of various types might give better results than using only a single DNN. In Table 1 and in Figure 3 we present detailed results of the proposed algorithm with various threshold values of confidence for ARM. It is also possible that our proposed method will not generate hashtag recommendations. This might happen when C1 = ∅ (see Equation (9)). Column #Recommended hashtags means that we evaluate precision, recall, and accuracy for no more than x top generated hashtags, where x is not more than length of vector S (see (11)). "No restrictions" means that we calculate all evaluation parameters for the whole vector S. The bold font indicates parameters with the highest values in the table. The best results were obtained for VDNN-ARM with threshold = 0.95 if we take into account a limited number of hashtags. When there is no restriction on the number of hashtags, the highest recall and accuracy were obtained for VDNN-ARM with threshold = 0.2. In Equation (7) we use parameter k = 5, the same as in References [1,9]. In all cases, when the number of proposed hashtags increases, the precision decreases, and the recall and accuracy become higher. This is an expected behavior. At the beginning of vector S there are the most voted (probable) hashtags. When the number of considered hashtags increases, the denominator in the precision equations also increases, and more and more less probable hashtags are included in the evaluation. In the case of recall and accuracy, the higher number of hashtags causes an increase in the numerator, which increases recall and accuracy. When we do not limit number of hashtags to 5 ("No restriction"), the recall and accuracy achieve the highest value. Table 2 presents a comparison of the proposed method to state-of-the-art approaches. The highest value for all coefficients were obtained for VDNN-ARM with a confidence threshold 0.95. VDNN-ARM outperforms the precision (1) of VGG-Object + VGG-Scene [9] by 17.91%; in the case of Ensemble-FFNN (intersection) [9], the recall (5) increased by 32.33% and accuracy (5) by 27.00%.    Table 1. Plot (a) shows precision obtained for various numbers of hashtags and confidence of ARM. Plot (b) visualizes recall and (c) accuracy.

Discussion
As can be seen in Section 4, the proposed method outperformed state-of-the-art approaches. Due to its voting schema this method incorporates benefits of both union and intersection schema. An intersection schema is responsible for aggregating and counting the recommended hashtag label results of each sub-DNN network. Union does not exclude less frequent hashtags from the final recommendation. Application of associative rules mining utilizes additional knowledge about conditional dependencies between hashtags. As can be seen in Table 2, in the case of algorithm [1], the image content data solely generated by DNN is not enough to overcome baseline results. The transfer learning approach utilized features of CNN for successful classification of image content.
Typically, in up-to-date literature, authors use methods that suggest x most probable hashtags from an output sigmoidal layer, where x is arbitrarily chosen by the user. VDNN-ARM allows one to manually choose the number of hashtag recommendations; however, by applying ARM and confidence thresholding schema, it might also be used to include less frequent hashtags that are recommended by ARM. The highest values of evaluation parameters, including precision (1), recall (5), and accuracy (5), have been obtained for VDNN-ARM with confidence threshold = 0.95. All parameters decreased as the confidence threshold decreased (see Table 1). This result suggests that increasing the confidence of ARM rules results in an increase in classifier performance. This very important indicator suggests that applying a higher number of confident rules leads to a higher number of "matching" hashtags generated by the approach.
Our results suggest the proposed algorithm is a promising approach that can be successfully applied in practice. Another important find is the limitations of DNN-based hashtag discovery algorithms, which we discussed in Section and visualized in Figure 2. In order to improve the evaluation results we need to improve other parts of algorithm than CNN-based image feature extractors. According to Reference [38] there is a certain category of image hashtags that authors named "stophashtags". This name is inspired from the term "stopwords", which is used in the field of computational linguistics to refer to common and non-descriptive words found in almost every text document. Authors of that research have shown that, contrary to descriptive hashtags (hashtags relevant to the subject of an image), "stophashtags" are characterized by a high normalized subject (hashtag) frequency on irrelevant subject categories. Because we used a third-party benchmark dataset in this research, which has already been preprocessed by their creators and used in other research, we did not filter out potential "stophashtags". It is possible that filtering "stophashtags" might improve the results of our method; however, the algorithm described in Reference [38] should operate on all acquired hashtags, not a subset that is present in the HARRISON dataset. In our future research in the field of hashtag recommendations, we plan to acquire an even larger dataset than HARRISON and apply to it "stophashtags" filtering. We believe this operation might lead to even more interesting and valuable results.

Conclusions
The proposed VDNN-ARM hashtag recommendation algorithm is an efficient approach that can be applied to any type of social media image data. As can be seen, the precision (1) coefficient is still relatively low; the first top hashtag appears only in about one-fifth of validation data. In the case of accuracy (5), over 55% of validation data has at least one correctly assigned hashtag; this is because the multi-label classification problem is difficult to correct: not only are there 997 classes, but the ground truth of class labels might not match the objects, scenes, and places that are present in images. Real-life hashtag descriptions often tell the state of mind, sentiment, or some abstract context that the person had in the moment of taking or publishing a photo. Each photo might have between 1 to 10 different hashtags, and that number varies between images. Therefore, when we do not have additional knowledge about the context of the picture (so called "a story behind photo"), we might not be able to mine/learn the rules that govern certain phenomena. The knowledge about those rules is also not fully understandable to a person who might try to manually assign hashtags. Because of this, it is very improbable that any algorithm based on HARRISON data will obtain perfect or even nearly perfect accuracy. Contrary to already published methods, our algorithm is capable of limiting the number of proposed hashtags by applying the ARM approach and its thresholding schema. Thanks to this, the VDNN-ARM in no-restriction mode can easily trade-off between precision and accuracy/recall. We believe our algorithm is not limited to hashtag recommendation; it can be applied to any type of multi-labeled image classification data. In our opinion, the next step in research should be developing methods that utilize additional information, such as the context of the photo, which can be extracted from discussions about this photo on social media, geopositioning information, and so forth. These additional data, besides image data and baseline hashtag information, seem to be crucial to increase the efficiency of multi-label hashtag recommendations above a certain level limited by image-oriented DNNs. The limitations of image-oriented DNNs are clearly visible in Figure 2.
Author Contributions: T.H. was responsible for conceptualization, proposed methodology, software implementation, and co-writing of original draft; J.M. was responsible for data curation and co-writing of original draft. All authors have read and agreed to the published version of the manuscript.