Image-Captioning Model Compression

Atliha, Viktar; Šešok, Dmitrij

doi:10.3390/app12031638

Open AccessArticle

Image-Captioning Model Compression

by

Viktar Atliha

and

Dmitrij Šešok

^*

Department of Information Technologies, Vilnius Gediminas Technical University, Saulėtekio Al. 11, LT-10223 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(3), 1638; https://doi.org/10.3390/app12031638

Submission received: 8 January 2022 / Revised: 1 February 2022 / Accepted: 2 February 2022 / Published: 4 February 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Image captioning is a very important task, which is on the edge between natural language processing (NLP) and computer vision (CV). The current quality of the captioning models allows them to be used for practical tasks, but they require both large computational power and considerable storage space. Despite the practical importance of the image-captioning problem, only a few papers have investigated model size compression in order to prepare them for use on mobile devices. Furthermore, these works usually only investigate decoder compression in a typical encoder–decoder architecture, while the encoder traditionally occupies most of the space. We applied the most efficient model-compression techniques such as architectural changes, pruning and quantization to several state-of-the-art image-captioning architectures. As a result, all of these models were compressed by no less than 91% in terms of memory (including encoder), but lost no more than 2% and 4.5% in metrics such as CIDEr and SPICE, respectively. At the same time, the best model showed results of 127.4 CIDEr and 21.4 SPICE, with a size equal to only 34.8 MB, which sets a strong baseline for compression problems for image-captioning models, and could be used for practical applications.

Keywords:

image captioning; model compression; pruning; quantization

1. Introduction

One of the most significant tasks combining two different domains such as CV and NLP is the image-captioning task [1]. Its goal is to automatically generate a caption describing an image given as an input. The description should contain not only a listing of the objects in this image, but should also take into account their signs, interactions between them, etc., in order for this description to be as humanlike as possible. This is an important task for many practical applications in everyday life, such as human–computer interactions, help for visually impaired people and image searching [2,3].

Typically, image-captioning models are based on encoder–decoder architecture. Some of the models, such as [4,5,6,7,8], use a usual convolutional neural network (CNN) as an encoder. However, usage of image detectors, such as in [9,10] for this purpose, has been rising in popularity in recent years. The decoder here is a text generator, most often represented as a recurrent neural network (RNN) [11] or a long short-term memory network (LSTM) [12] with attention [13]. However, more complex models based on the architecture of transformers [14], which are the state of the art in a variety of NLP problems, have been created, using transformers both for sentences [15,16,17,18] and images [19].

Unsurprisingly, rising interest in image-captioning models has significantly improved the quality of the models in a very short time. At the same time, improvements occur not only due to more complex architectures of neural networks, but also due to an increase in the number of parameters and, accordingly, the size of the captioning model. Moreover, it is not known for certain whether a more successful architecture and training method, or a physically larger number of parameters, leads to an increase in quality. At the same time, the deep neural network models are more widely used on mobile devices, which have limitations both in computing power and in the size of storage. As a result, current state-of-the-art models, such as, for example [9,10], cannot be used on mobile devices, as they usually weigh 500 MB and more. The problem of model compression for image captioning is very poorly studied. Even presented papers [20,21,22] usually only research decoder compression, although the encoder is an integral part of the model used when applying it to new data.

At the same time, model-compression techniques in other related areas of CV and NLP are well studied. So, for example, research on the various architectures of the object detector in the image (which is often used as an encoder for the image-captioning task) [23,24,25] have been conducted for a long time, with the aim of significantly reducing their size and increasing the operating speed without losing quality (or even increasing it). On the other hand, methods of compressing NLP models are investigated for various tasks, starting from text classification [26], and ending with text translation [27], which is similar in meaning to the image-captioning problem.

In this article, we have decided to fill this gap and undertake a comprehensive use and analysis of deep neural networks (DNN) model-compression techniques [28] applied to image-captioning models. We investigated various encoder architectures specifically tailored to the task of image detection in a low-resource environment. We also applied methods of compressing models to the decoder, reducing the architecture of the model itself, as well as applying various pruning [29] and quantization [30] methods. As a result, we were able to significantly reduce the occupied space of two state-of-the-art models named Up-Down [9] and AoANet [10], while not losing much in terms of important metrics. For example, using our approach, we were able to reduce the size of the classic Up-Down model from 661.4 MB to 58.5 MB (including the encoder), that is, by 91.2%. At the same time, the main image-captioning metrics such as CIDEr [31] and SPICE [32] decreased from 120.1 and 21.4 to 118.4 and 20.8, respectively (by 1.4% and 2.8%, respectively). For the AoANet model, the size reduction was 95.6% from 791.8 MB to 34.8 MB, the CIDEr and SPICE metrics fell by 1.7% and 4%, respectively, from 129.8 to 127.6 and from 22.4 to 21.5, respectively. Thus, the best-resulting model reaches 127.6 CIDEr and 21.5 SPICE at a size of only 34.8 MB.

The main contributions of this paper are as follows:

We have proposed the use of modern methods of model compression in relation to the image-captioning task, both for the encoder and for the decoder in order to reduce the overall size of the model;
We compared different options for such a reduction, such as different encoder models, different decoder architecture variations, as well as different pruning options, and conducted a study of the effect of quantization;
The proposed methods allowed us to significantly reduce the size of the models, without significant loss of quality;
The methods worked universally on two different models, which suggests that they can be successfully used for other architectures used for the image-captioning tasks.

The paper is organized as follows:

In Section 2 we review paper related to this work, including such topics as image-captioning tasks in general, neural network model-compression techniques and their particular application to image-captioning tasks;
In Section 3 we describe our methodology, firstly describing encoder and then decoder compression approaches;
In Section 4 we report the design of our experiments as well as their numerical results and their analysis;
In Section 5 we state the main achievements and a discussion of the results;
In Section 6 we conclude the work and name future work directions.

2. Related Work

2.1. Image Captioning

The task of the automatic generation of a caption describing an image is an important task for a smoother human–machine interaction. One of the earliest successful methods of such generation, which laid the foundation for modern image-captioning methods, is [4]. The architecture is a typical encoder–decoder architecture, which first generates some representation of the image, and then, based on this sentence, generates text, most often word by word. This model is followed by most modern image-captioning architectures.

Two key works that have significantly improved the performance of image-captioning models are [8,9]. The first work suggested using the REINFORCE algorithm to directly optimize the discrete model’s quality metric named CIDEr, thereby directly improving its value. The second work suggested using the detector of objects in the image as a encoder (the article itself uses Faster R-CNN [24]), and then, based on the detected objects and their attributes, generating a caption, instead of trying to compress the entire image into one vector and generating it based on such representation.

In addition, starting with [5], the attention mechanism is actively used for the image-captioning task, which allows the model to adaptively increase the attention to objects or areas that are most important for generating a text description for the current image. It is used by high-quality works such as [10,33,34] and others.

Recently, transformers [14] have been gaining more and more popularity, becoming state-of-the-art models for many tasks from the field of NLP. Their use improves the quality of image-captioning models as well, which is investigated in [15,16,18].

Also, research in the field of image captioning is moving towards the unification of models with other vision–language understanding and generation tasks. Works such as [35,36] use networks pretrained on a giant dataset and obtain a model capable of solving a number of vision–language problems. However, due to the fact that such pretraining requires huge computational resources, it is difficult for researchers to find improvements to such models and broadly study them.

2.2. Neural Network Models Compression

2.2.1. Effective Architectures

One of the most effective ways to reduce model size is to look for a more efficient architecture that contains fewer parameters but still performs well. The search for such architectures is especially active for those tasks that, on the one hand, are very useful to be able to solve in real time on mobile devices, and, on the other hand, which, as a rule, have too cumbersome architectures. One of these tasks is the task of detecting an object in an image, together with its classification and the determination of its attributes. So, for example, Faster R-CNN from [24] uses Region Proposal Network (RPN) and combines it with Fast R-CNN from [23], along with using convolutional neural network (VGG-16 [37]) as a backbone. SSD [38] initially creates a set of default boxes over different aspect ratios and scales, and then determines whether there are objects of interest. RetinaNet from [39] is a small, dense detector, performing well thanks to focal loss-oriented training.

As a rule, the backbone in the form of a convolutional neural network acts as one of the important components of detectors. Therefore, the task of selecting an effective convolutional network architecture is also on the agenda. Over time, various ideas have appeared to reduce the number of parameters, and, accordingly, the space occupied by the model. One of the most popular, MobileNetV3 [40] was found by a neural architecture search based on previous MobileNet architecture versions. The other widely used convolution neural network architecture is EfficientNet [41], which was explored during specialized research, concentrated on the amount of parameters that influence on the model quality. This network is also used in EfficientDet [25], which generalizes the ideas presented by transferring them to the objects’ detection area.

2.2.2. Pruning

Pruning is one of the most popular and effective methods for reducing the size of models [42,43,44]. Its idea is to remove model parameters that are not useful or meaningful when present in the model. Thus, deleting such parameters does not cause much damage to the final quality, but at the same time it allows you to save storage space and computing resources.

In general, all pruning methods can be divided into two categories: structured pruning [45,46] and unstructured pruning. Within structured pruning, entire rows, columns or channels of layers are removed. Such techniques are used both for reducing CNN models and for models using RNN.

Another category is unstructured pruning [47,48]. In the framework of this kind of methods, the removal of weights is not tied to a specific structure of the model; any weights can be removed depending on the criterion of their “importance”. Such methods are actively used in different variations for various tasks. For initial research, we focused on relatively simple methods that can achieve a good result.

2.2.3. Quantization

Another method of compressing a model without losing quality is quantization [49]. The essence of quantization is to use lower precision numbers for storing the model in order to reduce the amount of occupied space, and at the same time, without losing quality. Indeed, most practical applications do not require the precision that standard floating point data types provide.

Quantization approaches could be divided into two groups based on the type of compression range defined: static and dynamic quantization. In static quantization [50,51], the clipping range is calculated before inference and remains the same during the model’s runtime. In the other group of methods, called dynamic quantization [52,53], the clipping range is dynamically calculated for each activation map during model application.

2.3. Model Size Reducing for Image Captioning

The area of model compression for image captioning is underexplored. Only a few works offer their own approaches to reducing the size of the image-captioning models. In [54], the authors propose to use SquezeNet [55] as an encoder and LightRNN [56] as a decoder. The authors of [57] propose a way to reduce model size, tackling the problem of huge embedding matrices which are not scalable to bigger vocabularies. In [20,21], novel special pruning approaches to the decoder of image-captioning models are proposed.

However, all of these works investigate narrow aspects, for example, only reducing the size of the encoder or decoder or using only one method of compression. In contrast, our article is intended to provide a comprehensive understanding of the methods for compressing models for image captioning, using various combinations of approaches in order to reduce the final model size as much as possible without much loss of quality.

3. Methodology

This section will describe the methods we used to reduce the size of the models. Both the methods used for both investigated architectures (Up-Down and AoANet) and methods specific to each particular model will be described. It is important to say that we didn’t aim to compare Up-Down and AoANet models between each other, but to compare compression methods for several models. A comparison of the original models can be found in [10].

3.1. Encoder Compression

As mentioned above, the overwhelming majority of state-of-the-art image-captioning architectures use an image object detector as an encoder. That is, typical models are built according to the following scheme: let I be the input image, E be a detection model, and D be a decoder. Firstly, the method generates k image features

E (I) = V = {v_{1}, v_{2}, \dots, v_{k}}

,

v_{i} \in R^{d}

, such that each image feature encodes a region of the image. Then, the decoder generates caption

D (V)

based on these features.

We propose to use this general scheme as a sample, however, as an encoder, we take models that are more suitable for our task. One of these options is even saving the Faster R-CNN detection model, but with a different backbone (since it is this backbone that contains most of the parameters). As such a backbone, we propose using MobileNetV3, which performs well for memory-restricted tasks. For comparison, we also propose taking a detector that was initially sharpened for a small memory while maintaining good quality. EfficientDet was chosen as such a detector.

Thus, as E, we will have Faster R-CNN ResNet101 used in the original Up-Down model, as well as Faster R-CNN MobileNetV2 and EfficientDet. Because the sizes of these detectors are already small, we have not considered other methods of encoder compression.

3.2. Decoder Compression

Assuming that the current models contain more parameters than are necessary to achieve the same quality, we first investigate whether they can be reduced in number without changing the nature of the architecture. In this case, the method for a specific change in the number of parameters depends on the model under study. Further, on the best architecture, pruning methods are applied, and the best-pruned model is eventually quantized.

3.2.1. Architecture Changes

Following [9], the decoder model has three important logical parts:

Embeddings Calculation
In this part, one-hot encoded vectors representing words are transformed to word vectors. Let $Π$ be one-hot encoding of the input word and $W_{e} \in R^{E \times | Σ |}$ is a word-embedding matrix for a vocabulary $Σ$ . Then, word vector $π$ could be obtained by the following equation:

$π = W_{e} Π$

(1)

Here, the size of the parameter matrix depends on embedding dimension E and the size of the vocabulary $| Σ |$ . As the size of the vocabulary is a fixed value based on the preprocessing of the dataset, we experimented with changing E and explored its influence on the overall model performance as well as on the model size.
Top-Down Attention LSTM and Language LSTM
As these two modules are similar to each other in terms of number of parameters and inner representations, we treat them as one logical part. Both modules are LSTMs, so generally the operation of them over a single time step is the following:

$h_{t} = LSTM (x_{t}, h_{t - 1}),$

(2)

where $h_{t} \in R^{M}$ . In this case, the sizes of the parameter matrices of LSTM strongly depend on parameter M. We manipulated this parameter by trying to reduce it without causing huge losses in the model’s quality.
Attention Module
The third important module of the Up-Down architecture is called Attention Module. It is used to calculate attention weights $α_{t}$ for the encoder’s features $v_{i}$ . It works in the following way:

$α_{i, t} = w_{a}^{T} \tan h (W_{v a} v_{i} + W_{h a} h_{t}^{1})$

(3)

$α_{t} = softmax (a_{t}),$

(4)

where $W_{v a} \in R^{H \times V}$ , $W_{h a} \in R^{H \times M}$ . The main constants that influence the amount of weights in this module are H, V and M. V is the size of the encoder vectors, which is fixed by encoder choice, M has been discussed previously, and H is the dimension of input attention representation. We investigated how changing H influenced both model size and its performance along with other parameters.

Thus, in trying to reach a smaller model size without harming its performance metrics, we manipulated it with model parameters such as E, M and H.

The model AoANet described in [10], similarly to Up-Down, has both embedding calculation and LSTM parts. So, the reasoning for parameters E and M is similar to the previous subsection. However, the other parts of the architecture are different, so we don’t experiment with them in this paper.

We will reduce the number of all the parameters multiplying E, M and H for Up-Down and E and M for AoANet used in the original papers by the equal scale factor

γ

.

3.2.2. Decoder Pruning

Because most of the layers of the decoders we have considered are linear or LSTM layers, the most suitable pruning method is unstructured pruning. We propose to carry it out after training, fixing the model.

There are two parts of the decoder: calculating embeddings and processing them. The embedding part is a separate, important part, because it is responsible for the selection of vectors that will represent the words. In order to determine the effect of pruning of embeddings, as the main semantic part, on the quality of a model, we consider two options: pruning of the entire model and pruning of everything except embeddings.

Let the model to prune be

M (W_{1}, W_{2})

, where

W_{1} \in R^{m_{1}}

is a vector of model parameters to which the pruning algorithm is not applied, and

W_{2} \in R^{m_{2}}

is a vector of model parameters to which the pruning algorithm is applied, sorted by the increase in their

l_{1}

norms, let

α \in [0, 1]

be the pruning coefficient, and

A (W, α)

be the pruning algorithm, generating mask of weights which would be pruned. Thus, the final model after pruning would be

M (W_{1}, W_{2} \cdot A (W_{2}, α))

, where · denotes elementwise multiplication.

We concentrate on two methods of choosing weights to prune:

random pruning: $A (W)$ generates mask $M \in {0, 1}^{| W |}$ , where $M_{i} = 0$ with probability $α$ ;
for $l_{1}$ pruning: $A (W)$ generates mask $M \in {0, 1}^{| W |}$ , where $M_{i} = 0 \forall i < α | W |$ .

3.2.3. Decoder Quantization

Having fixed the model, we use post-training dynamic quantization. When converting from floating point numbers to integers, this number is multiplied by some constant, subsequently rounding the result so that it fits into an int. This constant can be defined in different ways. The essence of dynamic quantization is that, despite the fact that it quantizes the weights of the model before applying it and keeps the model compressed, it calculates a constant by which to multiply activations during the calculation, depending on the input data. This approach allows us to maintain maximum accuracy while storing the model in a compressed form. This method of quantization was chosen due to the fact that we are primarily interested in the accuracy and size of the model, and not the speed of its application.

4. Experiments

4.1. Experimental Setup

To compare the effectiveness of different model-compression methods, we used MSCOCO [58], which is the most common image caption benchmark dataset. It consists of 82,783 training images and 40,504 validation images. There are five different captions for each of the images. We used the standard Karpathy split from [7] for offline evaluation. As a result, the final dataset consists for 113,287 images for training, 5000 images for validation and 5000 images for testing.

We also performed preprocessing of the dataset by replacing all the words that occurred less than five times with a special token <UNK>. Further to following commonly used procedures, we truncated the words to maintain a maximal length of the caption equal to 16.

Metrics BLEU [59], METEOR [60], ROUGE-L [61], CIDEr [31] and SPICE [32] were used to compare the results. BLEU came from a machine translation task. It uses n-grams precision to calculate a similarity score between reference and generated sentences. METEOR uses synonym matching functions along with n-grams matching. ROUGE-L is based on the longest common subsequence statistics. Two of the most important metrics for image captioning are CIDEr and SPICE, as they are human consensus metrics. CIDEr is based on n-grams and performs TF-IDF weighting on each of them. SPICE uses semantic graph parsing in order to determine how well the model has captured the attributes of objects and their relationships.

For our experiments, we took models based on the https://github.com/ruotianluo/ImageCaptioning.pytorch (accessed on 4 November 2021) public repository. The implementation framework was PyTorch. For encoders, the built-in PyTorch Faster R-CNN was used, as well as the https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch (accessed on 4 November 2021) repository for EfficientDet. All models were trained for 30 epochs in the usual way, and then for 10 more epochs in a self-critical style following [8]. The models were trained and tested on Google Cloud Platform using servers with eight CPU cores, 32 GB of RAM and one Tesla V100 GPU accelerator.

4.2. Encoder Compression

In Table 1, we present the results of a comparison of Up-Down and AoANet models modified using different encoders. Variants with a Faster R-CNN ResNet101 encoder refer to the same models as in original papers. It can be observed that all of the proposed variations both for Up-Down and AoANet models have quality metrics very close to each other. This aligns well with the fact that despite their small size, image-detection models used as encoders in these experiments show great results in the standalone object detection task.

Furthermore, usage of smaller encoders designed to be launched on mobile devices helps to reduce the encoder’s size from 468.4 MB to 15.1 MB (a 96.8% reduction), losing only 0.6 in CIDEr and 0.3 in SPICE scores. We choose encoder EfficientDet-D0 to be the most suitable for this task among all other tested ones, and we use it in all further experiments.

4.3. Decoder Architecture Changes

In Table 2 we evaluate models with their decoder reduced using different scale factors. A value of

γ

equal to 1 corresponds to using original models, but, as said before, with EfficientDet-D0 encoder. For the Up-Down model as well as for AoANet, decoder reduction with scale factor equal to 0.5 shows results comparable to the model without reduction, but with approximately three times less memory and number of parameters. Reducing the model further leads to greater loss in quality. Thus, for example, Up-Down model’s CIDEr metric decreases by only 0.3 points when the scale factor of 0.5 is used, but when the scale factor of 0.25 is used it decreases by 4.9 points, which is a much bigger loss. A similar situation is seen for the CIDEr score drop for the AoANet model. So, in this case, using scale factor

γ = 0.5

acts as a compromise which helps to both reduce the model size and leave quality metrics on the same level. We will use decoders reduced with a scale factor equal to 0.5 for our experiments for the rest of the paper.

4.4. Decoder Pruning

Results of pruning using different methods and pruning coefficients are presented in Figure 1 and Figure 2. Blue and orange lines on both figures are correspondent to

l_{1}

pruning, while red and green lines are correspondent to random pruning. “True” represents pruning the embedding layer and “False” represents not doing so. NNZ means “number of non-zero parameters”, which is the common measure for comparing pruning techniques.

As it can be seen, random pruning is strictly worse than

l_{1}

pruning. This could be due to the fact that diminishing less valuable weights of the model expectedly influences the model’s quality metrics less than removing only random parameters of the model.

The other fact is that although not pruning the embeddings layer helps to maintain greater quality for big pruning coefficients for the Up-Down model, it almost doesn’t show any difference for the AoANet model. Additionally, the great decline in metrics could be observed for the Up-Down model starting from

α = 0.1

, and for AoANet starting from

α = 0.5

. Taking this into account, in any case we are not interested in the area of Up-Down model pruning where the choice of pruning or not pruning embeddings could play any role.

As said before, the biggest pruning coefficients (which lead to a bigger model compression) with which models still show good quality are

0.1

and

0.5

for Up-Down and AoANet, respectively. Increasing these values could lead to a drop in metrics. More detailed comparisons of model results near this drop, along with the unpruned model quality, are reported in Table 3. The table proves the observations obtained from figures. The chosen boundary values for pruning coefficients help to reduce model size by 10% and losing only 0.6 CIDEr points for the Up-Down model, and reducing model size by 50% and losing only 1.5 CIDEr points for the AoANet model. Such a great reduction in AoANet could indicate that the overall model is bigger and has more parameters than needed to show comparable results with its architecture. Furthermore, this model’s weights could be quite sparse, and lots of them could be close to 0. In this case, such extensive pruning wouldn’t lead to big loss in performance.

4.5. Decoder Quantization

For the quantization experiments, the best models obtained in the previous section were used. Thus, we used the Up-Down model pruned with

α = 0.1

and the AoANet model pruned with

α = 0.5

. The results of the quantization experiments can be found in Table 4. Quantization helps to reduce model size without approximately any loss of quality. The resulting decoder size of the Up-Down model is 43.4 MB, and the decoder size of the AoANet model is 19.7 MB.

5. Discussion

The final models compressed using all of the proposed methods compared with the original ones are presented in Table 5. It can be observed that our methods helped to significantly reduce model size with a comparably small change in performance metrics. The Up-Down model’s size was reduced from 661.4 MB to 58.5 MB, with a loss of 1.7 points in CIDEr and 0.6 points in SPICE metrics. The AoANet model’s size was reduced from 791.8 MB to 34.8 MB, with a loss of 2.4 points in CIDEr and 1 point in SPICE metrics. Using the proposed methods worked well for both models, which could be a good indicator of the methods’ generalizability. Sizes of 58.5 MB and 34.8 MB are already small enough to be used on mobile devices in real-world applications.

6. Conclusions

In this work we proposed the use of different neural network compression techniques for models trained to solve image-captioning tasks. Our extensive experiments showed that all of the proposed methods help to reduce model size with an insignificant loss in quality. Thus, the best obtained model reaches 127.4 CIDEr and 21.4 SPICE, with a weighting of only 34.8 MB. This sets up a strong baseline for future studies on this topic.

In future works, other compression methods such as knowledge distillation could be investigated in order to reduce the models’ sizes even more. Also, some more complex pruning or quantization techniques could be used. The other direction for further research could be encoder compression, using not only different architectures, but also the other methods, for example, pruning and quantization. Reducing the size of more complex and bigger models such as [35,36] is also a promising direction of study.

Author Contributions

Conceptualization, methodology, software, writing—original draft preparation, visualization, investigation, editing V.A.; writing—review, supervision, project administration, funding acquisition D.Š. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Staniūtė, R.; Šešok, D. A Systematic Literature Review on Image Captioning. Appl. Sci. 2019, 9, 2024. [Google Scholar] [CrossRef] [Green Version]
Zafar, B.; Ashraf, R.; Ali, N.; Iqbal, M.K.; Sajid, M.; Dar, S.H.; Ratyal, N.I. A novel discriminating and relative global spatial image representation with applications in CBIR. Appl. Sci. 2018, 8, 2242. [Google Scholar] [CrossRef] [Green Version]
Belalia, A.; Belloulata, K.; Kpalma, K. Region-based image retrieval in the compressed domain using shape-adaptive DCT. Multimed. Tools Appl. 2016, 75, 10175–10199. [Google Scholar] [CrossRef]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 June 2015; pp. 2048–2057. [Google Scholar]
Lu, J.; Yang, J.; Batra, D.; Parikh, D. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7219–7228. [Google Scholar]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; Khudanpur, S. Recurrent neural network based language model. Interspeech 2010, 2, 1045–1048. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 10578–10587. [Google Scholar]
Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled Transformer for Image Captioning. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8928–8937. [Google Scholar]
Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4467–4480. [Google Scholar] [CrossRef] [Green Version]
Zhu, X.; Li, L.; Liu, J.; Peng, H.; Niu, X. Captioning transformer with stacked attention modules. Appl. Sci. 2018, 8, 739. [Google Scholar] [CrossRef] [Green Version]
He, S.; Liao, W.; Tavakoli, H.R.; Yang, M.; Rosenhahn, B.; Pugeault, N. Image captioning through image transformer. In Proceedings of the Asian Conference on Computer Vision, Online, 30 November–4 December 2020. [Google Scholar]
Tan, J.H.; Chan, C.S.; Chuah, J.H. End-to-End Supermask Pruning: Learning to Prune Image Captioning Models. Pattern Recognit. 2022, 122, 108366. [Google Scholar] [CrossRef]
Tan, J.H.; Chan, C.S.; Chuah, J.H. Image Captioning with Sparse Recurrent Neural Network. arXiv 2019, arXiv:1908.10797. [Google Scholar]
Dai, X.; Yin, H.; Jha, N.K. Grow and prune compact, fast, and accurate LSTMs. IEEE Trans. Comput. 2019, 69, 441–452. [Google Scholar] [CrossRef] [Green Version]
Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. Fasttext. zip: Compressing text classification models. arXiv 2016, arXiv:1612.03651. [Google Scholar]
See, A.; Luong, M.T.; Manning, C.D. Compression of Neural Machine Translation Models via Pruning. arXiv 2016, arXiv:1606.09274. [Google Scholar]
Choudhary, T.; Mishra, V.; Goswami, A.; Sarangapani, J. A comprehensive survey on model compression and acceleration. Artif. Intell. Rev. 2020, 53, 5113–5155. [Google Scholar] [CrossRef]
Reed, R. Pruning algorithms-a survey. IEEE Trans. Neural Netw. 1993, 4, 740–747. [Google Scholar] [CrossRef]
Guo, Y. A Survey on Methods and Theories of Quantized Neural Networks. arXiv 2018, arXiv:1808.04752. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 382–398. [Google Scholar]
Huang, L.; Wang, W.; Xia, Y.; Chen, J. Adaptively Aligned Image Captioning via Adaptive Attention Time. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8940–8949. [Google Scholar]
Wang, W.; Chen, Z.; Hu, H. Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8957–8964. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 121–137. [Google Scholar]
Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 5579–5588. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002, 2980–2988. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Hoefler, T.; Alistarh, D.; Ben-Nun, T.; Dryden, N.; Peste, A. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. arXiv 2021, arXiv:2102.00554. [Google Scholar]
He, Y.; Ding, Y.; Liu, P.; Zhu, L.; Zhang, H.; Yang, Y. Learning filter pruning criteria for deep convolutional neural networks acceleration. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 2009–2018. [Google Scholar]
Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. arXiv 2020, arXiv:2006.05467. [Google Scholar]
Anwar, S.; Hwang, K.; Sung, W. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. (JETC) 2017, 13, 1–18. [Google Scholar] [CrossRef] [Green Version]
He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; Han, S. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 784–800. [Google Scholar]
Lee, N.; Ajanthan, T.; Torr, P. Snip: Single-Shot Network Pruning Based on Connection Sensitivity. arXiv 2018, arXiv:1810.02340. [Google Scholar]
Xiao, X.; Wang, Z. Autoprune: Automatic network pruning by regularizing auxiliary parameters. In Proceedings of the Advances in Neural Information Processing Systems 32, (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv 2021, arXiv:2103.13630. [Google Scholar]
Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
Yao, Z.; Dong, Z.; Zheng, Z.; Gholami, A.; Yu, J.; Tan, E.; Wang, L.; Huang, Q.; Wang, Y.; Mahoney, M.; et al. Hawq-v3: Dyadic neural network quantization. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11875–11886. [Google Scholar]
Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized clipping activation for quantized neural networks. arXiv 2018, arXiv:1805.06085. [Google Scholar]
Li, R.; Wang, Y.; Liang, F.; Qin, H.; Yan, J.; Fan, R. Fully quantized network for object detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–25 June 2019; pp. 2810–2819. [Google Scholar]
Parameswaran, S.N. Exploring memory and time efficient neural networks for image captioning. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics; Springer: Singapore, 2017; pp. 338–347. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Li, X.; Qin, T.; Yang, J.; Liu, T.Y. LightRNN: Memory and computation-efficient recurrent neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4392–4400. [Google Scholar]
Tan, J.H.; Chan, C.S.; Chuah, J.H. Comic: Toward a compact image captioning model with attention. IEEE Trans. Multimed. 2019, 21, 2686–2696. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–7 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Association for Computational Linguistics Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 9 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y.; Och, F.J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004; p. 605. [Google Scholar]

Figure 1. CIDEr and SPICE validation scores dependence on pruning coefficient

α

for different pruning methods applied to the Up-Down model. “L1” indicates that

l_{1}

pruning was used, whilist “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer and “False” indicates no pruning.

Figure 1. CIDEr and SPICE validation scores dependence on pruning coefficient

α

for different pruning methods applied to the Up-Down model. “L1” indicates that

l_{1}

pruning was used, whilist “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer and “False” indicates no pruning.

Figure 2. CIDEr and SPICE validation scores depending on pruning coefficient

α

for different pruning methods applied to AoANet model. “L1” indicates that

l_{1}

pruning was used, whilst “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer and “False” indicates no pruning.

Figure 2. CIDEr and SPICE validation scores depending on pruning coefficient

α

for different pruning methods applied to AoANet model. “L1” indicates that

l_{1}

pruning was used, whilst “Random” indicates that random pruning was used. “True” indicates pruning of the embedding layer and “False” indicates no pruning.

Table 1. Evaluation results of all of tested models with different encoders. Both “Size” and “#Params” columns refer to encoder.

Encoder Type	B@1	B@4	M	R	C	S	Size	#Params
Up-Down
Faster R-CNN ResNet101 [9]	79.8	36.3	27.7	56.9	120.1	21.4	468.4 MB	122.6 M
Faster R-CNN MobileNetV3	79.3	35.6	27.4	56.8	119.4	21	74.2 MB	19.4 M
EfficientDet-D0	79.4	35.8	27.3	56.7	119.5	21.1	15.1 MB	3.9 M
EfficientDet-D1	79.7	36.2	27.5	56.9	120	21.3	25.7 MB	6.6 M
AoANet
Faster R-CNN ResNet101 [10]	80.2	38.9	29.2	58.8	129.8	22.4	468.4 MB	122.6 M
Faster R-CNN MobileNetV3	79.4	38	29.5	58.2	127.2	21.9	74.2 MB	19.4 M
EfficientDet-D0	79.8	38.3	29.6	58.3	129.2	22.1	15.1 MB	3.9 M
EfficientDet-D1	80.1	38.4	29.8	58.4	129.5	22.1	25.7 MB	6.6 M

Table 2. Evaluation results of all of tested models decoders reduced with different scale factors. Both “Size” and “#Params” columns refer to decoder.

Decoder Scale Factor	B@1	B@4	M	R	C	S	Size	#Params
Up-Down
$γ = 1$	79.4	35.8	27.3	56.7	119.5	21.1	193 MB	50.1 M
$γ = 0.5$	79.4	35.2	27.1	56.6	119.2	20.9	68.8 MB	18 M
$γ = 0.25$	78.3	33.2	26.2	55.5	114.3	19.8	27.5 MB	7.2 M
AoANet
$γ = 1$	79.8	38.3	29.6	58.3	129.2	22.1	323.4 MB	84.8 M
$γ = 0.5$	79.5	38.5	29.5	58.3	129.1	22.1	100.7 MB	26.4 M
$γ = 0.25$	78.5	36.4	28.6	57.4	122.9	20.9	35.2 MB	9.2 M

Table 3. Evaluation results of all of tested models’ decoders pruned using different pruning coefficients. Both “Size” and “NNZ” columns refer to the decoder.

Pruning Coefficient	B@1	B@4	M	R	C	S	Size	NNZ
Up-Down
$α = 0$	79.4	35.2	27.1	56.6	119.2	20.9	68.8 MB	18 M
$α = 0.1$	79.2	35.1	26.9	56.3	118.4	20.8	62 MB	16.2 M
$α = 0.3$	77.5	32.6	25.7	54.7	109.4	19.5	48.2 MB	12.6 M
AoANet
$α = 0$	79.5	38.5	29.5	58.3	129.1	22.1	100.7 MB	26.4 M
$α = 0.5$	79.3	38.2	29.1	58.3	127.6	21.5	50.4 MB	13.2 M
$α = 0.7$	75.5	34.1	26.6	56.3	112	19	30.2 MB	7.9 M

Table 4. Evaluation results of all of tested models decoders with and without quantization. “Size” column refers to decoder.

	B@1	B@4	M	R	C	S	Size
Up-Down
Without quantization	79.2	35.1	26.9	56.3	118.4	20.8	62 MB
With quantization	79.2	35.3	26.9	56.3	118.4	20.8	43.4 MB
AoANet
Without quantization	79.5	38.2	29.1	58.3	127.6	21.5	50.4 MB
With quantization	79.4	38.3	29.0	58.3	127.4	21.4	19.7 MB

Table 5. Evaluation results of all of tested models both original and compressed using methods from this paper. Both “Size” and “NNZ” columns refer to the whole model.

	B@1	B@4	M	R	C	S	Size	NNZ
Up-Down
Original model	79.8	36.3	27.7	56.9	120.1	21.4	661.4 MB	172.7 M
Compressed model	79.2	35.3	26.9	56.3	118.4	20.8	58.5 MB	20.1 M
AoANet
Original model	80.2	38.9	29.2	58.8	129.8	22.4	791.8 MB	207.4 M
Compressed model	79.4	38.3	29.0	58.3	127.4	21.4	34.8 MB	17.1 M

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Atliha, V.; Šešok, D. Image-Captioning Model Compression. Appl. Sci. 2022, 12, 1638. https://doi.org/10.3390/app12031638

AMA Style

Atliha V, Šešok D. Image-Captioning Model Compression. Applied Sciences. 2022; 12(3):1638. https://doi.org/10.3390/app12031638

Chicago/Turabian Style

Atliha, Viktar, and Dmitrij Šešok. 2022. "Image-Captioning Model Compression" Applied Sciences 12, no. 3: 1638. https://doi.org/10.3390/app12031638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image-Captioning Model Compression

Abstract

1. Introduction

2. Related Work

2.1. Image Captioning

2.2. Neural Network Models Compression

2.2.1. Effective Architectures

2.2.2. Pruning

2.2.3. Quantization

2.3. Model Size Reducing for Image Captioning

3. Methodology

3.1. Encoder Compression

3.2. Decoder Compression

3.2.1. Architecture Changes

3.2.2. Decoder Pruning

3.2.3. Decoder Quantization

4. Experiments

4.1. Experimental Setup

4.2. Encoder Compression

4.3. Decoder Architecture Changes

4.4. Decoder Pruning

4.5. Decoder Quantization

5. Discussion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI