1 Introduction

The image captioning task aims at describing the contents of an image in natural language (Mishra et al. 2021), which can be accomplished by combining Computer Vision techniques with Natural Language Processing methods. The general idea of image captioning system is encoding input image into a vector using computer vision techniques and then decoding that vector into words using any decoder from NLP language models. An Example of image caption is illustrated in Fig. 1. Figures are the input of the image captioning system and the captions are the output. Benchmark image captioning datasets for English include Flickr8K (Hodosh et al. 2013) , NOCAPS (Agrawal et al. 2019) and MSCOCO (Lin et al. 2014). Since natural language generation is key part of the captioning system, BLUE score is considered as the common evaluation metric (Papineni et al. 2002)

Fig. 1
figure 1

Introductory examples of image captioning. https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2

The applications of this task are wide and varied, including but not limited to: assisting visually impaired individuals to surf the web (Makav and Kılıç 2019; Fisch et al. 2020; Liu et al. 2020), enhancing image search with semantic information (Lindh et al. 2020), navigating video scenes (Wang et al. 2020; Zhou et al. 2020a), or even enabling AI driven cars to better understand their environment (Kim et al. 2018; Xu et al. 2015; Zhou et al. 2020b).

Inspired by prior work (Bahdanau et al. 2015), Xu et al. (2015) proposed a model based on visual attention, trained in a deterministic manner using standard back-propagation techniques and additionally learning to soft attend on objects as well as non-objects (semantics) while generating the corresponding tokens in the output sequence. Their model produced state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO (Young et al. 2014). Later on, Aneja et al. (2018) achieved a similar score by using a purely convolutional architecture, replacing LSTM, with feed-forward masked convolutions to restrict the convolution operations to use only the past words’ information. Vinyals et al. (2015) and Huang et al. (2019) proposed an “attention on attention” (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and current context. Applying AoA to both the encoder and the decoder of the image captioning model achieved new state-of-the-art (SOTA) results (Wang et al. 2022).

1.1 Research objectives and our contributions

Urdu is an Indo-Aryan language that borrowed a large percentage of its vocabulary from other languages such as Arabic and Persian (Amjad et al. 2020). The Ethnologue, a well-known reference source that publishes statistics on living languages, has ranked Urdu as the \(11^{th}\) most spoken language in the world in 2020. It is also widely acknowledged as a major South Asian language, with 490 million native speakers worldwide (Shaik and Venkatramaphanikumar 2021). It is the official language of five Indian states, including Bhiar, Uttar Pradesh, and Jharkhand. It is the national language of Pakistan, which has a population of about 220 million people. According to the 2011 census of linguistic statistics conducted by the Indian government, India had 50,772,631 Urdu speakers. Urdu speakers can also be found in the United Kingdom, the United States, Canada, Australia, the Middle East, and Europe.

It uses Arabic script in cursive format (Nastaliq style) with the segmental writing system. Specifically, the Urdu language is based on an “abjad” system where the long vowels and consonants are necessarily written while the short vowels (diacritics) are optional. It is a bidirectional language where the numerals are written from left-to-right, while the characters are written from right-to-left. When characters are joined to make the words, they develop different shapes based on the context. Specifically, a character can have a maximum four shape variants known as initial, medial, final and isolated. The characters that can develop all four shapes are known as joiners, while the characters that can only have two shapes (final and isolated) are known as non-joiners (Kanwal et al. 2020).

Unlike English, a white space character is not considered as a reliable word boundary indicator in Urdu. That is, Urdu does not have consistent word boundary markings. For example, a writer may insert a space within a word

figure a

(respectable) in oder to make it visually correct, where the character . represents the ASCII space character. If the writer omits the space it may lead to an incorrect visual form

figure b

of the same word. Contrarily, the writer may omit space between two words

figure c

(Urdu language) because the shape of characters with or without space remains the same. That is, the Urdu words ending with non-joiner characters exhibit correct shape even without space. Consequently, a writer may omit space between words ending with non-joiner characters. Most existing studies on generative image captioning are focused on English. To the best of our knowledge, no such published work exists in the realm of neural image caption generation for Urdu. Urdu is a low-resource and more morphologically complex language than English (Mahmood et al. 2020; Malik et al. 2021).

Urdu is often regarded as a low-resource language due to the lack of or inadequacy of various critical resources, such as gold standard datasets and fundamental natural language processing (NLP) toolkits, such as reliable tokenizers and stemmers (Shaik and Venkatramaphanikumar 2021). Our discussion, however, is focused on the limitations of Urdu in the image captioning task, Some key limitations are as follows.

  • Lack of attention. Image captioning task has been extensively investigated for resource-rich languages such as English. To the best of our knowledge, no such published work exists in the realm of neural image caption generation for Urdu. Urdu is a low-resource and more morphologically complex language than English (Mahmood et al. 2020; Malik et al. 2021).

  • Unavailability of resources. Author gender identification is an important NLP task. However, as mentioned earlier, this is the first study on generative image captioning in Urdu and there is no existing corpus available to perform this task. Therefore in this paper we introduced a new corpus to perform this task.

Our contributions. The contributions of this work are as follows:

  • We present a new dataset for Urdu image captioning which can be accessed via GitHub.Footnote 1

  • We also discuss different types of attention-based architectures for image captioning in the Urdu language. These attention mechanisms are new for the Urdu language, as those have never been used for the Urdu image captioning task.

  • Further, we illustrate quantitative and qualitative analysis of the results - studying the impact of differing model architectures on the image caption generation task in Urdu.

  • Finally, we show that the best model achieves a BLEU-1 score of 72.5, BLEU-2 of 56.9, BLEU-3 of 42.8, and BLEU-4 of 31.6 on the Urdu image caption generation task.

The rest of the paper is organized as follows. Section 2 reviews the existing image captioning techniques. Section 3 discusses methodology and experimental setup. Section 4 presents the experimental results. Section 5 presents the conclusions and future work directions.

2 Literature review

The image captioning techniques can be organized into extractive and generative techniques. More details on extractive and generative captioning is provided in the following paragraphs.

2.1 Extractive captioning

Earliest approaches rely on hand-engineered features for visual elements and rule-based systems for language models. Some progress was reported using human-engineered templates and piecing together the phrases containing detected objects. Hodosh et al. (2013) treated the sentence-based image annotation as a ranking problem mapped to a given pool of captions. Whereas, several studies formulated this task as a retrieval problem and proposed solutions which represent embedding of images and text in the same space (Gong et al. 2014; Li et al. 2020; Zhou et al. 2020a). Socher et al. (2014) used deep learning to co-embed image and sentences together and Karpathy et al. (2014) embedded image sub-regions and sub-sentences jointly. Regional attributes have been used in many image captioning methods to alleviate the issues with predetermined caption templates. Farhadi et al. (2010) proposed detections to infer a triplet of image regions to return the suitable text by filling in a textual template. Li et al. (2011) used object detections and then piece together a final description using phrases containing detected objects, modifiers and locations using web-scale n-grams. Yao et al. (2010) introduced the web-ontology-language based on semantic representation produced as a result of parsing images, which is converted to human readable text. Kulkarni et al. (2013) used detection beyond triplets but with template-based text generation. The advantage of using the template-based methods is that the resulting captions tend to be grammatically correct. However, they use hard-coded visual concepts and hence suffer to produce the required variety in the output. Kuznetsova et al. (2014) extracted similar images relevant to the query image, then extracted noun verb and prepositional phrases from captions of those images. Eventually they run an object detector on the query image and compose captions using detected objects by pairing them with relevant captions of previously fetched images.

2.2 Generative captioning via deep learning

In contrast to the aforementioned dual stage methods, the recent trend for image to text generation is to use deep learning based encoder-decoder architectures that connect a CNN to an RNN to learn the mapping from images to sentences without involving any rules or human engineered features. For example, Mao et al. (2014), proposed a multimodal RNN (m-RNN) to estimate the probability distribution of the next token given previous tokens and the deep CNN feature of an image at each time step. Similarly, Kiros et al. (2014) constructed a joint embedding space using a more effective approach i.e. deep CNN model to encode image and a long short-term memory (LSTM) model encodes the text. Karpathy et al. (2014) also proposed a multimodal RNN generative model, but in contrast to Mao et al. (2014), their RNN is conditioned on the image information only at the first time step. The first landmark paper that reported tangible results was by Vinyals et al. (2015) combined deep CNNs for image classification with an LSTM for sequence modelling, to create a single network that generates descriptions of images. Chen and Lawrence Zitnick (2015) learn a bi-directional mapping between images and their sentence-based descriptions, which additionally enables reconstruction of visual features when given a caption as input. Tanti et al. (2017, 2018) conjectured that in a CNN-RNN setting for image caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN i.e. conditioning the language model (LM) by ‘injecting’ or in a layer following the RNN i.e. conditioning the LM by ‘merging’ image features where the later allows the RNN’s hidden state vector to shrink in size by up to four times. Their results suggest that the visual and linguistic modalities for caption generation need not be jointly encoded by the RNN since it yields large, memory-intensive models with few tangible advantages in performance; rather, the multimodal integration should be delayed to a subsequent stage.

2.3 Attention driven generative captioning

Bahdanau et al. (2015) proposed the soft attention mechanism for machine translation that produced revolutionary results by generating the target language tokens conditioning the LM on previous prediction by learning to shift and pay attention to parts of the source sentence representation. Inspired by prior work (Bahdanau et al. 2015), Xu et al. (2015) proposed a model based on visual attention, trained in a deterministic manner using standard back-propagation techniques and additionally learning to soft attend on objects as well as non-objects (semantics) while generating the corresponding tokens in the output sequence. Their model produced state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO (Young et al. 2014). Later on, Aneja et al. (2018) achieved a similar score by using a purely convolutional architecture, replacing LSTM, with feed-forward masked convolutions to restrict the convolution operations to use only the past words’ information.  Vinyals et al. (2015) and Huang et al. (2019) proposed an “attention on attention” (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and current context. Applying AoA to both the encoder and the decoder of the image captioning model achieved new state-of-the-art results (Table 1).

Table 1 Summary of recent image captioning models for English

3 Methodology and experimental setup

We chose ResNet-101 (He et al. 2016) to act as an encoder and a LSTM as a decoder. We have used two encoder-decoder architectures; (i) The Merge Model (Tanti et al. 2018) as a baseline and (ii) the Attention driven Context based Model (Xu et al. 2015) as our main model as shown in Fig. 2.

3.1 Dataset

To prepare the image mapped Urdu dataset we make use of the Flickr8K (Hodosh et al. 2013) dataset for cross-reference which is a standard dataset and widely used by the research community to perform image caption generation tasks for English (Hodosh et al. 2013). The Flickr8K dataset comprises 8000 images where each image is presented with 5 English captions on average. We have selected a subset of data from the Flickr8K dataset consisting of five English captions per image; these were manually translated into Urdu by a native speaker followed by several rounds of quality control involving another native speaker of Urdu. We select 1800 images from Flickr8K and translate 5 captions for each, thus producing 9000 Urdu captions. We call this dataset Dogs Flickr8K (see section Appendix for more details).

Fig. 2
figure 2

Caption prediction using attention driven Inject model

3.2 Model training

The data is randomized and split into 1440 images as train set, 180 as validation set and 180 as test set. Each image has five captions, such that it results in a corresponding split of 7200 train, 900 validation and 900 test captions.

For the encoder of our baseline model, we remove the last classification layer ‘FC’ to harness the image feature vector from the second last fully connected layer. However for our main model, based on attended annotation vectors, we make use of spatial context. We strip-off the trailing layers after convolutions i.e. pooling and fully connected (dense) layers to obtain the 3D tensor as an image feature set by adaptive average pooling the output of the last convolutional layer. This 3D feature set, 2048 layered 14x14 tensor, is flattened to a 2D representation of 196 annotation vectors each of size 2048 which is attended to by enhancing the relevant weight.

To initialize the language model (LSTM), annotation vectors are first averaged to produce a single vector of size 196 that is projected using two independent fully connected layers of neurons to the cell state size (512) and hidden state size (512). Soft attention is deterministic and a differentiable function comprising MLPs. This dense neural network is learnt as part of the training process to conditionally decide the amount of soft attention to be applied to each annotation vector \(a_i\) based on the decoder’s last hidden state \(h_{t-1}\). This warrants for two inputs to this attention network i.e. the flattened image feature annotations and the latest hidden state of the LSTM. The image feature vectors are projected to a 512-dimensional feature space by a fully-connected layer while another separate fully connected layer does the same for \(h_{t-1}\). The projected hidden state is amalgamated with each of the projected annotation vectors using the add operation which further produces a ReLU activated output of shape (196, 512). The tensor is passed to a Softmax layer that converts it to a probabilistic attention vector of dimension (196, 1). This vector is used to attend the (2048, 196) shaped annotation vectors to finally give the context vector representation of image features.

RNNs require fixed length sequences but we have sentences which are intrinsically of varied lengths. To make them uniform sized, we fixed the maximum size of the caption to be of a suitable length i.e. 39. This does not correspond to the longest sentence size in the dataset but was chosen by doing a percentile analysis discarding outliers to cover 95% of the captions. Longer captions are clipped to comply with the maximum allowed length. To compensate for shorter lengths \(<pad>\) tokens are appended to make each caption the same length. We substituted words with frequency of occurrence less than 3 with an \(<unk>\) token. This models the probability of unknown words that might appear in validation and test sets captions but are not present in the train set.

We introduced a custom embedding layer of size 512 which learns a fixed length continuous domain representation during the training process. This is the final representation of words that is consumed by the LSTM decoder. The LSTM is used with a hidden state size of 512. To predict the next word, we use the updated hidden state which is up-sampled by a fully-connected layer projecting the 512 vector to the vocabulary space. This is connected with Softmax for word prediction. Cross entropy loss (multi class) is used for back-propagation of gradients.

For the baseline model, we use only the last prediction \(S_{t-1}\)’s word embedding (512) as input to the next time step. The hidden state \(h_t\) incurs a cyclic update in the LSTM. For the attention driven main model, the context vector is combined with the previous prediction’s word embedding \(S_{t-1}\) to constitute the input. The vectors are combined using concatenation and fed together to the LSTM decoder to generate the next word.

The Adam optimizer is used with a learning rate of \(4e^{-4}\). BLEU-4 metric is tracked on the validation set throughout the training process. Adaptive learning rate is used with a decay of 20%, if there is no improvement in BLEU for 8 consecutive epochs. Drop Out of 0.5 has been employed with teacher forcing for 50% of the training epochs chosen randomly. A maximum of 100 Epochs was used, each having mini-batches of 32 while leveraging early stopping based on BLEU score if there are 20 epochs of no improvement.

Cross entropy loss, top 5 accuracy and BLEU scores were tracked. It is observed that the improvement in BLEU score does not always correspond to a reduction in loss so we stopped the training process early using BLEU-4. The resulting improvement in the language scoring metric BLEU-4 is evident as the stabilized img2seq model is tuned further to enhance the Encoder’s adaptability. This is done by image encoder retraining. Initially, transfer learning was leveraged on the encoder by keeping its weights frozen and only the decoder was trained. The training phase lasted for 31 epochs with the BLEU-4 score peaking at about 21.56 on the 11th epoch. We fine-tuned the encoder, restarting the training with parameters of the 11th checkpoint using a reduced batch size and reduced learning rate. This is because the trainable model size is now larger, additionally incorporating the computation and backpropagation of the encoder’s gradients. For ResNet, we only fine-tune convolutional blocks 2 through 4 while keeping the initial block intact, because the first convolutional block would have usually learned low level features that are fundamental to image processing, such as detecting lines, edges, curves, etc. Consequently we don’t change foundations. This resulted in improving the BLEU-4 score to a new high of 23.05 after 4 epochs.

Fig. 3
figure 3

Impact of early stopping via BLUE versus LASER

4 Experimental results and discussions

The image to natural language connection jointly tunes the encoder on top of the trained decoder to bridge the contextual gap between visual and linguistic components. This allows the loss feedback to flow to the image encoder improving the visual component compatibility with the language model. Gains in all BLEU 1-4 scores are recorded in Table 2. Table 3 shows the results on Urdu and those of relevant papers and state-of-the-art for English. We decided to test a multilingual BERT model that covers Urdu as well as being implemented in Hugging Face. The model consists of 110M parameters and is sized at 0.7 GB. We configured the main model to integrate with the BERT encoder. The embedding layer was frozen and the LSTM cells were configured to a layer size of 768, matching the dimensionality of the word embedding extracted from BERT. The BERT model uses sentence context in its entirety to generate the embedding and is very effective at encoding semantics. For Urdu, the best strategy was to learn the embeddings from scratch as part of the training process, rather than relying on pre-trained embeddings. This study reports the results using BLEU score as a quantitative metric to evaluate the goodness of fit as well as maximising BLEU score during the training process. BLEU score is based on the sequential conformance of N-Grams whereas natural language involves much more flexible constructs where alternate words or their combinations may constitute the same semantic sense. METEOR and CIDEr metrics are also used by the latest papers but they lack the necessary resources for Urdu. In the pursuit of better metrics for Urdu, we leveraged two additional candidates for sentence semantics (i) BERT-F1 Score (Zhang et al. 2019) which uses the BERT transformer model extracting word features from multiple layers to form semantic representation pools using the words from each of the reference and hypothesis sentences. It then computes Precision and Recall to give F1 for the hypothesis. (ii) LASER is introduced by Facebook Research (Artetxe and Schwenk 2019) to generate multi-modal sentence embeddings for zero-shot cross-lingual transfer. For the languages used for its training, LASER can transform the sentence into a joint space which produces language-independent vectors. To use them as a qualitative measure, there are multiple options such as L1, L2 norms and cosine similarity. The initial two being subject to certain biases across dimensions, we have used the cosine similarity of each hypothesis against 5 reference captions and computed macro and micro averages as measures to cover the whole evaluation set. We leveraged LASER and BERT F1 scores to govern the model training via early stopping. It was observed that they do not always correlate with BLEU score and the training process stops at a different junction which offers lower BLEU metric but maximizes LASER see Table 4 and Fig. 3. Final results on the evaluation set are listed in Tables 5,6, and 7 for reference and organized into good, average and bad predictions, respectively.

Table 2 Language model trained
Table 3 Performance of our model and state-of-the-art
Table 4 Early stopping, BLEU verses LASER
Table 5 Samples of good predictions
Table 6 Samples of average predictions
Table 7 Samples of bad predictions

5 Conclusions and future work

This is the first study on generative image captioning in Urdu. We present a new dataset for Urdu image captioning, annotation treatment and generalization guidelines to make visio-lingual deep learning models effective and applicable to modest sized dataset. We highlight the hindrances of standard evaluation metrics in Urdu and show the use of semantics driven techniques such as Bert-F1 and LASER may be appropriate for evaluating this task in Urdu. One can use transformer for decoder part to enhance the language model ability in the captioning which is left as future work at this movement.