Learning to Generate Popular Headlines

Headlines are not only essential for summarizing news articles but also for grabbing users’ attention. Headline generation is a type of text summarization that can employ either an extractive or abstractive approach, with the latter being more prevalent through deep learning models. However, creating a popular headline that can capture readers’ attention is challenging. To address this issue, we propose a hybrid headline generation approach that utilizes state-of-the-art transformer models to generate several headline variations for an article. Additionally, we use a model for predicting headline popularity that can choose the most popular headline from the generated ones. We also create a new dataset for predicting headline popularity by scraping Twitter accounts of news media. Our evaluation shows that fine-tuning summarization models for the headline generation task can significantly improve their performance. We also demonstrate that our proposed method can generate more popular headlines compared to the baseline methods that do not incorporate popularity prediction. For such an evaluation purpose, we create a popularity benchmark to automatically assess the effectiveness of our proposed headline generation approach in generating popular headlines.


I. INTRODUCTION
Headlines are the most important part of news articles, as they not only provide a concise one-sentence summary of the entire article but also serve to catch people's attention. In online news portals, good headlines can increase web traffic (e.g., by bringing users from social media to the news portal), which can potentially lead to an increase in news media's revenue. Headline generation can be considered as a type of text summarization where the goal is to create a high-quality headline for a news article. As with text summarization, there are abstractive and extractive approaches. In the extractive approach, a headline generator produces a headline that contains only words or phrases mentioned in the news article. However, in the abstractive approach, words that do not appear in the article may also appear in the generated headline. Most of the recent generative methods, especially those that use deep learning models, belong to the abstractive approach.
The associate editor coordinating the review of this manuscript and approving it for publication was Cheng Chin .
Although the goal of both summarization and headline generation is to produce a short, concise summary of the original document (i.e., a news article), there are some key differences that make the task of headline generation more challenging. First, in the headline generation task, the output summary should consist of only one sentence, whereas in text summarization, multiple sentences can be generated. This condition makes the headline generation task more difficult than the text summarization task, especially when using extractive approaches. Second, the generated headline should be interesting to catch the readers' attention; otherwise, a good article may be ignored by readers. There are some works on headline generation [1], [2], [3], [4], [5] that employed generative models to generate headlines. However, to the best of our knowledge, none of the previous works have taken into account the popularity of headlines when generating or evaluating them.
To address these issues, we propose a novel method for generating popular headlines for news articles. We develop two models, one for generating headlines for a news article and the other for predicting the popularity of a headline. In the proposed method, we first fine-tune a state-of-the-art open-source transformer model for the task of headline generation and use it to produce multiple headlines for each article. Next, we employ our headline popularity prediction model to predict the popularity of each generated headline and choose the headline with the highest predicted score. In addition, we create a new dataset of headline popularity (HP) 1 and use it to train our headline popularity prediction model. The main contributions of our work are summarized as follows: 1) We propose an approach to generating popular headlines using a combination of state-of-the-art opensource transformer models and a headline popularity prediction model. 2) We create a new dataset (HP) for headline popularity prediction by crawling the Twitter accounts of news media. Our dataset is not based on clickstream data collected from users' logs, as clicking on a headline is not a good indicator of user interest (discussed in Section IV). 3) We propose a headline popularity prediction model trained using the HP dataset. 4) We propose a method that utilizes a popularity benchmark to automatically assess the capability of generative models to generate popular headlines. 5) We evaluate the proposed approach on real datasets using three state-of-the-art transformers as our headline generation model. Our results show that the proposed approach improves baseline models in terms of popular headline generation. The remainder of this paper is organized as follows.
In section II, we review the relevant works generating headlines in the fields of psychology and computer science. In section III, we propose a novel approach to generate popular headlines. Next, in section IV, we introduce our Headline Popularity (HP) dataset, which we use to train the headline popularity prediction model and to create the popularity benchmark. Then, we evaluate our proposed headline generation approach in section V. Finally, we wrap this study up with the conclusion in section VI.

II. RELATED WORKS
Recent advancements in generative neural network models have led to a significant trend in automatically generating high-quality headlines for news articles. This section provides a brief overview of the recent research conducted on headline generation.
The Washington Post uses Headliner, software owned by The Washington Post, to automatically generate and suggest headlines to their authors [1]. Their strategy is to have several variations of headlines for each news article and test them to see which is best to use. One of the main reasons why The Washington Post uses more than one headline for its news articles is that news stories usually cover different aspects of events, which makes it difficult for a single headline to cover all events while capturing users' attention.
In [2], a model based on the BERT transformer model [6] was used to generate headlines. Instead of building a dictionary for abstractive text generation, the authors composed headlines by extracting all the necessary words from their articles. The proposed model was designed in a way that it can only generate tokens that are either available in the article or defined in their dictionary. Based on their evaluation, human evaluators found it difficult to recognize the generated headlines are machine-generated ones. However, they also noted that the generated headlines were different from the ground truths. As a result, their proposed method would likely receive low scores using the popular evaluation metrics such as ROUGE [7], and they did not include such an evaluation.
In [3], well-known phrases were inserted into the news headlines to make them more attractive from the users' point of view. The proposed model picks a well-known phrase (e.g., a famous proverb) that matches the news article and inserts it into the headline. The proposed approach was evaluated on 1000 samples each of which was scored by 3 different human evaluators. However, the method may insert phrases that are not related to the main body of the article, and thus the generated headline is screened by a human evaluator to determine whether it is good to use.
In [4], transformer encoder models were employed to generate personalized headlines based on the preferences of individual users. The authors published a new dataset called PENS (Personalized News Headlines) in which the training set consists of news content and news readers' click streams from June 14 to July 12, 2019. For the test set, they hired 103 native students to browse news content and write their desirable headlines for the articles that seemed interesting to them. Also, they proposed a novel deep-learning model to encode both news content and users' behavior to generate personalized headlines. In the end, the generated headlines are compared to the headlines written by the human evaluators. However, the headlines written by human evaluators are less accurate in comparison with the original headlines written by professional journalists.
In [5], a headline generator model using an attentionbased encoder-decoder framework for machine translation [8] was proposed. The authors introduced a convolutional attention-based architecture and demonstrated that their proposed modification outperformed the original framework in the headline generation task. However, instead of using the entire article, they only used the first sentence of the article to generate the headlines.
To the best of our knowledge, none of the existing methods for generating headlines consider popularity as a factor. In this paper, the proposed method considers the popularity of headlines in headline generation.

III. GENERATING POPULAR HEADLINES
In this section, we present our novel approach to generating news headlines. Our proposed approach aims not only to generate a concise headline capturing the essence of the article but also to generate the most popular headlines.

A. OVERVIEW
An overview of our proposed headline generator approach is shown in Figure 1. The proposed approach consists of four main components: Tokenization, Fine-tuning, Generation, and Selection. The first three components pertain to a transformer generator model, which is responsible for generating multiple candidate headlines for a given input article. While the last component, a customized version of a transformer encoder for regression, is utilized to select the best headline among the generated candidates. We provide a detailed description of the first three components in subsection III-B and the fourth component in subsection III-C.

B. GENERATION PHASE 1) TOKENIZATION
In this part, we process the headlines and bodies of the articles in training, validation, and test parts of the Newsroom dataset (IV-A), using the appropriate tokenizer for the transformer model. The tokenizer converts the input texts (headlines or bodies) into token ids. We define a new task to train the models for headline generation by appending ''headline:'' at the beginning of each article's body before tokenization. Inserting ''headline:'' at the beginning of the input to the models allows them to recognize that this is a different task from the summariztion task for which they have already been trained.

2) FINE-TUNING
In this phase, we fine-tune all layers of the transformer models to generate headlines that resemble the ground truth data (i.e., the actual headlines of the articles). We can use any kind of generative transformer model in our approach, but we use the state-of-the-art models, namely BART [9], ProphetNet [10], and T5 [11] as these models demonstrated outstanding performance in summarization benchmarks. For fine-tuning, we use a greedy approach to generate each token, compare each generated token to its ground truth counterpart, and calculate the loss using the cross-entropy loss function. At the end of each epoch during model fine-tuning, the model is evaluated on the validation dataset and the model with the lowest loss value is kept as the best model.

3) GENERATION
In this phase, we first use a fine-tuned headline generation model from the previous stage (i.e., Fine-tuning component) to generate 10 different candidate headlines for each input article. The reason for generating multiple headlines is that the generator was fine-tuned to only generate a headline that can concisely describe the article's body. Depending on the training data set, the generated headline may not be appealing to readers. Therefore, we use the generator to produce multiple headlines that accurately describe the article and then choose the most attractive one from them.
We can use the beam search to generate multiple headlines. However, the top headlines generated from the beam search are similar to each other as it is shown in Figure 2 (the identical headlines are shown with the same color). To avoid generating identical or similar headlines, we replace the greedy approach used in the beam search with the multinomial sampling technique.
In this method, the transformer model provides a probability distribution over the tokens for each generated token, based on the input sequence (i.e., an article's body) B and previously generated tokens w 1:t−1 according to Formula 1.
The model chooses the next word w t randomly sampled from this probability distribution using a multinomial sampling technique, giving a higher probability to tokens with higher probabilities in the distribution. We also limit the length of the generated output (i.e., a headline) to a maximum of 20 tokens. An example of using greedy search versus using multinomial sampling techniques in beam search. The highlighted headlines are repeatedly generated more than once. As we can see in the picture, the beam search with multinomial sampling generates much fewer duplicated headlines.

C. SELECTION PHASE
This part of the model uses a headline popularity prediction model to determine the popularity score for the candidate headlines and then select the one that achieves the highest score. The architecture of the headline popularity prediction model is shown in Figure 3.
Given a news headline (denoted as x 1 , x 2 , . . . , x m , where x i is a token), the prediction model first uses a transformer encoder to convert the input headline (x 1 , . . . , x m ) into a semantic representation (i.e., H ). We use the pre-trained allmpnet-base-v2 2 from Hugging Face because it performs best in the Hugging Face benchmark. 3 We then extract the embedding representation of the CLS token from the last hidden layer of the model (i.e., H 0 last in formula 3), and feed it into a fully-connected layer (FC) whose input and output layers have the same size as the embedding vector. We use tanh as the activation function. On top of all the previous layers, a regressor is used, which is a perceptron having a linear activation function, to predict a popularity score. This process can be expressed as: 2 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 3 https://www.sbert.net/docs/pretrained_models.html A major challenge in developing the popularity prediction model is obtaining a labeled training data set. Since no such training data are publicly available (discussed in more detail in IV-B), we create a new headline popularity dataset (HP).

IV. DATASETS
In this research, we use the Newroom dataset [12] to train our headline generation models. Also, we introduce a new dataset to train our headline popularity prediction model. In order to understand our method better, we describe the datasets in this section, although our methods can be applied to other datasets that contain the same types of information.

A. THE NEWSROOM DATASET
To train our headline generator models, we use the Newsroom dataset, which contains 1.3 million news articles from 38 major news media between 1999 and 2017. Nearly all previous works that have utilized this dataset have made use of the main body and summary of news articles.
However, this is the first time that the headlines from this dataset have been used. We clean the dataset by removing articles having short headlines (less than 4 tokens) or long headlines (more than 20 tokens). We found that most headlines with less than 4 tokens are just the name of the editor or the author of the article. Also, since long headlines are rare in news media, we remove them from our data to prevent models from learning to generate long ones. Moreover, we observe most of the long headlines in our data are not the actual headlines of the article but a summary of the article. We also remove short articles (less than 40 tokens) or long articles (more than 1000 tokens). We have found that most short articles are not the full version of the article or contain only URL addresses. Also, articles longer than 1000 tokens are rare in our dataset. After applying the above changes, the size of the dataset is reduced from 1,212,740 to 950,071 articles (Table 1). Moreover, we find out the majority of the articles in the dataset were published between 2015 and 2017, totaling 414601 in number.

B. HEADLINE POPULARITY DATASET
Most of the existing headline popularity datasets [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] used users' log history on a news portal to calculate the popularity of headlines using measures such as the number of clicks, views, etc. However, the view counts and click counts of headlines do not only depend on the attractiveness of the headlines per se. It can highly depend on the other factors that usually are not shared in the existing headline popularity datasets such as the way and the duration that each news headline was shown on the main page, the position of the headline on the main page, or how often each news headline was recommended to the users by the news media's recommendation engine.
To eliminate the influence of the factors mentioned earlier, we define the popularity of the headlines in terms of the number of likes they received on Twitter. Nearly all of the news media have established a Twitter account, through which they disseminate the majority of their headlines. We crawl some social media statistical information by using tweet IDs from the clickbait detection dataset. 4 The crawled information regarding the tweets is shown in Table 2. The distribution of the data based on the posted date is shown in Table 3. As we can see, the majority of the news in our dataset was published in 2016. Also, the distribution of published news by the news media is depicted in Figure 4.
Since the headlines are from different news media Figure 4, and news media have varying numbers of followers, we compare the number of likes of each headline with respect to the number of likes of other headlines from the same news media. This is done using the following steps: 1) We remove the articles that belong to the four least frequent news media in 2) We remove the news articles whose body is not available.

3) To detect outlier headlines in terms of popularity,
we compare the popularity of each headline with that of other headlines from the same publisher (i.e., news media). To do that, we calculate quantile 90% for favorite count per news media. The headlines that their favorite count is greater than their quantile 90%, are considered outliers. Instead of removing the outliers, we clip their favorite count to the value of their quantile 90%. Figures 5 and 6 show the distribution of favorite count for each news media before and after handling the outliers, respectively. 4) Then, we apply the min-max normalization technique to the favorite count values for each news media (Figure 7). 5) In the end, we divide the data into training, validation, and test data sets with a percentage of 90%, 5%, and 5%, respectively (Table 4).

V. EVALUATION
This section consists of two parts. In the first part, we generate headlines using state-of-the-art summarization models. We then train these models on a headline generation task using the training set of the Newsroom dataset. After each epoch of model fine-tuning, the model's performance is evaluated on the validation dataset. The model with the lowest loss value is considered the best model and is kept for further use. In the second part, we generate headlines using the proposed headline generation approach described in III. We also create a popularity benchmark to investigate whether using the proposed approach leads to the generation of more popular headlines or not.

A. EVALUATION OF PRE-TRAINED AND FINE-TUNED MODELS ON HEADLINE GENERATION
First, we conduct our experiment by using the pre-trained versions of BART [9], ProphetNet [10], and T5 [11] models that were trained specifically for the summarization task. 5 These models are described briefly as follows: The T5 transformer model, a powerful natural language processing (NLP) tool developed by Google, is based on the standard encoder-decoder transformer architecture proposed in [23]. The T5 model uses the transformer architecture, which employs self-awareness mechanisms to process and generate text. It has achieved state-of-the-art performance on various NLP tasks, such as question-answering, text classification, and summarization. During pre-training, 15% of the tokens are randomly removed and replaced with sentinel tokens. Then, in the decoder, the target is the dropped-out tokens delimited by their respective sentinel tokens.

2) BART
The BART model is another powerful transformer model for NLP tasks that is developed by Facebook. This model is based on the transformer architecture having a bidirectional encoder similar to BERT and a left-to-right decoder like GPT [24]. BART was pre-trained on large amounts of text data, which makes it very effective for a wide range of NLP tasks such as text summarization, question answering, and language translation. The pre-training tasks for BART include token masking, text infilling, token deletion, document rotation, and sentence permutation.

3) ProphetNet
ProphetNet, a transformer model that is developed by Microsoft, has achieved impressive results in NLP tasks. It changes the original transformer model in a way that instead of predicting the next token, the model predicts the next n tokens simultaneously. The encoder part of the model VOLUME 11, 2023 60909 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
A. Omidvar, A. An: Learning to Generate Popular Headlines  is similar to the primary transformer model [23] but they added a multi-head n-stream self-attention mechanism to the transformer decoder. We use the pre-trained versions of these models to generate headlines for the test set of the NewsRoom dataset.
The generated headlines are compared to the ground truths using ROUGE [7], BLEU [25], and METEOR [26] scores. Then, we fine-tune the above models on a training portion of the NewsRoom dataset by defining a new training task as was described in III. We use the batch size of 16 and 60910 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
A. Omidvar, A. An: Learning to Generate Popular Headlines AdamW [27] optimizer with the learning rate of 2 × 10 −5 and weight decay of 0.01. We also use a half-precision mode (i.e., 16-bit floating-point) to speed up the training. The result of the models on the test set of the Newsroom dataset is shown in Table 5, where the highest value for each measure is highlighted in bold. In table 5, R 1 , R 2 , R L , B 1 , B 2 , B 3 , and B 4 stand for ROUGE-1, ROUGE-2, ROUGE-L, BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores respectively.
As we can see in Table 5, the fine-tuned models (headline generation models) drastically outperform the pre-trained versions (summarizaiton models). This seconds what we said earlier in I regarding the differences between headline generation and summarization tasks. Even though the pretrained BART, ProphetNet, and T5 models perform well on the summarization task [9], [10], [11], they under-perform on the headline generation task. However, when we fine-tune them specifically on a headline generation task, their results improve significantly on all the performance measures listed in Table 5.
In terms of BLEU and METEOR scores, the fine-tuned version of BART is the best model for the headline generation task. However, in terms of ROUGE scores, the fine-tuned version of ProphetNet is the best. Interestingly, this pattern holds for the pre-trained versions of the models as well that the pre-trained ProphetNet performs by far the best among the three pre-trained models in terms of ROUGE scores while the pre-trained BART is slightly better than the pre-trained ProphetNet in terms of the BLEU scores. However, unlike what we saw for the fine-tuned models, the pre-trained ProphetNet outperforms the two other pre-trained models in terms of the METEOR score.
In the next section, we use these fine-tuned models as headline generation models and evaluate our proposed approach to generating popular headlines.

B. EVALUATING THE PROPOSED APPROACH ON THE POPULARITY OF GENERATED HEADLINES
In this part, we use our proposed headline generation approach in III to generate popular headlines. The main components of our proposed headline generation approach are the generation component and the selection component. For the selection component, as described in III, we use a headline popularity prediction model that includes a transformer encoder, a fully connected, and a linear regression layer. We train our headline popularity prediction model on the training set of the HP dataset (i.e., IV-B) for 40 epochs and keep the best snapshot of the model that achieves the lowest MSE (i.e. 0.065101) on the validation dataset. We use the batch size of 64, the AdamW optimizer with a learning rate and weight decay of 2 × 10 −5 and 1 × 10 −5 respectively, and MSE as the loss function.
For the generation component, we conduct our experiment with each fine-tuned model as the generation component of the proposed approach. We also run the experiment using each pre-trained model as the generation component of the proposed approach. Our aim is to find out whether the proposed approach can improve the result of the pre-trained and fine-tuned models in terms of popularity or not.
We design the following method to automatically assess the ability of our model to generate popular headlines by creating a popularity benchmark dataset. The benchmark contains popular headlines (and articles) from the HP dataset TABLE 5. A comparison between pre-trained summarization models and their fine-tuned versions for headline generation task. TABLE 6. The comparison between headline generator models with and without using the proposed approach on the popularity benchmark. The methods marked with a * use our popularity prediction model to select the most popular headline from 10 candidates generated by our proposed strategy for generating multiple headlines.
we created earlier. If our model can perform better than the baseline models in terms of text generation metrics (e.g., BLEU, ROUGE, and METEOR) on popular articles (that is, the headlines we generate match the popular headlines better than the ones generated by the baselines), our model performs better than baselines in terms of generating popular headlines.
To create the popularity benchmark, we select the headlines with a popularity score above 0.8 from the test and validation sets of our HP dataset. As a result, 1084 popular news articles remain. However, an article in the benchmark data set may occur in the training set of the Newsroom dataset, the next step is to remove the articles from the popularity benchmark whose headline is similar to at least one headline from the training set of the newsroom dataset. To do this, we convert all headlines in the benchmark and Newsroom datasets into semantic embedding representations using SBERT. Then, we calculate the cosine similarity score between each headline in the benchmark with all the headlines in the Newsroom dataset, and we keep the maximum value as the similarity score of a headline in the benchmark with respect to the Newsroom dataset. We remove the headline from the benchmark if the calculated similarity score is above 0.6. After this step, the number of remaining headlines is 677. Again, the idea behind using the popularity benchmark is that if the proposed model could improve the baselines on the text generation metrics, we can conclude that the proposed architecture can enable the headline generator models to generate more popular headlines since all the ground truth headlines in the benchmark are highly popular. In this way, we can assess how good the generated headlines are in terms of popularity.
After creating the popularity benchmark, we use the fine-tuned and pre-trained models mentioned in Table 5 to generate headlines for the articles in the popularity benchmark. Then, we employ each model in our proposed architecture and investigate whether using the proposed approach could improve the baselines in terms of the evaluation metrics. The results of the proposed approach are indicated by appending * to the end of the name of the model. The results of this experiment on the popularity benchmark are shown in Table 6. As we can see, regardless of the type of model used (i.e., T5, BART, or ProphetNet), and whether it is fine-tuned or pre-trained, our proposed approach can improve the results in terms of all evaluation metrics in the popularity benchmark. For example, comparing BART:finetuned to BART:fine-tuned*, the latter outperforms the former in terms of all evaluation metrics used. This also applies to all fine-tuned and pre-trained versions of the other models. Another interesting result, similar to the one shown earlier, is that the fine-tuned versions of ProphetNet perform better than the fine-tuned versions of BART in terms of ROUGE scores, while BART performs better in terms of the other metrics. However, unlike the previous results, this pattern no longer holds true for the pre-trained models, as the pretrained versions of BART outperform the pre-trained versions of ProphetNet with respect to all the evaluation metrics used.
We also conducted a similar experiment, but instead of using the popular headlines, we used the least popular ones (i.e., the popularity score is below 0.2). After removing unpopular headlines that are similar to the Newsroom training set (i.e., similarity score above 0.6), 1402 unpopular headlines remain. We then run the same experiment that we did TABLE 7. The comparison between headline generator models with and without using the proposed approach on the unpopular headlines. The methods marked with a * use our popularity prediction model to select the most popular headline from 10 candidates generated by our proposed strategy for generating multiple headlines.
on the popularity benchmark with the unpopular headlines. The result of this experiment is shown in Table 7.
As we can see, the best results in terms of ROUGE scores, BLEU-1, and BLEU-2 no longer belong to the proposed approach. Also, the effect of the proposed approach varies between models, sometimes improving the outcome (e.g., BART:pre-trained vs. BART:pre-trained*), while in other cases worsening it (e.g., ProphetNET:fine-tuned vs. ProphetNet:fine-tuned*). However, in terms of the popularity benchmark, the pattern was completely different in a way that using the proposed approach always improves the result.

VI. CONCLUSION
We proposed a novel approach to generating popular headlines by combining transformer-based generative models with a headline popularity prediction model. To generate the headlines, we first fine-tuned state-of-the-art transformer generative models on the Newsroom dataset and used the fine-tuned models to generate multiple candidate headlines for an input article. Then, a headline popularity prediction model is used to select the headline with the highest predicted popularity. The popularity prediction model consists of a few layers of neural network structures, including a transformerbased encoder, a fully-connected layer, and a regression layer. We created a new dataset, the Headline Popularity (HP) dataset, to train the headline popularity prediction model. To automatically evaluate the effectiveness of our proposed approach for generating popular headlines, we developed an approach by creating a popularity benchmark. Our results demonstrated that our proposed method improved the stateof-the-art text generation models in terms of the metrics for generation quality (that is, BLEU, ROUGE, and METEOR) on the popularity benchmark consisting of popular articles, indicating the proposed method can better generate popular headlines.
One of the two limitations of the proposed approach is that we only trained both the popularity prediction and headline generation models on English news articles. Future research can focus on training multilingual models that can be used for articles written in any language. The other limitation of the research is that we did not investigate the effectiveness of Large Language Models (LLMs) in the headline gener-ation task. Future research can investigate the potential of major LLMs such as GPT-4 in headline generation. However, a major bottleneck in evaluating the effectiveness of LLMs in the headline generation task is ensuring that the LLMs have not already seen the evaluation dataset during their pre-training task, as the news article benchmarks (e.g., Newsroom) are publicly available on the Internet.