Impact of lexical and sentiment factors on the popularity of scientific papers

We investigate how textual properties of scientific papers relate to the number of citations they receive. Our main finding is that correlations are nonlinear and affect differently the most cited and typical papers. For instance, we find that, in most journals, short titles correlate positively with citations only for the most cited papers, whereas for typical papers, the correlation is usually negative. Our analysis of six different factors, calculated both at the title and abstract level of 4.3 million papers in over 1500 journals, reveals the number of authors, and the length and complexity of the abstract, as having the strongest (positive) influence on the number of citations.


I. INTRODUCTION
The number of citations an article receives can be considered a proxy for the attention or popularity the article achieved in the scientific community.Citations play a crucial role both in the evolution of science [1][2][3][4][5] as well as in the bibliometric evaluation of scientists and institutions, in which case the number of citations is often tacitly taken as a measure of quality.Understanding which factors in a paper contribute or correlate with citations has been the subject of a number of investigations (see Refs. [6][7][8] for reviews).Diversity in the affiliation of authors, multinationality, multidisciplinarity, and number of references, figures or tables have all been identified as factors that positively correlate with citations.
Here we perform a more systematic investigation of how different textual properties of scientific papers affect the number of citations they acquire (see Appendix A for data description).A classical result, which motivates our more general analysis, is the negative correlation between title length and citations (i.e., shorter the titles more citations) [9][10][11][12].In our analysis we consider additionally the complexity and the sentiment of the text both at the title and the abstract, see table I. Lexical complexity is usually considered as proportional to the effort needed (by non-experts) to understand the texts.The three measures of text complexity we use (see table I) take into account the number of different words in the text (normalized by the length) and the length of these words in syllables (see Appendix B for details).In several previous studies authors used the concept of the sentiment analysis (i.e., emotional content) of the examined text/messages.In general, psychologists are able to specify several dimensions of emotions, reaching as far as 12 [13].However two of themvalence and arousalare probably the best recognized and the most frequently used.Valence reflects the emotional sign of the message (negative, neutral, positive) while arousal is used to describe the level of activation (low, medium, high).Pairs of valence and arousal can indicate the specific emotion type [14], e.g., fear (negative and aroused), sad (negative and not aroused) etc., however they can be also utilized as independent variables.For example valence as a standalone dimension has successfully been used to detect collective states of online users [15], to indicate the end of online discussions [16] or to predict the dynamics of Twitter users during Olympic Games in London [17].Lately this kind of analysis has also been introduced to judge upon the role of negative citations [18], citation bias [19], and to check what boosts the diffusion of scientific content [20].Here we quantify arousal and valence through dictionary classifier, see Appendix C.  I: List of textual factors whose relation to citations we investigate in our paper.Whenever possible, factors are obtained on the title and abstract of a paper.See Appendices B and C for exact definitions.Additionally, we consider the number of authors (motivated by previous studies e.g., [7,21]).

II. RESULTS
We are interested in quantifying the relationship between X -a real number that quantifies for each paper one of the textual factors listed in table I -and the logarithm of the number of citations Y ≡ ln(citations + 1).We standardize X in order to be able to compare the different factors (see Appendix D) and we use the citations provided by Web Of Science at the end of 2014 for papers published in 1995-2004 [38].Exemplary results of the X vs. Y relationship for two factors in two journals are shown in the left part of figure 1.The broad scattering of the points shows that visual inspection fails even to detect whether the relation between X and Y is positive or negative.The simplest (and widely used) approach is to perform an ordinary (least square) linear regression Y = α † + β † X, in which case β is related to the Pearson correlation coefficient r as β = rσ Y /σ X (in fact, due to standardization of variable X in our case β † is simply cov XY ).For the data in figure 1, this yields: β † = 0.020 ± 0.011 with p > 0.05 for title length in Science case and β † = −0.21± 0.03 with p < 0.001 for valence in Nature Genetics.In other words, the second example shows a negative correlation between valence and citations while the first shows no clear correlation between the number of characters and citations (we cannot reject the null hypothesis of lack of linear dependence at 5% significance level).We note that the analysis of Ref. [12], which identified a negative correlation between title length and citation, was restricted only to the most cited papers.This difference in the conclusion regarding the role of title length and the large variability shown in the data motivates us to go beyond the above described computation of linear correlations, which relies on the (homoscedasticity) assumption of uniform errors in the whole dataset.
A. Quantile regression (QR) Quantile regression [22] gives the opportunity to track the relation between variables for different parts of the dataset.The simple question it addresses is: what are the coefficients α and β of a linear relation Y = α(τ )+β(τ )X that divides the dataset so that a fraction τ of points lies below the line and the remaining part (1-τ ) above it (a precise formulation of QR is shown in Appendix E).We thus obtain a sequence of values β(τ ) which can be thought as the quantification of the relation between X and Y at the τ quantile.The QR is widely used in different fields [23] and has lately been applied to predict future paper citation basing on their previous history, i.e., early citations as well as on the Impact Factor [24].
The results in the center panels of figure 1 show a clear τ dependence of β, a signature of the non-linearity of correlations.For instance, the top panel shows that for low values of τ there is a positive correlation between number of characters in the tile and citations, while for high τ the correlation is reversed.This shows the limitations of the popularized message [25,26] following Ref.[12] that shorter titles lead to more citation.This only holds if you know in advance that your paper will be among the top cited papers (longer titles seem to be better, e.g., in order to avoid being among the least cited papers).Similar observations (with the opposite trend) are observed in the lower panel for valence -the emotional polarity -contained in the abstract of Nature Genetics articles.These examples show that even simple textual variables can have a mixed relation to the number of citations acquired by the papers of a given journal.We repeated the QR analysis for all factors in more than 1500 journals [39].In our discussion of our different findings below we focus on three characteristic values of β which represent the low-cited (β low ≡ β(τ = 0.02)), typical (β half ≡ β(τ = 0.5)), and top-cited (β top ≡ β(τ = 0.98)) papers (graphically represented in the central and left panels of figure 1 by a summary pointer, i.e., a red arrow with a circle).

B. Strength of factors
In order to compare the strength of the effect of a factor on the number of citations we focus on the distribution of β half (typical papers) across different journals.The linear relationship Y = ln(citations) = α + βX and the fact that X is standardized imply that β quantifies how much growth in citations should be expected from the variation of one standard deviation in one factor (e.g., β = ln 2 ≈ 0.69 means that the number of citations Y doubles by moving one standard deviation in X). Figure 2 summarizes the results and presents the factors ordered according to the median of the β half distributions.The influence of factors is overall rather weak, as seen by the fact that for most journals |β| < 0.5.Factors in the title are considerably weaker than those in the abstract or the one connected to the number of authors.The variation across journals is in general high, but higher in the title than in the abstract (possibly due to the fact that the estimations of X are more robust in the abstract due to the larger amount of text).The strongest factors observed are: (i) the number of words in the abstract, (ii) the number of authors and (iii) z-index in the abstract.For those factors, over 75% of journals (equivalently, the whole box) are placed above zero.The negative value of Herdan's C can be attributed to its anticorrelation to the number of words (see Appendix B); when C is accounted for that fact and presented in the form of z-index the value is positive.This means that for a typical paper and for most journals a more variable vocabulary (more unique words) translates into more citations.Similarly, the number of words in the abstract or the number of authors are positively correlated with the number of citations in almost all journals.
a. Quantile dependence.Now we quantify in which extent the influence of factors (β) varies across papers with different number of citations (the quantile τ ).We are particularly interested in the cases in which the effect of a given factor on the most successful papers is significantly different from the effect on typical papers.To quantify how typical this is, we count the number of journals for which β top = β half is observed beyond the estimated uncertainties σ βtop , σ β half , i.e., |β top − β half | > (σ βtop + σ β half ).The results shown in table II reveal that overall this happens in about 1/3 of the cases (it is more typical for text length and less typical for sentiment factors).Table II reveals also the factors for which β top = β half because β(τ ) grows in most journals (and thus β top > β half , as in the case of valence in the abstract), decays in most journals (and thus β top < β half , as in the case title length), or shows a mixed behavior in different journals (as in the case of arousal).
The next question we investigate is the extent into which the quantile dependence leads to a reversal of the effect of factors, i.e., when β(τ ) crosses 0. Table III shows the percentage of journals with positive β low , β half , and β top coefficients for each factor.It shows that except for singular cases (marked by asterisk) the observations tend to be significantly different from chance (50%).The variation across the different β's (quantiles) quantifies the number of journals for which β(τ ) crosses zero.Such a behavior has already been discussed for title length in Science (see figure 1), and table III confirms the generality of this observation (it shows for title length 72% of journals with positive β low as compared to nearly 75% with negative β top ).In case of three factors (title length, Herdan's C in the abstract, and valence in the abstract), we observe that moving form β low to β top we cross 50%, which indicates that for a certain range of β the factor in question increases citations for most journals while for other β's the opposite effect is typical across journals.
The combination of the results of these two tables allows for a more complete picture of the τ dependence on β for different factors.For instance, the number of authors and the number of characters in the title can be identified as the ones that exhibit the strongest systematic trend of decaying β(τ ) (in about 40% of journals, as shown in table II).However, only for the number of authors the majority of the values are above zero (see table III), i.e., the value of β for top papers is less then for typical ones but it still stays positive.On the other hand, in the case of the number of characters not only is β smaller for top papers as compared to typical ones but it changes its sign as well.Sentiment factors (except for the valence in the abstract) bring no overall information about the trend -the number of up-and downward occurrences is similar.Notably, there is a strong coincidence between z-index and fog index in the abstract, suggesting that although those two quantities have different definitions, both indicate the increase of correlations between abstract complexity and citations.
Variability across journals.The large variability across journals apparent in all our analysis can have different origins.One possibility is that certain journals

III. DISCUSSION AND CONCLUSION
In this paper we investigate the importance of factors of scientific papers on the popularity they acquire.As factors we consider the number of authors of the paper and also text-related properties that quantify the length of title and abstract, the complexity of the vocabulary, and sentiment based on the used words.These factors capture different stylistic dimensions of scientific writing and were selected also based on previous works that indicated a correlation to the number of citations.We found that the factors with a stronger (positive) effect on citations are the number of authors and the length of the abstract.Text complexity is positive correlated with citation at the level of the abstract, while we could not detect a strong effect within the title.The agreement of two factors designed to quantify text complexity -the z-index and Gunning fog index -support this conclusion (the opposite result is obtained if Herdan's C measure is used, but we attribute this to the negative correlation of this measure with text length).In terms of the sentiment factors, the level of arousal a title or abstract invokes is poorly correlated with citations.This result should be examined more carefully as there are controversies as to the relation between text polarity and information contained therein (see [28,29] and the following discussion).Also the vocabulary on which we rely in this study [30] has been obtained by evaluating the common reception of words.This fact can strongly affect the value of valence, e.g., a highly negative word "cancer" in medical papers.
The discussion above, and the fact that a statistically significant effect is present for most factors, should not hide that the effect is typically weak (|β| < 0.5 for most factors, quantiles τ , and journals) and that there are strong fluctuations across papers and journals.For instance, a positive correlation between number of characters and citations for all the quantiles is measured in the New England Journal of Medicine, while a negative correlation is observed in the overwhelming majority of other journals.One of the main findings of our paper is that the factors vary also strongly depending whether the analysis uses all or only the most cited papers.We quantified this effect by the dependence of β on the quantile τ in a Quantile Regression analysis.One example in which this effect is particularly strong is the role of title length in figure 1.In the public media [25,26], the message behind the finding [12] of negative correlation between text length and citations was that authors should write shorter titles to achieve more citations.While this simple message is appealing and agrees with some stylistic recommendations, our results show that for most journals this is wrong (even if one assumes that there is a causal relation behind the correlations).The negative correlation is found only in the most cited journals, for typical journals the correlation is positive (longer titles are better).This suggests that papers with short titles show a larger variation on the number of citations and can be very well cited or very poorly cited.A similar behavior is observed in other factors, and a significant dependence on τ is seen on average in 1/3 of the journals.
Altogether, our results indicate that textual properties of title and abstract have non-trivial effects in the processes leading to the attribution of citations.In particular, the effect varies significantly between papers with usual number of citations and with large number of citations.This finding is even more important considering that the number of citations across papers varies dramatically.The weak signal we detect can be considered also a sign that the quantities we measure have limited information, e.g., expressing the impact of publications by single number (the number of citations) can be misleading and lacking information (a point that has been previously raised, e.g., in Ref. [31]).The overall estimates (calculated over a set of journals or categories) may dim the clear picture one receives while observing a specific journal.For authors interested in how to write the title and abstract of their paper, we recommend looking at the values of β half and β top of the different factors for the specific journals of interest.

FIG. 2 :
FIG.2: Strength of factors calculated over all journals.Box-plots (see definition on the right) summarize the distribution of β half values across different journals.Influential factors are identified as those for which |β| is large for almost all journals (e.g., when the box does not contain β half = 0 line this implies that in at least 75% of the journals the value of β half is above or below zero).
FIG. 3: Summary pointers showing β low , β half , and βtop for two factors number of title characters (top) and valence in abstract (bottom) (see figure 1 for the definition of summary pointers).Journals are grouped according to the OECD bibliographic categories [27].The 8 journals with highest Impact Factor in each category are shown (6 for Other natural sciences).The categories are sorted with respect to the number of positive β half values.Testing null hypothesis that categories are randomly attributed to journals (we compare the average standard deviation within categories with a random attribution of categories to journals) yields p-values p = 0.002 for title length and p < 10 −8 for valence in abstract.The same procedure performed for Impact Factor (by creating 12 categories according to decreasing IF) gives p = 0.02 for title length and p < 10 −5 for valence in abstract, suggesting higher impact exerted by scientific category.

TABLE II :
identified as those for which |β| is large for almost all journals (e.g., when the box does not contain β half = 0 line this implies that in at least 75% of the journals the value of β half is above or below zero).Factors often affect top and typical papers differently.Percentage of journals for which βtop = β half are reported.The right column, βtop = β half , is the sum of the two others.By calculating exp(β∆X) one can directly estimate how much gain in citations is obtained on average by a move in ∆X standard deviations in the variable X (e.g., for title length in the journal Lancet β half = 0.33 and thus extending the length of the title by one standard deviation gives almost 40% gain in citations; for Nature, β half = 0.038 and thus one obtains less than 4% gain).
TABLE III: Percentage of journals with positive β low , β half , and βtop for each factor.All values are statistically significant (p < 0.001) except for those marked with an asterisk * (see Appendix F). comparing with a random attribution of IF, popularity proves to be statistically significant, although to much less extent than scientific discipline (see caption for figure 3). Figure 3 allows also for a straightforward comparison of the strength of title length and abstract valence factors in different journals.