Linear Regression Analysis of Title Word Count and Article Time Cited using R

There is a common idea, that title variables like article title length would influence article citations. The aim of the present study was to investigating possible relationship between size of article title and number of article citations by minimizing scientometric variable biases. A dataset containing ~100,000 virological literatures was obtained from Web of Science InCite TM database from 1997 to 2016. Variables, Title size (TWC), Year (YoP), Source (JS), and Publisher were selected. In addition number of times article is cited ‘Time Cited’ (TC) was retrieved from Web of Science InCite TM . Linear regression analysis was performed between variables and TC using R for a possible prediction model for TC. Result has shown a robust standard error corrected linear regression with only 30.6% power of predictability. Furthermore, it was found that TC, YoP, and JS have meaningful potential in the linear model. Moreover, TC is negatively correlated with YoP, JS, and positively with TWC. As a result, size of article title, years passed since publication and the journal in which article accepted are good but not sufficient predictor of article citations. In addition, article is a multi-characteristic subject and other predictors can be supposed. However, we think that finding an efficient statistical linear predication model for TC, by increase of articles citation, is overwhelming.


INTRODUCTION
Effects of article title length on article impact are controversial.Studies have shown that an article title length may have a positive, negative, our neutral influence on articles citations. [1,2,3]However, many factors may affect study's outcome.Importantly, sample size, statistical methods, journals or topics by which articles are retrieved, and time-spans are major factors.For example, popular journals with more attentions may cause a bias toward those in which articles title are in a pre-defined format by the journal. [1]udies about this subject had different results due to different data selection, statistical analysis strategies.In a study on more than 9000 article from 22 different journals, authors conclude that articles in journals with higher impact factors tend to have large word counts in title and get more citations. [5]In later study, authors have chosen more variables from titles of 423 Articles divide into two separate results-describing and methods-describing titles groups.With different statistical analysis as well as logistic regression they have shown that titles with less characters would bring more citations. [12]Results similar to those of Paiva et al were obtained in a study of seven journals from PloS publication.It was found that though each journal had different scope short titles had more downloads and citations. [2]ese studies clearly show that design of study may lead to either positive or negative correlation between title size and citations.Therefore to minimize bias due to different data sources, in this paper we have selected a dataset from uniform topic and research area within a certain time period.R is a free statistical tool with over 2,000 cutting-edge, user-contributed packages available on CRAN.Additionally, we preferred R to other statistical tools because of, in addition to its availability, accessibility to routinely updated advanced packages incorporating recent developments in mathematics making it a comprehensive tool to carry out different types of data analysis, use of data presentation packages, and it's capability to incorporate and analyzing various types of data formats.
As a result, it was found that article word length has a potential impact on article citation.In addition, it was concluded that, along with title size, other scientometrics variables would have influence on article citations.

Data Retrieve and Article Title Word/Character Counting
One thousand scientific article records in virology research area (SU=Virology) were retrieved from Web of Science InCite TM database from 1997 to 2016 as on September 27, 2016.After that, data was merged into CSV file format using Publish or Perish 4.0 software package (Harzing, A.W. 2007, http://www.harzing.com/pop.htm).Microsoft (MS) Excel formulae were used for data manipulation and title word counting.
Articles database was cleaned for any duplication and articles with missing data on any of the selected variables mentioned in 2.2 were deleted.

Variable Selection
Information on the following variables was then tabulated from the above articles database: Article title word count (TWC), year of publication (YoP), publishers, and journal sources (JS) in which article is published and number of article citations (TC)retrieved from Science In Cite TM as on September 27, 2016.Journals (JS) were grouped into high impact or low impact factor using Web of Science Journal Citation Reports ® .Data mining was performed in less than one hour.

RESULTS
After data retrieval and trimming, only 99,838 articles were left with desired information.As shown in Figure 1, articles were from 37 publishers and 56 journals.Based on sources and publishers, mean of citations and total number of articles are shown in Tables 1 and 2. Sum of citations in 99,838 articles was 2,542,056.Figure 1a shows American Social Microbiology with largest number of papers to have most citations (40.55 ± 49.950) and TWCs with subjects AIDS Research and Human Retroviruses having largest total citations in years between s1997-2016.Linear regression resulted in a model with a prediction ability of 13.68% (y-intercept = 62.325, slope = 0.545, adjusted R-squared = 0.1368, and p-value < 2.2e-16).All Predictors had significant same < 2.2e-16 p-value in regression modeling, but with TWC p-value equal to 1.12e-06.This model with three TWC, YoP, and Publisher predictors was better than that models based on each variable alone.Adjusted R-squared with the Year predictor was 0.1228, 0 with the Journal Source variable, 0.021 with the Publisher predictor, and 0 with TWC.However, predictors TWC (p-value = 0.012), Year (p-value: < 2.2e-16), and Publisher (p-value < 2.2e-16) have a synergic potential to prediction of TC (Adj.R 2 =0.1368).By removing Journal Source from the model, the power of prediction has changed inconsiderably (Adj.R 2 = 0.1357, p-value < 2.2e-16).Figure 3 shows diagnostic plot of the predicted model.In addition, there was no multi-colinearity in any of predictors.Moreover, heteroscedasticity was evaluated to check for hetero dispersion within variable if any using studentized Breusch-Pagan test.Result showed existence of non-acceptable heteroscedasticity (BP = 527.89[df: 4], p-value < 2.2e-16).Standard errors were corrected to take care of this.The correction was found to change R-squared value to 0.306 (y-intercept: 4.19, S.E:0.02,p-value < 0.0001).
Figure 4 shows observed vs. predicted values.Based on journal source, R 2 was significantly higher but only for those journals in with few articles.Predicting equation model obtained with these journal sources were not able to predict observed citations (data is not shown).We hypothesized that significant negative correlation between citations and Publisher could be because of inclusion of large number of journals with low impact factors (IF) in the dataset.To answer this question, data was split into  2, high impact journals have more citations as expected.In contrast in my hypothesis, large portions of articles were in to categories with more IF.

DISCUSSION AND CONCLUSION
A p-value less than 0.05 is considered sufficient for assigning a variable into a predicting linear model.Linear regression results obtained here also indicate effect of TWC on response variable, TC.However in this paper we have examined in detail if TWC-based linear model for predicting response variable TC is reliable or not.
We have conducted a linear regression analysis on a database containing Virological papers.Interestingly, using TWC variable, we found that in case of low TC in sets of data containing small number of articles, a linear model can be assigned (Table 2).However, results do not show a reliable linear model for prediction of TC irrespective of number of articles and high TC .It is likely that in articles that receive higher number of citations, readers pay attention to many more variables than simply TWC, making it harder to model a regression.
Having checked relationship between TWC and TC, to show no linear relations (only 30.6% predicting ability with standard error corrections) we have then incorporated, in addition to TWC (article word size), YoP (year of publications), and JS (journal source) and searched f for a meaningful predictors of TC (article time cited).We find that TC is negatively correlated to YoP and JS (Publisher,) and positively with TWC (P<0.05).Negative correlation of JS and TC, is shown in Figure 2, thus TC of articles in high impact factors journals during the years 1997-2016 are less predictable.
We note that Scientometric and Bibliometrics studiesemploy varied ways of data collection and analysis.However, a scientific paper also has descriptive and reflective contents. [1]Falahati et al, have observed that title length and subject of article are both relevant to article citations, but they did not find correlation between title length and citations, [3] implicating other factors from bibliometrics materials may be involved.Article citation may be influenced by research area, topics, words size, characters, punctuations etc.Also some topics, in a certain time period may attractmore interest than other subjects.Therefore analysis based on, different time period segments may minimize biases either in variety of published      articles or time variable itself.For this, other methods or data retrieve strategies need to be taken.
In scientific view, number of times an article cites is major impact of the article.Therefore, finding factors influencing article citation needs further research in the future.Accordingly, those factors with high impact on article time cited can be used for reconstruction of statistical predictive model(s).

Figure 1 :
Figure 1: Schematic representation of TWC and TC based on publishers and journal source.a) Shows total TC in each publisher, b) illustrates TWCs within publishers, c) is TC related to journal sources, and d) demonstrated number of TWC used in each journal source.

Figure 2 :
Figure 2: Journal Sources with their respect Scientometrics information's.Journal sources in Cat1 comprised of TC, number of articles and TWC mean of 6.5, 424.67, and 13.59, respectively.In Cat 2, data have changed to 10.45 for TC, 1907.36 for number of articles, and 14.68 for TWC.Mean of citations in Cat 3 and 4 was 17.67 and 27.19, 2843.70 and 4596.20 for mean number of articles, 15.30 and 13.55 for TWC, respectively.

Figure 3 :
Figure 3: Diagnostic plot of adjusted linear model.Upper left plot shows residuals versus fitted values; data are in the regression line as red line is laid on the dotted line.Normal Q-Q plot shows residual are normally distributed; outline data with higher citations are shown as well.Lower left plot is used to measure square root of standardize residuals against the fitted value; as the red line is flat, it is assumed that the variance of residuals does not change the distribution.Fourth plot shows how each data point influences the regression; as it is shown, outliers-high leverage and large residuals data (point over 0.5 Cook's distances)-may affect linear regression fit, as the red line is not leaving dashed line, it indicate good regression fit.

Figure 5 :
Figure 5: Scatter plot matrix of data variables correlation.

Table 1 : Information of data based on publishers.
a Year in which data are published.bAdjusted-R 2 of linear regression analysis of TC and TWC based on publisher.twocategories, one with four quartiles (Cat.1-4) of journals with IF articles with less than 1.5, 1.501-2.45,2.451-4.1 and another of journals with IF greater than 4.11.Moreover, data were spilled uniformly in each IF category based on the source of publication.As it is illustrated in Figure

Table 2 : Information of data based on publishers
a Year in which data are published.b Adjusted-R 2 of linear regression analysis of TC and TWC based on publisher.