Predicting substantive biomedical citations without full text

Significance Citation networks document knowledge flow across the biomedical literature, and insights from these networks are increasingly used to form science policy decisions. However, many citations are known to be not substantively related to critical early stages of the citing study. This adds noise to the insights derived from these networks. Here, we train a machine learning model that generates prediction scores associated with substantive citations. We use this model to show that government funding is linked to a disproportionate amount of knowledge transfer from basic to clinical research that is likely to be substantive in nature. This result raises the possibility that federal funding for biomedical research may be a straightforward lever for translating basic research knowledge into clinical discoveries.


Feature descriptions
• specter_cosine_sim: SPECTER cosine similarity using the title and abstract text of the citing paper and that of the referenced paper • pub_ref_reflist_sd: Standard deviation of the pairwise SPECTER cosine similarities using the title and abstract text of the referenced paper and that of all other papers in the citing article's reference list • pub_ref_reflist_mean: Mean of the pairwise SPECTER cosine similarities using the title and abstract text of the referenced paper and that of all other papers in the citing article's reference list • pmid_reflist_sd: Standard deviation of the pairwise SPECTER cosine similarities using the title and abstract text of the citing paper and that of all other papers in the citing article's reference list • pmid_reflist_mean: Mean of the pairwise SPECTER cosine similarities using the title and abstract text of the referenced paper and that of all other papers in the citing article's reference list • reflist_reflist_sd: Standard deviation of the pairwise SPECTER cosine similarities using the title and abstract text of all papers in the citing article's reference list to one another • reflist_reflist_mean: Mean of the pairwise SPECTER cosine similarities using the title and abstract text of all papers in the citing article's reference list to one another • ref_year: Publication year of the referenced paper • ref_rcr: Relative Citation Ratio of the referenced paper • pub_rcr: Relative Citation Ratio of the citing paper • cocited_by_ref: Count of the number of other papers in the citing paper's reference list that also cited the referenced paper • pub_pctile: Percentile of the citing paper's Relative Citation Ratio • ref_mc: Molecular/Cellular Biology score of the referenced paper. This is the average of relevant Medical Subject Heading terms attached to this paper that fall into the Molecular/Cellular category • pub_mc: Molecular/Cellular Biology score of the citing paper. This is the average of relevant Medical Subject Heading terms attached to this paper that fall into the Molecular/Cellular category • ref_a: Animal score of the referenced paper. This is the average of relevant Medical Subject Heading terms attached to this paper that fall into the Animal category • pub_a: Animal score of the citing paper. This is the average of relevant Medical Subject Heading terms attached to this paper that fall into the Animal category • pub_year: Publication year of the citing paper • ref_pctile: Percentile of the referenced paper's Relative Citation Ratio • ref_h: Human score of the referenced paper. This is the average of relevant Medical Subject Heading terms attached to this paper that fall into the Human category • pub_h: Human score of the citing paper. This is the average of relevant Medical Subject Heading terms attached to this paper that fall into the Human category • in_lcc: Binary flag for whether the referenced paper falls into the largest connected component of the local citation network of the papers in the citing article's reference list • direct_and_cocitation: Binary flag for whether the direct citation between the citing and referenced paper is also a co-citations; in other words, has another article been published after the citing paper that referenced both the citing and cited paper? • ref_is_research: Binary flag for whether the referenced paper is a primary research article • same_journal: Binary flag for whether the citing and cited papers are both published in the same journal • pub_is_research: Binary flag for whether the referenced paper is a primary research article

Machine learning & accuracy testing
Our outcome measure for prediction was a binary flag indicating whether a given reference had been added in peer review (True) or was found in the original preprint and carried over to the published version (False). Training data contained 440,000 instances of balanced positive and negative data, and accuracy statistics were calculated from a smaller holdout test set. XGBoost was used for the final model (1), although we also tested random forests, support vector machines, and logistic regression (each of which had poorer performance). The model achieved an F1 score of 0.7. Training and testing using the complete balanced dataset with 10-fold cross-validation yielded similar results. Our final test set comprised of 3482 citation linkages that were out-ofsample for training the model.
References that were found in the preprint version but not in the published version were omitted because of a difference in how such references are indexed. bioRxiv and medRxiv host both the primary reference list and references found only in the supplemental material, while many publishers do not deposit references from supplemental material into structured citation indices. It cannot be easily distinguished whether a reference is found in the preprint but not the published version due to being genuinely dropped, or whether it was part of the supplemental data and not covered by publishers. We therefore omitted such references from training data.

External validation & analysis
To identify earlier stage clinical trials cited by later stage clinical trials for the same drug, we used PubChem to match drugs and PubMed Publication Type terms to identify which Phase (I-IV) each citing/referenced paper was (2). To identify those trials studying the same disease, we matched Some external validations required examining the location in the citing paper of the citation. Such citation contexts are available from OpenCitations and Colil (3,4); we used context data from the latter service. Only citations in sections that contained the case-insensitive string "methods" in the header were used for matching citations in the Methods section (e.g. "Methods", "Materials and Methods", or "Methods and Results"). Additional citations to the same paper in other sections of a paper did not exclude a reference from consideration as long as it was also cited in the Methods section.

Supplemental Results
Feature importance scores Feature importance scores are shown in Supplemental Table 1. Out-of-sample positives received a higher prediction score of 0.63 on average, while out-of-sample negatives received an average prediction score of 0.37 (p < 0.01, Wilcoxon rank sum test, and Figure 3d-e). This is important because it is possible for a prediction system to have predictive power mainly for identifying positive values, while performing closer to chance at identifying negatives. If that were the case, it would indicate that the model had learned some predictive properties of papers added in peer review, but little else. The low distribution of prediction scores for the out-of-sample negative samples indicates that the model is learning patterns about early-stage citations as well.

Feature space analysis
We first began with features extracted from three databases: PubMed, iCite, and Semantic Scholar. This initial model used the features described in the Methods: Feature Descriptions section (which derived data from iCite and PubMed), as well as four additional features derived from Semantic Scholar (Supplemental Table 2). This model, in addition to being more complex, also requires ingesting a third database, so its upstream data processing is also more complex. The four features derived from Semantic Scholar data each received feature importance scores near the bottom the importance ranking (Supplemental Table 3), so we explored creating a model without these (background, methods, results, authors_pmid_ref_overlap). Both models received similar F1 scores: 0.7 for the model including Semantic Scholar data, and 0.7 for the model using only PubMed and iCite data (see Results). A heatmap of the feature correlation matrix is shown in Supplemental Figure 1. Because these two scores were similar, we used the model without Semantic Scholar data to simplify the model and data processing pipeline.

Feature reduction analysis
To further test whether the model complexity could be reduced without sacrificing accuracy or external validity, we collapsed pub_year and ref_year into a composite variable dyear by subtracting ref_year from pub_year. Based on feature importance scores, we also removed two of the three local network structure features (direct_and_cocite and in_lcc). In addition, we removed the Human, Animal, and Molecular/Cellular scores, and retained only the percentiled RCR scores (pub_pctile and ref_pctile) while omitting the linear RCR scores. Finally, we removed the same_journal and is_research flags. We retained all the SPECTER data because these features received among the highest feature importance scores.
Supplemental Figure 1. Feature correlation matrix displayed as a heatmap, based on the training data.
Our reduced feature set generated a model with similar levels of accuracy (F1 = 0.69 double check this). This indicates that raw accuracy was not much affected by the reduced feature space. However, when we checked its external validity on our five examples of known substantive knowledge transfer (clinical drug progression, clinical disease progression, iPSC, XFP, and CRISPR), only four out of five examples showed the expected pattern of low prediction scores. This indicates that some external validity may have been traded off by reducing the feature set in this way. For this reason, we continued our experiments with the main model as described in the main text (full iCite + PubMed features but no Semantic Scholar features).