Journal Pre-proof In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving

Objective To describe and pilot a novel method for continuously identifying newly published trials relevant to a systematic review, enabled by combining artificial intelligence (AI) with human expertise.

• The semi-automated system found 100% of the relevant studies found by a conventional manual update, during a pilot, when updating a systematic review of covid vaccination trials What this adds to what is known • Living systematic reviews have been proposed as a new model for keeping evidence syntheses updated. Most current living reviews rely on repeated manual update searches, which are time consuming and laborious • We show that using a hybrid AI/expert model could lead to lower latency updates, potentially reducing workload, and improving the currency of systematic reviews What is the implication, what should change now • Systems which use AI to automatically notify systematic review authors of new evidence ('push' updates) are feasible, and should be piloted on a wider range of systematic reviews.
• Future research should examine how best to adapt these technologies for use in more complex reviews (particularly reviews of non-trial evidence, and those with complex inclusion criteria) • Journal publishers should investigate models for rapid updating, to enable automated live updates of review status to be published J o u r n a l P r e -p r o o f Background For many health conditions and treatments, evidence accumulates rapidly. [1,2] Systematic reviews identify, appraise and synthesise all empirical evidence on healthcare topics, and are therefore invaluable for making clinical decisions and informing policy. However, most reviews are static publications, which can become quickly out of date as new primary research is published. [3] For the reader, it is currently impossible to determine whether any particular systematic review is up-to-date, or whether new important new research was published after the searches were conducted. For authors, it is unclear whether it is worth the effort of updating their review, given uncertainty about whether new evidence exists which might change their conclusions. [4] For commissioners and policy makers, it is unclear when and whether to fund updates of systematic reviews.
As an example, consider the topic of COVID-19 treatments, or vaccines. New studies are being rapidly conducted and published on these topics. A 'static' systematic review on either topic, with a search date of six months ago (from the time of this writing) is likely to have missed critical new findings, and failed to provide an account of the current science. Given the pace of new published trial evidence in COVID-19, a conventional systematic review would likely become outdated before it was ever published.
Living systematic reviews have been proposed as one model for keeping rigorous syntheses current with evolving evidence. [5,6] The idea is to update syntheses as new evidence emerges, ideally with low latency. For COVID-19 specifically, a number of living reviews are currently being maintained on both treatment and vaccines. [7,8] To date, living systematic reviews have been achieved by repeating a conventional systematic review update on a frequent basis (updating searches, say, monthly or weekly), screening the results and extracting data. [9] This process still depends upon review teams having to actively run searches and find new studies (a "pull" model); and will result in some lag between manual J o u r n a l P r e -p r o o f search and identification of relevant studies. In addition, conventional database searching can yield large numbers of abstracts which require screening.
The process of conducting the search, and screening the results to identify potentially relevant abstracts is a large proportion of the work to conduct an SR. The findings of this work (whether there are new studies identified or not) is important to readers and policy makers. The main mechanism to provide this information to users is to publish a 'full' update.
This process, particularly for 'empty updates' is time consuming. There is a need to identify new evidence relevant to existing systematic reviews in a more efficient and less manual way. In addition, there is a need to have a formal way to represent the currency of existing systematic reviews, based on whether all relevant evidence has been incorporated.
There has been much recent research attention on how to use artificial intelligence (AI) systems to automate (or semi-automate: where AI systems are combined with human experts) living updates. [10,11] The most advanced technology in this respect is the use of machine learning (ML) to prioritise studies for screening, which has found to be accurate and efficient in a number of methodological studies, [12][13][14] and is available at the time of writing in several systematic review authoring tools. [15,16] Here, we describe a hybrid system that integrates machine learning and natural language processing (NLP) methods with human expertise to translate static systematic reviews into living reviews. The system automatically monitors research databases for new, relevant research to a systematic review, and notifies the review authors. This "push" model differs fundamentally from the standard approach to updating reviews, which depends on review authors taking the initiative to periodically search for newly published evidence. We present a formative evaluation of the system, comparing the reliability of (semi-)automatic systematic

Machine learning inclusion decisions
Our goal in this step is to automatically filter out the vast majority of irrelevant articles. We have found previously that machine learning models with sufficient recall for systematic reviews (which aim to retrieve all research fulfilling their inclusion criteria) will, even in the best case, retrieve a high fraction of false positives. We therefore aim to develop a model with near 100% recall, but add a later screening step by a human expert to remove false positives. A lower precision is therefore acceptable so long as the volume of articles for manual screening is manageable. To achieve this, we train a classification layer on top of 'BERT'-based [18]   In the case of our example topic (which was subject to particularly high rates of research and publication during the study period), the system identified on average three potentially eligible abstracts per week which were then pushed to the review's lead author. Review authors can screen the new studies by signing on to the website (Figures 3 and 4).

Publication of live status update
We automatically publish a live update, which makes use of the latest information from both the automated and manual evidence screening (see example in Figure 2). This text is designed to be displayed as an additional section in the structured abstract, with the header "Automatic updates". We display the full abstract including the live update section on our website, and also make this available via a REST API so that external journal publishers J o u r n a l P r e -p r o o f could opt to display a live, updated version of the abstract as part of the primary research article in future.
We provide meta-data about new studies (including numbers of studies screened, and how many were deemed relevant by the topic expert, and numbers of trial participants). This numerical meta-data is collected from our screening records, and from the structured data in the Trialstreamer database (which has been automatically extracted using NLP models), and displayed following a template.
As part of this step, we also have explored the use of automatic narrative summaries of newly included studies. We aimed to produce a brief summary of the new studies' findings to be presented alongside the templated metadata described in the main paper. We provide further details about this method and results in the Appendix.

Evaluation: Prospective case study with a COVID-19 vaccination review
We evaluated the system prospectively in comparison to a conventional manually updated living systematic review on COVID-19 vaccination evidence. The baseline full systematic manual searches for this review were completed and screened on February 9th, 2021.
We ran our comparative evaluation from February 9th 2021 to August 1st 2021. During this period, the review authors performed conventional manual update searches, and we ran the semi-automated system in parallel. We calculated recall with respect to the combined set of included articles from the manual and automatic update systems. Screening of the abstracts found by RobotReviewer LIVE was done by an independent member of the review team, who was not involved in the screening of the manual update searches.
Due to time taken to screen abstracts on the manual update, the last manual update search done during the evaluation period was on July 1st 2021. The "push" model used by our J o u r n a l P r e -p r o o f automated system, where smaller numbers of abstracts were sent to be screened on the day of publication, meant that there was close to no lag between abstract publication and screening, and the live status updates included abstracts published up to and including August 1st 2021. We present results separately until July 1st (which represent a direct comparison of automatic update search performance versus conventional manual update searches at intervals), and from July 1st until August 1st 2021 (which evaluate any advantage in screening efficiency with automation) to allow a fair comparison.

Results
We present a PRISMA flow diagram comparing the screening approaches in Figure 5. The baseline (manual) version of the review search was conducted in February 2021. This yielded 4493 abstracts of which 38 both met eligibility criteria and were reports of RCTs. Manual update searches retrieved 135 abstracts; by contrast the automated system retrieved 56. Both strategies resulted in the same 31 included abstracts after screening. .

Discussion
We have presented a system for identifying new evidence to include in systematic reviews, and for producing live abstract updates on the currency of systematic reviews.
RobotReviewer Live combines AI (ML/NLP) with human expertise, and allows new studies to be incorporated in published review reports quickly after publication. We have made the software, ML models, and data needed to implement the system freely available as open source software. We also provide a prototype of RobotReviewer Live that features a simple user interface, which should allow systematic review authors to produce live updates for their existing "static" systematic reviews. This prototype is also available as open-source code.
We provide an easy-to-use interface to allow experts to validate the automatic search results potentially providing substantial efficiencies in the updating process, while still providing the assurances afforded by expert verification. In practice, converting a new conventional systematic review to a "living" equivalent using the system could be done in a matter of minutes. We make the technology available as open source, together with a REST API to enable live updates to be used inline in published journal articles, embedded in the websites of third party publishers. Even where a review is not actively kept up to date, this may allow interested individuals to see estimates of the amount of relevant evidence published since the time said review was completed. In the future, this platform may also permit "crowdsourced" maintenance of systematic reviews.

Related systems have been developed and evaluated; notably the Cochrane 'Evidence
Pipeline', and Centralised Search Service. [23][24][25] These projects also monitor research databases (using a combination of machine learning identification of RCTs and crowdsourcing) and notify Cochrane review groups (who each typically manage tens of systematic reviews on a common clinical theme) when new research is published relevant to J o u r n a l P r e -p r o o f their theme. By contrast, our system is designed to manage updates for individual systematic reviews.
In our prospective case study, the automated method identified all of the includable abstracts found manually. We continued to run the automated system for an additional month after the evaluation period (until August 1st), since the review team conducted the manual update search earlier than we had expected. In this month, the automated system found 12 additional abstracts which were deemed includable. This illustrates the advantage of the low latency "push" screening model, especially for topics such as Covid-19 vaccination, with rapid publication rates.
One criticism of systematic review automation tools previously is that they are often found as discrete, scattered pieces of academic code which require substantial technical expertise to use in practice. [11,26] To overcome this problem, we have produced an easy-to-use web interface which should allow users to create a "living" version of a systematic review with minimal effort (Figures 3 and 4).
This technology is still emerging, and users should be aware of important limitations.
Although the performance on this case study is strong, the evaluation review is ideal for such technology. The review question is precise, and concerns a well-defined intervention and health condition, both of which are easy to capture in the structured vocabularies used in the Trialstreamer database. In the midst of a pandemic, there are also large numbers of eligible studies being published (whereas precision is likely to reduce in any search as the prevalence of eligible studies decreasesno matter whether manual or automated). We have presented a single case study, and it is likely that performance will vary particularly for more complex reviews.

J o u r n a l P r e -p r o o f
Currently, we make use of the Trialstreamer database, which at present is limited to articles describing RCTs. We intend to make additional article types available in future; at present the system is limited for use to systematic reviews of intervention trials due to the data sources used. At present, we make use of articles from PubMed onlywe are unable to access additional proprietary databases such as EMBASE which might (modestly) harm the recall of the system. [27] Overall, although the individual components of the system have been extensively validated, this report describes the only validation using a conventional systematic review as a comparator. The reliability of the system in general (particularly for reviews that deviate substantially from the format of the current evaluation) requires further study.

Conclusion
Manually updating systematic reviews is time consuming and laborious, meaning many conventionally produced reviews become quickly out-of-date. We hope that further evaluation and development of the ideas and methods presented here will bring the goal of dynamic publication of live evidence synthesis updates a step closer into practice.