Securing the future of research computing in the biosciences

Author summary Improvements in technology often drive scientific discovery. Therefore, research requires sustained investment in the latest equipment and training for the researchers who are going to use it. Prioritising and administering infrastructure investment is challenging because future needs are difficult to predict. In the past, highly computationally demanding research was associated primarily with particle physics and astronomy experiments. However, as biology becomes more quantitative and bioscientists generate more and more data, their computational requirements may ultimately exceed those of physical scientists. Computation has always been central to bioinformatics, but now imaging experiments have rapidly growing data processing and storage requirements. There is also an urgent need for new modelling and simulation tools to provide insight and understanding of these biophysical experiments. Bioscience communities must work together to provide the software and skills training needed in their areas. Research-active institutions need to recognise that computation is now vital in many more areas of discovery and create an environment where it can be embraced. The public must also become aware of both the power and limitations of computing, particularly with respect to their health and personal data.

This is the absolute upper limit for the amount of data that can be collected. Typically, the Titan Krios generates between 5 and 8TB per day, depending on the grid type, automation software and data collection parameters. Data streams of even this smaller size require significant upgrades to the networking infrastructure typically found in Universities. For example, at the University of Leeds, each Titan Krios, and each computational resource used for EM image processing, is connected by uncontested 10GB Ethernet to a dedicated GPFS filestore. The size of these datasets, and the resulting difficulties in moving them around, are a key impediment to the potential adoption of cloud compute resources in this area. Future detector upgrades (see below) will require adoption of even faster networking technologies.
For archival purposes, the size of a dataset can be reduced by a factor of ∼ 40×, if only drift-corrected, and dose-weighted averaged images are retained. In practice, researchers store raw frame data for a nominal period of time, before deciding that a reduced dataset is sufficient, which is then archived for at least a 10-year period, at the behest of research funders. A more pragmatic approach may actually be to discard the data, but place the cryo-EM grid that generated the images into long-term storage under liquid nitrogen. A new dataset could be subsequently be collected much more cheaply than storing the original images. Even if the frame data were almost immediately discarded after corrections were applied, this would still result in a staggering data volume of around 10TB per day, all of which would need to be archived.
While we are still very far from overcoming the experimental barriers necessary to achieve this goal, one ambition of structural biology is to generate an atomic resolution structure of a cell. To estimate the volume of imaging data this would require: Assume each voxel is 1Å in size (which comfortably provides a resolution ∼ 3Å). Assume the volume of a typical eukaryotic cell is ∼ 5µm 3 . At 1Å per voxel, we require 1.25 × 10 14 voxels. As each voxel requires 4 bytes, a 3D reconstruction of the cell at atomic resolution would require 500TB of storage. This is only two orders of magnitude smaller that the whole dataset curated by the EBI [3]. Good statistical averages would then require thousands of measurements. Given these data, biologists will look for differences between different cell types from the molecular level upwards, and compare diseased and healthy states. However, biological time-scales span many orders of magnitude, from nanoseconds for atomic-scale thermal fluctuations, to milliseconds for dynamic molecular processes such as transcription, to years for amyloid formation, and the associated onset of neurodegenerative disease. Consequently, the growing requirement for data storage in the biosciences is unlikely to reach saturation, as the complexity of molecular biology looks set to remain far greater than our knowledge for the foreseeable future.
Going below the atomistic level with XFELS: Biology is powered by chemical reactions. Therefore, to gain a full mechanistic understanding of molecular biology we April 8, 2019 2/7 need to probe the time (fs-ps) and length-scales (Å) associated with electron transfer during enzyme catalyzed reactions. X-ray Free Electron Lasers (XFELS) provide particularly high peak brilliance, improved beam coherence compared to synchrotron generated X-rays, and can be generated as short pulses in the 10-100fs regime, which enables them to probe biochemical reactions [4]. XFELS have been used to determine the structural dynamics of photoisomerisation following photon capture by photoactive yellow protein microcrystals over fs to ps time-scales [5], and to monitor changes in protein structure and dynamics in the carbymonoxide myoglobin complex on photolysis of the Fe-CO bond [6]. In both cases, complementary computer simulations at the quantum mechanical level were used to interpret the experimental data. XFELS brings their own new set of computational challenges. Firstly, the data output is vast: 50GB per second [7] (equivalent to ∼ 200 times the current output of cryo-EM). Moreover, the interpretation of the data is non-trivial, and requires bespoke software implementing new quantum physics algorithms [8].
Case Study 2. Data storage sizes for an atomistic map of C. elegans C. Elegans is the model organism for eukaryotic species. It contains 2000 cells. Here we estimate the data storage requirements for tracking the position of every atom in C. elegans at 1µs intervals throughout the lifetime of the worm. In principle this could either be provided by MD simulations (assuming perfect performance of MD forcefields and many orders or magnitude improvement in the computational efficiency of MD codes), or using a hypothetical future imaging device capable of sampling at these speeds and resolutions. An atomistic MD simulation would provide a slightly larger dataset, because each voxel in the cryo-EM image is only 4 bytes, whereas storing atomistic coordinates needs around 10 bytes.
These enormous data sizes will be required to turn measurements at this level of detail into insight and understanding of the molecular life cycle of the worm. A robust test of physical understanding of a system is whether we are able to reproduce its behaviour through computer modelling. A "smart simulation" of C elegans requires considerable coarse-graining from atomistic resolution, and-or multi-scale switching (34). For example, if protein diffusion, docking, enzymatic function can be adequately described by a "block-translation-rotation" approach then for most of the computation a protein can be represented as a collection of one "coarse-grained atom" per domain. The reduction in the number of effective atoms in the system is typically of order 10 3 . If individual proteins are considered as the irreducible units in the model, further simplifications are possible, and more still if elements of sub-cellular architecture, such as microtubules, and then the cytoskeleton, can be represented at individual entities. The question is then: how do we construct a series of models at different spatial resolutions that correctly capture the relevant biophysics at each length-scale? How do we then couple these models together, so that information can flow between the various length-scales?
Coarse graining in length also permits coarse-graining in time. If a diffusive dispersion relation is assumed so that τ ∼ r 2 then the temporal coarse-gaining gives another 10 2 in data reduction for each 10 3 reduction in the number of number of units considered in space. We can obtain a more substantial reduction in data sizes if we assume that a new computational regime occurs whenever the number of atoms changes by 10 3 . If adequate sampling requires two of these regimes to be explored, then sampling of ns dynamics requires ms time-scales, and so on. Therefore, sampling at each regime requires 10 6 snapshots. Such a coarse-graining strategy reduces the required dataset from 10 18 TB per cell to 10 8 TB (see Table S1), which is still a staggeringly large number. Research Computing is the innovative use of computer hardware and software to enhance research by providing computational implementations of scientific ideas, models and procedures. It complements theoretical and experimental approaches; providing insight from modelling, such as in silico drug screening, or molecular dynamics (MD) simulations of proteins, or the analysis of protein-protein interaction networks. It is becoming increasingly integral to the experimental biosciences, particularly in bioinformatics, but also increasingly for cryo-EM and other imaging techniques. The nature of research requires flexibility and agility, and also the ability to fail without catastrophic consequences. Developing such bespoke solutions can be challenging to implement within administratively heavy IT service management frameworks (e.g. ITIL [9]) while many of the computational requirements of the biosciences may not be sufficiently novel regarding computational procedures to qualify as computer science research. Therefore, while it is currently convenient for institutions to place Research Computing into existing organizational IT services structures it needs to be recognized as performing a distinct function [10].
Research Computing needs to intersect constructively both with academic computer science, so that their novel methods can be rapidly integrated into the biosciences, and with technology service providers, so that core IT infrastructure is robustly maintained. This integration requires a holistic understanding of computer science, of the management of computing systems and of the relevant technical issues within the biosciences. Agile software development practices, such as DevOps [11], have led to emerging practices of ResOps and SciOps when applied to scientific computing and research. These approaches encourage intimate collaboration between operational teams and research/product development teams. New ideas including automation, continuous testing and continual requirements re-prioritization, are engendering a culture that is highly effective in research, including in the biosciences (e.g. ResOps@EBI [12]).
Research Computing compared to Enterprise IT: Given the growing role of computing in bioscience research, and the increasing scale of the facilities employed, it is instructive to compare Research Computing with the computing systems and software deployed to support organisations generally, which is known as "Enterprise IT", and which is treated as an operational cost (see Table B).

Table B. Research Computing comparison with Enterprise IT
The goals of Enterprise IT are normally centered around cost-efficiency, targeting consistently high service levels through systematic and repeatable delivery processes. Research Computing has to be effective, and this often requires the use of innovative, flexible and adaptive approaches to yield new (and sometimes unexpected) insights. Research-oriented institutions must be able to support both Research Computing and Enterprise IT working alongside each other (so called bimodal operation). In addition to operational efficiency, Research Computing support relies on other metrics (such as publication and citation data and its impact) to demonstrate the value it adds. While much bioscience software is developed within academic teams or embedded in national facilities, in some institutions HPC service teams and increasingly software engineers may be located in IT services [13]. Teaching Research Computing skills (see the section on Building computational skills for the biosciences) is another opportunity for IT service providers and academics to closely collaborate and exchange ideas. Undergraduate teaching opportunities also provide a route to make a wholly academic career path (e.g. lectureships) for Research Software Engineers (RSEs) viable at universities, because the need for undergraduate teaching provides a long-term financial future for such appointments.

Enterprise IT Key mission
To accelerate research and improve consistency and repeatability by making use of the scientific method to define the resource mix needed to solve scientific problems.
To efficiently support operational activity in any enterprise or organisation.

Objective of computational facility
To provide scientific insight as part of the scientific method, often through intensive computing (e.g. HPC), requiring data analysis, simulation and modelling across a wide range of domains.
Transactional and "systems of record" 1 , supporting all key enterprise activities and processes, including business decision making.

Computer platforms
Diverse computer platforms, including specialist HPC and visualization tools, where software and hardware may be tightly integrated.
Standard and virtualised platforms with software which is largely platform independent.

Software and development
Research software is mostly developed by research students, postdocs or RSEs, and is driven by the interests of the academic team. Open source software is considered best practise.
Often closed source software, with an emphasis on "buy, not build".

Activity life cycles
Oriented around fixed length research project cycles, often with project usage limits and allocations.
Oriented around business cycles, e.g.financial years or operational activities.

Client devices and platforms
Highly diverse, including mobile devices (e.g. for clinical trials), laboratory equipment, sensors, wearable technologies and now Internet of Things (IoT).
Covers full range of client devices, printers, networks, WiFi etc.

Staff knowledge and skills
Requires IT or other professionals e.g. HPC or RSE experts, to have good levels of understanding of research domains and disciplines e.g. physical sciences, biosciences, social sciences.
Requires IT professionals and business relationship managers to have a high level of understanding across client disciplines e.g. finance, HR, operations to fulfill project/service requirements.

Data curation and protection
Subject to data protection, patient confidentiality, ethic committee scrutiny and funding body requirements, including the Open Science agenda.
Subject to data protection and wider legal requirements.