Copernicus, a hybrid dataflow and peer-to-peer scientific computing platform for efficient large-scale ensemble sampling

doi:10.1016/j.future.2016.11.004

Future Generation Computer Systems

Volume 71, June 2017, Pages 18-31

https://doi.org/10.1016/j.future.2016.11.004 Get rights and content

Under a Creative Commons license

open access

Highlights

•
Hybrid dataflow and peer-to-peer computing to fully automated ensemble sampling.
•
The platform automatically distributes workloads and manages them resiliently
•
Problems are defined as workflow by reusing existing software and scripts.
•
Portability in networks where parts are behind firewalls.

Abstract

Compute-intensive applications have gradually changed focus from massively parallel supercomputers to capacity as a resource obtained on-demand. This is particularly true for the large-scale adoption of cloud computing and MapReduce in industry, while it has been difficult for traditional high-performance computing (HPC) usage in scientific and engineering computing to exploit this type of resources. However, with the strong trend of increasing parallelism rather than faster processors, a growing number of applications target parallelism already on the algorithm level with loosely coupled approaches based on sampling and ensembles. While these cannot trivially be formulated as MapReduce, they are highly amenable to throughput computing. There are many general and powerful frameworks, but in particular for sampling-based algorithms in scientific computing there are some clear advantages from having a platform and scheduler that are highly aware of the underlying physical problem. Here, we present how these challenges are addressed with combinations of dataflow programming, peer-to-peer techniques and peer-to-peer networks in the Copernicus platform. This allows automation of sampling-focused workflows, task generation, dependency tracking, and not least distributing these to a diverse set of compute resources ranging from supercomputers to clouds and distributed computing (across firewalls and fragile networks). Workflows are defined from modules using existing programs, which makes them reusable without programming requirements. The system achieves resiliency by handling node failures transparently with minimal loss of computing time due to checkpointing, and a single server can manage hundreds of thousands of cores e.g. for computational chemistry applications.

Keywords

Peer-too-peer

Distributed computing

Dataflow programming

Scientific computing

Job resiliency

Cited by (0)

Iman Pouya holds a master’s degree in Computer science and engineering from Chalmers Institute of Technology. He is currently a Ph.D. student at the Royal Institute of Technology. His research is focused on distributed computing and ensemble methods for molecular simulations. Iman Pouya is the core developer of Copernicus.

Sander Pronk has a background in molecular simulation, focusing on method development, in particular large-scale free energy and pathway sampling methods, and micro- and meso-scale behaviour of fluids. He did a Ph.D. with Daan Frenkel in Amsterdam, and post-doctoral research in Berkeley before moving to Stockholm as a researcher with Erik Lindahl. He is the main designer of Copernicus’ compute architecture.

Magnus Lundborg obtained his Ph.D. from Stockholm University in 2011 and performed postdoctoral work at University of Cambridge before moving to the Royal Institute of Technology and later Stockholm University. His research is focused on MD simulations of small molecules, computer-assisted NMR analysis, and lately lipid systems.

Erik Lindahl obtained his Ph.D. from the Royal Institute of Technology in 2001. He first started working on the molecular dynamics package GROMACS together with colleagues in Groningen in the 1990s, and later performed postdoctoral work with Michael Levitt at Stanford University. Since 2004 he is Professor of Biophysics at Stockholm University, with a second appointment at the Royal Institute of Technology. The lab’s research is focused on understanding structure and function in complex biological systems (in particular membrane proteins) through a combination of computational and experimental methods.