Patterns for High Performance Multiscale Computing

Patterns software for High Performance Multiscale Computing. Following a short review of Multiscale Computing Patterns, this paper introduces the Multiscale Computing Patterns Software, which consists of description, optimisation and execution components. First, the description component translates the task graph, representing a multiscale simulation, to a particular typeofmultiscalecomputingpattern.Second,theoptimisationcomponentselectsandappliesalgorithms to find the most suitable mapping between submodels and available HPC resources. Third, the execution component which a middleware layer maps submodels to the number and type of physical resources based on the suggestions emanating from the optimisation part together with infrastructure-specific metrics such as queueing time and resource availability. The main purpose of the Multiscale Computing PatternssoftwareistoleveragetheMultiscaleComputingPatternstosimplifyandautomatetheexecution of complex multiscale simulations on high performance computers, and to provide both application-specific and pattern-specific performance optimisation. We test the performance and the resource usage for three multiscale models, which are expressed in terms of two Multiscale Computing Patterns. In doing so, we demonstrate how the software automates resource selection and load balancing, and delivers performance benefits from both the end-user and the HPC system level perspectives.


Introduction
Multiscale modelling & simulation has become a wellestablished way to study complex phenomena that encompass multiple space and time scales [1].In this approach, a multiscale model is constructed by combining, or coupling, a collection of single-scale submodels, each of which captures processes on a https://doi.org/10.1016/j.future.2018.08.045 0167-739X/© 2018 The Authors.Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).distinct space and time scale; see e.g.[2][3][4].Multiscale modelling is widely used in most areas of science and engineering [5], such as biomedicine [6][7][8], fusion [9,10], material science [10,11], energy [12] and engineering [10,13].It is self-evident that any high-fidelity multiscale model must employ substantial high performance computing resources, since the individual single scale models comprising it themselves have to run on such machines.
In addition to specific multiscale applications, a number of tools and frameworks which assist in multiscale computing have been established.These range from domain-specific frameworks such as AMUSE [14] and OASIS-MCT [15] to solver-specific frameworks such as the MOOSE framework for finite-element codes [16] and fully generic frameworks [1,3,[17][18][19] encompassing related coupling tools such as the Multiscale Coupling Library and Environment 2 (MUSCLE2) [20].
We have previously developed the Multiscale Modelling and Simulation Framework (MMSF) [3,[17][18][19], which provides a theoretical and methodological framework for constructing multiscale simulations in four main stages.First, we model multiscale phenomena as collections of single-scale submodels then decide on which models interact with each other and how.Single scale models and couplings are presented within a Scale Separation Map, allowing us to describe and compare multiscale models on a conceptual level.Second, we specify the single scale models, their couplings and interactions using the Multiscale Modelling Language [1,18].Third, we convert these definition to a fully implemented multiscale model, currently relying on MUSCLE2 [20] (although the concepts of the MMSF can also be applied to other coupling environments such as AMUSE [14]).An important property of MUSCLE2 is the separation of concerns that it affords.Submodels are not aware of any other components.Moreover, required adaptations are minimal on the level of a submodel in order for it to be incorporated into a multiscale model implemented with MUSCLE2.Fourth, we deploy and execute the multiscale application on a set of computational resources.Developers and users can run different submodels on different machines [4], using for example the QCG middleware [21], a paradigm that we call Distributed Multiscale Computing [4].
Knap et al. [22] have previously proposed a distributed multiscale computing framework that supports the on-demand execution of microscale models coupled to a macroscale model (very similar to one of the computing patterns we proposed in [23]), using large scale supercomputing resources.Although their framework has, to our knowledge, not yet been applied outside the domain of materials science for which it was originally created, the authors do propose a general conceptual framework that could be adopted for use in other disciplines.This resonates with our vision of generic multiscale computing environments, where a separation of concerns is achieved between multiscale modelling & simulation on the one hand, and deploying and executing a multiscale simulation in a given HPC environment on the other.
Our ''Lego-based'' philosophy for the construction and execution of multiscale applications relies on single scale submodels and their interactions, and results in more degrees of freedom for both programming and executing a multiscale simulation.To efficiently execute multiscale applications on high-end HPC machines, a number of challenges have to be addressed, such as load balance (providing resources to each of the single scale models), fault tolerance (sometimes instantiations of single scale models may fail) and energy awareness (depending on properties of single scale models, potentially also in combination with load balancing, energy aware optimisation).Our intention is that these challenges are handled in a generic way, as far as possible avoiding the imposition of that burden on the developers of multiscale applications.Those developers should take care of the scale bridging mechanisms and the efficiency of the single scale models, while the challenges of execution within a High Performance Computing (HPC) environment should be addressed through a generic layer added to MMSF that we call Multiscale Computing Patterns (MCPs) [23].
We defined MCPs as ''high-level call sequences that exploit the functional decomposition of multiscale models in terms of single scale models'' [23], and distinguished three patterns: Extreme Scaling, Heterogeneous Multiscale Computing and Replica Computing.Each of these patterns is described using a generic task graph that aids in understanding how to best map these patterns to HPC resources.In addition to the generic task graph, an MCP contains performance information about single scale models, an XML-based specification of the multiscale application named xMML [20], and a set of algorithms and heuristics used to combine this into input files for the execution environment.In this paper, we report on the design and implementation of the MCP software, and present the first results of executing multiscale simulations using MCPs, including discussions on the added value of using such solutions for High Performance Multiscale Computing.Here, we mainly integrate these MCPs with MMSF to increase the effectiveness by means of which we can develop, deploy and execute multiscale simulations on existing petascale and emerging exascale resources [23].
The MCP software architecture consists of a description component, an optimisation component and an execution component.In the description component the software uses the task graph of the specific multiscale model, in combination with auxiliary information (e.g., definitions of single-scale models), to identify the type of pattern and create input definitions for the optimisation component.In the optimisation component, the software selects and applies a set of optimisation algorithms to identify a range of efficient mappings of the submodels in the application to specific HPC resources.Lastly, the execution component is a middleware layer which identifies the optimal mapping of submodels to the available resources, taking additionally into account queueing times and resource occupancy.Moreover, the execution component deploys and executes the application, with all of its submodels on the target resources.Three examples of using Multiscale Computing Patterns software are illustrated and examples of cost functions are worked out, showing that a wide range of variables for Multi-Objective Optimisation algorithms can be chosen.The idea is that Multiscale Computing Patterns software will automatically detect which cost functions and algorithms to select based on the type of pattern and user requirements.
The structure of our paper is as follows.We describe the MCPs in Section 2, and introduce the Multiscale Computing Patterns software and its components in Section 3. In Section 4, we provide by way of proof of concept three examples of the use of the Multiscale Computing Patterns software.Finally, we provide a discussion and conclusion in Section 5.

Multiscale computing patterns and high performance multiscale computing
In this section, we discuss the concept of Multiscale Computing Patterns and express the MCPs as generic task graphs.For full details, we refer to Alowayyed et al. [23].Fig. 1 shows the generic task graphs for all three computing patterns.
The Extreme Scaling (ES) pattern represents a type of multiscale model where one primary single-scale model is coupled to a set of serial and/or parallel auxiliary models on any scale as shown in Fig. 1((a) and (b)).The primary model 1 is compute intensive, energy hungry, and highly scalable, whereas the auxiliary models are not.Therefore, the efficiency of this type of multiscale models is highly dependent on the efficiency of the primary model and the primary-auxiliary interactions.Assuming that developers have implemented the primary model efficiently, the main aim is to reach a minimal interference between primary and auxiliaries.This can be done using load balancing while ensuring minimal communication between primary and auxiliaries.The serial auxiliary model can give rise to strong underutilisation of available resources (e.g. if it does not scale to a large number of cores), and special mechanisms to handle such situation are required.
The Heterogeneous Multiscale Computing (HMC) pattern (Fig. 1(c)) represents the typical form of macro-micro coupling, where the numerical solver at the macro-scale level requires input from multiple micro scale model instantiations (for instance to compute a spatially varying quantity, such as for example a constitutive equation, say a viscosity in a flow problem).Thus, the number of micro-scale models is dynamic and largely dependent on the dynamic evolution of the macro-model.The HMC manager has some control over the number of micro-scale models, by preventing redundant calculations (by storing results of previous microscale simulations in a HMC database and where possible extracting results from the database, e.g. by interpolations between results obtained earlier), and spawning extra micro-scale models, when necessary.Typically, the number of micro-scale models will be very large and a single microscale run may require substantial computing resources and, hence, dominate computing and energy cost.
In the Replica Computing (RC) pattern a large number of copies of tera-and/or peta-scale simulations are needed to produce statistically robust results.These replicas are not part of an overarching structure like HMC, but are spawned in the initial step.In this step, the parameter space for parameter sweeping is set.Then, simulations and data processing per replica take place.Both static and dynamic flavours of RC are considered in Fig. 1((d), (e)).All replicas execute independently of each other.If a replica (i.e. a simulation) fails, the RC patterns affords some level of fault tolerance, taking into account maintaining the overall statistics.This is the main difference with HMC.On the other hand, HMC and RC are similar in terms of load balance issues.

Design of multiscale computing patterns software
The Multiscale Computing Patterns software consists of three parts: (1) the Description Part, where the user describes the multiscale application, (2) the Optimisation Part, where the software predicts and optimises the application performance, (3) and the Execution Part, where the application is deployed using an underlying resource allocation service (in our case the QCG middleware).We present the components of the Multiscale Computing Patterns software, and their interrelations, in Fig. 2.
The logical description and the complete set of requirements of a multiscale application is collected in the description component.This part relies on concepts from the Multiscale Modelling and Simulation Framework.It is helpful to facilitate the work of the end user and provide a single input mechanism for all multiscale applications, as well as detecting the type of MCP.The MMSF xMML description file was extended for Replica Computing to accommodate the notation of the number of replicas.The optimisation component determines which MCP optimisation applies, collects required performance figures, and calculates the relevant metrics (e.g.parallel efficiency, throughput, energy usage).Based on these results, a constrained optimisation is performed resulting in a small set of the most suitable execution scenarios which are passed to the execution component.The role of the execution component is to select the best allocation plan, based on the availability of the requested resources and cost criteria (time to complete, energy usage), and to start and monitor the execution.

Description component
The Description component (top layer in Fig. 2) contains an architecture-agnostic definition of the multiscale application, and its main requirements.It builds on concepts from the Multiscale Modelling and Simulation Framework.The description component consists of a task graph, submodel definitions, simulation and middleware parameters and user information, all feeding into the Translation Service.
The task graph is expressed in the form of a highly adaptable textual description (xMML, see [20]) which is used to detect motifs (repetitive submodels and dependencies).The task graph is also needed to observe workflow related issues such as the expected frequency of communication between submodels.
Submodel definitions contain all the information required for a single-scale model to run.This includes information on submodelspecific dependencies, and the resource requirements for each submodel (e.g., mandatory use of GPU-architectures, or a minimal memory requirement per core).The description component may rely on previously developed tools such as MAD/MaMe [24], and can already leverage existing configuration information from the FabSim automation environment [25].
All the simulation and middleware parameters are collected in a separate component.This includes input parameters, the required environment modules and resource limits for the overall simulation (all submodels and coupling library).Also, this component holds all information needed to compose the multiscale simulation (e.g. using MUSCLE2) and to execute the simulation (e.g. using QCG Middleware [26]) using (distributed) HPC infrastructures such as the Experiment Execution Environment (EEE).This component is designed such that existing known machine configurations from FabSim (machines.yml)can be directly reused in the context.Similarly, user-specific information can be directly reused from existing FabSim configurations (machines_user.yml).We present an example of the reuse of FabSim information within this context as part of the Binding Affinity application described in Section 4.2.
These three pieces of information are then supplied to a translation service, which merges and converts them into a format suitable for the optimisation component.Currently, the translation tool is application specific, and produces two xml files as output.One file, matrix.xml, is shown in Listing 1 in Appendix A and contains templated information from the submodel definitions.The other file, multiscale.xmlshown in Listing 2 (Appendix A), has information from the simulation and middleware parameters.

Optimisation component
The main software tool within the Optimisation component is the Pattern-Driven Planner.This tool requires input from both the Translation service as well as the Node Description List.The node description list is updated regularly to reflect the current status of available nodes in the targeted supercomputers, and contains information of node types.A single node type represents a set of nearly identical nodes in terms of hardware configuration (e.g.processor type).The node types should be defined based on knowledge gathered a priori by the middleware from the infrastructure provider.Table 1 shows an example of node types.
The second layer contains the Pattern-Driven Planner and the Performance Estimator components.The Pattern-Driven Planner component collects measurements and/or predictions of performance of submodels, under various execution scenarios, from the Performance Estimator.Then, it uses this information to compute required cost functions (e.g efficiency, throughput, energy usage, or a combination) on available resources.Given the specific MCP and all other available information, the Pattern-Driven Planner performs a constrained optimisation against these cost functions,  and provides a selection of particularly suitable execution scenarios to the Execution component.The Execution component will then select the optimal execution plan based on chosen specified cost criteria (time to completion, energy consumption), by taking into account additional information only available to the middleware (e.g.estimated queueing time, live information on availability of resources, etc.).
The Measurements and Architecture Modelling components respectively store and calculate submodel performance information as a function of the chosen number of cores/nodes and problem size.The Single Scale Performance Model captures the scalability of submodels with respect to problem sizes and number of processors.The Performance Estimator, in turn, relies among others on the Single Scale Performance Model to obtain, interpolate and/or calculate performance results for the multiscale model.For example, this could be achieved by interpolating between performance results of adjacent problem sizes in the multiscale model and/or, relying on performance models, to predict the performance using a core count for which no measured values have yet been obtained.
In the Measurements component, the overall cost of the submodel as a function of problem size is measured for 1 to n cores on the first node in a specific node type, and for 2 to N nodes assuming full occupancy on each node for a given number of iterations.In the baseline case, the cost is represented as execution time, but note that these calculations can also be done for other metrics of cost such as energy.The actual measurements can be obtained from tools designed specifically for performance profiling tools, such as Allinea MAP [27], the tool of choice in our research.A template of measurements is shown in Listing 3 in Appendix A; this measurement listing might contain specially marked values (i.e NA), for node types where a specific single scale model is not supported.
The Architecture Modelling software is established to provide predictions for existing machines, but also for non-existing emerging exascale configurations.This allows users to assess how MCPs could optimally benefit from such hypothetical machines, or contribute in co-evolution of such new architectures.
Based on performance results from the Performance Estimator, the Pattern-Driven Planner groups types of nodes into classes depending on the similarity of performance figures, type of computing pattern and cost criteria (e.g efficiency, makespan time, energy usage, resource usage, . . . ) computed as cost function.Then, using multi-objective optimisation, the tool will generate a small number of alternative execution scenarios.The importance of the alternatives here is to give the Execution component the freedom to choose from a set of resources with comparable performance per submodel depending on the availability of these resources as well as on the variation in queue times.We will enhance this component to extend the patterns with capabilities to also consider issues related to energy awareness and fault-tolerance.All in all, for each pattern we will formulate constrained optimisation problems that as output will deliver alternative execution profiles to the Execution component.
The exact output of the Pattern-Driven Planner to the Execution component will be several allocation plans and other requirements, as described in the next section, to run multiscale application.Here, the output file holds information about the kernels and corresponding helpers, the classes of node types and a set of allocation plan.The allocation plan is a specific mapping of the multiscale model to resources.Listing 4 in Appendix A shows the template of the node classes and allocation plans parts.

Resource allocation and execution component
The responsibility for the execution component is twofold.First, it needs to select the best allocation plan from the plans provided by the optimisation component.The selection pertains to the mapping of computational kernels to a specific set of physical resources of defined types, taking into account the (sometimes conflicting) requirements of users and resource providers.Second, once an optimal plan has been selected, this component needs to ensure the efficient and reliable execution of the application within the distributed heterogeneous infrastructure.
The execution component is mainly provisioned using the QCG environment2 [24], a mature middleware system deployed in many HPC centres across Europe.QCG delivers a set of ready to use components that can be installed and managed at each site, irrespective of the internal policies or local queueing systems.To fulfil expectations and objectives of both users and resource providers, QCG features and extendable brokering service which allows for customised brokering algorithms and strategies.In addition, QCG provides support for advance reservation, co-allocation and workflows, enabling the execution management of multikernel applications, with both cyclic and acyclic dependencies, on a distributed e-Infrastructure [20,21].
Deploying multiscale simulations on production e-infrastructures gives rise to a number of challenges that are difficult to anticipate prior to the execution component.For example, the user objective for an immediate job start, e.g. through means of advance reservation, needs to be harmonised with the provider's objective for high resource utilisation.In addition, the Pattern Driven Planner provides plans that are likely to be optimal from a user perspective, but have not yet incorporated the constraints imposed by the presence of other workloads in the e-infrastructure environment.
The QCG Pattern-aware Scheduler (which is part of QCG Broker) calculates which of the plans provided by the Pattern Driven Planner is optimal with respect to the objectives of all involved stakeholders.In doing so it takes into account the current and historical load on e-infrastructure resources, including both the occupancy of the actual resources and the queue lengths.The Patternaware Scheduler can perform this optimisation with respect to required cost criteria, either a single time to completion criterion or a combination of two criteria, total energy expenditure and time to completion.Here, the time to completion is calculated by adding the predicted queueing time (predicted by Queue Time Prediction Service to QCG) and execution time (provided by the optimisation component).In the energy optimisation case, QCG Scheduler firstly selects a set of candidate plans which finish according to the QCG time to completion prediction in the requested period of time and then it selects an optimal plan with the minimal total energy expenditure (calculated and given by the optimisation component).Through its direct integration with the QCG environment, the Pattern-aware Scheduler accounts for the multi-kernel nature of multiscale application and the fact that each kernel may behave differently in the context of performance and energy-usage when executed on different resource types [26,28,29].
The QCG Pattern-aware Scheduler relies on a plugin-like architecture to gather all required information (see dashed boxes in the Execution component of Fig. 2).For example, the scheduler uses the Queue time metrics plug-in to get precise knowledge about the expected queue time on available resources.This plug-in is integrated with external resource-level components, in this case the Queue-time Estimation service.Similarly, we are planning to implement an Energy metrics plug-in and combine it with the QCG Pattern-aware scheduler.
As the new brokering module uses new types of input parameters to specify the requirements of the MCPs, we have extended the job description interface and revised several internal schemas used to exchange information between the components in QCG-Broker service.We present several key fragments of the this extended description in Appendix F. Here, all jobs described using a pat-ternTopology element will be processed using the new scheduling engine.
Based on the result of the QCG Pattern-aware Scheduler, the QCG Job Controller module prepares the execution environment by transferring input data and starting the job submission to one or more distributed resources.The resources in our e-infrastructure are made accessible to QCG Job Controller using services implementing the Basic Execution Service (BES) interface [30].
QCG Job Controller contains a set of specific adaptations to address the requirements for efficiently executing high performance multiscale simulations using high-end e-Infrastructures.Both the QCG-Broker interface and the core capabilities of the service components have been extended to support a range of pattern-based multiscale jobs.Specifically, to allow efficient execution of the Replica Computing Pattern applications, we have incorporated two additional QCG mechanisms: workflows and job arrays.We incorporated a modified version of existing workflow mechanisms [31], eliminating the need to transfer data between subsequent tasks executed on the same resource, and simplifying the execution of workflows in parameter sweep tasks.The job arrays functionality allows a set of independent tasks to be run on a resource and be considered as a single QCG task.In the Replica Computing Pattern these sets of subtasks can be scheduled by the middleware to be executed on various clusters, thereby balancing the overall load on the infrastructure.Job arrays not only help to reduce the management complexity of all tasks executed separately, but increase the overall throughput of the system and decreases the total time to completion of a simulation.

Applications of the multiscale computing patterns software
In this section, we present three exemplar applications from different scientific domains (one from Fusion research and two from biomedicine, being cell based blood flow modelling and the Binding Affinity Calculator BAC) to demonstrate the capabilities practical usage of the MCP software, and the benefits in terms of application performance.Our applications are mapped to two different computing patterns, with Fusion and cell based blood flow modelling mapped to the Extreme Scaling (ES) pattern and BAC to the Replica Computing (RC) pattern.Applications for HMC are currently under development.In addition, details of the required steps and various code snippets at each level of the software stack are presented from the perspective of the application developer.All performance figures presented are measured at two supercomputers that participated in the studies, namely SuperMUC [32] (Tier-0 HPC from Leibniz-Rechenzentrum, Germany), and Eagle [33] (Polish national grid clusters from Poznan Supercomputing and Network Center, Poland).Further details are listed in Table 2.

Extreme scaling
In ES, the ultimate goal is to ensure minimal interference between the primary model and the auxiliaries.It may happen (as in the example of the cell based blood flow simulation) that the auxiliary models induce large waiting times for the primary model, thus potentially wasting resources and reducing resource usage.The Multiscale Computing Patterns software detects this situation automatically, and then interleaves two multiscale simulations, executing both at the same time [23].This mechanism would increase the resource usage efficiency.For ES, the efficiency of the multiscale model ϵ M can be calculated as [23]: and the resource usage efficiency (R) as: where P i is the number of cores used for submodel i, T i is the execution time on submodel i excluding waiting times, T is the total execution time including waiting times, and ϵ P the efficiency of the primary model.

Fusion application
Nuclear fusion has the potential to produce clean and carbonfree energy, as physicists hope to demonstrate with ITER, which is a tokamak device that uses magnetic fields to confine plasma.However, the grand challenge of long-term plasma confinement requires the understanding of interactions between very smallscale turbulence and large scale plasma behaviour [9,34].Therefore, having a robust multiscale computing scheme to study this interaction has become a vital goal in the fusion community.Our targeted fusion application simulates the time evolution of a plasma's 1D profiles (for instance electron temperature) in the tokamak core with a transport code, while under the influence of anomalous transport coefficients computed by a 3D turbulence code and periodic 2D equilibrium reconstruction [34].The transport, turbulence, and equilibrium codes are submodels developed separately and are well-benchmarked.These submodels share a common data interface and are embedded into MUSCLE2 as kernels, which allows for straightforward coupling through a simple and configurable script as described in [9].Such simulation is essentially multiscale in time, and corresponds to the ES computing pattern depicted in Fig. 1(b).The turbulence code is the primary submodel in this application because it requires the vast majority of the computational power compared to the other submodels.
The starting point in the description component of the software is to compose a task graph in xMML format.This text file (shown in Appendix B) contains the list of submodels involved, time and space scales and input/output data of each submodel, and coupling between submodel pairs through their respective input/output data.If desired, the user can deploy the jMML tool [20] to generate the task graph from the xMML [2] (displayed in Appendix D).Besides visual representations, the jMML tool can turn the content from a task graph into a skeleton configuration for MUSCLE2.The designer of the coupled application can implement submodels as MUSCLE2 kernels and other simulation parameters (either global or specific to a kernel) into the MUSCLE2 configuration file [20].An example of the fusion application's configuration file, which is written as a ruby script, is displayed in Appendix C. Note that at this stage, the user can directly connect to a cluster where all executables, libraries and input data are present, and write an ad-hoc script to be submitted to the local batch queue system.However, the burden of manually adjusting the configuration and selecting the optimal cluster lies on the user every time wants to run a simulation.The MCP software has features that relieve these burdens from the user by automatically selecting the best configuration for a given performance metric, as described in further detail in the remainder of this subsection.
The task graph is submitted and parsed by the Translation service along with other specifics provided by the developer, such

Table 3
Performance for ES applications, namely Fusion and cell based blood flow modelling (RBC).T is the execution time (excluding waiting times) for primary (Pr) and auxiliaries (aux) on P Pr and P aux number of cores in seconds, ϵ is the efficiency for the primary (Pr) and the multiscale model (M) and R is the resource usage.as additional submodel definitions, details on middleware, simulation parameters, and user information.In the current implementation, the Translation service is a python script which, as a result, creates two template xml files: matrix.xmland multiscale.xml.Matrix.xmlcontains information related to single scale submodels, such as measurements of their performance.An example of benchmark data on scaling of the primary submodel for two types of nodes is shown in Fig. 3. Multiscale.xml contains information related to the coupled application.These two templates are prefilled with information from the xMML file and can be completed by the application designer.An example is given in Appendix E.
Next, the outputs of the Translation service (matrix.xmland multiscale.xml)are passed on to the Pattern-Driven Planner, which in turn generates an XML job script for the selected middleware (the QCG).Currently, the Pattern-Driven Planner proposes three optimal plans that minimise the cost, and an example of such plans is shown in Appendix F. Currently, these plans are drafted based on the measurements of runs performed manually by the application designer.The next stage will be to enhance the Performance Estimator such that it can benchmark on-the-fly and interpolate on settings for which no performance data is available.
Finally, the job script from the Pattern-Driven Planner is submitted to the QCG.QCG selects one of three plans and starts the simulation on the system(s) involved.For the fusion application, and in general for most ES applications, it is more sensible to select a plan in which all submodels run on a single site, for auxiliaries do not require much resources.In that case, we should only care about serialisation due to serial auxiliary models and how that could impact the execution [23].Also note that the speedup of the primary model is super-linear because the efficiency was calculated with 64 cores instead of one core, which may lead to latency hiding.For fusion, the time of the primary model spent in waiting for the auxiliaries is not that large, as shown in Table 3, so no additional actions are required, and, therefore, the resource usage is high.
In particular, QCG selects the thin nodes in SuperMUC to run the fusion simulation (see Table 3).A production run with 1000 iterations using 2048 cores was completed successfully using the software scheme described earlier.The entire run was completed in approximately 11.1 h, or 22733 core hours.Among the three submodels, the primary submodel (a turbulence code based on gyrofluid theory) took about 17 s per iteration, while the transport and equilibrium auxiliary submodels took 1 and 3 s, respectively.However, the fusion plasma in this particular example needs approximately 4000 iterations to reach equilibrium state.Therefore, improving efficiency becomes essential as future simulations require more computing time.The current simulation couples the submodels in series.One way to speed-up the simulation is to run auxiliary submodels in parallel when possible, which is theoretically the case for the timescale-less equilibrium submodel.This idea is preliminary and its validation is necessary before such transformation is added as a possible optimisation technique in the Pattern-Driven Planner.
The ultimate goal for the fusion application is to use a more sophisticated turbulence model, namely replacing the gyrofluid model with a gyrokinetic model, to simulate plasma in the core of a tokamak.In addition, the ability to simulate plasma of a much larger volume would be necessary to understand possible instabilities that could destroy plasma confinement in the ITER tokamak.These goals require an extensive amount of computing resources, as well as intelligent and highly optimised coupling approaches.The Multiscale Computing Patterns software have demonstrated initial success with a smaller-scale problem.With further improvements, we envision that these patterns can efficiently handle future exascale calculations involving ITER and gyrokinetic simulations.

Cell based blood flow simulation
In this application we couple continuous blood flow simulations implemented in Palabos [35] (a fully parallelised open source Lattice Boltzmann Model) to cell based blood flow simulations implemented in the Hemocell suspension simulation framework (an Immersed Boundary Lattice Boltzmann Model (IB-LBM)) [36][37][38][39].Specifically, we couple two continuous fluid fields (C L and C R , which are serial auxiliary models) to the inlet and outlet of Hemo-Cell, in order to provide the correct in-and outflow conditions to the more expensive suspension simulation (P, the primary model in this application), and to keep the domain for the cell based blood flow simulation as small as possible.This application has also been coupled using MUSCLE2.The performance measurements are shown in Fig. 4. As is clear, in this case the auxiliary models (C L and C R ) require a small amount of computing and only execute on a small core count.However, the primary model, HemoCell, is compute hungry, but at the same time scales very well to a much larger core count, even so that the execution time of the primary becomes comparable to the execution time of the auxiliary submodels.This situation was analysed in [23] and calls for a more advanced scheduling of the pattern, basically interleaving two instantiations in order to make best use of the available computing resources.The MCP software is able to orchestrate such more advanced scheduling of multiscale applications.
Table 3 shows that the resource usage for running this application is 0.35.This is due to the large waiting times of both primary and serial auxiliaries in the naive scheduling, which means wasting 1122 cores hours (1.08 h per core) for primary and 1755 (12.5 h per core) for auxiliary models by doing nothing but waiting.To solve this, we interleave two different instantiations with each other, as proposed in [23].This mechanism was coordinated using wait/notify semantics [10].By doing so, we doubled the resource usage efficiency by reducing the wasted cores to 887 and 152 core hours for primary and auxiliaries models respectively.By implementing more advanced load balancing algorithms and selecting the right number of cores for the primary and auxiliary models, we can increase the resource usage efficiency even more.We are currently realising such more advanced features of the MCP software.

Replica computing (binding affinity calculator)
The procedure for replica computing is similar to that for Extreme Scaling (Section 4.1).The starting point for all RC pattern applications is the task graph, via an xMML textual description.A ''multiplicity'' tag in the ''instance'' node of the xMML description indicates that multiple instances (replicas) are required for that submodel.The Translation service detects the presence of this tag, identifies that the RC pattern is required and that the associated cost function in the Multiscale Computing Patterns software should be invoked.
The Translation service uses submodel definitions in separate files.To illustrate this, we describe the process for the Binding Affinity Calculator (BAC) [40], an automated molecular simulation based free energy calculation workflow tool, which we use to calculate ligand-protein binding affinities.Rapid and accurate calculation of binding free energies is of major concern in drug discovery and personalised medicine.The underlying computational method is based on classical molecular dynamics (MD).These MD simulations are coupled to the molecular mechanics Poisson-Boltzmann surface area (MMPBSA) method to calculate the binding free energies.For purposes of reliability, ensembles of replica MD calculations are performed for each method, and we have found that about 25 of these are required per MD simulation in order to guarantee reproducibility of predictions.This is due to the intrinsic sensitivity of MD to the initial conditions, since the dynamics is chaotic.Therefore, BAC is an ideal example of the replica computing pattern.BAC consists of a workflow where, within each replica, the output from one submodel (NAMD) is used as input to the next submodel (AmberTools).For more information, we refer to [40,41].
BAC previously used the FabSim [25] tool extensively to perform simulation runs and, therefore, we have added an option to the Translation service to allow the matrix.xmland multiscale.xmlfiles to be completed (as much as possible) through reading of Fab-Sim configuration files.This demonstrates the potential versatility of our MCP approach, which should enable relatively straightforward integration with existing multiscale execution environment as, in this case, FabSim.For example, it uses the machines.ymlconfiguration file from FabSim, which lists the configuration settings of submodels on remote resources (e.g., location of libraries and required execution flags).Additional information specific to for the Translation service (and not required by FabSim) can also be added to this file, including restrictions on the submodel (GPU/CPU compatibility, max/min number of cores, etc.).This allows submodel information to be reused if it is required for different multiscale applications.Then, FabSim compatible YaML file (shown in Appendix G) are used to assist the completion of matrix.xmland multiscale.xml.BAC currently does not use a coupling library (such as MUSCLE), so no additional files are required.However, in the future we foresee hybrid MCPs, where each replica could for instance be a full-blown ES by itself, and then such additional information would be needed.
Following the procedure outlined for the ES pattern, the user passes matrix.xmland multiscale.xml to the Optimisation component.Unlike the ES pattern, the user does not need to specify the required number of cores for the overall simulation.This is decided by the Performance Estimator by calculating the cost function.
Finding a cost function for RC that will generate resource allocation plans is different to that for ES.First, there is an obvious tradeoff between the number of replicas that must be executed, the minimum number of cores that one single replica needs, and the total number of cores available for the overall job.The performance data for RC uses the minimum time per replica for different node types in different hosts, as shown in Fig. 5 for a single BAC replica.This data is collected in the Single Scale Performance Model.In the simplest cost-function, where we consider only time to solution, all replicas would be run concurrently on the node with the shortest running time per replica.However, there are several constraints that the Performance Estimator must also consider such as queue constraints (number of concurrent jobs, time limitations node availability and queueing time).
Most supercomputers have a limit on the number of jobs that can be run or queued at any moment in time per user.For example, on SuperMUC machine, the maximum number of jobs that can be run concurrently on the thin nodes in the ''general'' queue is 8, while there are no restrictions on the Eagle machine.
As an example, if we have two RC applications which require 40 and 80 replicas respectively, the Pattern-Driven Planner needs to calculate which is faster: running all replicas at one supercomputer SuperMUC (while taking into account the constraint of concurrently running 8 jobs per user) or distributing the jobs among different hosts, for example, across SuperMUC and Eagle, using the functionality in QCG to run across multiple resources.To illustrate how this could be coordinated, let us take the hypothetical situation that there is also a 12 job limit on concurrent jobs running on Eagle to mimic the workload.In Fig. 6, we show the time to completion as a function of the number of ''batches'' running on SuperMUC, where a ''batch'' is defined as a set of 8 concurrent running jobs on SuperMUC.The remainder of the replicas are run on Eagle (again in ''batches'' of up to 12 jobs).
Fig. 6 shows we estimate that for 40 replicas, the shortest time to completion is for 2 ''batches'' to be run on SuperMUC, while for 80 replicas, the minimum time to completion is for 4 ''batches'' to be run on SuperMUC.This assumes that all replicas take the same time (the shortest time-to-completion from our benchmarking), that all the replicas are independent (no communication) and that each ''batch'' runs directly after the other.It is clear there is a  limitation to this model; it will only be realistic if the time spent in the queue is very short.Otherwise, the time to completion could be very different to that predicted in Fig. 6, and we could envisage the most efficient split in replicas across resources being completely changed if the queueing times are very different across the resources.The estimation of queueing time will be investigated and to incorporated into the middleware in the future (as described in Section 3.3).
The output file description to the execution component is unified among all computing patterns as described in Section 3.3.QCG also have the ability to distribute replicas to the intended machines and gather the results in one place.Fig. 7 shows timings of test BAC runs.In these studies, we run 10 replicas across two supercomputers, SuperMUC and Eagle.By running 8 replicas on SuperMUC and the rest on Eagle we reach the least time-to-completion.
To quantify this speedup [42], we would compare the best timing of distributing replicas T distr with the best timing of running them on SuperMUC (with batch time) T local .The speedup is calculated as: , and the speedup is 1.2 for our set of studies.This means that at the moment of running this set of replicas, we would gain a

Table 4
Performance model for RC application, namely BAC.N is the total number of replicas, P R is the number of cores per replica, T R is the Time per replica in seconds, T local and T distr are the shortest total simulation time (including queueing times) for a local and a distributed runs and S is the Speedup.speedup due to the varied queuing time.The queueing time will be predicted and hosts will be automatically selected by QCG based on the knowledge of run time and queueing time as stated before.Table 4 summarises the results from the BAC application for timeto-completion runs in Fig. 7.

Discussion and conclusions
We have introduced and described the Multiscale Computing Patterns software, which extends the Multiscale Modelling and Simulation Framework to enable high performance multiscale computing based on three generic patterns.We demonstrated its usage and added-value for three different types of multiscale applications: fusion and cell based blood flow simulation, both as examples of Extreme Scaling, and binding affinity calculation as an example for Replica Computing.In addition, these multiscale models are based on different coupling approaches, including MUSCLE2, as well as coupling using scripts and the FabSim automation toolkit.
We implemented and demonstrated the Extreme Scaling and Replica Computing computing patterns.The third computing pattern, Heterogeneous Multiscale Computing, will be implemented, discussed and demonstrated in future work.In the current implementation, each of the demonstrated applications highlights specific strengths of our software approach.For the fusion application, the software abstracts the complication of HPC and chooses the most appropriate number of cores to obtain the required cost criteria (i.e.time to completion).For blood flow, our approach enabled the use of double the resources otherwise accessible.Lastly, for binding affinity calculations, our approach serves to abstract away the choice as to whether the replicas should all run on one and the same computer or be distributed across multiple computers.This automated scheduling approach, which recommends execution across two resources, delivers a time-to-completion speedup of 1.2 compared to the scheduling of all replicas on a single resource.
The Multiscale Computing Patterns software maintains a separation of concerns in three areas.The top layer, the Description component, represents the logical description of the multiscale model.This is the part that is most 'visible' to the application users and developers.The Optimisation component is focused on performance aspects, and provides a number of optimisation criterion based on the type of the multiscale computing pattern and the required criteria.Finally, the Execution component integrates a range of functionalities from the underlying e-infrastructure, and uses the information from the Description and Optimisation components to create and run execution scenarios, each optimised either for minimal time to completion, or minimal energy consumption (given a fixed time to completion requirement).This modular implementation helps multiscale model developers to concentrate on optimising the single scale models of which the application is comprised, without needing to go into details about the HPC machines.The developer can choose the optimisation criteria required.
In this work, we have assembled a range of powerful functionalities for optimising and deploying multiscale applications on large scale HPC infrastructures operating at the multi-petascale, and presented an application-agnostic approach which reduces the development effort required for these purposes.We plan to release the software described here shortly.Generic approaches to High Performance Multiscale Computing are highly sought after across scientific disciplines, and indeed we have already begun propagating the first elements of our to other application domains such as astrophysics and materials modelling.

Fig. 1 .
Fig. 1.Generic task graphs for the Extreme Scaling computing pattern (a,b), the Heterogeneous Multiscale Computing pattern (c) and the Replica Computing pattern (d,e).(a) shows the case where the auxiliaries B P are running in parallel with the primary model A, while in both (a) and (b) auxiliaries B S are in series with the primary model.In (c) multiple instances of the cost critical micro submodel (A) are called on-demand by the macro submodel (B).The macro-scale solver requires input from micro-scale solvers at every time step.(d) shows the case where multiple instances of submodels interact in phases, while in (e) the same operation is shown with addition to a self-relaunch mechanism.

Fig. 2 .
Fig. 2. Architecture of the Multiscale Computing Patterns software.The dashed-line boxes represent external components (which either exist separately or are under development).

Fig. 3 .
Fig. 3. Runtime on different resources for one iteration of the primary submodel in the fusion application.

Fig. 4 .
Fig. 4. Total runtime of cell based blood flow modelling submodels on Eagle [33] haswell_128 nodes.Note the difference in scale between the primary suspension model (left) and the auxiliaries continuous flow models (right).

Fig. 5 .
Fig. 5. Time and efficiency per replica (NAMD kernel) on different number of cores on different node types.

Fig. 6 .
Fig. 6.Theoretical time of running multi-replicas simulations across 2 resources (SuperMUC and Eagle), as a function of number of ''batches'' (i.e.sets of 8 concurrent jobs) on SuperMUC.The remainder of the replicas are run as batches of up to 12 concurrent jobs on eagle.

Fig. 7 .
Fig. 7. Time to completion of multi-replicas simulations across 2 resources (Su-perMUC and Eagle), as a function of number of replicas, ten in total, distributed on SuperMUC and Eagle.

Table 2
Resources used for the measurements in Sections 4.1 and 4.2.