Stride: a flexible platform for high-performance ultrasound computed tomography

Advanced ultrasound computed tomography techniques like full-waveform inversion are mathematically challenging and orders of magnitude more computationally expensive than conventional ultrasound imaging methods. This computational and algorithmic complexity, and a lack of open-source libraries in this field, represent a barrier preventing the generalised adoption of these techniques, slowing the pace of research and hindering reproducibility. Consequently, we have developed Stride, an open-source Python library for the solution of large-scale ultrasound tomography problems. On one hand, Stride provides high-level interfaces and tools for expressing the types of optimisation problems encountered in medical ultrasound tomography. On the other, these high-level abstractions seamlessly integrate with high-performance wave-equation solvers and with scalable parallelisation routines. The wave-equation solvers are generated automatically using Devito, a domain specific language, and the parallelisation routines are provided through the custom actor-based library Mosaic. Through a series of examples, we show how Stride can handle realistic tomographic problems, in 2D and 3D, providing intuitive and flexible interfaces that scale from a local multi-processing environment to a multi-node high-performance cluster.


Introduction
Ultrasound computed tomography techniques such as full-waveform inversion (FWI) have the potential to produce high-resolution, 3D reconstructions of tissues such as the breast [41,34], the limbs [40], or the adult human brain [16].However, generalised adoption of these techniques is hindered by the fact that tomography algorithms are computationally demanding and algorithmically complex, while existing medical tomography codes are, as far as we are aware, closed source, difficult to maintain, and slow to adapt to new research.
FWI is a technique, originally developed in the field of geophysics, that produces reconstructions of tissue properties by solving an associated inverse problem.FWI is computationally expensive because, for realistic 3D problems, it requires the solution of thousands of partial-differential equations (PDEs) and the storage of hundreds of gigabytes ORCID(s): of memory at every iteration in order to estimate billions of parameters.At the same time, FWI is algorithmically challenging due to the non-linear, non-convex nature of the inverse problem being solved.Therefore, any software for solving FWI problems has to address its computational and algorithmic needs, but should also emphasise the high-level, problem-specific abstractions that are necessary to ease the adoption of these tomographic techniques.
In the fields of geophysics and seismic exploration, different approaches have been taken by open-source libraries to solve these issues.On one hand, libraries like Madagascar [14], SimPEG [9] and PySIT [18], have managed to provide flexibility and high-level abstractions, but have done so at the expense of performance.On the other hand, libraries like SAVA [22] and JavaSeis [17] have focused on performance at the expense of flexibility and extensibility.As a way to bridge the gap between these two extremes, libraries such as SeisFlows and Pyatoa [28,8], jInv [33], and Waveform [12] provide flexible interfaces in high-abstraction languages like Python, Julia or MATLAB that interface with highperformance, hand-tuned solvers.LASIF and Inversionson [23,37] have followed a similar approach by providing  (1), based on which the tomographic problem is expressed (2).The tomographic problem becomes fully defined when appropriate physical modelling is introduced (3).The execution of Stride is parallelised using the custom library Mosaic (4), and tools are provided to save and load its details (5).modular seismic work-flow management that wraps a highperformance tomography solver.
Recently JUDI [42], written in Julia, has gone a step further by providing high-level abstractions in a modern language together with high-performance solvers that are automatically generated by the domain-specific language (DSL) Devito [26,27].Automatic code generation for solvers is increasingly important with an ever growing number of specialised architectures, from traditional central processing units (CPUs) to graphical processing units (GPUs) and fieldprogrammable gate arrays (FPGAs), as well as associated parallel programming languages (Cuda, OpenACC, etc.).Fine tuning codes for each of them by hand would be a daunting task for most researchers, whereas DSLs like Devito can generate code that is automatically tuned for each target architecture and parallel language.In doing so, DSLs also increase productivity by simplifying the implementation of new types of physics and discretisations.
The high computational complexity of FWI also requires, for realistic problems, that codes can be deployed to specialised high-performance computing (HPC) systems like multi-node clusters or cloud computing services.This represents a further barrier for domain scientists, who are generally not proficient in the use of HPC systems.Of the reviewed geophysical and seismic libraries, only some of them, such as LASIF and Inversionson, and SeisFlows and Pyatoa, have been designed with HPC deployment and scaling in mind.
Here, we present Stride, an open-source Python library for medical ultrasound tomography that emphasises flexibility and modularity, high performance, and scalability.It achieves this, firstly, through high-level, domain-specific abstractions and heuristics.Secondly, by integrating with the automatic code generation library Devito.Finally, we introduce a parallelisation library for seamless HPC deployment and scaling.Stride is available on GitHub1 and a complete documentation of its interfaces is available online 2 .
The remaining of this paper is structured as follows: in Sec. 2, we will present an overview of the structure of Stride, followed by a more detailed exploration of each of its components with accompanying examples; in Sec. 3, we will assess the accuracy of the wave-equation solvers provided by Stride, and we will present examples of tomographic reconstructions in 2D and 3D, using both numerical and experimental data; finally, we will present our discussion and proceed to our conclusions in Sec. 4 and Sec. 5, respectively.

Software structure
Stride has been designed to address the computational and algorithmic complexity of tomographic imaging by providing high-level interfaces that are modular and extensible, and that closely match the mental framework of domain specialists.It has been implemented in Python, a high-level, interpreted programming language that provides characteristics such as portability, ease of use, and dynamic typing.We have chosen Python because it is the de facto language for scientific computing and machine learning, with a large community and package ecosystem.
The high-level interfaces provided by Stride are aimed at addressing five fundamental aspects in high-performance ultrasound computed tomography (Fig. 1): 1. first, abstractions and tools are provided for the solution of optimisation problems, which are the basis for most tomographic imaging algorithms; 2. based on these, a series of classes encapsulate the definition of the tomographic problem being solved, e.g. the transducers employed or the signals used to excite them; 3. the relevant physical processes, such as acoustic or elastic wave propagation, are then modelled by using appropriate solvers that execute high-performance code through DSLs like Devito; 4. scaling of these algorithms, from a local workstation to HPC clusters, is achieved by using an integrated parallelisation library called Mosaic; 5. finally, tools are provided for saving and loading the different components of the problem using a standardised file format.
Each of these will be presented in detail in the following five sections.

Abstractions for solving optimisation problems
Techniques such as ultrasound computed tomography, optoacoustic tomography [3], or even ultrasound calibration techniques like spatial response identification [10,11], are commonly formulated as mathematical optimisation problems, which are solved numerically by using local methods like gradient descent.Therefore, a fundamental necessity when implementing these techniques is the availability of abstractions that allow us to pose our optimisation problems, calculate gradients of those problems with respect to the relevant parameters, and then apply these gradients through some local optimisation algorithm.In the next paragraphs, we introduce the abstractions that, being at the core of Stride, enable the solution of such inverse problems.
Consider a continuously differentiable function ( ), which can be expressed as ( ) = ⟨ ̂ ( ), 1 ⟩ with some adequate function ̂ ( ) and some bilinear form ⟨ , ⟩.We know that the derivative of ( ) with respect to is, where ∇ ( ) represents the derivative of an operator ( ) in the direction , and the derivative is by definition linear in the differentiation direction.Consider now that = ( ) is another continuously differentiable function.Then the derivative of ( ) with respect to is, by virtue of the product rule.At this point, we introduce the concept of the adjoint of an operator.Given an operator ⋅, its adjoint is * ⋅, defined so that ⟨ , ⟩ = ⟨ , * ⟩.Then, we can rewrite the expression as, That is, the derivative of function ( ) with respect to can be calculated by finding the derivative of ̂ ( ) with respect to its input and then applying the adjoint of the Jacobian of ( ) on the result.In the discrete case, this is equivalent to the Jacobian-vector product.
Similarly, if we added a third function = ( ), then the same result could be obtained for the derivative of ( ) with respect to , and the same procedure could be followed for any arbitrary chain of functions for whose inputs we wanted to calculate a derivative.This procedure, known as the adjoint method or backpropagation in the field of machine learning, is effectively the reverse mode that automatic differentiation libraries provide to calculate derivatives, albeit in the continuous limit.This is the core abstraction used in Stride.
Stride considers all components in the optimisation problem, from PDEs to objective functions, as mathematical functions that can be arbitrarily composed, and whose derivative can be automatically calculated through the procedure presented above.In Stride, each of these functions is a stride.Operator object, where their inputs and outputs are stride.Variable objects (Listing 1).
Listing 1: Example calculation of the gradient of a chain of functions using Stride.Note the use of the await syntax that is needed for compatibility with the Mosaic parallelisation library.from s t r i d e import V a r i a b l e x = V a r i a b l e ( name=" x " , n e e d s _ g r a d=T r u e ) z = a w a i t h ( x ) y = a w a i t g ( z ) w = a w a i t f ( y ) a w a i t w .a d j o i n t ( ) # T h e g r a d i e n t i s n o w i n " x .g r a d " When each stride.Operator is called, it is immediately applied on its inputs to generate some outputs.At the same time, these outputs keep a record of the chain of calls that have led to them within a directed acyclic graph.When w.adjoint() is called, this graph is traversed from the root w to the leaf x, calculating the gradient in the process.Only the leaves for which the flag needs_grad is set to True will have their gradient computed, which will be stored in the internal buffer of the variable x.grad.Now, we proceed to apply these general abstractions to find the gradient of a more practical optimisation problem.Consider the PDE-constrained optimisation problem, * = argmin ( , given some scalar objective function or loss function ( , ) and some PDE ( , ) = , for some vector of state variables and a vector of design variables .Considering ( , ) to be an adequate, continuously differentiable function in some neighbourhood of , we can apply the implicit function theorem.Then ( , ) = has a unique continuously differentiable solution ( ) and its derivative is given by the solution of, We can then define a reduced objective ( ) = ( ( ), ) = ⟨ ̂ ( ( ), ), 1 ⟩ , and we can take its derivative with respect to by using the previously introduced procedure, Substituting expression 6 into expression 7 we obtain, where ( ) is the solution of the adjoint PDE, In this optimisation problem, both ( , ) and ( , ) would be stride.Operator objects.Adding new functions to Stride requires defining a new stride.Operator subclass that implements two methods, forward and adjoint (Listing 2).
The abstractions presented allow us to intuitively pose optimisation problems and calculate derivatives of an objective function with respect to the parameters of interest.However, in order to solve the problem, we have to apply this derivative to update our guess of the parameters and repeat the procedure iteratively until we are satisfied with the final result.
Stride provides local optimisers of type stride.LocalOptimiser that determine how parameters should be updated given an available derivative.For our previous example, we can follow the procedure in Listing 3 to apply a step of gradient descent in the direction of our calculated derivative.Writing new, user-defined optimisers only requires the creation of a stride.LocalOptimiser subclass that takes the stride.Variable being optimised when the class is instantiated and that defines the method step(), which executes a single step in the optimisation process.
In order to iterate through the optimisation procedure, we could use a standard Python for loop.However, we also provide in Stride a stride.OptimisationLoop to use in these cases, which will help structure and keep track of the optimisation process.
Iterations in Stride are grouped together in blocks, with the stride.OptimisationLoop containing multiple blocks and each block containing multiple iterations.Partitioning the inversion in this way allows us to divide the optimisation more easily into logical units that share some characteristics.For instance, in FWI it is common to gradually introduce frequency information into the inversion to better condition the optimisation.In this case, it would make sense to assign one block to each frequency band, and run that band for some desired number of iterations.Listing 4 adds a stride.OptimisationLoop around our previous example.
Listing 3: Once a gradient has been calculated, a step in the optimisation algorithm can be taken by using a

Problem definition
In addition to providing abstractions for solving optimisation problems, Stride introduces a series of utilities for users to specify the characteristics of the problem being solved, such as the physical properties of the medium or the sequence in which transducers are used.
In Stride, the problem is first defined over a spatiotemporal grid, which determines the spatial and temporal bounds of the problem and their discretisation (Listing 5).Currently, we support discretisations over rectangular grids, but other types of meshes could be introduced in the future.On this spatiotemporal mesh, we define a series of gridaware data containers, which include scalar and vector fields, and time traces.These data containers are subclasses of stride.Variable.Based on this, we can define a medium, a stride.Medium object, a collection of fields that determine the physical properties in the region of interest.For instance, the medium could be defined by two stride.ScalarField objects containing the spatial distribution of longitudinal speed of sound and density, as in Listing 6.
Next, we can define the transducers, the computational representation of the physical devices that are used to emit and receive sound, characterised by aspects such as their geometry and impulse response.These transducers are then located within the spatial grid by defining a series of locations in a stride.Geometry.In Listing 7 we instantiate Finally, we can specify an acquisition sequence within a stride.Acquisitions object (Listing 8).The acquisition sequence is composed of shots (stride.Shot objects), where each shot determines which transducers at which locations act as sources and/or receivers at any given time during the acquisition process.The shots also contain information about the wavelets used to excite the sources and the data observed by the corresponding receivers if this information is available.
All components of the problem definition can be stored in a stride.Problem object, which structures them in a single, common entity.

Physical modelling
Physical modelling is defined in Stride through stride.Operator objects that represent specific implementations of a numerical solver applied to a PDE.Stride does not prescribe a specific solver or numerical method, and different codes and implementations can be integrated with it as long as they conform to the stride.Operator interface.By default, Stride integrates with the Devito library, a domain-specific language that generates highly optimised finite-difference code from high-level symbolic differential equations [26,27].Using Devito, we provide an out-of-thebox implementation of the second-order isotropic acoustic wave equation, for which Devito automatically generates code that can be readily executed in parallel on CPUs using Open Multi-Processing (OpenMP), and on GPUs using both OpenMP and OpenACC.
Acoustic modelling in Stride is governed by the equation, where ( , ) is the pressure, ( ) is the longitudinal speed of sound, ( ) is the mass density, = −2 0 −1 , and 0 ( ) is the absorption coefficient.The implementation of the acoustic wave equation is fourth-order accurate in time and tenth-order accurate in space.This results in a stability region with Courant-Friedrichs-Lewy (CFL) constant of 0.81 in 2D and 0.66 in 3D [2], as well as the requirement of a minimum of 3 points per wavelength (PPW) to minimise numerical dispersion.Our solver includes options for both constant and variable density and attenuation.Attenuation follows a power law, with frequency dependence controlled by the parameter in the equation, which can take values 0 and 2. In these cases the implemented derivative is not fractional.
In terms of boundary conditions, Stride includes options for a sponge absorbing boundary [43] or a perfectly matched layer [15].In all cases, sources and receivers can be defined in locations off the grid, with both bi-/tri-linear interpolation and high-order sinc interpolation [19].It is important to note that current, out-of-the-box implementations of the adjoints of our PDE solvers consider domains to be unbounded, as these represent the most common scenario in ultrasound imaging.However, alternative boundary conditions can be readily accounted for through user-level extensions of the PDE operators.Although physical modelling in Stride is currently focused on finite-difference methods, future releases could include integration with pseudospectral-element DSLs such as Dedalus [6] or finite-element DSLs like FEniCS/Firedrake [24,32].

Parallelism
In practice, derivatives of the optimisation problem are not calculated one data point at a time, but in batches, and the result is averaged to obtain an estimate of the gradient for that iteration.Because, in most cases, each of these data points is fully independent, this can be exploited so that they are calculated in parallel.For some simple problems, this can be done within a single workstation.However, in most practical problems, compute and memory demands require that these computations are mapped across different interconnected sets of hardware, such as multi-GPU systems and CPU clusters, running locally, remotely, or on the cloud.
The most important limiting factor when scaling reallife FWI workloads in parallel environments is memory a w a i t t a s k _ 1 allocation, management, and communication, with potentially hundreds of gigabytes being stored and transferred during the optimisation process.Therefore, a parallelisation framework is required that offers fine-grained control of the computational workload allocation and memory management for code developers, while also providing the end user with a high level of abstraction that integrates tightly with the optimisation constructs provided by Stride.We have developed Mosaic to facilitate the expression of parallelism in Stride in an intuitive manner.
Mosaic is an actor-based parallelisation library based on asynchronous, zero-copy message passing through Ze-roMQ sockets [44].Actors in Mosaic are called tessera, and can be generated by decorating any Python class using @mosaic.tessera.When instantiating a class that has been decorated, Mosaic will start a remote instance of that class in one of the workers.At this point, remote method calls to that tessera can be executed and the attributes of that remote object can be accessed.An example of how Mosaic is used can be found in Listing 9.
In Mosaic, subsequent method calls to a remote object are guaranteed to be executed in order, but calls to different remote objects are not.However, if there are explicit dependencies between two or more remote method calls, Mosaic will ensure that these are executed in the right order (Listing 10).
The structure of the Mosaic runtime, which can be seen in Fig. 2, is composed by a series of processing units, which could be located in a single, local workstation or distributed across a remote network.The first of such units contains a monitor process, a warehouse process, and a head process.The monitor process collects information about the Mosaic network, including occupation rate, resource use and connection state.The warehouse process acts as a centralised key-value storage location that is accessible from across the whole Mosaic network.The head process is the place where the main user code is executed.In each of the remaining processing units, a node monitor and one or more workers are allocated.The node monitor keeps track of the runtime status of its local processing unit and oversees the life cycle of each of the workers in its unit.Finally, the workers act as containers for tessera actors, whose methods can be executed remotely.All processing units in the Mosaic network are directly interconnected to each other, creating a decentralised communication mesh.
Mosaic can be run in interactive mode in a Jupyter notebook, or from a terminal window using the mrun command.The Mosaic runtime can be used without any code changes in a local multi-processing environment or a multi-node cluster.Therefore, Mosaic gives us the flexibility to parallelise work across multiple CPUs within a single compute node, as well as across multiple interconnected nodes, with the distribution topology related to the specific application at hand.Additionally, our Devito solvers can parallelise the execution of the wave equation across multiple CPU cores by using thread-level parallelism.

File input and output
As the popularity of ultrasound tomography increases, the number and size of datasets are also growing, but no standard format exists for their exchange.This slows algorithm development and limits research reproducibility.In order to address this, we have introduced with Stride a standardised file specification and a set of tools to interact with it.
In the setup of ultrasound tomography workflows, there are usually a number of intermediate files that are generated describing aspects such as medium properties, transducer impulse responses or data recorded during laboratory experiments.In Stride, we use the Hierarchical Data Format (HDF5) [36] for saving and loading these datasets and provide a series of tools to conveniently interact with them.Fig. 3 shows the basic file specification proposed in Stride for the different components of a standard tomographic workflow.

Modelling accuracy
We have validated the accuracy of the acoustic solver by comparing it against an analytical solution of the wave equation for a homogeneous medium [30].The comparison was performed, in 2D and 3D, by transmitting a threecycle tone burst centred at 500 kHz into a medium with constant speed of sound of 1500 m/s, constant density, and no attenuation.The employed grid was sampled at 0.250 mm in space (minimum of 8 PPW) and 0.060 µs in time (maximum CFL constant of 0.36).The resulting acoustic The runtime is divided into several logical processing units, which could represent, for instance, processes in a local multiprocessing environment or different machines in a multi-node cluster.In the first processing unit, the user code is executed in the head, while the monitor tracks the status of the runtime and the warehouse acts as a central storage unit.In the remaining processing units, a node monitor is allocated to track the status of that local unit and communicate this to the global monitor, and one or more workers are also created to execute tessera calls.All endpoints in the Mosaic runtime are interconnected to each other.
wave was then recorded at 51 equispaced points, starting at the transmission location and increasing in distance up to a maximum separation of 300 mm.
Results for the comparison are shown in Fig. 4, both for the 2D (Fig. 4-A) and the 3D cases (Fig. 4-B), where errors with respect to the analytical solution were calculated using the normalised root-mean-square error.We can see how the Stride numerical solutions closely match the analytical ones, remaining accurate at a significant distance from the transmission site.
We have performed a further validation of the Stride acoustic solvers on a more complex medium with inhomogeneous speed of sound, density, and attenuation of order zero, for which an analytical solution is not available, by comparing it against kWave [38], a state-of-the-art ultrasound modelling library written in MATLAB and based on pseudospectral element methods.The comparison was performed using a human skull section, seen in Fig. 5-A, sampled at 0.125 mm (minimum of 24 PPW), and illuminated by a bowl ultrasound transducer with a 64 mm radius of curvature and a 64 mm aperture diameter.The transducer surface was discretised using 20,000 point sources, evenly distributed using Fibonacci spirals [39].This example forms part of a transcranial ultrasound simulation benchmarking and intercomparison exercise organised by the ITRUSST (International Transcranial Ultrasonic Stimulation Safety and Standards) planning group [4].The transducer was excited by a continuous sinusoidal wave at 500 kHz and the simulation was run with a step size of 0.016 µs (maximum CFL constant of 0.36) until steady state was reached.The magnitude of the pressure field at the excitation frequency was then extracted after Fourier transform.Fig. 5-B and C show a 2D slice through the resulting 3D wavefield, from which we can observe the good agreement between both solutions.A similar conclusion can be extracted from the 1D profiles, seen in Fig. 5-D and E. The agreement between both solvers is quantitatively confirmed by a relative error of 1.64%, calculated over the entire 3D volume.Existing differences between the results of both solvers are likely due to the use of different numerical methods to solve the wave equation, as well as differences in source injection routines and boundary condition implementation.It is important to note that implementation differences cannot be fully eliminated, even in the limit where both numerical methods converge, due to the fact that Stride and kWave are solving fundamentally different equations in order to model acoustic wave propagation: Stride solves the secondorder linear acoustic wave equation, whereas kWave solves three coupled equations that are equivalent to a generalized Westervelt equation.

Imaging in 2D
For our first imaging experiment, we extract a 2D slice from a numerical breast model as seen in Fig. 6-A.The Listing 11: To image the spatial distribution of speed of sound, we create a stride.ScalarField(..., needs_grad=True) and set the starting distribution to be 1500 m/s everywhere.We also create a stride.GradientDescent optimiser to update the variable at every iteration.To make use of the gradient-calculation capabilities of Stride, we instantiate our speed-of-sound field with needs_grad=True, and set the starting model to a constant sound speed of 1500 m/s (Fig. 7-B).We also instantiate a Listing 12: We create the necessary operators for the reconstruction.The keyword argument len=num_workers controls the amount of copies of the operators to be instantiated by Mosaic in each remote worker.gradient descent optimiser to update our variable (Listing 11).We can see in Listing 11 how the stride.ScalarField has been instantiated by calling parameter().Using this method will ensure that, as the field is sent across the Mosaic network, a reference to the original object will always be maintained.This will allow us to calculate the gradient in different workers and then propagate the results back to the local runtime.
Then, we can instantiate our operators remotely, creating one copy for each available worker (Listing 12).In this case, we use an operator for the PDE and another one for the objective function, and we also create pre-processing operators for our source wavelets and our output time traces.
We perform the inversion by gradually introducing frequencies, starting at 300 kHz and going up to 600 kHz.
We do this by running the optimisation loop in blocks, with each block using a different frequency band.At each block, we complete 8 iterations, randomly selecting 16 shots without replacement in each of them.That is, each shot is used once at every frequency band.We run the function in Listing 13 for every iteration of the reconstruction loop in Listing 14.We run this inversion on a local multi-processing environment, within a Jupyter notebook, by simply adding the command mosaic.interactive("on")at the beginning of our notebook.This workstation is equipped with 64 GB of memory and 6 physical cores (Intel i7-8700K, 6 cores, 3.70 GHz).The acoustic Devito PDE was compiled using the GNU gcc compiler version 7.5, and was executed on the Jupyter notebook using 3 Mosaic workers and OpenMP thread-level parallelism with 2 threads for each worker.Each of the Mosaic workers calculates the gradient for a single shot at a time, which entails one forward propagation and one adjoint propagation of the acoustic solver, before combining the gradients for all shots at each iteration.With this configuration, each shot gradient calculation took a total of 2.99 ± 0.30 s.
Once the optimisation loop runs through all frequency bands, a final reconstruction is obtained (Fig. 7-C).We calculate the mean of the absolute value of the difference between the final reconstruction and the original model, which is displayed in Fig. 7 with the symbol .We can see how the reconstruction closely matches the ground-truth model, both qualitatively and quantitatively.As expected, inaccuracies can be observed in the reconstruction, which can be explained through a number of factors.First, limited sampling of the wavefield is performed at the boundaries of Listing 13: At every iteration, a subset of the available shots are selected randomly to calculate a gradient.The calculated gradient is then used to update the speed of sound distribution.

a s y n c d e f r u n _ i t e r ( f_max ) : # S e l e c t s o m e s h o t s f o r t h i s i t e r a t i o n
s h o t _ i d s = a c q u i s i t i o n s .the model because a finite number of receivers is used.Second, the available frequency bandwidth is also necessarily finite, which will limit resolution and prevent high-contrast interfaces from being perfectly recovered.Next, we apply the same imaging script that we have just introduced to now image an experimental tissue-mimicking phantom.A polyvinyl alcohol (PVA) cryogel phantom was constructed with two layers of different speed of sound values and an inner cavity filled with water [7].The dimensions of the phantom are, approximately, 57.4 mm in width, 70.4 mm in height, and 130 mm in depth.Speed-of-sound values for the different layers of the phantom were experimentally measured using time of flight to be 1521 ± 3 m/s for the outer layer and 1502 ± 4 m/s for the inner layer.A photograph of the cross section of the phantom can can be seen in Fig. 8-A.Data were then acquired using two P4-1 transducers (ATL, USA), each of which contains 96 transmitting and receiving elements.The two P4-1 transducers were independently attached to two rotary motors, allowing them to move around the phantom for full illumination.Data were acquired by transmitting with a centre frequency of 1.4 MHz.
The inversion was performed over 120 µs, in steps of 0.048 µs (maximum CFL constant of 0.37), using a spatial sampling of 0.200 mm (minimum of 4.67 PPW) and a grid size of 890×890.Imaging was carried out using a single block and a single frequency band with an upper limit of 700 kHz across a total of 152 iterations.During each iteration, 10 shots were selected randomly without replacement so that each shot was used four times at the end of the block.A single frequency band is sufficient in this example because, for this particular experiment, the starting point of our inversion is close enough to the minimum of the optimisation to ensure convergence.Simultaneously, the resolution offered by this frequency band (with a half-wavelength of approximately 1 mm in water) is sufficient, given the size and level of detail of the phantom, to recover a high-resolution reconstruction.
Using a starting model that contained homogeneous water (Fig. 8-B), a high-resolution reconstruction of the phantom is obtained (Fig. 8-C).Stride can successfully recover the two layers of speed of sound, as well as the internal water cavity.The reconstruction shows high contrast between layers, and the correct recovery of the complex details at the interface between them.We can also see how, at some points, the two layers of the phantom seem to gradually dissolve into one another instead of presenting sharp interfaces.This could be an imaging artefact due to errors in the calibration of the data acquisition setup, but could also be due to the natural degradation of the phantom, which could have led to the two layers merging at these locations.
We run this inversion on the same workstation as the previous example, using the same 3 Mosaic workers, so that each shot gradient calculation took a total of 28.32 ± 4.46 s.Adding a single argument to the PDE call, pde(..., platform="nvidia-acc"), is sufficient to run the same inversion on an available GPU instead of the CPU.In this case, the Devito-generated OpenACC solver is compiled using the PGI pgc++ compiler version 21.2.Then, using the same workstation, equipped with an NVIDIA GeForce RTX 2080 Ti with 11 GB of memory, and a single Mosaic worker, each shot gradient calculation took a total of 5.63 ± 0.07 s.

Imaging in 3D
Although relevant when imaging structurally simple, soft tissues such as the breast, 2D imaging on its own is of limited applicability in realistic tomographic reconstructions, where 3D modelling and inversion is needed to account for the full physics of wave propagation in the human body.At the same time, it is in these 3D problems where the computational cost of FWI is most apparent and where tomography codes are required to scale robustly.In order to showcase the scaling capabilities of Stride, we choose for our second experiment a numerical 3D model of the adult human head (Fig. 6-B).The model is based on the MIDA model [20], to which acoustic properties were assigned as described by Guasch et al. [16].Three slices through this numerical model can be seen in Fig. 9-A to C. The model is sampled with a spacing of 0.750 mm (minimum of 3.22 PPW), resulting in a grid of size 367×411×340 and more than 51 million unknown parameters to be estimated.A total of 1024 transducers were located around the head as seen in Fig. 6-B, with all transducers acting both as sources and receivers.Imaging was performed with a three-cycle tone burst centred at 500 kHz.Modelling was carried out over 300 µs, with time steps of 0.150 µs (maximum CFL constant of 0.60).
Stride has been designed to seamlessly scale from 2D to 3D, and moving from one to the other only requires changing three lines of the code when defining the spatial grid.The remaining code can be run without any changes.In this case, the reconstruction is performed in the frequency range between 100 kHz and 600 kHz, starting from a model that only contains the skull (Fig. 9-D to F).Each frequency band in the reconstruction is run for 8 iterations, and 128 shots are randomly selected without replacement for each of them.
Due to the higher computational requirements in 3D, we run this reconstruction in an HPC cluster environment.Except for removing the mosaic.interactive("on")command, no changes are required to the code when scaling from the local to the cluster environment.Each compute node in the cluster is equipped with 256 GB of memory and 128 cores (2xAMD Zen2 EPYC 7742, 64 cores, 2.25 Ghz).Nodes are connected using an HPE Slingshot interconnect with 200 Gb/s signalling.The Devito solver is compiled using the GNU gcc compiler version 7.5, and is executed using OpenMP thread-level parallelism across 32 threads.
Each of the nodes calculates the gradient for a single shot at a time, which once more entails one forward propagation and one adjoint propagation of the acoustic solver, before combining the gradients for all shots at each iteration.Work distribution across the different nodes is managed by the Mosaic runtime, with the time taken to allocate this work generally dominated by the serialisation, communication, and processing of the data associated with the execution of each shot.However, serialisation in Mosaic has a negligible impact due to its zero-copy implementation.Communication overheads could have an impact on performance, but these are minimised by high-speed interconnects and by the asynchronous nature of Mosaic and its underlying ZeroMQ sockets.This means that user code is not slowed down by the actual time taken to send messages across the network by allowing the overlap of computation and communication: the head process dispatches all shots almost instantaneously, and independent worker processes across the network start computing as soon as the first message arrives.Message processing, on the contrary, will have an impact on performance due to the intrinsic single-threaded nature of Python.This could be alleviated by offloading some of this processing to lower-abstraction interfaces in C. With all this in mind, each shot gradient calculation took 5.82 ± 0.36 min, including time spent in work distribution.
The high accuracy of the final reconstruction obtained using Stride can be seen in Fig. 9-G to I. Also in this case, we have calculated a corresponding quantitative error measure for the full 3D model, shown in Fig. 9 with the symbol .Errors in the reconstruction can in this case be attributed to similar reasons to the previous numerical 2D case, with the added factor of limited illumination in certain regions of the model.We can see, for example, how the regions close to the neck and around the sinuses are more poorly resolved due to the location of sources and receivers around the head.We can also see how resolution is degraded as we move closer to the upper regions of the skull due to lower ray density in these areas.
At this point, we explore the scaling capabilities of the Mosaic parallelisation layer by running a fixed number of individual shot gradient calculations, 128, while increasing the number of compute nodes used in the HPC cluster.The achieved acceleration is calculated by comparing the amount of time taken to complete all gradient calculations using a certain number of nodes with respect to the time taken using a single node.Under ideal circumstances, this means that, for example, an acceleration of 128 times is expected when using 128 compute nodes.This test is repeated five times, and the final acceleration is taken as the average over all repetitions.
Results for this strong scaling test can be seen in Fig. 10, where we can observe that Mosaic achieves nearly ideal scaling up to 128 compute nodes.For the largest number of nodes, we can see how the obtained acceleration deviates slightly from the ideal curve.This corresponds, approximately, to a 2% loss in performance, which can be attributed to the effective single-threaded nature of Python programs that we have previously discussed.

Discussion
We have shown that Stride provides an intuitive framework for the solution of ultrasound tomography problems, seamlessly switching between 2D and 3D applications, and between a local workstation and a multi-node cluster.
Implementations of ultrasound tomography methods like FWI have to address their computational and algorithmic complexity.To do this, Stride has been designed to provide tailored optimisation routines, high-performance PDE solvers, and scalability to HPC systems, while simultaneously offering a high level of abstraction to ensure flexibility, productivity, and modularity.
From the point of view of the optimisation, we have seen how Stride closely matches the mathematical formulation of the inverse problem, for which gradients can be intuitively calculated using the adjoint method.Our approach here resembles that taken by machine learning libraries like Acceleration is calculated as the amount of time taken to complete all gradient calculations using a certain number of nodes with respect to the time taken using a single node, averaged over 5 experiments.
PyTorch [31], which have been highly successful at broadening the reach of these technologies beyond computational experts.This serves the double purpose of easing adoption by users, some of which might already be familiar with some of these libraries, and facilitating integration with these machine learning tools.
We have to note that gradients for Stride problems are calculated at a high level by treating the PDE or the loss functions as differentiable primitives, but no differentiation is happening through their internal mathematical operations.This is the subject of ongoing research and will be introduced in future versions of Stride.
From the point of view of the PDE solver, Stride faces the performance-flexibility dichotomy in a similar manner to the geophysical library JUDI [42]: we provide intuitive interfaces in a high-abstraction language, while using a DSL like Devito under the hood.From a symbolic specification of the PDE, Devito automatically generates architecturespecific C code that matches the performance of handtuned implementations [26,27].This offers a high degree of flexibility, allowing the inclusion of new physical models with minimal effort and without hindering performance.It is this flexibility that allows us to run the same wave equation solver on a CPU multi-threaded environment or a GPU with effectively no code changes.
Currently, Stride problems can only be defined on rectangular grids, on which finite-difference methods can be applied using Devito.Nonetheless, Stride does not prescribe any of these, and future work will explore the inclusion of different discretisation approaches and integration with other DSLs like FEniCS/Firedrake for finite-element methods [24,32] or Dedalus for spectral methods [6].
Other open-source libraries exist for numerical modelling in ultrasound medical imaging, such as the previously mentioned kWave [38], based on pseudospectral element methods; Field II [21], which uses a linear scattering approximation; or Bempp-cl [5], which employs a boundary element method, among others.These libraries have been tailored to accurately model sound propagation in biological tissues and generally provide hand-tuned implementations that can achieve high performance.Stride is agnostic to the underlying solver employed and any of these could be readily integrated with it.However, that would diminish the flexibility that is achieved by using a DSL that can obtain comparable performance for both the physical models currently available and any new ones that could be introduced.
Stride has been designed to tackle the problem of intuitively scaling to HPC systems in a similar spirit as for the solver: high-level interfaces hide from the user the complexity of deploying the algorithms to target systems, allowing imaging scientists to focus on the reconstruction algorithms rather than the low-level details.We provide for this the custom parallelisation library Mosaic.
Traditional HPC workloads usually rely on the message passing interface (MPI) standard to express parallelism in applications.However, originally designed in the 1990s, MPI has so far no capacity for fault tolerance and its interfaces are too cumbersome and low level for most nonspecialists.Other Python libraries exist for writing parallel applications, most notably Dask [13], PyCOMPSs [35], and Ray [29].Dask expresses parallelism as a series of stateless tasks that form a computational graph, which can be executed in parallel.PyCOMPSs uses tasks similarly to express parallelism, although these do not have to be stateless.However, PyCOMPSs employs a Java-based runtime that requires the serialisation of objects to file in order to communicate with Python, incurring in a performance penalty.Contrarily, the Ray parallel framework is primarily based on the actor model.We have chosen to design Mosaic using an actor-based model because, much like objectoriented programming, we consider that it better matches the world view and the mental framework of domain specialists.It also allows us to keep objects and their allocated memory warm within a specific compute node or associated accelerator, incidentally making it more intuitive for end users to manipulate remote memory.We have chosen to implement a custom parallelisation library for Stride due to a need for fine-grained control of the computational workload allocation and memory management that existing libraries are unable to provide.
Through the examples presented, we have seen that switching from a local multi-processing environment to an HPC cluster with Mosaic is straightforward and requires no significant code changes.We have also seen through our 3D experiments that realistic Stride reconstructions could be potentially scaled across hundreds of compute nodes thanks to the zero-copy, asynchronous work allocation of the Mosaic library.However, work is still needed to fully understand and exploit the scaling capabilities of Mosaic across large on-premises and cloud computing clusters, with particular interest in minimising data transfers across the network by exploiting caching mechanisms to detect redundant communications.
Additionally, while Mosaic offers the capacity to parallelise across elements of an iteration batch, the integration with Devito offers another degree of freedom to parallelise within PDE solves through MPI-based domain decomposition.Domain decomposition, whose use in Stride is being actively explored, allows a user to distribute the computation of the PDE solution.This will be of importance when solving large problems whose size exceeds memory available in any single node or memory available in a particular accelerator such as a GPU.It will also allow for increased computational performance by splitting PDE solves in a single node across available CPU sockets, thus enforcing data locality.
There are two distinct applications for which Stride has been designed: wave propagation modelling and tomographic imaging.In terms of modelling wave phenomena, a number of other libraries are openly available to users, some of which include the already mentioned Field II [21], Bempp-cl [5], or kWave [38], among others.The choice of one library over another will be down to the aims and requirements of a specific modelling exercise.For example, Field II should be chosen when modelling accuracy can be traded off for shorter computational times, whereas the boundary element method in Bempp-cl will provide accurate modelling that remains computationally efficient when the number of tissue interfaces in the model is low.As we have shown here, Stride and kWave can achieve similar levels of modelling accuracy for complex tissue geometries.Nonetheless, finite-difference solvers in Stride will be more computationally efficient, whereas kWave will display smaller numerical dispersion for a similar discretisation grid thanks to its pseudospectral formulation.These differences will, however, become irrelevant as other numerical methods are integrated into Stride: a different method will be chosen depending on the specific application.
In terms of tomographic imaging, it is important to distinguish between full-wave methods, such as FWI, and others, such as time-of-flight tomography and diffraction tomography.Stride is, at the time of this writing, the only openly available library for full-wave tomographic imaging in the medical context.Stride, however, does not currently provide solvers for other types of ultrasound tomography and other tools should be used in these cases [1].
In terms of compatibility, Stride can be installed on Unix operating systems, and is compatible with Windows through the Windows Subsystem for Linux and through Docker containers.
Through these design decisions, Stride achieves flexibility and modularity, allowing each of its components to be modified independently or entirely substituted.At the same time, importance has been placed on ensuring that lowerlevel interfaces can be used to provide users with increasingly fine-grained control over the problem and its execution.Although we have designed Stride with ultrasound tomography in mind, the formulation of the physics-constrained optimisation problem is related to other imaging techniques, like optoacoustic tomography, and even calibration methods like spatial response identification.This makes Stride readily applicable to a number of medical ultrasound problems.

Conclusions
Advances in ultrasound-based imaging methodologies such as ultrasound computed tomography and optoacoustic tomography rely on increasingly complex mathematical and computational models.This puts a strain on researchers to both develop novel imaging algorithms and translate them into high-performance and scalable code, thus slowing scientific progress.
To bridge the gap between flexible development and reallife application, we have designed and developed Stride, an open-source Python library that is both intuitive and efficient.Stride allows algorithms to be written for a 2D model and be easily scaled up to 3D, and allows code to be tested on a local workstation and readily deployed to an HPC cluster.We achieve this by combining modular interfaces written in a high-abstraction language with automatically-generated, high-performance solvers, and with tailored parallelisation routines.
By providing high-level interfaces that intuitively match the representation of problems posed by domain specialists, and which are efficient and scalable out of the box, Stride has the potential to dramatically increase the productivity of imaging researchers.This will have a significant impact by accelerating the development of new ultrasound-based imaging technology and its translation from bench to bedside.Furthermore, other imaging applications where the efficient solution of physics-constrained optimisation problems is needed could also benefit from the general abstractions provided by Stride, such as non-destructive testing, aeronautics, or experimental fluid mechanics.
The authors would like to acknowledge the ITRUSST (International Transcranial Ultrasonic Stimulation Safety and Standards) planning group who developed the benchmark example used in Sec.3.1 and provided the kWave simulation results.

Figure 1 :
Figure 1: Schematic representation of the Stride software structure.A series of basic abstractions for solving optimisation problems are provided (1), based on which the tomographic problem is expressed (2).The tomographic problem becomes fully defined when appropriate physical modelling is introduced (3).The execution of Stride is parallelised using the custom library Mosaic (4), and tools are provided to save and load its details(5).

Listing 2 :# C o m p u t e w a v e e q u a t i o n s o l u t i o n r e t u r n u d e f a
Example of gradient calculation for a PDEconstrained optimisation problem like the one solved in FWI. from s t r i d e import O p e r a t o r , V a r i a b l e c l a s s L ( O p e r a t o r ) : d e f f o r w a r d ( s e l f , m ) : d j o i n t ( s e l f , g r a d _ u , m ) : # C a l c u l a t e d e r i v a t i v e w r t t o m # a p p l y i n g a d j o i n t o n g r a d _ u r e t u r n grad_m c l a s s J ( O p e r a t o r ) : d e f f o r w a r d ( s e l f , u , m ) : # C a l c u l a t e l o s s v a l u e r e t u r n l o s s d e f a d j o i n t ( s e l f , g r a d _ l o s s , u , m ) : # C a l c u l a t e t h e d e r i v a t i v e w r t u # C a l c u l a t e t h e d e r i v a t i v e w r t m r e t u r n g r a d _ u , grad_m # C r e a t e t h e d e s i g n p a r a m e t e r s m = V a r i a b l e ( name="m" ) m. n e e d s _ g r a d = T r u e # I n s t a n t i a t e t h e o p e r a t o r s l = L ( ) j = J ( ) # A p p l y t o c a l c u l a t e g r a d i e n t u = a w a i t l (m) l o s s = a w a i t j ( u , m) a w a i t l o s s .a d j o i n t ( ) # T h e g r a d i e n t i s n o w i n " m .g r a d " o r b l o c k i n o p t _ l o o p .b l o c k s ( n u m _ b l o c k s ) : f o r i t e r a t i o n i n \ b l o c k .i t e r a t i o n s ( n u m _ i t e r s ) : m. c l e a r _ g r a d ( ) u = a w a i t l (m) l o s s = a w a i t j ( u , m) a w a i t l o s s .a d j o i n t ( ) a w a i t o p t i m i s e r .s t e p ( )

Listing 5 :
Example spatiotemporal grid.from s t r i d e import Space , Time , G r i d s p a c e = S p a c e ( s h a p e , s p a c i n g ) t i m e = Time ( s t a r t , s t e p , num ) g r i d = G r i d ( s p a c e , t i m e )

Listing 6 :
Example stride.Medium containing the spatial distribution of longitudinal speed of sound and density.from s t r i d e import Medium , S c a l a r F i e l d medium = Medium ( g r i d = g r i d ) medium .add ( S c a l a r F i e l d ( name=" vp " , g r i d = g r i d ) ) medium .add ( S c a l a r F i e l d ( name=" r h o " , g r i d = g r i d ) ) medium .vp .f i l l ( 1 5 0 0 .) medium .r h o .f i l l ( 1 0 0 0 .) some stride.Transducer objects and then add them to a corresponding stride.Geometry.Listing 7: Example geometry with its associated transducers.from s t r i d e import P o i n t T r a n s d u c e r , \ T r a n s d u c e r s , Geometry t r a n s d u c e r s = T r a n s d u c e r s ( g r i d = g r i d ) t r a n s _ 0 = P o i n t T r a n s d u c e r ( 0 , g r i d = g r i d ) t r a n s _ 1 = P o i n t T r a n s d u c e r ( 1 , g r i d = g r i d ) t r a n s d u c e r s .add ( t r a n s _ 0 ) t r a n s d u c e r s .add ( t r a n s _ 1 ) g e o m e t r y = Geometry ( t r a n s d u c e r s = t r a n s d u c e r s , g r i d = g r i d ) g e o m e t r y .add ( 0 , t r a n s d u c e r =t r a n s _ 0 , c o o r d i n a t e s = [ . . .] ) g e o m e t r y .add ( 1 , t r a n s d u c e r =t r a n s _ 1 , c o o r d i n a t e s = [ . . .] )

Listing 8 :
Example acquisition containing only one shot.from s t r i d e import Shot , A c q u i s i t i o n s l o c _ 0 = g e o m e t r y .g e t ( 0 ) l o c _ 1 = g e o m e t r y .g e t ( 1 ) a c q u i s i t i o n s = A c q u i s i t i o n s ( g e o m e t r y=g e o m e t r y , g r i d = g r i d ) s h o t = S h o t ( 0 , s o u r c e s =[ l o c _ 0 ] , r e c e i v e r s =[ l o c _ 0 , l o c _ 1 ] , g e o m e t r y=g e o m e t r y , g r i d = g r i d ) a c q u i s i t i o n s .add ( s h o t )

Listing 9 :
Basic usage of Mosaic to create remote objects, call their methods and access their attributes.from m o s a i c import t e s s e r a @ t e s s e r a c l a s s Remote : d e f _ _ i n i t _ _ ( s e l f ) : s e l f .v a l u e = 0 d e f add ( s e l f , v a l u e ) : s e l f .v a l u e += v a l u e r e t u r n s e l f .v a l u e # C r e a t e a r e m o t e i n s t a n c e r e m o t e _ o b j = Remote .r e m o t e ( ) # C h e c k t h e c u r r e n t v a l u e o f t h e a t t r i b u t e p r i n t ( a w a i t r e m o t e _ o b j .v a l u e ) # A d d a n e w v a l u e t a s k = r e m o t e _ o b j .add ( 5 ) # T h i s w i l l r e t u r n i m m e d i a t e l y a n d # w e c a n d o o t h e r w o r k i n t h e m e a n t i m e # W h e n r e a d y , w e c a n w a i t f o r t h e # r e m o t e m e t h o d c a l l t o f i n i s h a w a i t t a s k # T h e r e t u r n v a l u e o f t h e m e t h o d c a l l # i s s t o r e d i n t h e r e m o t e w o r k e r , a n d # t o a c c e s s t h e m w e w i l l n e e d t o d o # t h i s e x p l i c i t l y p r i n t ( a w a i t t a s k .r e s u l t ( ) ) # C h e c k t h e n e w v a l u e o f t h e a t t r i b u t e p r i n t ( a w a i t r e m o t e _ o b j .v a l u e )

Listing 10 :
Expressing parallelism and dependencies in Mosaic.# C r e a t e a r e m o t e i n s t a n c e s r e m o t e _ o b j _ 0 = Remote .r e m o t e ( ) r e m o t e _ o b j _ 1 = Remote .r e m o t e ( ) # T h e s e c o u l d b e e x e c u t e d i n p a r a l l e l t a s k _ 0 = r e m o t e _ o b j _ 0 .add ( 5 ) t a s k _ 1 = r e m o t e _ o b j _ 1 .add ( 1 ) # a n d h a v e t o b e a w a i t e d s e p a r a t e l y a w a i t t a s k _ 0 a w a i t t a s k _ 1 # A n e x p l i c i t d e p e n d e n c e w i l l m a k e # t h e m e x e c u t e i n s e r i e s t a s k _ 0 = r e m o t e _ o b j _ 0 .add ( 1 ) t a s k _ 1 = r e m o t e _ o b j _ 1 .add ( t a s k _ 0 ) # a n d o n l y t h e l a t t e r n e e d s t o b e a w a i t e d p r i n t ( a w a i t t a s k _ 1 .r e s u l t ( ) ) # w i l l p r i n t 7 # D e p e n d e n c i e s c a n a l s o b e i n t r o d u c e d a s t a s k _ 0 = r e m o t e _ o b j _ 0 .add ( 1 ) t a s k _ 1 = r e m o t e _ o b j _ 1 .add ( t a s k _ 0 .done , 2 )

Figure 2 :
Figure 2: Schematic representation of the Mosaic runtime.The runtime is divided into several logical processing units, which could represent, for instance, processes in a local multiprocessing environment or different machines in a multi-node cluster.In the first processing unit, the user code is executed in the head, while the monitor tracks the status of the runtime and the warehouse acts as a central storage unit.In the remaining processing units, a node monitor is allocated to track the status of that local unit and communicate this to the global monitor, and one or more workers are also created to execute tessera calls.All endpoints in the Mosaic runtime are interconnected to each other.

Figure 3 :
Figure 3: Specification of the Stride file format.The definition of the spatiotemporal grid is the basis upon which different types of data containers and the various components of the problem are then specified.

Figure 4 :
Figure 4: Accuracy of the acoustic wave equation solver against analytical solution.The numerical solution of the acoustic wave equation calculated by Stride is compared to the analytical solution for a medium with homogeneous speed of sound.The comparison is performed in 2D (A) and 3D (B), at a distance to the emitter ranging from 0 to 300 mm.Error is calculated as the normalised root-mean-square error with respect to the analytical solution.

#
P r e p a r e s t a r t i n g m o d e l vp = S c a l a r F i e l d .p a r a m e t e r ( name=" vp " , g r i d =g r i d , n e e d s _ g r a d=T r u e ) vp .f i l l ( 1 5 0 0 .) medium .add ( vp ) # P r e p a r e o p t i m i s e r o p t i m i s e r = G r a d i e n t D e s c e n t ( vp , s t e p _ s i z e = s t e p _ s i z e ) resulting 2D model can be seen in Fig. 7-A.The model has been obtained from an open database [25], and has been adapted by populating it with acoustic tissue properties and by adding a tumour.From here onwards, all examples were run with constant density and no attenuation.The model, sampled with a spacing of 0.500 mm (minimum of 3.73 PPW), has a size of 456×485.The model is surrounded by 128 point transducers, seen as blue dots in Fig. 6-A, all of which act as sources and receivers.Imaging is performed using a three-cycle tone burst centred at 500 kHz, and is carried out over 200 µs in steps of 0.080 µs (maximum CFL constant of 0.26).Both temporal and spatial sampling are kept constant during modelling and inversion, for this and all subsequent examples.However, this is not required and users could exploit different dispersion and stability conditions by changing the discretisation across different imaging blocks.

#Figure 5 :
Figure5: Accuracy of the acoustic wave equation solver against state-of-the-art solver.The 3D numerical model (A) contains a human skull section (blue) and a bowl ultrasound transducer (red).We compare a 2D slice through the resulting steady-state wavefield for the state-of-the-art solver kWave (B) and for Stride (C).Additionally, we compare two 1D profiles through the centre of the transducer focus (D-E).

Figure 6 :
Figure 6: Setup used in the numerical experiments.For the 2D experiment (A), a slice is taken across a numerical 3D model of the breast and 128 point transducer, which can be seen as blue dots, are distributed around it.For the 3D experiment (B), a numerical head model is imaged by surrounding it with 1024 transducers (also visible as blue dots).The scale shown at the bottom of the numerical models applies equally in all spatial directions.

# C a l c u l a t e d e r i v a t i v e a w a i t f u n . a d j o i n t ( ) # W a i t f o r l o o p t o e n d a w a i t l o o p # U p d a t e v p a w a i t o
s e l e c t _ s h o t _ i d s ( num= s h o t s _ p e r _ i t e r , r a n d o m l y=T r u e ) # C l e a r t h e g r a d i e n t vp .c l e a r _ g r a d ( ) # A s y n c l o o p o v e r s e l e c t e d s h o t s @runtime .a s y n c _ f o r ( s h o t _ i d s ) a s y n c d e f l o o p ( worker , s h o t _ i d ) : # F e t c h o n e d a t a p o i n t s u b _ p r o b l e m = p r o b l e m .s u b _ p r o b l e m ( s h o t _ i d ) w a v e l e t s = s u b _ p r o b l e m .s h o t .w a v e l e t s o b s e r v e d = s u b _ p r o b l e m .s h o t .o b s e r v e d # P r e -p r o c e s s t h e w a v e l e t s w a v e l e t s = p _ w a v e l e t s ( w a v e l e t s , r u n t i m e=w o r k e r ) # E x e c u t e t h e P D E m o d e l l e d = pde ( w a v e l e t s , vp , p r o b l e m=s u b _ p r o b l e m , r u n t i m e=w o r k e r ) # P r e -p r o c e s s t r a c e s t r a c e s = p _ t r a c e s ( m o d e l l e d , o b s e r v e d , f_max=f_max , r u n t i m e=w o r k e r ) # C a l c u l a t e l o s s f u n = a w a i t l o s s ( t r a c e s .o u t p u t s [ 0 ] , t r a c e s .o u t p u t s [ 1 ] , p r o b l e m=s u b _ p r o b l e m , r u n t i m e=w o r k e r ) .r e s u l t ( ) p t i m i s e r .s t e p ( )

Listing 14 :
The inversion is performed by selecting subsequent frequency bands and, in each band, a certain number of iterations are run to calculate a gradient.o p t _ l o o p = O p t i m i s a t i o n L o o p ( ) # S t a r t o p t i m i s a t i o n f o r b l o c k , f_max i n \ o p t _ l o o p .b l o c k s ( num_blocks , f r e q s ) : # E v e r y i t e r a t i o n i n t h e b l o c k f o r i t e r a t i o n i n \ b l o c k .i t e r a t i o n s ( n u m _ i t e r s ) : a w a i t r u n _ i t e r ( f_max )

Figure 7 :
Figure 7: Stride reconstruction in 2D.A 2D acoustic breast model (A) is imaged starting from a homogeneous distribution of speed of sound (B).Stride manages to accurately reconstruct the target model (C).The mean of the absolute value of the difference between the ground-truth model and the inversion is displayed here as .

Figure 8 :
Figure 8: Experimental Stride reconstruction in 2D.A tissuemimicking phantom (A) is imaged starting from a homogeneous distribution of speed of sound (B).The Stride reconstruction (C) closely matches the target phantom, is able to recover the different layers of speed of sound and the complex interface between those layers.We can also see the signal used experimentally for imaging (D) and its corresponding magnitude spectrum (E).

Figure 9 :
Figure 9: Stride reconstruction in 3D.A 3D acoustic head model (top row) is imaged starting from a model that contains only the skull and is homogeneous otherwise (middle row).Stride manages to accurately reconstruct the target model (bottom row).The mean of the absolute value of the difference between the ground-truth model and the inversion is displayed here as .

Figure 10 :
Figure 10: Mosaic strong scaling for the 3D head model.Scaling obtained with Mosaic (red, dashed line) is compared to the ideal scaling scenario (black, continuous line).Scaling is analysed by running 128 shot gradient calculations for the 3D head model across an increasing number of compute nodes.Acceleration is calculated as the amount of time taken to complete all gradient calculations using a certain number of nodes with respect to the time taken using a single node, averaged over 5 experiments.