Characterising particulate random media from near-surface backscattering: a machine learning approach to predict particle size and concentration

To what extent can particulate random media be characterised using direct wave backscattering from a single receiver/source? Here, in a two dimensional setting, we show using a machine learning approach that both the particle radius and concentration can be accurately measured when the boundary condition on the particles is of Dirichlet type. Although the methods we introduce could be applied to any particle type. In general backscattering is challenging to interpret for a wide range of particle concentrations, because multiple scattering cannot be ignored, except in the very dilute range. Across the concentration range from 1% to 20% we find that the mean backscattered wave field is sufficient to accurately determine the concentration of particles. However, to accurately determine the particle radius, the second moment, or average intensity, of the backscattering is necessary. We are also able to determine what is the ideal frequency range to measure a broad range of particles sizes. To get rigorous results with supervised machine learning requires a large, highly precise, dataset of backscattered waves from an infinite half-space filled with particles. We are able to create this dataset by introducing a numerical approach which accurately approximates the backscattering from an infinite half-space.

Under close inspection, many materials are composed of small randomly distributed particles or inclusions. So it is no surprise that the need to measure particle properties, such as their average size and concentration, spans many physical disciplines. For quick non-invasive measurements, waves, either mechanical, electromagnetic or quantum, are the preferred choice. However, measuring a broad range of particle concentrations and sizes is still an open challenge. For high concentrations the wave undergoes multiple scattering, which requires specialised methods to compute and interpret. And further, measuring a wide range of particle sizes means a wide range of frequencies needs to be considered.
The type of wave used depends on the type of particle: acoustic waves are used to measure liquid emulsions [1], sediment on the ocean floor [2] and polycrystalline materials [3]. Microwaves are vital in remote sensing of ice [4]; optics for aerosols [5] and cellular components, both micrometer [6] and nanoscale [7] structures, among many other applications. In all these applications, there are cases when transmission experiments are impractical, because either the material is too opaque or, for example, has an unknown depth. The next natural choice is to use reflected, or backscattered, waves.
Here we ask: can one source/receiver measure the properties of a random particulate medium? And is it possible to do so without measuring the backscattering for a range of scattering angles, and without knowing the depth of the medium? Figure 1 illustrates a backscattered wave in time measured at one point in space. We consider only elastic scattering, and scattered waves that have the same frequency as the incident wave. We show that, with this simple setup, it is possible to recover a wide range of concentra-54001-p1 wave amplitude x R backscattering Fig. 1: (Colour online) The snapshot above is of a plane-wave pulse being backscattered by the grey particles, in the region x > 0, after time t = 20 (non-dimensional). The incident pulse originated at the line x = xR, then travelled towards the particles and was then backscattered. The blue line graph shows the amplitude measured at (xR, 0) over time, where around time t = 25 the backscattered waves begin to arrive. The particles occupy 10% of the volume and around 150 particles were used for these simulations.
tions and particle radiuses, even including particles with a sub-wavelength radius. We also identify which part of the backscattered signal is sensitive to the concentration and particle radius. To achieve these goals, we use learning curves from supervised machine learning, and, in doing so, we also show how to accurately predict particle radius and concentration from backscattered waves. Supervised machine learning in similar contexts has already shown great promise [8,9]. See [9] for a summary of machine learning applications in remote sensing.
The long-term goal is to develop a device, as simple as possible and with little prior information, that can determine the statistical properties of the particles for a broad range of random media. To do so will require theoretical predictions, experiments and simulations of backscattered waves. A supervised learning approach can then easily combine data from these different sources to produce an algorithm that predicts particle statistics. Here we take the first step towards this goal, by using simulated data, as it is the most accurate for a broad range of media.
The most common approach to determine particle properties from backscattering, to date, is to adjust the parameters of a mathematical model until it fits the measured backscattering [2,10]. Ideally, these two approaches could be combined to produce an accurate method valid for a large range of parameters.
Simulating near-surface backscattering. -We consider a simple setting with Dirichlet boundary conditions, that is, where the scalar wave-field w = 0 on the boundary of the particles, which for acoustics corresponds to zero pressure, for elasticity corresponds to zero displacement and for electromagnetism corresponds to zero electric or magnetic susceptibility, depending on the polarisation. This case is particularly challenging for many of the current theoretical approaches, as they can lead to unphysical results, even for low frequency and low concentration, as we demonstrate below. We restricted ourselves to two dimensions to lighten the computational load, which is qualitatively similar to three dimensions [11]. In the conclusion we discuss extensions to three dimensions.
Consider an incident plane wave e ik(x−xR−t) , where k is the wave number of the background medium, and we have non-dimensionalised by taking the phase velocity of the background to be 1. We non-dimensionalise because the theory applies to many different applications. The total wave w = e ik(x−xR−t) + u satisfies the two-dimensional scalar wave equation, where u is the backscattered wave from the particles within the half-space x > 0 and (x R , 0) is the receiver position, such as shown in fig. 1. If the receiver is close to the particles, then near-field effects will dominate and many realisations will be needed to calculate the statistical moments. To avoid this, we choose x R = −10. We use a for particle radius, n for number of particles per unit area (concentration) and φ = a 2 πn for volume fraction, and consider a wide range of media: for instance, these values are typically used in emulsions, suspensions, and for atmospheric aerosols.
For random media it is convenient to use the moments of u. That is, if Λ represents one configuration of particles, then u = u(Λ) depends on Λ and its ensemble average is u = u(Λ)p(Λ)dΛ, where p(Λ) is the probability of the particles being in the configuration Λ, then the central moments are We will now associate each medium with a fixed particle radius, concentration and set of moments u and (2). There are many specialised methods to determine these moments [12][13][14][15][16]. Those that accurately calculate u j for a broad frequency range require u , and a common approximation of u is to assume that u ≈ e ik * x inside the random media, for some effective wave number k * . For small volume fraction φ and direct backscattering [17,18] this approximation leads to u ≈ −iφ(πa 2 k 2 ) −1 n J n (ka)/H n (ka)e i(nπ−kx) , where J n and H n are a Bessel and Hankel function of the first kind. However, this approximation diverges when a → 0, while φ is fixed, and leads to the unphysical result | u | > 1. Even rigorous methods [12], deduced for moderate volume fraction, present the same problem. This problem is  2 from particles of radius a = 0.2 occupying φ = 20% of the volume. Left: backscattering from five different configurations. Right: the moments of 756 configurations. The height of the black line is u the mean response, while the total thickness of the green and red regions are the second u 2 (standard deviation) and fourth u 4 (kurtosis) moment. a result of strong scatterers, with w = 0 on their boundaries, completely reflecting waves at low frequencies for any particle volume fraction. This can be seen by investigating the effective properties [19,20]. The consequence is that series expansions of u for small volume fractions do not converge for scatterers with w = 0 on their boundaries. For other strong scatterers with w ≈ 0 on their boundaries, this series converges very slowly.
It may be possible to accurately describe backscattering from strong scatterers with integral methods that are valid for any volume fraction [21,22]. Though, we note, that methods derived from Lippmann-Schwinger-type equations are not formally valid for scatterers with discontinuous material properties [23], such as strong acoustic scatterers.
To accurately determine all the moments over the range (1), we use a numerical approach based on the multipole method [24] to calculate u(Λ) for each configuration Λ, from which we determine the u n with a Monte Carlo method 1 . In all our convergence tests, truncation errors and benchmarks were within 1% accuracy for each simulation. In the supplementary material Supplementarymaterial.pdf (SM) we explain how to reproduce our results, including high-performance software to simulate the backscattering [26] and implement the machine learning. The data used in this paper is also publicly available [27].
Approximating the backscattering from an infinite halfspace, with a limited computational domain, is challenging. To overcome this challenge we calculate the backscattering of the incident time pulse e −0.1(x−xR−t) 2 , which, for wave numbers 0 ≤ k ≤ 1, results in less than 1% Gibbs phenomena, and receive the backscattering at (x R , 0). By only receiving the signal for t ≤ 98, we can exclude from the simulation all particles that would take more than t = 100 for their first scattered wave to arrive at the receiver (x R , 0). That is, we need only simulate particles that are near the surface, which is why we call this near-surface backscattering. See fig. 2 for the incident time 1 The alternative would be to piece together different theoretical methods, whose range of validity is not clear [25]. pulse and for u , u 1 and u 4 , where we include u 4 as it is known to be sensitive the micro-structure [16,28].
In total we simulated the moments of 205 different media, evenly sampled from (1), which required 83000 backscattering simulations -each corresponding to one configuration Λ. For the larger simulations up to 7600 particles were used. To estimate the quality of the calculated moments, we used the standard error of the mean. See fig. 3 for an overview of the simulated moments.
Learning from backscattering. -With a high-quality data set of backscattered waves, we can now use supervised machine learning to generate a model that best fits the radius and a separate model that best fits the concentration. To test these models, we use them to predict the concentration and the radius of yet unseen media using only backscattered waves as input. Our supervised machine learning method of choice is kernel ridge regression [29,30], because when using continuous kernels it can fit any continuous function [31]. This allows us to establish whether the radius or the concentration are continuous functions of u or u 2 or both. In other words, we can determine which moments are needed to predict the radius and concentration. We present the results for concentration, instead of volume fraction, because it can be accurately predicted from just u .
Our training set consists of the simulated backscattered moments of 205 different media. Using this training set we train a model, that is to say, we use kernel ridge regression applied to the training set to generate a model. The hyperparameters of the ridge regression were selected using a 7-fold cross-validation. To determine the predictive power of our model, we generate a test set with 81 randomly chosen media with radius 0.2 ≤ a ≤ 2.0 and volume fraction 1% ≤ φ ≤ 21%. Every medium of the test set is distinct Fig. 4: (Colour online) This figure shows that to accurately predict concentration requires only u , but to accurately predict the particle radius requires also second moment u 2. The top two models were trained using only the mean u , while the bottom two were trained using the mean u and second moment u 2. The best prediction for the concentration gives R 2 = 0.98, which results from using low wave numbers, discussed later.
from the training set. To measure the goodness of fit, we use the R 2 coefficient with respect to the mean of the test set. If R 2 = 0, then the model has the same predictive power as the mean of the test set, while R 2 = 1 shows that the model has perfect prediction. Finally we tested two continuous kernels, the Gaussian (or radial basis) and the Ornstein-Uhlenbeck kernel. Both kernels gave similar scores through cross-validation, though the Ornstein-Uhlenbeck kernel had a slightly better R 2 coefficient on the test set, so we only report these results.
Results. -We train two models using only u , one to predict the concentration and one to predict the radius, see the top graphs of fig. 4. The top left and top right graphs show the scatter plot of the concentration and radius of the test set against the predicted concentration and radius, respectively. The prediction for the concentration is almost perfect, with R 2 = 0.96. On the other hand, the prediction for the radius is almost meaningless with R 2 = 0.53. The failure of the first moment u alone to predict the radius is significant, as it indicates that the radius is not a continuous function of u .
To accurately predict the radius, the second moment was necessary. Indeed, training a model on the first and second moment resulted in an accurate prediction of the radius with R 2 = 0.93, see the bottom right panel of fig. 4.
To show that our results, such as the top right panel of fig. 4, are not due to insufficient data, and likely extend beyond our data set, we examine the learning curves. A learning curve shows the R 2 coefficient as the quality of the data is increased. For example, if it was possible to predict the radius from only u , then the model's R 2 coefficient would increase when improving the training data's quality. Contrary to this, if the R 2 coefficient does  not increase, or if there is no clear trend, then the model cannot predict the radius, no matter the quality of the training data.
We vary the quality of the training data by changing: the number of media, the number of simulations for each medium, and by limiting the maximum wave number k of the incident wave. For every change in the training data we re-train the model of the radius and the model of the concentration. The resulting learning curves are shown in figs. 5 and 6. The graphs on the left of all these figures are the result of using a model trained only on u , and from them we see that the R 2 of the radius model does not tend to 1 when increasing the training data quality. The simplest explanation for this is that u does not by itself carry information about the radius. On the other hand, the graphs on the right of figs. 5 and 6 are models trained on u and u 2 , and clearly their R 2 for the radius 54001-p4 converges to R 2 = 1. In contrast, the concentration is accurately predicted from u even when using either 30% of the number of training media, having large standard errors of the mean, or using only wave numbers k ≤ 0.1. In fact limiting 0 ≤ k ≤ 0.1 leads to an R 2 = 0.98 for the concentration.
Finally, from fig. 5, we see that the learning curve saturates around a maximum wave number of 0.8. This indicates that 0 < k < 0.8 is the ideal range to measure particles in the range 0 < a < 2.
Conclusions. -Our results indicate that the first direct backscattered moment u does not carry information about a broad range of particle radiuses, for strong scatterers. However, the second moment u 2 does carry this information. On the other hand, the particle concentration can be accurately predicted from just u . We also demonstrated that only incident wave numbers 0 < k < 0.8 are needed to accurately measure particles with radius 0 < a < 2. This implies that neither theory, simulation or experiments need go beyond ka = 1.6, at least for strong scatterers. This also means that we are able to accurately recover radiuses that are 20 times smaller than the smallest incident wavelength.
In this study we did not consider limitations in spatial and temporal resolution, which, of course, are important in practice. However, before specialising to one particular scenario, i.e., typical acoustics and light scattering experiments, we need to know what is possible to measure or not in an ideal setting. Studies like these are therefore a vital first step. Another important step is to quantify how uncertainties in the measurements affect the prediction of the particle properties. This can be achieved by using Guassian process regression [32], which is, in a sense, a Bayesian version of kernel ridge regression.
Ultimately, our machine learning model could be embedded into a device to predict particulate properties. Though our model is initially trained on simulated data, our training procedure is simple enough that the model can be updated using real data. This step of adapting models trained on simulated data to real applications has been applied to challenging problems such as robotic grasping [33], facial recognition [34], 3D pose inference [35] and optical flow estimation [36] to name a few. These applications have advanced in strides by using simulated data, and we see a similar potential for characterising random media, such as this work.
Both the simulation (near-surface backscattering) and machine learning approach we have presented could be applied to characterise any type of particulate material from wave backscattering. To extend our approach, to 3D and other types of particles, computational efficiency is important. Simulating the backscattered moments would be faster if the multilevel Monte Carlo methods [37] and fast multipole methods [38] were used. For instance, it may be possible to measure the physical properties of the particles, as well as the size and concentration. Another avenue to create more backscattering data is to piece together different theoretical models, which could then be validated with the numeric approach we introduced: near-surface backscattering in time. * * * ALG, WJP and IDA are grateful for the funding provided by EPSRC (EP/M026205/1, EP/L018039/1). RMG is grateful for funding provided by the FSMP at the INRIA -SIERRA project-team. JD would like to acknowledge the receipt of an EPSRC CASE studentship from the School of Mathematics and Thales UK.