Feasibility study of range verification based on proton-induced acoustic signals and recurrent neural network

Range verification in proton therapy is a critical quality assurance task. We studied the feasibility of online range verification based on proton-induced acoustic signals, using a bidirectional long-short-term-memory recurrent neural network and various signal processing techniques. Dose distribution of 1D pencil proton beams inside a CT image-based phantom was analytically calculated. The propagation of acoustic signal inside the phantom was modeled using the k-Wave toolbox. For signal processing, five methods were investigated: down-sampling (DS), DS + HT (Hilbert transform), Wavelet decomposition (Wavedec db1, db4 and db20). The performances were quantitatively evaluated in terms of mean absolute error, mean relative error (MRE) and the Bragg peak localization error ( ΔBP). In addition, the study analyzed the impact of noise levels, the number of sensors, as well as the location of sensors. For the noiseless case (32 sensors), the Wavedec db1 method demonstrates the best performance: ΔBP is less than one pixel and the dose accuracy over the region adjacent to the Bragg peak (MRE50) is ∼3.04%. With the presence of noise, the Wavedec db1 method demonstrates the best noise immunity, achieving ΔBP less than 1 mm and an MRE50 of ∼12%. The proposed machine learning framework may become a useful tool allowing for online range verification in proton therapy.


Introduction
One unique advantage of proton therapy is its capability to deliver maximal dose to the location of a tumor (also known as Bragg peak), while minimizing dose exposure to normal tissues (Wilson 1946). However, range uncertainties could adversely affect treatment outcome, resulting in either under-dosing or over-dosing problems (Knopf and Lomax 2013a). There are several approaches for range verification. Firstly, implantable markers with wireless read-out, have been used for verifications in photon, electron and proton treatments (Lu 2008, Lu et al 2010, Knopf and Lomax 2013a, Cho et al 2013, La Rosa et al 2014, 2016, Carlier et al 2020. One major limitation of this approach is a very limited spatial coverage and being invasive. The second approach is based on proton-induced nuclear signal: prompt gamma (PG) or positron emitters, both of which face challenges in high cost of system development and poor counting statistics (Min et al 2012, Gueth et al 2013, Nishio et al 2006, Helmbrecht et al 2012a, Liu et al 2019, Hu et al 2020. In particular, long acquisition time is required because of the half-life time of a positron emitter. Furthermore, magnetic resonance imaging (MRI) has also been utilized to evaluate radiation-induced changes in human tissues, but it only allows for a retrospective check (Gensheimer et al 2010, Yuan et al 2013, Uh et al 2018.
In addition to the aforementioned approaches, utilizing proton-induced acoustic signal is another promising approach for online verification. Since the initial proposal (Askaryan 1957), many research groups have extended the concept on a number of aspects including both numerical simulations (Jones et al 2014, Ahmad et al 2015, Kipergil et al 2017, Patch et al 2018, Forghani et al 2019 and experimental measurements (Sulak et al 1979, Albul et al 2004, De Bonis 2009, Assmann et al 2015, Jones et al 2015, Lehrack et al 2017, Haffa et al 2018. For instance, one study detected acoustic signals induced by proton beams in the fluid media and developed a thermal expansion model for this phenomenon (Sulak et al 1979). Other studies reported the detection of proton-induced acoustic signal in both animal tissues and humans (Hayakawa et al 1989, 1995, Tada et al 1991. Several recent works further examined the time-of-flight (TOF) method in a uniform water medium (Jones et al 2016a, Nie et al 2018, Hickling et al 2018. However, all of them cannot be readily applied to heterogeneous tissues. Recently, our group reported a feasibility study of applying the time reversal-based reconstruction method in both 2D and 3D heterogeneous tissues for dose verification . Compared to the TOF method, the time reversal method intended to address two limitations. First, the extraction of arrival time in the TOF method is complicated due to complex pulse shapes and the interference between positive and negative pressure waves. For instance, a characteristic bi-polar waveform may get distorted if there are lots of air/soft tissue or bone/soft tissue interfaces along the propagation path. This will consequently comprise the accuracy of the information of arrival time. Second, the time reversal method is able to provide information with respect to both range and dose profile, while the TOF method is only able to locate the position of a Bragg peak. Nevertheless, while the computation task of the TOF method is pretty straightforward and fast, the time reversal method takes up to several minutes for a 3D phantom even after GPU acceleration . Such a time requirement makes it less likely to be utilized for real-time dose verification. Therefore, we propose a new framework aiming to combine the strengths of both the TOF method and the time reversal method, leveraging the power of machine learning. The framework is expected to be not only work for heterogeneous tissues, but also speeds up the verification process. Instead of traditional image reconstruction techniques, the prediction of dose distribution is to be made using a bidirectional long-short-term-memory recurrent neural network (Bi-LSTM RNN). In our view, such a machine-learning based framework is more efficient and easier to use compared to other models described above. To find an effective way of feature extraction and signal compression, three signal processing techniques tested are: direct down-sampling (DS), Hilbert transform (HT) and Wavelet decomposition (Wavedec, with different basis functions). The paper consists of three steps. The first step is to simulate the physical process of dose deposition and wave propagation in a 3D phantom. The second step is to develop/evaluate the Bi-LSTM RNN model, and compare the performances of different signal processing techniques. The third step is to examine the framework with respect to the signal-to-noise ratio (SNR) of acoustic signals, the number and spatial distribution of acoustic sensors.

Proton-induced acoustic signal
We simulated the propagation of acoustic signal using the k-Wave toolbox, as detailed in our previous work . To simplify the generation of dose distribution, 1D dose profile within a CT image-based phantom was analytically obtained by calculating the stopping power (SP) using the continuous-slowing-down-approximation (CSDA) model. More details can be found in the supplementary material (available online at stacks.iop.org/PMB/65/215017/mmedia). Good consistence between the results from the analytical calculation and from Monte-Carlo was confirmed. Next, the dose deposition was transformed into the pressure signal through an adiabatic process (Xu andWang 2006, Yu et al 2019): where ρ δ (r, 0) [unit: Pa] is the pressure distribution at a given position r and time t = 0. ρ [unit: kg m −3 ] is the mass density, Γ = α V c 2 /C p is the dimensionless Grüneisen coefficient dependent on the volumetric thermal expansion coefficient, α V [unit: K −1 ], the speed of sound, c [unit: m s − ], and the isobaric specific heat capacity, C p [unit: J/(K kg)]. D δ (r, 0) [unit: J kg − ] is the normalized original dose distribution (induced by a delta-function proton pulse) at a given position r and time t = 0. The information of heat capacity and sound velocity of medium were derived from the HU values in CT image, as described in our previous work.  To simulate the clinical setting, the total dose was set to be 2 Gy (i.e. 1 Gray = 1 J kg −1 ). Depending on the beam energy used, nearly 80% of the total dose concentrated over the Bragg peak and resulted an initial pressure of ∼170 Pa according to the formula (1). As a starting point for signal detection, 32 ideal point sensors (no directionality modeled) were attached to the skin as shown in figure 1(b). The parameters of the simulation were summarized in table 1 and a number of signal waveforms are provided in figure 1(d). Each raw signal waveform has a total of 20 000 sample points over a sampling duration of 200 µs (sampling rate = 100 MHz).  Detailed information and parameters of the original data (including CT image, proton beam) and k-wave setting (which is also the characteristic of the original acoustic signal).

Data preprocessing
Prior to being used as inputs of the Bi-LSTM RNN model, the raw signal waveforms were preprocessed to extract features, as well as reducing the burden of data handling. For instance, a sequential signal of 20 000 steps would not only require a huge number of hyper-parameters of the model to be trained, but also make it challenging for the model to capture long-range correlation. In other words, it is necessary to reduce the size of data while extracting important features at the same time. To accomplish that, three scenarios were investigated and elaborated below: down-sampling (DS), DS + HT, Wavedec (db1, db4 and db20). The first scenario is direct DS. We took only the first 15 000 data points of the raw signal waveform, since the amplitude of those points afterwards was negligible. Fast Fourier transform (FFT) was then utilized to analyze the frequency spectra of the signals, and a cut-off frequency of around 1 MHz was observed as shown in the figure 2. According to the Nyquist sampling theorem, the sampling rate can thus be reduced down to 2 MHz (twice the cut-off bandwidth) without causing aliasing. The DS scenario, hereafter referred to as down-sampling acoustic signal (DSAS), compressed the number of sample points in each waveform from 15 000 down to 300.
HT is a powerful signal processing tool in many areas (  amplitude envelope through the construction of an analytic signal. The amplitude envelope information of the DSAS after HT (hereafter referred as DS + HT) was expected to maintain important features to be captured for machine learning. In addition, all data points being positive in an envelope might also be beneficial for prediction accuracy and noise immunity (figure 2).
Wavelet decomposition (Wavedec) is another useful tool for analyzing time-series signals (Arivazhagan and Ganesan 2004, Mallat 2009, Chang et al 2010, Martis et al 2012. When compared to Short-Time Fourier Transform, wavelet transform (WT) is able to provide high resolution in both the frequency-and the time-domain. More precisely speaking, our purpose is to decompose the raw signals using a set of wavelet basis functions to implement feature extraction and data compression. In our study, the discrete WT was implemented as a filter-bank through a cascade of high-pass and low-pass filters (i.e. Quadrature Mirror Filters), a technique widely deployed in multi-resolution analysis and sub-band coding. During each step, a signal is decomposed into low-frequency information (represented by approximation coefficients) and high-frequency information (represented by detail coefficients), as elucidated in figure 3.
To be specific, we decomposed the raw signal waveforms and used the output at the 6th level as the input for machine learning. dbN represents the Daubechies wavelets, a family of orthogonal wavelets characterized by different number of vanishing moments N (1, 4, 20 in our study). cA6 represented the 6th level approximation coefficients, the inputs of the BiLSTM RNN model. The frequency components in each sub-band are shown in figure 3(a). The cA6 values (6th level approximation coefficients) correspond to frequencies from zero up to 0.78125 MHz and the cD6 values (6th level approximation coefficients) correspond to frequencies from 0.78125 MHz to 1.5625 MHz. Figure 3 also demonstrate the outputs of Wavedec db1 without and with the presence of noise (SNR = 5), for the signal detected by a selected sensor when irradiated by a proton beam (100 MeV). The results clearly indicate that the cA6 values indeed embody multiple important features of the raw signal waveform while helping remove high-frequency noise components. In addition to db1, two different Daubechies wavelets (db4 and db20) were tested in the same manner.

RNN model and machine learning
For machine learning, a bidirectional long short-term memory (BiLSTM) was selected in order to identity the correlation between signal waveforms and associated dose profiles (see details in section 2.1). Compared  (Wavedec,db1) in the form of sub-band coding (i.e. filter bank). (b) Results of raw signal and cD1 (1st level detail coefficient), cD2, cD3, cD4, cD5, cD6 and cA6 (6th level approximation coefficient), for both SNR = 5 and noiseless cases. Note that the horizontal axis is commonly represented in order instead of time, and the number of data points is reduced a factor of 2 after every level of decomposition.
to convolutional neural networks (CNN), RNN specializes in handling sequential data and requires less trainable parameters (Hochreiter and Schmidhuber 1997). It accepts one sequence as input and produce one sequence as output. At each time step, an RNN combines an external input with the internal hidden states (and gates) to produce an output. Said differently, it does not rely on the input alone at the current step but use recurrent connections to keep track of the context hidden in a sequence. As a result, we believe it is a good candidate for mapping an acoustic signal waveform to a dose profile. Since a basic RNN often fails to work for long input/output sequence due to the problem of vanishing gradient (Bengio et al 1994), the long short-term memory architecture (LSTM) was adopted to facilitate gradient propagation. The intention of introducing the bi-directional scheme is to help reinforce the correlation within an input sequence by propagating signals both forward and backward.
In this study, a single CT slice in the abdomen was selected and the training dataset was generated by adjusting beam energies between 80 and 160 MeV (table 1). Due to the long simulation time of 3D wave propagation using the k-Wave toolbox, only a single beam path (1D) was included in this study for all beam energies. Different beam paths will be the included in the next step of our research. The complete dataset (161 waveform-dose pairs) was split for training (140) and testing (21). The Bi-LSTM RNN model contained three layers (the first layer: 100 hidden units, the second layer: 20 hidden units, and the third layer: 1 unit). The loss function was defined to be a weighted average of L 2 loss and L 1 loss. The ADAM algorithm (learning rate = 0.001, beta1 = 0.9, beta2 = 0.999, epsilon = 1 × 10 −8 ) was used for optimization (Kingma and Ba 2014). As shown in figure 4, X i is the input after normalization and preprocessing and y i is the output of 1D dose distribution to be predicted. For multiple sensors, their signals were concatenated as the input per time step.

Performance evaluation
For quantitative evaluation, MAE (mean absolute error) and MRE (mean relative error) defined below were used for dose verification: where n is the total number of the testing datasets, m is the length of the dose profile in a given dataset,ŷ i is the predicted dose at a given depth, y i is the actual dose at a given depth and y max is the maximum of the dose profiles at a given depth. For a more comprehensive evaluation, MAE and MRE were also analyzed for different regions. For instance, MAE20 represents the MAE over the region from the depth corresponding to the 20% of the peak amplitude up to the peak position. MAE50 represents the MAE over the region from the depth corresponding to the 50% of the peak amplitude up to the peak position. In addition, the range uncertainty was also assessed, defined as the absolute difference between the peak location in a predicted dose profile and that in a raw dose profile.

SNR dependency
A well-known challenge is relatively small amplitude of proton-induced signals, especially after the propagation of acoustic waves through tissues in three dimensions. On top of that, noise may result from both an acoustic sensor itself and readout electronics such as low-noise amplifiers. Therefore, the study of SNR dependency is an important task. To simply the modeling, random noise was added to the down-sampled signal of each sensor with different SNR levels: where I is the intensity of a single time point in the acoustic waveform, I p is the maximum amplitude of a given acoustic signal waveform and randn is a normal distribution function to generate Gaussian random noises, a reasonable assumption for acoustic sensors in our belief. To select a reasonable SNR, the results from published literatures were selected as the reference. In our work, the pressure at the Bragg peak was around 170 Pa for a dose of 2 Gy and the average pressure detected by a single sensor was about 0.5 Pa. From the experimental studies (Hayakawa et al 1988, 1995, Jones et al 2016a, the noise level was measured to be around 20 mPa (dependent on hydrophone design and readout electronics), which was indeed consistent with the preliminary hydrophone experiments in our lab. In a number of simulation studies (

Number and distribution of sensors
Two aspects related to practical application were investigated. First, based upon the numbering scheme in section 2.1 ( figure 1(b)), we chose different numbers of sensors (32,24,16,8) to repeat the study, which involved the re-training of the BiLSTM RNN model. For example, a case of 16 sensors implies the use of only sensors numbered from 0 to 15. This task was conducted with two models (DS and Wavedec db1). One accompanying issue is how to optimize the distribution of sensors, besides the total number. By fixing the total number of sensors at 8, we studied four different position schemes. For example, the scheme A includes sensors [0,2,4,6,8,10,12,14] (the upper left region in figure 1(b)). The scheme B includes sensors [1,3,5,7,9,11,13,15] (the lower left region in figure 1(b)). The scheme C includes sensors [1,2,5,6,9,10,13,14]. The scheme D includes sensors [0,7,8,15,16,23,24,31]. The last two schemes have a larger spacing between two adjacent sensors and thus a larger spatial coverage.

Machine learning and signal processing techniques
Using the Bi-LSTM RNN model, the results of five cases are presented below: DS, DS + HT, Wavedec db1, Wavedec db4 and Wavedec db20. Figure 5 shows the cA6 results of Wavedec db1 and the predicted dose profiles for different energies. Three important patterns are observed. First, the cA6 results indeed carry some meaningful information (or features) of the dose profiles. Second, the predicted dose profiles align with the raw dose profiles with good accuracy (MRE: 0.19-0.25%), especially over the region of the Bragg peak. Third, the dose profiles are quite smooth, in spite of the cA6 results exhibiting a large degree of fluctuation. Figure 6 illustrates the training and cross-validation loss of three cases (DS, DS + HT, Wavedec db1), all of which are able to converge smoothly after 20 000 epochs. No overfitting is observed and the same trend is observed for both Wavedec db4 and db20. Figure 7 shows the prediction performance of all samples in terms of MRE and range uncertainty (Wavedec db4 and db20 not shown). Take the Wavedec db1 case for an example, the MRE and range uncertainty are less than 1% and 0.6 mm, respectively. Table 2 summarizes the quantitative results of all five cases without the presence of noise. Several patterns are observed. First, DS + HT shows the worst performance in all metrics, suggesting that the HT may have discarded some useful features by keeping only the information of amplitude envelope. Second, DS and

SNR dependency
The impact of SNR on performance is shown in figures 8 and 9. Figure 8 shows an example of the noisy acoustic signals of three selected sensors, Wavedec db1 cA6 signals and prediction results, under three SNRs (20, 10 and 5). Compared with the noisy acoustic signals, the Wavedec db1 cA6 signals exhibit the same overall shapes, while being considerably smoother. The predicted dose profiles maintain good overlapping with the raw dose profiles. As the SNR decreases down to 5, the discrepancy between the two profiles become more noticeable over the peak region. Figure 9 presents the results of MRE, MRE20, MRE50 and range uncertainty, as a function of SNR. As the SNR decreases, all four merits deteriorate accordingly with varying degrees. The Wavedec db1 case outperforms others with a range uncertainty below 2 mm and MRE below 2%, even at a low SNR of 5. Range uncertainty should be of higher priority relative to other merits, because the MRE will increase drastically even given a very small offset in the range or peak position. Figure 9(a) also suggests that the DS + HT case demonstrates better noise immunity relative to the DS case. However, the DS + HT model may only be considered as an alternative at a SNR exceeding 15, where the range uncertainty is comparable to that of the Wavedec db1 case. Figure 9(b) shows that for three Wavedec cases, no significant difference exists when the SNR is above 15, all obtaining an MRE of ∼1% and an MRE20 of ∼2%.
The other finding in figure 9 is that the MRE50 is consistently larger than the MRE20, likely due to the following two reasons. The first one is due to the intrinsic shape of a dose profile, a relatively flat region on the left side followed by a region of steep increase towards the peak on the right side. A rapid change of amplitude thus makes it difficult for the machine learning model to capture the underlying correlation. The second reason is that the number of data points selected for the calculation of MRE50 is much less than that of MRE 20 and MRE (formula 3), thus causing the MRE50 to be more sensitive to the presence of noise. Figure 10 illustrates the predicted dose profiles for DS and Wavedec db1 with different number of sensors and table 3 summarizes the quantitative results. Two interesting patterns are observed. First, the DS case fails to work with 8 sensors ( figure 10(a), a large shift to the left) but Wavedec db1 still succeeds ( figure 10(b)), corresponding to a range uncertainty of 43.93 mm and 0.06 mm (the average result for all samples), respectively. Second, for the DS case, a larger number of sensors apparently helps improve the accuracy of prediction, while such trend is not found for the Wavedec db1 case. Table 4 summaries the quantitative results of four position schemes with 8 sensors. All four schemes show good accuracy in range prediction with range uncertainty less than 0.1 mm (the average result for all samples). The position scheme A demonstrates slightly superior performance over others, yielding an MRE50 of 2.06%. One possible explanation is that for the scheme A, the region across which the acoustic waves transverse exhibits a relatively uniform tissue composition (the upper left region in figure 1(b)), compared to the other three schemes.

RNN modeling
As previously mentioned, a time-of-flight approach can only verify the range in a uniform medium. It is much more demanding to identify the correlation between acoustic signals and dose delivery in  heterogeneous tissues, since the propagation of acoustic wave is a complicated process and depends on many factors, such as acoustic properties of tissues (e.g. sound speed), beam path and position of sensors. These  Table 4. Performance comparison of different distributions of 8 sensors (see figure 1 for the numbering scheme). Scheme A: sensors [#0, 2, 4, 6, 8, 10, 12, 14]. Scheme B: sensors [#1, 3, 5, 7, 9, 11, 13, 15]. Scheme C: sensors [#1, 2, 5, 6, 9, 10, 13, 14]. factors interleave with each other and further confound the task. To address this challenge, the RNN-based machine learning framework was selected due to its unique strength in extracting sequential correlation, as well as requiring a small number of hyper-parameters (Liu et al 2019). As illustrated in figure 5, every single dose profile actually bears a close correlation with a unique map comprising multi-channel acoustic waveforms. In essence, the BiLSTM RNN model analyzes the characteristics of those time-series signals in order to extract meaningful features. For example, one feature of particular importance might be the relative time offset of amplitude peaks among individual channels. Compared to the time-reversal approach in our previous work , the BiLSTM model is more efficient requiring much shorter computational time and less sensors. We acknowledge that to treat multi-channel signals all at once using a deep CNN might also be a viable solution, but two potential concerns make it less appealing. First, a CNN would require a much larger number of hyper-parameters to be trained as well as more data samples in the training dataset. Second, a CNN may not provide the same strength in identifying temporal correlation as an RNN model. To study whether the BiLSTM RNN model is able to work for a more realistic beam profile in three dimensions is being carried out in our group. We believe the answer is positive, albeit a slightly inferior accuracy is to be expected. For a proton beam of a finite cross-sectional area, the essential correlation among multiple sensors should remain the same as that identified for an ideal pencil beam, from the aspect of physics. Note that in our study, although the dose profile was only 1D, the propagation of sound wave was indeed modeled in three dimensions using the k-Wave toolbox. How the wave equations are solved in 2D and 3D inside the k-Wave toolbox was discussed in our previous publication . Therefore, the remaining challenge of an extension to 3D stays largely within the implementation of a 3D RNN.
Besides the small size of the training dataset in our study, two other limitations are mentioned below. First, the BiLSTM RNN model provides neither the absolute coordinate of a Bragg peak nor the angle of a beam path. Such a limitation is linked with our initial assumption that the entry position and beam angle are two parameters we can accurately derive from a treatment/delivery plan. If desired, these two parameters can be learnt by including them as two additional outputs of the BiLSTM RNN model. Second, our current solution is devised to perform verification for a specific beam delivery. Whether a more generalized model through the incorporation of anatomical information of tissues can be developed to alleviate the need of re-training for different deliveries, awaits further investigation. This can somehow borrow the concept in our previous endeavor with proton-induced nuclear signals (Liu et al 2019, Hu et al 2020, but will be a more demanding task since the wave propagation and detection rely heavily on anatomical structures.

Signal processing techniques
To facilitate machine learning, three preprocessing methods were attempted out of the following considerations: (1) data compression to reduce the length of input signals, (2) maintaining important features associated with the correlation between dose profiles and acoustic waveforms, and (3) filtering out noise to obtain a higher SNR. To be specific, the DS method down-sampled the original data according to the Nyquist theorem, reducing the number of time steps from 15 000 to 300. The HT method detected the amplitude envelope of the down-sampled acoustic signals. The Wavedec method focused on the decomposition of raw signals into different frequency bands.
The implication of the DS results is twofold. First, due to the physical process of acoustic wave generation within the tissue, in particular the time constant of stress confinement, the raw waveforms exhibit a maximum frequency no higher than 1 MHz ( figure 2(b)). This opens room for DS without aliasing and information loss. Second, it reduces the challenge on hardware development, such as the sampling requirement of an analog-to-digital converter (ADC). However, DS has a limited role in noise suppression.
The HT method is found to be more immune to noise when compared to the DS method (figure 9), although the DS method outperforms in the noiseless case (table 2). Such a comparison may be interpreted from the following two perspectives. On one hand, HT constructs an analytical signal from which the amplitude envelope or phase information is to be extracted. In our study, only the amplitude envelope was used as input. This may inevitably result in information loss and deteriorates the accuracy of prediction, relative to the DS method. It is worth mentioning that the phase information was also examined (not shown in the paper), and its performance is found to be comparable, if not inferior to the results achieved when using the amplitude envelope as input. We speculate this can be attributed to the bursting nature of proton induced acoustic signals. A region comprising the maximum amplitude and major bipolar pulses typically has a very short duration of ∼20-40 µs. On the other hand, by keeping only amplitude envelopes as features, the HT method makes itself more immune to noise and accurately pinpoints the peak positions in the waveforms, which in our belief bear the closest relationship with dose profiles.
For the wavelet decomposition method, the raw signals were decomposed through different stages and the number of data points was reduced by a factor of two at each subsequent stage. As a result, the raw signals can be compressed to the same extent as in the DS method. In our study, the maximum decomposition level was selected to be 6, beyond which the prediction became less accurate. Wavedec db1 demonstrates the best performance and noise immunity in our study (table 2). This proves the effectiveness of wavelet decomposition with respect to both feature extraction and noise immunity. To be specific, while utilizing the HT method discards some information as mentioned above, the wavelet decomposition effectively extracts features through approximation coefficients (cA6 in figure 5), yielding similar if not better performances to that of the DS method. In addition, the results in figure 9(b) and table 2 indicate that the wavelet basis functions (db1, db4, and db20) do not perform equally well, and db20 demonstrates the most inferior performance at low SNRs. A definitive explanation for this is not clear and more in-depth study about optimizing the selection of a wavelet basis is to be carried out. From a different perspective, the results also underscore the flexibility offered by the wavelet transform, allowing us to fine-tune the selection of wavelet basis for desired feature extraction.

SNR and sensor configurations
One challenge in dose verification based upon proton induced acoustic signals is low initial pressure and low SNR. The noise components of sensors and readout circuits further decrease SNR (Ahmad et al 2015). In order to take all of them into account, the study of noise immunity of the proposed BiLSTM RNN model was included in our study. The performance comparison in figure 9 highlights the importance of maintaining a high SNR to achieve good accuracy for the DS and DS + HT methods. Otherwise, these two models fail to work. To achieve a range uncertainty within 1 mm, the SNR needs to be above 22. It is understandable that the SNR dependency relies on different signal processing techniques such as HT and Wavedec, but the BiLSTM RNN model also helps with noise immunity and robustness, due to both the capture of sequential correlation and bidirectional signal flow.
By contrast, the Wavedec db1 case is more immune to noise. Even at a low SNR of 5, the range uncertainty is below 1 mm and the MRE is about 2%, while the MRE50 amounts up to ∼20%. This implies that the prediction over the region adjacent to the Bragg peak is more demanding than other regions. As previously explained, two reasons for this are the rapid change of amplitude and small number of data points. Taking a dose profile of 100 MeV beam energy as an example, the calculation of MRE50 includes only 4 points, while the calculation of MRE and MRE20 includes 300 points and 25 points, respectively. One subtle point related to this is about the generation of dose profiles. In our study, all 1D dose profiles were analytically calculated using the CSDA model (Supplementary Material). Compared to Monte-Carlo simulation, the analytical model generates a sharper Bragg peak because it cannot model physical processes such as energy straggling and range straggling. For this reason, we believe that MRE50 can be further reduced when dose profiles are generated using Monte-Carlo simulation.
Besides beam delivery (i.e. dose and spill time), SNR also depends on the performance of various acoustic sensors being used. Several experiments (Hayakawa et al 1988, 1995, Jones et al 2016a and simulation studies (Ahmad et al 2015, Jones et al 2016a, Riva et al 2018 suggest that the signal and noise is about 0.1-0.5 Pa and 20 mPa, respectively. Riva et al (2018) examined SNR as a function of the distance between the detector and the Bragg peak given a delivery of 2 Gy, and reported a SNR of 15 for a distance of 35 cm. Based on these previous studies, a range between 5 and 40 was chosen in our study. The results will be useful in guiding the design of hydrophones and the selection of beam delivery parameters. Furthermore, proton FLASH is an emerging radiotherapy scheme that offers ultra-high dose rate delivery and ultra-fast irradiation (>40 Gy s −1 ) (Beyreuther et al 2019, Bourhis et al 2019. FLASH has great potential to boost the amplitude of acoustic signals and SNR compared to a conventional dose rate (2 Gy min −1 ). We are optimistic that either FLASH or hyperfraction will help increase the applicability of an acoustic-based verification solution in proton therapy.
With regard to sensor configuration, a number of conclusions can be drawn from tables 3 and 4. First, for the DS method, more sensors are desired for a larger coverage, which helps establish the correlation among multiple channels. For the Wavedec db1 method, no considerable degradation is found as the total number of sensors reduces from 32 down to 8 (table 3). This not only suggests that the correlation among multiple channels can be successfully established by using fewer sensors, but also reveals the BiLSTM RNN model's insensitivity to the interference of alpha and gamma waves, a serious challenge encountered in the TOF method (Jones et al 2016a, Nie et al 2018. Using less sensors has one side benefit, that is, the BiLSTM RNN model would have less inputs and thus hyper-parameters to be optimized. Furthermore, having 8 sensors in different locations achieves comparable performance, for the Wavedec db1 method (table 4). If our proposed framework can be tailed to work with 8 sensors, or even less, it would be quite a feat compared to the time reversal reconstruction method . One simple heuristic example might be that three sensors are sufficient to locate a point (three dimensions) within a uniform medium.

Considerations towards practical application
To fully exploit the advantage of our proposed approach, a number of practical challenges need to be tackled. For instance, the location of acoustic sensors cannot cause any interfere with proton beams. The size and directionality of an acoustic sensor (e.g. currently assumed as an ideal point in our work), as well as the coupling between a sensor and patient, need to be further investigated.
The delivery mode of proton beam is also a highly relevant issue. For a pencil beam scanning (PBS) mode, mono-energetic pencil beams are guided towards different locations with the use of steering magnets. Therefore, a verification can be implemented on the basis of per beam or position, given no overlapping of signal detection between two consecutive beams. For a passive scattering (PS) mode, it is common to form a spread-out Bragg peak through the use of a range modulation wheel. Due to the finite rotation speed (e.g. 400-600 rpm), a beam is delivered in the order of milli-seconds (per energy/position), much longer than the time scale of proton-induced acoustic signals (micro-seconds). A reduced dose rate due to the use of scatterers poses an additional challenge (i.e. weaker proton-induced acoustic signals). Therefore, our proposed framework is unlikely to be applicable for a standard PS delivery mode.
How to accurately obtain acoustic parameters of different tissues should be paid due attention as well. To be specific, the propagation of acoustic waves and the signal waveforms strongly rely on the set of physical parameters configured in the k-Wave toolbox. Detailed derivation of those parameters using linear interpolation and HU information from CT images can be found in our previous publication . One concern here is regarding the robustness of our proposed approach. Both input and output for machine learning were so far originated from simulation in this paper. In order to apply the approach to clinical settings, the accuracy of simulation needs to be carefully examined. The accuracy of the k-Wave simulation toolbox itself is less a concern. However, the selection of acoustic parameters (e.g. sound speed in heterogeneous tissue), will undoubtedly impact waveform shapes and feature extraction. To what extent the model depends on the acoustic parameters await further investigation, possibly through experimental measurements once a prototype system is built. Besides CT images, other imaging modalities such as ultrasound imaging and photoacoustic tomography may also be leveraged for answering the same question.

Conclusion
We demonstrated the feasibility of range verification in proton therapy using proton-induced acoustic signals and a machine learning framework. Three signal processing techniques were investigated for feature extraction: direct down-sampling, Hilbert transform, and Wavelet transform. The proposed framework successfully identified the correlation between acoustic waveforms and dose profiles, offering a number of advantages such as incorporating heterogeneous anatomical information, predicting both the location of Bragg peak and dose distribution, and being more computationally efficient. We are currently working towards the extension of the framework to a fully 3D model.