Compensating for visibility artefacts in photoacoustic imaging with a deep learning approach providing prediction uncertainties

Conventional photoacoustic imaging may suffer from the limited view and bandwidth of ultrasound transducers. A deep learning approach is proposed to handle these problems and is demonstrated both in simulations and in experiments on a multi-scale model of leaf skeleton. We employed an experimental approach to build the training and the test sets using photographs of the samples as ground truth images. Reconstructions produced by the neural network show a greatly improved image quality as compared to conventional approaches. In addition, this work aimed at quantifying the reliability of the neural network predictions. To achieve this, the dropout Monte-Carlo procedure is applied to estimate a pixel-wise degree of confidence on each predicted picture. Last, we address the possibility to use transfer learning with simulated data in order to drastically limit the size of the experimental dataset.


Introduction
Photoacoustic (PA) imaging is an emerging biomedical modality based on the generation of acoustic waves by light absorption. This modality is promising, as it enables imaging at large depths with high spatial and temporal resolution, and can provide images of the optical absorption [1] with specific molecular contrast which can be enhanced by spectroscopy.
In conventional PA imaging, a short nanosecond laser pulse is sent into the medium and the emitted ultrasonic waves are collected by a conventional ultrasound (US) probe. At the US propagation time scale, the object illumination is quasi instantaneous as the speed of light is several orders of magnitude higher than the speed of sound, resulting in the emission of strongly coherent acoustics waves [2]. These waves interfere constructively or destructively depending on the structure of the object, often leading to two well-known artefacts on the reconstructed image: the limited bandwidth and the limited view artefacts [3]. With a resonant detection bandwidth, when an object larger than the acoustic central wavelength of the transducer is illuminated, the strong low frequency component of the PA signals is filtered out. With a limited view (limited detection aperture), for a structure elongated along the axis of the probe, the PA waves interfere constructively perpendicularly to the probe but mostly destructively throughout the elongation. As a result, very few signals are collected in this case by linear or matrix array probes due to their limited angular view. Both type of artefacts will further be referred to as the visibility problem in this paper. * Corresponding author.
The limited view problem has been addressed in several studies. The most intuitive approach is to either rotate the object relatively to the probe [4] (or vice versa [5]) or use ring shaped transducer arrays [3,6] in order to cover all angles. However, a clinical implementation would benefit from a handheld real-time system as currently used in ultrasound imaging. Other approaches rely on the introduction of a spatial modulation of optical absorption of the sample, either using injection of sparse absorbing particles [7], by a modulation of physical properties [8], or by computing statistical properties of the PA signal generated by fluctuating sources in the medium [9,10]. Fluctuations of the PA signals can be produced by random optical speckle pattern illuminations [9] or by flowing red blood cells [10], naturally present in the blood vessels. Nevertheless, these methods require long acquisition times in order to get significant statistical properties, and therefore have a poor temporal resolution.
In this work, a deep learning approach is proposed to overcome the visibility problem and improve the image quality in a real-time single shot configuration. A neural network can be viewed as an algorithm composed of many parameters, called weights, designed to compute input data into a desired form of output [11]. This algorithm is trained over multiple examples to obtain the best representation of the studied phenomenon. After training, the network transforms an input raw image into an output image that is expected to resemble the (unknown) ground truth. The training set consists of multiple raw data/ground truth pairs that will be used to optimize the weights of the network. Convolutional neural networks (CNN)are amongst the most popular category of deep learning algorithms (DLA) [12] , and have reached state of the art performances in several imaging problems including segmentation [13], classification [14], artefacts removal [15] or denoising [16]. CNN have been introduced recently in biomedical imaging, showing impressive results in various tasks [17][18][19]. Over the past two years, a few groups started investigating deep learning applied to PA imaging for several purposes including direct reconstruction of the initial pressure [20], handling artefacts coming from sparse data [21][22][23], reflection artefacts removal [24,25], point source localization [26,27] and quantitative measurements [28,29]. The correction of the limited bandwidth problem was also investigated on very simple objects [30]. Some of these studies [22,23,31] showed that deep learning can also reduce the limited view artefacts although results were either numerical or obtained with non-conventional imaging devices. A linear array was used in experiments [32] but a ground truth was missing to assess the success of the approach. Finally, in most of the cited studies, experimental results were predicted from models trained only on simulation data.
In this work, we focus on the correction of the whole visibility problem, induced both by the limited view and limited bandwidth of a conventional linear US probe. The originality of our approach resides in the design of a dedicated model object and a method to create an experimental training dataset. The method is used to assess the capacity of a neural network to remove these artefacts on experimental images that were not used during the training. In this study, a ground truth is known also for the test set, which consisted of some of those unseen images. Thus, evaluation of the quality of the reconstruction can be performed. We point out that this study is not designed to produce quantitative PA images, as our ground truth does not directly represent the optical absorption, and focuses on providing morphological images. As a consequence, it can currently not be applied for quantitative imaging including multispectral investigations. Moreover, our study is limited to a given class of objects, and an investigation of the ability of the network to generalize to other classes of objects was out of the scope of this work. A preliminary discussion on generalization is provided in the supplementary materials.
Despite the impressive performances of DLA to reconstruct PA images, errors can be made by the algorithm which may misinterpret the data. This is one of the main limitations of neural network approaches in the medical field: the lack of confidence in the results. In this work, we estimate the uncertainty in our prediction through a Bayesian machine learning framework. We followed the approach proposed by Ghahramani and Gal [33], referred to as Monte Carlo dropout (MC dropout), which has been recently applied for phase imaging [34]. Uncertainty estimation using a Bayesian framework has already been studied in PA imaging to reduce artefacts induced by approximation of reconstruction parameters [35,36]. A specific deep learning approach has also been developed to estimate uncertainty of the optical parameter estimation in quantitative PA imaging [37]. Our case is different, as we do not want to take into account approximation of some parameters, but estimate the uncertainty linked to the deep learning process. To do so, our CNN is converted into a Bayesian neural network to introduce randomness in the prediction process, which makes the prediction no longer deterministic: the model will predict different outputs for the same input. Then, for a given input, several outputs are generated and are interpreted as samples of a probabilistic distribution, from which parameters can be estimated, such as the mean value and the confidence measure. The uncertainty estimation provides positions of invented and poorly reconstructed structures. This estimation is very useful for real-time navigation as a feedback for the user, who may eventually choose to display only the reliable parts of the images. However, we point out that this estimation does not provide information for structures missing in the reconstruction, and is therefore not a measurement of the fidelity.
We also study the DLA performance over different input data types. Usually, a conventionally reconstructed image is used. This prior reconstruction is obtained by applying delays and summation (DAS) on the Hilbert transform of the radiofrequency (RF) signals. This operation produces a complex image whose modulus is computed to form the demodulated beamformed (dmBF) image, which is the one displayed for the end user. While the input of the DLA for PA image reconstruction is usually the dmBF image, we choose to train our network with the modulated beamformed (mBF) image. The mBF image (Fig. 1a) is obtained by applying DAS directly on the real-valued RF time signals. Consequently, the mBF image is modulated by the impulse response of the transducer, resulting in axial oscillations. While the mBF image represents the object less faithfully than the dmBF image (because of the probe oscillations), we show that it carries more information that the DLA can exploit. The two approaches are compared in Supplementary Materials.
Finally, we investigate the design of the training sets. Indeed, processing experimental data with CNN trained solely on simulated data seems to produce poor reconstruction [23] which we confirm here (Supplementary Materials), while constructing a large experimental dataset is complex and time consuming. We varied the relative sizes of the combined experimental and simulated datasets and observed its impact on the reconstruction performances.

Conventional reconstruction methods
For comparison purposes, conventional DAS images (dmBF) and L2 deconvolution images are provided. DAS is fast and robust whereas deconvolution methods are more computational and more complex to implement since the knowledge of the point spread function of the system and regularization are necessary. Here, image deconvolution is achieved using a least-square minimization approach with a L2-regularization penalty term. It was performed by a fast iterative shrinkage thresholding algorithm (FISTA) [38]. The inversion is defined as: is the expected reconstructed object, the object at each iteration, represents the RF signals and is the propagation matrix, containing the imaging system response at each point of the reconstruction grid. is the regularization parameter, tuned heuristically by visually comparing the reconstructed image with the ground truth.

Creation of the experimental dataset
A model of leaf skeleton was chosen as imaging sample (see Fig. 1a). This model has been used in previous studies as it provides a branching structure qualitatively similar to that of a vascular network, and produces conventional images with similar visibility artefacts [39]. To obtain a sufficient photoacoustic signal, the leaves veins were tainted with black ink and the limbs were dissolved by chemical treatment. The smallest veins of the leaves are finally manually cut out to remove unresolvable details. Each pair of the dataset consisted of a mBF image (input of the network) and the corresponding photograph (ground truth) of a 5.12 × 5.12 mm 2 patch of the considered leaf.
As shown in Fig. 1a, the leaf is maintained inside a horizontal plane of an agarose gel, which stands within a tank filled with degassed and deionized water. Through a side window composed of a frame tightening a Mylar membrane, an ultrasonic transducer array (15.6 MHz central frequency, L22-8v, Verasonics, USA), connected to a multichannel acquisition system (Vantage 256 High Frequency, Verasonics, USA) is coupled to the water tank with echographic gel. Thus, the leaf is in the imaging plane of the probe. It is illuminated from the top with Fig. 1. a, Creation of the experimental training set. A linear probe is coupled to a water tank containing the leaf, through a window composed of a tight Mylar membrane. The leaf is in the imaging plane of the probe. The laser beam is shined from the top and the RF signals are acquired. A photograph of the leaf was previously taken. The mBF PA image of the ROI is reconstructed and the photograph is processed to extract the same area. b, Uncertainty prediction: Several images are generated using the same input. The mean and the standard deviation (std) of these samples are estimated pixel by pixel. The prediction is unstable in the marked area, resulting in a high std. 5 ns laser pulses at 10 Hz repetition rate ( = 532 nm), produced by a frequency-doubled Nd:YAG laser (Surrelite, Continuum, USA). For each laser shot, PA signals are acquired and mBF images are reconstructed using DAS assuming a homogeneous medium with a speed of sound of 1500 m s −1 , neglecting the presence of the agarose gel. To obtain several independent samples from a same leaf, the leaf is mechanically translated respectively to the probe and the light source.
We define our ''ground truth'' images as photographs of the leaf taken with a CMOS camera (X-E2, Fujifilm, Japan). These photographs are converted to grey scale (8 bits) and pixels below a threshold are set to 0 to suppress background noise. A registration between the PA image and the corresponding photograph is needed. The magnitude of the transformation to apply to co-register the two images (decomposed as rotation, translation and scaling) were found automatically by maximizing the correlation coefficient between the PA image (the reference) and the transformed photograph. 593 pairs of images from two leaves are obtained, split between the experimental training set and the experimental validation set with respectively 500 and 93 pairs. The validation set is used during the training to assess the optimization process is over. An experimental test set of 15 pairs is then constituted from two different leaves. It is used to evaluate the performance of our approach.

Creation of the simulation dataset
The two photographs used as ground truth for the experimental training set were also used to create the simulation dataset. Data augmentation is applied on those photographs to increase the dataset size by applying rotations, mirror transformations, horizontal or vertical shears, and centre expansions or compressions. Then, 1 × 1 cm images are extracted to compute their PA signals.
The method used to simulate PA imaging is described in our previous work [40]. In brief, the imaging system response is experimentally measured for a single source at one spatial location and the synthesis of the RF signals coming from a whole object is obtained by summing the contributions of each pixel of the object. The object is then reconstructed with DAS as for the experimental data. The medium is assumed to be homogeneous with a constant speed of sound of 1500 m s −1 . DAS is then applied to reconstruct an mBF image of 5.12 × 5.12 mm 2 area and the photograph is cropped to match the dimensions. Propagating PA waves from a larger area (1 × 1 cm 2 ) than the one viewed by the network (5.12 × 5.12 mm 2 ) enable to take into account the presence of surrounding structures which can affect reconstruction during experiments. A series of 1400 pairs of images are obtained. Around five days are needed to compute the dataset on Matlab with a desktop computer.

Deep learning framework
UNET [41] is a widely used CNN in the medical field. A slightly modified architecture, presented in supplementary materials is implemented with the open source libraries Tensorflow and Keras. Dropout [42] and batch normalization [43] layers are added to limit overfitting of the model. The last layer contains only one filter instead of two in the original version, as the expected output is a single image. The last activation function is also suppressed as the prediction is no longer a binary image. It is worth mentioning that several modifications supposed to improve the result including skip connections between input and output [17], residual blocks [44] and fully densely connected blocks [31] have been tested without significant improvement of the prediction. The cost function is the classical mean squared error, and an Adam optimizer is used with a learning rate of 5.10 −4 and momentum of 8 [45] with batch sizes of 8 images. An early stopping approach based on the validation loss was chosen to limit under-and over-fitting [46]. The prediction phase must be random to model uncertainty. In the MC dropout approach, noise is injected in the model by activation of the dropout layers (dropout rate of 50%) both during training and prediction. In this study, 20 inferences are generated from forward passes through the model with a different dropout mask. The different resulting predictions allow to further estimate the distribution mean and its standard deviation which gives a map of uncertainty (see Fig. 1b). The training and the evaluation of the network, composed by around 30,000,000 neurons, are performed on a NVIDIA Quadro P2000. Around 50 minutes are needed for the training on the simulation set and 30 minutes on the experimental set.

Quantitative assessment of the network performance
As mentioned previously, the same leaves are used to create the simulation and the experimental dataset. It means that from the same object (i.e. an area of the leaf), we will be able to obtain either the experimental RF signals or the simulated one. For comparison purpose, the reconstructed objects shown in the figures are the same for both simulations and experiments. A third example is used for the MC dropout results, the estimated uncertainties of the two previous examples being described in the supplementary materials. All images are normalized by their maximum and represented with a colorbar from 0 to 1, since quantitative information is not expected.
To evaluate the accuracy of a trained neural network, the normalized 2D cross-correlation [47] (NCC) and the scaled and shifted structured similarity index (sSSIM) are computed between each output and ground truth. The first one uses local sum to normalize the crosscorrelation for feature matching. SSIM [48] is a widely-used metric to evaluate the perceived quality of an image. It is computed over several small windows of the image, quantifying the structure, contrast and intensity similarities. The sSSIM [49] is used for obtaining a scaled and unbiased score which was not disadvantaging for the other reconstruction methods. For an overall performance estimation of the network, the mean and standard deviation values among the test set are presented. On each result, we computed an uncertainty map with the MC dropout method, which only require a single-shot acquisition. For comparison purposes, as an alternative way to assess the local variability in the reconstructions, we also computed an uncertainty map from the standard deviations of the reconstructions of 50 acquisitions of the same sample. In this case, the variability comes from the experimental noise while the CNN remain deterministic. We also computed the absolute truth error of the reconstruction defined as the difference between the ground truth and the predicted image. Fig. 2 shows a comparison between the reconstructed image from simulated data provided by the DLA (d,h) (trained with simulated data), the dmBF image (DAS, b,f), a L2-regularized deconvolution (c,g) and the photograph of the object (ground truth, a,e). L2 minimization and DAS clearly do not provide vertical structures, i.e. the structures elongated in the axis of the probe. Veins having inclinations beyond the detection aperture (typically beyond 45 degrees) are missing due to the limited view problem. The inside of the thicker vessel is also missing and the thickness of the thinner vessel is underestimated for DAS reconstruction and overestimated for the L2 minimization (arrows). In contrast, the deep learning reconstruction yields an almost artefactfree reconstruction with errors located only on the smallest appendages resulting from the manual cutting, and on few vertical structures which are not completely recovered (stars on the images).

Simulation results
The performances of the different reconstruction methods and their standard deviation, evaluated over the 15 pairs of the simulation test set with the metric described in 2.5 are shown in Table 1. These numbers clearly confirm the qualitative visual impression: when the DLA is used, the NCC and sSSIM are about three times higher compared to the simple DAS. Scores for the deconvolution method, not shown here, are on the same order than that of the DAS. Fig. 3 shows a comparison between the reconstructed image from the experimental data provided by the DLA (d,h) (trained with experimental data), the conventional DAS reconstruction (b,f), a L2regularized deconvolution (c,g) and the photograph of the object (ground truth, a,e). Similarly, the DAS approach and the   L2-minimization both fail to recover vertical structures as well as to provide a good rendering of the vein thicknesses by filling the inside of the thicker ones. In contrast, the DLA trained on the experimental data yields a reconstruction with most of the vertical structures recovered and a correct thickness of the veins (arrows). A few structures are again not recovered, and some mistakes occurred especially for the reconstruction of vertical veins (stars). The orientation is not always perfectly respected, and in some places, some veins appear when there are none in reality. Quantitative performances are shown in Table 1 where as for simulations, we observe a large improvement for the deep learning approach comparing to the DAS, although lower than simulation results. Both the sSSIM (0.76) and the NCC (0.8) are significantly enhanced. It may also be noted that the DAS performs better on experimental data than on simulation data.

Uncertainty estimation
Results of the MC dropout procedure are presented in Fig. 4. A low standard deviation indicates a good robustness of the technique:  the prediction remains stable over several realizations. Areas with high value are similar in the estimated uncertainty map (Fig. 4e), the absolute error map (Fig. 4d) and in the map of standard deviation from experimental noise (Fig. 4f). Most of the errors in the prediction are captured like the small vessel at the bottom, that was not fully recovered, or the central one which was reconstructed but in a curved shape instead of a straight one (arrows).

Impact of the pretraining and the size of the training set on the performance
In this part, the efficacy of a pretraining session with a simulation dataset is investigated as a means to improve the general performance and for reducing the size of the experimental training set. The uncertainty prediction was not studied in this configuration. We increase the training set size with unseen part of the two leaves used for the testset. The DLA was trained with experimental datasets of different sizes, from 10 to 550 pairs (the entire dataset). For each size, the training is repeated 30 times with, for each of them, a training and validation set composed of different pairs randomly chosen. This is needed to limit the influence of individual pairs on the training set size (for example, it is likely that 10 examples very close to the test set will provide a better prediction than 20 very distant ones), especially for small set sizes. The displayed sSSIM values are therefore an average over all the test sets from the 30 different realizations. To evaluate pretraining, we repeated this procedure with weights initialized by those obtained at the end of a training session on a simulation dataset composed of 1400 pairs.
The results are shown in Fig. 5. As expected, the performance increases with the experimental dataset size. Below 200 pairs, errors remain present and the veins thickness is not always faithfully represented. From 200 pairs, the image quality seems visually stable, although the sSSIM value still slowly increases. With pretraining (blue curve) convergence is faster. When the full dataset is used, pretraining only slightly increases the performance of the network (sSSIM improved from 0.76 to 0.78). For a smaller size such as 50 pairs, the score improvement is better (from 0.63 to 0.72). A reconstructed image comparable to the one obtained with the total experimental training set is almost reached from this experimental dataset constituted of 50 pairs.
In this situation, a pretraining session therefore enables to decrease the size of the experimental training set by a factor 4.

Discussion
The algorithm trained with simulated data is able to produce images that are free from visibility artefacts, when applied to simulated test data. When trained with simulated data, the algorithm applied to experimental data however fails to provide images of good quality, as illustrated in the Supplementary Materials (see fig. S3). When both training and prediction are made with experimental data, while a few errors may remain, most vessels are well recovered. A fundamental difference between simulation and experiment is the nature of the ground truth. In simulation, the ground truth exactly represent the distribution of absorbed light. In the experiments, the ground truth is a photographic 2D projection of a three dimensional absorption distribution. It then represents the integration of the optical absorption through the sample thickness (which may vary among leaf veins). Consequently, the photograph is not a quantitative representation of the sample absorption, and cannot therefore be a quantitative ground truth. Thus, our method is not supposed to provide a quantitative reconstruction of the absorbed light, as the network is forced to learn from a 2D representation of a 3D object of finite and varying thickness. Given the nature of our ground truth, the purpose here was to demonstrate that the morphology of the sample could be retrieved free of visibility artefact. Quantitative ground truth would be required to provide quantitative information on the absorption coefficient, as needed for instance for spectroscopic applications.
The use of the mBF image as input of the network improves the performance both quantitatively and qualitatively compared to the dmBF image input (see supplementary materials, Fig. S2). In fact, the mBF image, although affected by the oscillatory impulse response to the ultrasound probe, turns out to carry more information to be captured by the network.
The results show that errors are often located at the edges of the reconstructed image. Indeed, in these areas, less information about the surrounding structures is available. One way to limit these artefacts could be to reconstruct on a larger area and crop the edges. Most of the errors remaining on experimental images are located where the manual cutting was performed resulting in small appendices. These structures, which do not belong to the initial object, turned out to mislead the network which seeks to elongate them to join all the veins together. It is reasonable to think that the number of errors would have been lower on a more regular object. More broadly speaking, these results can be enhanced by improving the quality of the training set.
Nevertheless, potential errors in the reconstructed image, especially the invented structures are problematic for end users (clinicians, biologists...). The MC dropout approach proposed in this article helps locating most of them. Importantly, the estimated uncertainty remains low where reconstruction is correct, leading to a clear distinction of the suspicious areas: false alarms, which could mislead the clinician, are rare. It is worth mentioning that if the method helps locating the invented structures or incorrect reconstruction, (true positive), it is less efficient when capturing missed structure (false negative), as illustrated in the supplementary materials, Fig. S4. This is understandable, as these errors are mostly related to a lack of information in the data. The map of the standard deviation over experimental realizations was computed to compare our result with the uncertainty map based on making the CNN stochastic. The areas showing high values in the two maps are co-located, showing that the MC dropout method with a single-shot acquisition provides an information similar to that resulting from noise induced variability. This feature is promising in the context of moving tissue imaging and real time navigation. We explored other methods that can provide uncertainty maps, such as Deep ensemble [50] and Dropout ensemble [34]. In our study, the estimation of the mean image turned out to be superior for MC dropout according to the metrics presented previously. This difference could be explained by the required modification of the loss function for the other methods, involving a decrease of the overall performance.
To obtain the result of our study, our model was trained on an experimental dataset. However in clinical context, large experimental datasets may be complex to build. Using only simulated data to train a model and reconstruct experimental images would be ideal as simulation data can easily by produced. In our study, this approach turned out to produce unsatisfactory results, as we illustrate in supplementary materials (Fig.S3): predictions on experimental data provided by a DLA trained on either simulated data or experimental one are shown, and show that many artefacts remain on the experimental images reconstructed from a DLA trained with simulation results. These results are also in agreement with observations made by Davoudi et al. [23]. Although simulations are supposed to correctly model experiments, the nature of the ground truth for both training is very different for simulations and experiments in our case, which is likely the reason why training from experimental data provides much better results as compared to training with simulation data. Nonetheless, as shown in the previous section, the incorporation of simulated data through a transfer learning approach allows reducing significantly the size of the experimental dataset. The algorithm only needs to update its parameters with the difference between the simulations and the experiments, which is easier than learning the overall procedure. In the medical field, such a pretraining session could be useful for reducing the number of patients necessary to create a training set.
While our objective was limited to a proof-of-concept demonstration, several challenges must be taken into account in order to apply our method in a practical context.
The ground truth used for our proof of concept has to be replaced in practice by a ground truth that can be measured in a realistic environment (including deep inside tissue). Such a ground truth measurement could come from another imaging modality such as X-ray computer tomography (X-ray CT) or magnetic resonance imaging (MRI), which can accurately retrieve morphological information. They would however not provide a quantitative ground truth for the optical absorption, which would prevent the use of the method for spectroscopic approaches. A quantitative estimate of the ground truth can be obtained with a more sophisticated PA imaging device [23], or any method providing a visibility artefact free image proportional to the optical absorption. The training could be done with such a device free of visibility artefacts, to train a DLA to be applied on a simpler device. Quantitative reconstruction has been obtained with such an approach in the context of simulations [51] The influence of noise on RF signals should also be studied to assess the validity of our approach in a noisier environment. In our work, the signal to noise ratio (SNR) on RF signals is about 60. This value however represents the SNR of signals produced by horizontal structures, while this work mainly focuses on the reconstruction of vessels affected by the visibility problem, for which the signal is almost nonexistent. In addition, the background of our DAS images is polluted by clutter, an artefact located around the object originating from the lack of information for the reconstruction. The amplitude of the clutter is often higher than the one from vertical structures. The model situation considered in this work thus remains significantly challenging.
Finally, the quality of the prediction is strongly influenced by the class of the object to reconstruct. The relative homogeneity of the studied dataset is one of the reason the DLA performs well: while very good predictions were made for leaves that were never seen by the network during the training, the unseen leaves were for the same species than the leaves used for the training. This is however a quite inherent limitation of deep learning approaches. In the supplementary information, we provide a preliminary investigation on the ability of our approach to generalize prediction to objects that do not belong to the class used for training: we tried to reconstruct the vessel structure available in the k-wave package (http://www.k-wave.org/) with the algorithm trained on the simulated dataset constructed with leaves. The predicted image, showed in supplementary materials (Fig. S5), is well reconstructed without artefact, suggesting there is no, or low overfitting in our approach, and that the network may generalize well. In a more general context, the capacity of a network to generalize is crucial and must be investigated for each particular situation.
Aside from increasing the quality of the reconstructed image, DLA offers other interesting features. Only 10 ms is needed to make a prediction using a regular graphic card which is much lower than the reconstruction time for the deconvolution method. Real-time reconstruction during user navigation could be achieved. Besides, once trained, the network does not need any parameters to be set by the user, unlike for deconvolution approach where the regularization parameter has to be chosen carefully and in a rather subjective way.

Conclusion
The possibility of removing visibility artefacts with a neural network has been demonstrated both in simulations and experiments on a model class of complex objects. Vertical parts of objects and the inside of large structures, missing on conventional reconstruction approaches, are recovered. These qualitative assessments are confirmed by quantitative metrics, which are far better for the DLA approach compared to conventional reconstruction methods. However, some errors may still be present in the reconstructed images, such as invented or poorly reconstructed structures as well as missing structures, although their number might be reducible by improving the experimental protocol. A MC dropout approach was proposed and successfully applied to identify invented and poorly reconstructed structures: high values on the generated uncertainty map are in agreement with high values on the true error map. Besides, it has been shown that pretraining the network with simulated data enables to reduce the size of the experimental training set by a factor of 4 while maintaining a similar quality of reconstruction.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding information
This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 681514-COHERENCE).