Deep speckle correlation: a deep learning approach towards scalable imaging through scattering media

Imaging through scattering is an important, yet challenging problem. Tremendous progress has been made by exploiting the deterministic input-output"transmission matrix"for a fixed medium. However, this"one-to-one"mapping is highly susceptible to speckle decorrelations - small perturbations to the scattering medium lead to model errors and severe degradation of the imaging performance. Our goal here is to develop a new framework that is highly scalable to both medium perturbations and measurement requirement. To do so, we propose a statistical"one-to-all"deep learning technique that encapsulates a wide range of statistical variations for the model to be resilient to speckle decorrelations. Specifically, we develop a convolutional neural network (CNN) that is able to learn the statistical information contained in the speckle intensity patterns captured on a set of diffusers having the same macroscopic parameter. We then show for the first time, to the best of our knowledge, that the trained CNN is able to generalize and make high-quality object predictions through an entirely different set of diffusers of the same class. Our work paves the way to a highly scalable deep learning approach for imaging through scattering media.


Introduction
Light scattering in complex media is a pervasive problem across many areas, such as deep tissue imaging [37], imaging in degraded environment [43], and wavefront shaping [35,45,54]. To date, there is no simple solution for inverting scattering because of the many possible optical paths between the object and the detector. The output of a coherent light scattered from a complex medium exhibits a seemingly random speckle pattern [14]. The speckle's spatial distribution is a complex function of both the microscopic arrangement of the scatterers and the wavefront of the incident field. Thus, a comprehensive deterministic characterization of the scattering process is often difficult, requiring large-scale measurements.
Major progress has been made by using the transmission matrix (TM) framework [24,38,54] that characterizes the 'one-to-one' input-output relation of a fixed scattering medium as a linear shift-variant matrix. Due to the many underlying degrees of freedom, the TM is inevitably large, whose size generally grows quadratically as the transferred pixel number, i.e. the system's space-bandwidth-product (SBP). This makes this approach highly measurement and data-demanding for high-SBP applications. Under special conditions, simplification can be made using the memory effect [13], which approximates the system to be shift-invariant. However, the SBP of this method is still small due to the limited memory effect range [13,46], finite sensor dynamic range [22], imaging geometry [28,34,52], and trade-offs between illumination coherence, speckle contrast, and measurement requirement [6,11,22,26].
A major limitation of these existing approaches is their high susceptibility to model errors. The phase-sensitive TM is inherently intolerant to speckle decorrelations [15,19,31,39]. Slight changes of the medium can lead to much reduced correlations between the speckles measured before and after. This indicates the breakdown of the previous input-output relation, and results in rapid degradation of the transferred images. In other words, a new TM is needed once the speckle patterns become decorrelated, e.g. Pearson correlation coefficient (PCC) < 1/e, making these methods challenging to scale for applications involving dynamic scatterers. Current solutions focus on developing hardware with higher speed than the medium's decorrelation time [7,9,31,32,56]; still, they are often limited by the memory effect.
Our goal here is to develop a highly scalable imaging through scattering framework by overcoming the existing limitations in susceptibility to speckle decorrelation and SBP. The main approach is to build a 'one-to-all' model that possesses two essential statistical properties. First, 'one' model sufficiently encompasses the statistical variations across 'all' scattering media with different scatterer microstructures but within the same class. Second, the model can distill the statistically invariant information encoded in the speckle patterns (correlated or decorrelated). Together, they allow the single model to be generalizable to various objects/media having the same statistical characteristics.
The proposed model is built on a deep learning (DL) framework. To satisfy the desired statistical properties, we do not train a convolutional neural network (CNN) to learn the TM of a single scattering medium. Instead, we build a CNN to learn a 'one-to-all' mapping by training on multiple scattering media with different microstructures while having the same macroscopic parameter. Specifically, we show that our CNN model trained on a few diffusers can sufficiently support the statistical information of all diffusers having the same mean characteristics (e.g. 'grits' [4]). We then experimentally demonstrate that the CNN is able to 'invert' speckles captured from entirely different diffusers to make high-quality object predictions, as outlined in Fig. 1.
DL is shown to be powerful in solving complex imaging problems, providing stateof-the-art performance in super-resolution [41,57], holography [40,42], and phase recovery [36,47]. Instead of building an explicit model, DL takes a data-driven approach that seeks solutions by learning from large-scale dataset. The major benefit includes the flexibility and adaptability in solving complex problems, in which a parametric model is hard to derive and/or prone to errors. Closely related to our work are the learning-based techniques for imaging/focusing through diffusers [16,17,29,33,53]. Unfortunately, all existing networks are only trained and tested on the same diffuser, so the model may still be susceptible to speckle decorrelation. Indeed, as tested in our experiment, a single diffuser trained CNN does not capture sufficient statistical variations to interpret speckle patterns from other diffusers. Another closely related line of work is using DL to imaging through multi-mode fibers (MMF) [8,12]. Image transfer through a MMF also results in speckle patterns due to spatial mode mixing. CNNs have been designed to capture sufficient statistical variations of the setup so as to provide superior robustness against random variations.
We demonstrate our technique under shift-variant scattering by placing a diffuser at a defocused plane [28,29,34]. This geometry provides a limited isoplanatic region (≈ speckle size) [28,34], as verified experimentally in Fig. 2. The objects extend well beyond the isoplanatic region (∼300×300 speckle size). Our task is further complicated by the intensity-only measurement under coherent illumination; the mapping between the object and speckle intensity is nonlinear [14]. The training step in our DL method is conceptually similar to the TM calibration, in which a series of patterns are input to the diffuser and the output is measured. In TM calibration, interferometric measurements are often required [24,38]; additional phase-retrieval procedures are needed when intensity-only data are used [10]. Here, the proposed CNN learns to interpret the 'phaseless' measurements using its nonlinear, multilayer structure.
We experimentally achieve ∼256×256-pixel SBP using up to 2400 training pairs. Importantly, our training data were collected on multiple diffusers. Distinct from the TM approach, our trained CNN is able to predict objects through 'unseen diffusers' that were never used during training. We experimentally quantify the CNN performance trained with 1, 2, or 4 diffusers and demonstrate the superior robustness over speckle decorrelation of our technique. We further demonstrate that the trained CNN is able to generalize over new object types through unseen diffusers.
Although it is hard to give an explicit expression of our CNN model (a common challenge in DL), we attempt to provide some insights by performing both CNN visualization and statistical analysis on our data across multiple objects and diffusers. The basic mechanism of DL is to identify statistical invariance across large datasets [27]. We first visualize the activation maps of our CNN when inputting speckle patterns obtained from the same object but through different diffusers. By quantifying the correlations between the corresponding activation maps, we show that our CNN indeed gradually learns the invariance across these speckle patterns. Next, we visualize speckle intensity correlations and show that physical invariance does exist across seemingly decorrelated speckle patterns taken through different diffusers. Such information would be hard to be directly utilized using existing models. Our CNN model is able to discover and exploit these 'hidden' invariant features owing to its higher representation power.
We demonstrate a promising DL framework towards highly scalable imaging through scattering media. Our method significantly improves the system's information throughput and adaptability as compared to existing approaches, by improving both the SBP and the robustness to speckle decorrelations.

Experimental setup
We use a spatial light modulator (SLM) (Holoeye NIR-011, pixel size 8µm) as a programmable amplitude-only object with two orthogonally oriented polarizers before and after [ Fig. 2(a)], similar to [29]. It is coherently illuminated by a collimated beam from a HeNe laser (632nm, Thorlabs HNL210L). The SLM is relayed onto the camera (Thorlabs Quantalux, pixel size 5.04µm) by a 4F system. Two lenses with focal lengths 200mm (L1), and 125mm (L2) are used to provide a 0.625 magnification. This design approximately produces the same effective pixel size for the object and the image, which is convenient for the CNN implementation since the same number of pixels can be used for the input and output without resizing [29]. Precise pixel-wise alignment was not performed nor needed. A ∼9mm iris is placed at the pupil plane of the 4F system to control the speckle size. The theoretical average speckle size is ∼8.8µm, or equivalently ∼14µm on the object plane, as set by λ/2NA (NA denotes the numerical aperture of the 4F system) [14]. This is experimentally verified by taking the autocorrelation of a speckle pattern through a diffuser and measuring the full-width at half-maximum [14], which reads ∼16µm, as shown in Fig. 2(b) (for ease of comparison, all length measurements are converted to the object side).
The spatially variant scattering is generated by placing a thin glass diffuser (Thorlabs, 220 grits, DG10-220) between the SLM and the 4F system's first lens. This system theoretically provides a small isoplanatic region that is limited to a single speckle since the diffuser is placed at a defocus position [34]. We quantify the isoplanatism by measuring the intensity speckle correlations [22]. A 3×3 pixel 'point-object' is scanned linearly across the SLM pixel-by-pixel (8µm). The isoplanatic range is then found by calculating the PCC between the speckle pattern from the central point and the one from each shifted point. Rapid speckle decorrelation beyond a single speckle range is observed in Fig. 2(c). The correlation coefficient plateaus around 0.3, close to the value in the speckle intensity autocorrelation curve [ Fig. 2(b)]. The smallest object (24µm) was limited by the signal-to-background ratio of the experiment due to imperfect polarizer extinction power producing non-negligible background at low-light levels. The same procedure was repeated on different object sizes; nearly The speckle size is ∼16µm, characterized by the speckle's intensity autocorrelation.
(c) The isoplanatic range is ∼1 speckle size characterized by the cross-correlation coefficients between speckle patterns of shifted point objects.
identical curves are obtained (see the supplementary material). The same behavior was numerically predicted in [29,34]. Speckle measurements are repeated on several diffusers having the same macroscopic parameter (220 grits). All glass (BK-7) diffusers are manufactured by the same process (Thorlabs), in which the top surface is first polished, and the bottom surface is then ground with the specified (220) grit. 220-grit provides an average 63µm feature size on the glass surface. When imaged in our setup, speckles generated by all diffusers possess similar statistical properties, including the average speckle size, and the background correlation (0.3) (see Figs. 2 and 11).

Data acquisition
The central 512×512 SLM pixels are used as the object; the corresponding central 512×512 camera pixels are used as the speckle intensity for CNN training and testing. Considering the system's resolution (measured by the speckle size), the SBP is ∼300×300 pixels with a field-of-view (FOV) of ∼4×4 mm 2 , which is well beyond the isoplanatic patch. The objects displayed on the SLM are 8 Figure 3: The proposed CNN architecture to learn statistical relationship between speckle patterns and unscattered object. It takes the general encoder-decoder Unet structure (the layer indices are marked in blue). Starting with a high-resolution input speckle pattern, the encoder gradually condenses the lateral spatial information (size marked in black) into high-level feature maps with growing depths (size marked in purple); the decoder reverses the process by recombining the information into feature maps with gradually increased lateral details; the output consists of a two-channel object, background pixel-wise prediction.
In total we take speckle patterns using 9 different diffusers. We use data from up to 4 diffusers to train our CNN, the data from the other 5 diffusers are never seen by the CNN during training and are only used for testing. The training objects are only taken from the handwritten digit and letter databases. The Quickdraw objects are only used for testing. The data were taken spanning ∼8 weeks, demonstrating the robustness of our approach to possible random variations during the experiment.
To collect the training data, we use in-total 600 objects (300 digits and 300 letters). For each training diffuser, we take 600 speckle images, giving in-total up to 2400 training dataset.
Our testing data are purposely designed to have four groups for characterizing our CNN's generalization capability evaluated from different perspectives: Group 1 tests the CNN generalization over speckle decorrelation due to the change of diffusers. It consists of 3000 'seen objects through unseen diffusers' collected from the same 600 objects used in the training, but through the 5 unseen testing diffusers. Group 2 tests the CNN over the change of diffusers and unseen objects of the same type (as the training objects). It consists of 200 'unseen objects of the same type through unseen diffusers' from previously unused 200 objects during the training and of the same class (100 digits + 100 letters), and through a randomly selected unseen testing diffuser. Group 3 tests the CNN over the change of diffusers and new object types. It consists of 800 'unseen objects of new types through unseen diffusers', and through the 5 unseen testing diffusers. The objects are taken from the Quickdraw database. Group 4 benchmarks the CNN performance trained on a single diffuser. It consists of 28 'unseen objects through the same diffuser' from previously unused 28 objects of the same type (9 digits + 19 letters) during training, and through a randomly selected seen training diffuser.

Data preprocessing
Due to computational limitations, all input and output images are first downsampled from 512×512 pixels to 256×256 pixels by taking the average within each 2×2 neighboring pixels (i.e. 2×2 binning). The downsampling reduces both the number of network parameters (which grows with the input size) and the required data size for training without overfitting (which grows with the network parameters). However, two artifacts may be resulted. First, our system images each speckle with approximately two pixels; after downsampling, each binned image pixel contains intensities from several speckle grains, effectively reducing the contrast of the input patterns [14]. Second, each binned object pixel may combine pixels from both the object and background regions, introducing incorrect (noisy) ground-truth. Robust training using noisy ground-truth has been shown in other CNN tasks [59]. In essence, the CNN learns the invariants and filters out the random noise. Our results suggest that the downsampling has little effect to the final results. Next, for both training and testing, the input speckles are normalized between 0 and 1 by dividing each image by its maximum.
Our CNN is designed to perform two types of tasks. First, the binary detection task outputs a two-channel binary estimate of the object and background. Accordingly, during the training, each grayscale object is thresholded by setting all non-zero valued pixels to 1 to give the ground-truth object; the ground-truth background is the complement. Second, the grayscale object reconstruction task outputs a two-channel grayscale estimate of the object and background. The ground-truth object is the grayscale image displayed on the SLM, processed with 2×2 pixel binning and normalized between 0 and 1; the ground-truth background is defined by subtracting the ground-truth object from 1.

CNN implementation
We build a CNN to learn a statistical model relating the speckle patterns and the unscattered objects. Importantly, the goal is to make predictions through previously unseen diffusers.
The overall structure of the proposed CNN (Fig. 3) follows the encoder-decoder 'Unet' architecture [44] with modifications of replacing each convolutional layer with a dense block [18] to improve the training efficiency [29]. The input to the CNN is a preprocessed 256×256 speckle pattern. Next, the input goes through the 'encoder' path, which consists of 4 dense blocks connected by max pooling layer for downsampling. The intermediate output from the encoder has small lateral dimensions (16×16), but encodes rich information along the 'depth' (having 1088 activation maps). Each dense block contains multiple layers, in which each layer consists of batch-normalization (BN), the rectified linear unit (ReLU) nonlinear activation, and convolution (conv) with 16 filters. Next, the low-resolution activation maps go through the 'decoder' path, which consists of 4 additional dense blocks connected by up-sampling convolutional (up conv) layers. The information across different spatial scales are tunneled through the encoder-decoder paths by skip connections to preserve high-frequency information. After the decoder path, an additional convolutional layer followed by the last layer produces the network output. The design of this last layer requires careful consideration of the desired imaging task.
Our CNN is designed to image sparse objects. Widely used loss functions including mean squared error (MSE) and mean absolute error (MAE), cannot promote sparsity since they assume the underlying signals follow Gaussian and Laplace statistics, respectively [23]. In a recent work [29], the negative PCC is shown to promote sparse predictions. Here, we propose an alternative method. First, we use a softmax layer to produce a pair of mutually complementary object and background channels. We then use the averaged cross-entropy [44] as the loss function L, which has shown to promote sparsity [50], and is given by where g is the ground-truth pixel value and p represents the prediction; the average is over all N -pixels x across both channels c. Both g and p can take binary or continuous values. Importantly, our design allows making both binary and grayscale predictions. First, we consider the pixel-wise binary detection problem -the CNN predicts if the object is present or not pixel-by-pixel. In this case, both the ground-truth and predictions take binary values. The intermediate output from the softmax layer is often interpreted as the probabilities of each pixel belonging to the object and background classes. Second, we consider the grayscale object reconstruction problem -the CNN predicts continuous-valued intensity in each object pixel. In this case, both the ground-truth and predictions take grayscale values. The predictions are directly from the softmax layer. Since our objects are generated with a 8-bit SLM, the CNN predictions are set to the same bit-level.
The CNN training was performed on the BU SCC with one GPU (NVIDIA Tesla P100) using Keras/Tensorflow. Each CNN is trained with 500 epochs by the Adam optimizer for up to 44 hours. The learning rate of 10 −4 is used for the first 300 epochs, 10 −5 for the next 100 epochs, and 10 −6 for the final 100 epochs. Once the CNN is trained, each prediction was made in real-time. More details of the CNN architecture, parameter optimization, and training procedures are provided in the supplementary material. We also provide open source code of our CNN model along with pre-trained weights and sample data in [1].

Results
We present our results from four types of experiments, in line with the acquired data described in Sec. 22.2. The results from the first three experiments are all from the CNN trained with 4 training diffusers and tested on 5 testing diffusers. The last experiment is to compare the 4-training-diffuser results against those from the CNN trained on a single diffuser. Although our CNN is able to make both binary and grayscale predictions, we here only show binary images. Grayscale network provides similar performance, as detailed in the supplementary material. This is probably because our CNN is designed to image sparse objects. Imaging non-sparse objects become more challenging [29], which will be considered in our future work.
In the first experiment, we test our CNN to predict 'seen objects through unseen diffusers' (Task 1). Notably, our CNN demonstrates superior generalization in predicting objects through previously unseen diffusers. Representative examples of the speckle and prediction pairs are shown in Fig. 4. More results are given in the supplementary material. For the same object, although the speckle patterns through different diffusers appear notably different, the CNN consistently makes high-quality predictions. Later, we quantify the differences between these speckle patterns by speckle decorrelation analysis in Sec. 4. The prediction results present slight variations since our CNN makes pixel-wise predictions, rather than the whole-image classification [25]. Our pixel-wise prediction task is considerably more difficult since the network needs to effectively learn the per-pixel input-output relation. In addition, since our CNN adapts to all diffusers of the same class, the learned relation needs also to be adaptable to all possible statistical variations. The variations of the predictions for this task using our binary CNN are quantified later in Fig. 8. Representative examples and statistical analysis on the grayscale CNN predictions are provided in the supplementary material.
In the second experiment, we test our CNN on a more difficult task of predicting 'unseen objects of the same type through unseen diffusers' (Task 2). The set of objects have never been used in the training. They, however, belong to the same object class to the training data, i.e. handwritten digits and letters. A quantitative comparison between Task 1 and Task 2 measured by the speckle decorrelation is presented in Sec. 4. Representative examples are shown in Fig. 5, demonstrating that the CNN is able to make high-quality binary predictions of these unseen objects from the same class, while through unseen diffusers. The corresponding grayscale predictions are shown in the supplementary material.
In the third experiment, we further test our CNN on predicting 'unseen objects of new types through unseen diffusers' (Task 3). The set of objects have never been used in the training and belong to a different object class (Quickdraw). Representative examples are shown in Fig. 6, demonstrating that our CNN is still able to make highquality predictions of these unseen new types of objects through unseen diffusers. The quality of the binary predictions for this task are quantified in Fig. 9 across different object types. The corresponding grayscale predictions are evaluated in the supplementary material.
In the fourth experiment, we compare our '4-training-diffuser' results against those from the CNN trained on a single diffuser. The results are presented in Fig. 7, which consists of two tasks. Task 4 makes predictions on unseen objects by the CNN that is trained and tested on the same diffuser. Successful demonstrations of accomplishing this task via machine learning have been reported [16,29,33]. Task 5 makes predictions on unseen objects through a different unseen diffuser by the CNN trained on a single diffuser. The goals of this experiment are in twofolds. First, due to the different choices of CNN architectures and loss functions, here we validate that our design can indeed reliably perform Task 4, as shown in Fig. 7. Our results from our CNN are further quantified in Fig. 8, which match the state-of-the-art performance with an average PCC of 0.626 [29]. Second, we verify that a CNN trained on only a single diffuser cannot be reliably generalized to other diffusers (shown in Fig. 7), since the CNN is tuned to only fit to the model of a specific diffuser.
Next, we quantify the performance on the 'seen objects through unseen diffusers' task. We expand the comparisons across 6 CNNs trained on 1,2, or 4 diffusers with 3 training dataset sizes (in total: 800, 1600, and 2400 pairs). We use two metrics, including the Jaccard index (JI) and PCC. Both metrics are useful to measure the similarity between image pairs [61]; they provide slightly different scores due to the differences in error-counting. Each CNN is tested under the same condition, using the same 1000 speckle patterns (the same as Fig. 4).
We first present the JI scores. In the top figure of Fig. 8, the JI of each CNN tested on each individual testing diffuser is shown as a circle. Results from all 5 unseen diffusers are clustered together, regardless of the CNN being used, demonstrating the consistency of the CNN prediction against object and diffuser variations. In addition, we make two observations. First, the performance improves as more training diffusers are used. This is evident by comparing the results from the same number of 800 training dataset while increasing the number of training diffusers (similarly for the 1600 case). Second, the performance further improves by increasing the size of training dataset. This is seen by comparing the same number of 4 training diffusers while increases the training dataset size (similarly for the 2-diffuser case). To provide an intuitive visualization of the JI score, the bottom figure of Fig. 8 shows a few representative examples. In the first row, the result is further broken down to the true-positive (white), the false-positive (green), and the false-negative (purple). Next, we provide the alternative evaluation using the PCC score. The mean PCC of each CNN is given in Table 1. The general observations remain the same as the JI evaluation. In addition, we observe that the performance from '4 diffusers, 800 dataset' is slightly better than that from '2 diffusers, 1600 dataset' (i.e. more diffusers and less dataset), further demonstrating the effectiveness of training using multiple diffusers.
Finally, we quantify the performance on the 'unseen objects of new types through unseen diffusers' task in Fig. 9. The results are from the CNN trained with 4 training diffusers and 2400 training datasets (the condition for Fig. 6). In general, our trained CNN is able to make high-quality predictions albeit with reduced JI scores as compared to the 'seen object' case. The performance also varies with the specific object types. In total, we tested 6 different types, whose performance are quantified by the mean and standard deviations of the JI. These results suggest that the quality of the CNN model is also influenced by the object types used during training. A larger training dataset covering additional object types may further improve our results.

analysis
To provide some insights of our CNN model, we perform analysis on both the network and the speckle patterns. The main principle of DL is to learn statistical invariant information across large dataset [27]. Thus, our goal is to look for any meaningful invariant features among speckles taken through different diffusers. If found any, it can suggest that it is plausible to establish a statistical mapping to relate these speckles by the CNN model.
First, we visualize the intermediate activation maps [60] from each layer of our CNN when inputting speckle patterns from the same object but through different testing diffusers. Starting with a pair of visually distinct speckle patterns, the activation maps gradually resemble similar patterns as the data flow through the encoderdecoder paths, as shown in Fig. 10(a). To quantify the learned invariance, we compute the pair-wise PCCs of each corresponding layer (across all channels) from the same object for all possible combinations of the 5 testing diffusers. The PCC generally grows as the CNN layer; PCC curves from different objects follow the similar trend, as shown in Fig. 10(b).
Next, we perform speckle correlation analysis. Our findings are summarized in Fig. 11. First, we quantify speckle decorrelation in our measurement using the classical PCC metric [15,19,31]. Figure 11(a) presents the PCC's histograms under various tasks (defined in Sec. 3), each from 400 randomly chosen speckle patterns. We describe the result based on the order of decorrelation (hence the difficulty of the task). First, Task 4 ( Fig. 7) is evaluated by A D1 * B D1 , which correlates speckles from different objects through the same diffuser. Most of the speckle patterns are decorrelated and the mean coefficient is 0.307, which is consistent with the values found in both the isoplanatism and speckle size characterization plots in Fig. 2. Second, Task 1 (Fig. 4) is evaluated by A D1 * A D2 , which is for the same object through different diffusers. The speckle patterns are further decorrelated to a mean value of 0.221. Third, Tasks 2,3,5 (Figs. 5, 6, and 7) are evaluated by A D1 * B D2 , which is for different objects through different diffusers. This gives the lowest correlation of around 0.207.
A single-valued metric does not sufficiently capture the rich information encoded in the speckle patterns. As inspired by speckle correlography [26] and the variants [6,11,22], next we investigate the speckle intensity correlation function for different speckle pairs. Representative examples from our main findings are presented in Fig. 11(b). Importantly, taking the speckle intensity autocorrelation as the reference, speckle intensity cross-correlation from the same object but through two different diffusers (e.g. the first for training, and the second for testing) resembles the similar pattern as the reference. These correlation patterns do not follow the simple relation exploited in [6,11,22,26]. Nevertheless, the invariance maintained across speckle patterns from training and testing diffusers do suggest that there exist learnable and generalizable features. This suggests that if the CNN is trained and tested with the same object but through different diffusers (e.g. in Fig. 4), there exists physically meaningful invariance exist in these speckle intensity correlation patterns. Our CNN model is able to discover and exploit these 'hidden' information although these speckle pairs are considered 'decorrelated' based on the PCC. Next, correlation patterns from visually similar objects are shown to present notable difference, which demonstrates the sensitivity of these features. Overall, we speculate that these invariant correla-tion patterns/features could contribute to the scalability of our CNN with respect to speckle decorrelations. Furthermore, our results on unseen objects through unseen diffusers (Figs. 5 and 6) suggest that these learned invariance are generalizable to a broader range of speckle measurements.

conclusion and discussion
We have demonstrated a deep learning framework to significantly improve the scalability of imaging through scattering. Traditional techniques suffer from the 'one-to-one' limitation, in which one model only works for one fixed scattering medium. Here, we take an entirely different 'one-to-all' strategy, in which one model fits to all scattering media within the same class. In practice, this leads to significantly improved resilience to speckle decorrelations and improved space-bandwidth-product. Our approach promises highly scalable, large information-throughput imaging through complex scattering media.
We envision that our technique can be useful in imaging biological samples. Several macroscopic parameters [58], such as absorption and scattering coefficients, and (transport) mean-free-path, are routinely used to characterize a sample's scattering properties, as well as to make phantoms with controlled optical properties. One may train, classify, and image through these biological samples by adapting our technique.
We have demonstrated our technique to image through shift-variant scattering induced by a thin diffuser. This condition closely resembles those involving aberrations induced by a single scattering layer [20,28]. Our technique opens up the opportunity to compensate for these aberrations in real-time without expensive hardware, and provide expanded field-of-views and improved tolerance to the change of aberrations. The ultimate challenge for imaging through scattering is to deal with volumetric multiple scattering. Several learning-based approaches have been reported recently [21,30,48,49,51,55]. Future work could adapt our approach to handle these more challenging scenarios.   Figure 6: Testing results of 'unseen objects of new types through unseen diffusers'. The CNN trained with four 'training diffusers' is used to make predictions using speckles from new types of objects through unseen testing diffusers. The testing objects are taken from a new class (Quickdraw) that have never been used during training. When tested on speckles from unseen object (during training) through the same diffuser, the CNN is able to make high-quality predictions. However, it fails on speckles from a different unseen diffuser, demonstrating the importance of the proposed DL strategy involving multiple diffusers.

Ground truth
Trained with 4 diffusers

Trained with 2 diffusers
Trained with 1 diffuser Figure 8: We compare the performance of multiple CNNs trained on 1, 2, and 4 diffusers using different dataset sizes (800 in blue, 1600 in orange, 2400 in green) by the Jaccard index (JI). Each CNN is tested under the same condition, using the same 1000 speckle patterns from seen objects through 5 unseen diffusers. Each circle represents the average JI on all objects through each testing diffuser. The mean JI of each CNN is marked by black horizontal bars. The bottom figure shows representative example predictions from the CNN trained on 1, 2, and 4 diffusers, respectively. To visualize the result, the first row shows the CNN prediction that is overlaid with the true-positive (white), the false-positive (green), and the false-negative (purple).