Deep Learning Enabled Real Time Speckle Recognition and Hyperspectral Imaging using a Multimode Fiber Array

We demonstrate the use of deep learning for fast spectral deconstruction of speckle patterns. The artificial neural network can be effectively trained using numerically constructed multispectral datasets taken from a measured spectral transmission matrix. Optimized neural networks trained on these datasets achieve reliable reconstruction of both discrete and continuous spectra from a monochromatic camera image. Deep learning is compared to analytical inversion methods as well as to a compressive sensing algorithm and shows favourable characteristics both in the oversampling and in the sparse undersampling (compressive) regimes. The deep learning approach offers significant advantages in robustness to drift or noise and in reconstruction speed. In a proof-of-principle demonstrator we achieve real time recovery of hyperspectral information using a multi-core, multi-mode fiber array as a random scattering medium.


INTRODUCTION
Motivated by the need for imaging in complex environments and through opaque media, new techniques for characterizing and controlling multiple scattering are currently seeing a tremendous development [1].This new toolbox opens up directions for controlling light in random media and exploiting it for applications such as imaging and optical information processing.At the same time, improved understanding allows us to retrieve more information from seemingly random scattering fields to see around corners and through opaque media [2,3].Control over light scattering has led to exciting new applications such as programmable multiport optical elements for classical and quantum states [4][5][6][7], quantum secure keys [8] and compressive sampling imaging systems [9].
The multispectral characteristics of speckle fields have been used successfully in a range of studies to realize speckle spectrometers [10][11][12][13][14][15].By exploiting wavelength-dependent speckle patterns from a multimode fiber, a spectral resolution of picometers in the near-infrared and nanometers in the visible region has been demonstrated [16,17].Next to single-channel spectrometers, multiplexing of spectrally resolved speckle fields into hyperspectral imaging systems is of great interest.Compared to traditional approaches such as Integral Field Spectrometers [18,19], complex media can offer new opportunities for combining broadband transmission with high spectral resolution [20][21][22].In the spatial domain, multimode fibers as well as multi-core fiber bundles are a topic of study for a variety of imaging applications such as remote sensing and endoscopy [23][24][25][26].
Recently, a multicore multimode fiber bundle has been used as a frequency characterization element in a high-throughput imaging spectrometer for snapshot spatial and spectral measurements with sub nanometer spectral resolution [27].A compressive sensing (CS) algorithm was successfully employed to retrieve spectral information.Convex regularization techniques such as CS provide a suitable solution for a number of problems in computational imaging.However, in addition to their high computational cost, the applicability depends on a sparsity assumption and indeed performance is reduced for dense data.Faster data processing and requirement of robustness with respect to noise and drift could benefit from an entirely different approach based on artificial neural networks [28][29][30][31][32][33][34].Computational methods using neural networks that are trainable for specific problems have recently been shown to be highly efficient and fast [35][36][37].Recently, this approach has been used in various applications utilizing speckle patterns such as image reconstruction, object classification and recognition [38][39][40][41][42][43][44].
Here, we demonstrate the successful application of Deep Learning (DL) neural networks to the retrieval of spectral information from speckle images.Using a multi-mode, multicore fiber array as a multiplexed speckle spectrometer, we achieve real-time spectral imaging over several thousands of individual fiber cores.Besides being orders of magnitude faster than other, CS-based techniques, we investigate the robustness of DL to noise as well as to image shifts that could originate from thermal expansion or vibrations in the imaging system.We show the adaptability of DL in such conditions with good performance achieved by appropriate training.Results for DL are compared with CS and with analytical regularized inversion approaches.We find that DL performs well both in the compressive and oversampling regimes, combining a good balance in characteristics with fast reconstruction speeds and massive parallelized performance over many fiber cores.

METHOD
We use a multi-core, multimode fiber (MCMMF) as the complex scattering medium (Edmund Optics, Fiber optic image conduit).Mode mixing in each individual fiber results in a characteristic speckle pattern with a wavelength dependence determined by the fiber length and the angle of incidence of the incident light.For a direct comparison with previous results we used in our first calibration experiments the setup as described in Ref. [27].For the projection of animations we developed a new setup as shown in Fig. 1a based on an acousto-optic tunable filter (AOTF) and spatial light modulator (SLM).
In short, a supercontinuum light source (Fianium SC400) was spectrally filtered using an AOTF with a resolution of 5 nm.The filtered light was projected onto a single-mode fiber to ensure stability of the illumination beam with wavelength and to eliminate any other forms of spectral drift in the setup.The fiber output was reflected off the liquid crystal spatial light modulator (SLM, Holoeye Pluto) and was projected onto the MCMMF at an incident angle of 4° with an image demagnification of 5:1.The MCMMF consisted of 3012 fibers with individual core diameters of 50 µm.Fibers of different lengths can be used for different application depending on the bandwidth required [27].After transmission through the fiber array, the output facet of the MCMMF was imaged onto the focal plane array of a 12-bit, 5 MPixel monochrome CMOS camera with a pixel size of 2.2 μm x 2.2 μm (AVT Guppy) using a 1:1 imaging system.Collected images were transferred to PC via IEEE 1394a and saved in uncompressed TIF format in 2592 (H) × 1944 (V) resolution in 12 bits.
Figure 1b illustrates the typical information obtained at the exit surface of the MCMMF in the form of wavelength dependent speckle patterns obtained from the individual fiber cores.Each pattern corresponds to a superposition of higher order fiber modes that is dependent on the wavelength, the length of the fiber and the angle of incidence.All fiber cores are slightly different and local variations in the material properties, strain, impurities and other random structural elements give rise to an individual set of speckle patterns for each fiber core in the array.The speckle patterns for every wavelength are stored into a multispectral transmission matrix for every core, which in principle allows retrieval of spectral information from arbitrary superposition states using a number of different techniques.
Spectra consisting of more than one wavelength component, as well as continuous spectra, result in a superposition of many speckle patterns.Analytical inversion techniques like Moore-Penrose pseudoinversion can be employed to reconstruct the spectra from these superpositions, but their performance is strongly dependent on noise and appropriate regularization is needed.In this work we compare our DL approach with analytical inversion using Tikhonov regularization (TR) [9].Moreover, analytical inversion is limited to the oversampled regime, and its breakdown is observed at the Shannon-Nyquist sampling limit [22].Compressive sensing (CS) extends reconstruction into the undersampling regime under conditions of sparsity, in our work CS was implemented for comparison to DL using the python package "cvxpy" [45].
Spectral reconstruction via DL was implemented using a convolutional neural network (CNN), composed of a series of convolutional layers followed by two fully connected layers of 512 and 256 nodes with dropout regularization using 70% keep probability.The final, dense output layer of 43 neurons represents the spectrum, where each neuron corresponds to a discrete wavelength channel.The size of the CNN was manually optimized for each tested sampling condition.For a region of interest (ROI) size of 5x5 pixels, the best performing network consists of two convolutional layers with each a 2x2 kernel (CNN (i) in Fig. 2, yellow).On 20x20 pixels, a three-layer CNN with kernel size 3x3 throughout the network was found to perform best (CNN (ii) in Fig. 2, blue).Each convolution is followed by batch normalization and a leaky ReLU activation layer, all layers use valid padding.We found that any type of pooling consistently reduced the reconstruction quality, so no pooling was performed.The networks were implemented in python using keras as frontend for tensorflow [46].
To test the performance of the different approaches, multiple patterns were digitally added up together to simulate a real signal made of a given number, Nλ, of nonzero wavelength components with randomly  varying intensities.The images of the speckle patterns were cropped to various ROI sizes to achieve different regimes of oversampling and undersampling, as given by the ratio of the total number of calibrated wavelengths, Y, to the total number of pixels of the ROI, X.For each multimode fiber, a total of 30,000 images was generated as training dataset, of which 29000 were used to train the neural network and the remaining 1000 served for validation.For the final evaluation, additional data were used.

RESULTS AND DISCUSSION
A direct numerical illustration of the reconstruction capabilities of the DL approach is shown in Fig. 3.In Fig. 3a,b the performance is shown for different sampling regimes ranging from oversampling (Y/X=9.30)down to deep undersampling (Y/X=0.21).The cartoons in Fig. 3a illustrate the quality of the reconstruction, where a smiley emoticon was used as the ground truth in a single wavelength channel.For a single non-zero wavelength Nλ=1, this is the only information contained in the spectrum.For the case Nλ=10, nine other wavelengths in the spectrum are filled with a crossed-out symbol in the form of a capital "X".We see that in both cases, the reconstruction of the target is very good in the oversampling regime, while the quality becomes poorer below the sampling limit, Y/X < 1.For Nλ=10 we can see the appearance of the cross in the image, indicating the presence of significant cross-talk between the spectral channels.
A quantitative analysis of this dependence of the reconstruction quality on the sampling rate is presented in Fig. 3b, where the cross correlation of the reconstructed spectrum with the input spectrum (ground truth) is plotted versus the sampling ratio Y/X.We can clearly see the main trends identified in the illustration, namely a good quality of reconstruction in the oversampling regime and a degradation in performance for Y/X<1.In the undersampling regime, we see that the reconstruction improves for lower number of nonzero wavelengths, with reasonably good performance (defined as correlation > 0.5) for sampling rates as low as Y/X=0.21 for just a single non-zero wavelength.Clearly, DL is able to infer meaningful results under conditions in which the information density is sparse and where analytical inversion techniques show a complete breakdown [22].In other words, DL is able to cross over far into the compressive sensing regime and therefore exhibits similar characteristics to a CS-based approach.
Having shown the strength of DL in the compressive sensing regime, it is of interest to investigate its capabilities in the regime of dense information, under conditions where the sparsity assumption underpinning CS ceases its validity.We start again with a numerical illustration in Fig. 3c to visualize the amount of information that can be encoded through speckle images.Using Y=43 available wavelength channels, we encoded separately the red, green and blue (RGB) channels of 14 independent RGB images using the experimentally obtained transmission matrix.The remaining unused wavelength channel allows to assess the residual cross-talk.Raw RGB reconstruction data are given in the Supporting Information.
Figure 3c shows the reconstructed RGB images obtained using our DL approach in either the regime of undersampling (Y/X=0.84)or oversampling (Y/X=9.30).In case of oversampling (Y/X>1), the neural network has much more input information to work with, which results in excellent image reconstruction quality and low residual cross-talk.For the undersampling regime, the images are still discernible but with significant reconstruction noise and cross-talk.These trends are again quantified in the accompanying analysis of Fig. 3d, showing the crosscorrelation with the ground truth versus number of non-zero wavelengths.We see that the network output is almost perfectly correlated in the oversampled case (perfect reconstruction), which holds even for dense spectra, where signal is present in all wavelength channels (Nλ=43).As seen in Fig. 3c, for increasing number of wavelengths the effect of undersampling results in a rapid decrease of reconstruction fidelity.
In Fig. 4 deep learning (DL) is compared directly with both the TR and CS reconstruction In this benchmark, 1000 randomly generated spectra were generated numerically from the experimental transmission matrix.Figure 4a shows the oversampling case (Y/X=9.30)corresponding to an ROI of 20x20 pixels, while Fig. 4b column shows the undersampling case (Y/X=0.58)corresponding to only 5x5 pixels.Several examples of typical spectra are shown (blue dash, ground truth), together with their corresponding reconstructions using DL (red), TR (orange) and CS (light blue).The lower two examples correspond to continuous spectra with a high density of spectral information.
Figure 4c,d gives the full quantitative analysis of the average of the cross-correlation between each of the 1000 randomly generated spectra (ground truth) and its respective reconstruction.In the oversampling regime, all methods perform well with correlation values >0.95.In the undersampling regime TR fails completely (average crosscorrelation < 0.5) as it can be seen to produce a mostly flat spectrum for all cases irrespective of the spectral shape.DL yields a very good performance and even clearly outperforms CS for dense spectra in the undersampling case.
The slightly weaker performance of DL compared to CS in the oversampling case can be explained by the statistical training procedure in DL, while CS on the other hand is an analytical approach, yielding generally a close-to-optimum solution -however at a significantly higher computational cost compared to the neural network reconstruction, as discussed further below.In the undersampling case we observe that DL tends to result in inferred spectra that are smoother than the original input spectrum, whereas CS results in more spikes in the spectra even when the input is smooth.
The previous tests considered speckle reconstruction using a multispectral transmission matrix in the complete absence of noise.In a real-life scenario, one can expect some level of noise to be present in the image, either shot noise, electronic camera noise or other non-specific backgrounds.An imaging system may also experience some drift caused by environmental effects, such as vibrations and thermal variations.To assess the robustness of the different approaches against typical perturbations, we compare in Fig. 5 the respective performance of DL, CS and TR.While CS and TR are based on analytic methods which are intrinsically inflexible to variations, DL has the advantage of allowing some level of adaptability through the choice of training data.
To investigate the adaptability of the DL approach, we trained the neural network using noisy and spatially shifted training data with the aim of making it more robust against these effects.To account for the effect of noise, we added normally distributed random intensity noise to every pixel of the speckle pattern.Figure 5a shows a parameter map showing the relative performance of DL trained on noisy data (DL+N) against CS, which is quantified as the ratio of their respective crosscorrelations with the input spectra.Noise-adapted DL consistently outperforms CS over a large parameter-space.Blue regions indicate a superior DL performance.DL always performs better for relatively dense spectra with Nλ>15.Even in most of the "white" parts of the colour plot, DL is outperforming TR and CS by at least some percent (see bar plots in Fig. 5b).CS is better performing only on sparse data at very low sampling rates.Furthermore TR contains a free parameter which has to be empirically adjusted to match a given noise level in order to achieve the specified performance [9].The neural network trained on noisy data (DL+N) outperforms the normal DL when dealing with noisy spectra.This increased adaptability of the DL+N neural network comes at the cost of a relatively small hit on performance when dealing with noiseless data as seen in Fig. 5b.
To simulate spatial drift, the cropped ROI was randomly shifted by one pixel in arbitrary directions.We compare in Fig. 5c,d the performance of networks that were trained with and without spatial shift (DL+S).If the neural network is trained without such perturbed data, the performance of DL decreases, just like for the conventional methods TR and CS.However, if the training datasets contain accordingly perturbed data, DL shows a distinct performance boost at the reconstruction from non-perfect speckle images.In the case of shift (Fig. 5c,d), accordingly trained DL dramatically outperforms the other techniques on perturbed data, while again the loss of performance on unshifted data is relatively small.Finally, apart from its universal applicability and superior stability, the most important achievement of using DL over CS is its reconstruction speed.We therefore demonstrate that our deep learning based hyperspectral reconstruction method is capable of real-time image processing.To this end we projected a grayscale video via an SLM onto the MCMMF.During video playback we randomly changed the wavelength of the illumination using the AOTF.We analyzed in real time the 2700 speckle patterns for each core, captured by the CMOS using artificial neural networks (one network for each fiber), which were trained in advance and pre-loaded into RAM.Figure 6a shows frames from the original video data in the top row, and the three first wavelengths of the spectral reconstruction in the bottom rows.Upon switching of the wavelength, the reconstructed image changes channel with limited cross-talk between channels.At 1.74s, the AOTF was switched during the frame acquisition, briefly resulting in two spectral components.We note that the reconstruction quality is a bit worse compared to the synthetic data shown before, which is due to nonperfect intensity stability of the setup and some spurious residual wavelength correlations in the shorter (2.54cm) Fiber.
The training of the networks for 2700 fiber cores takes about 8 hours on an Nvidia Quadro P6000 GPU.This is a one-time procedure since the speckle patterns can be maintained stable over long times.The hyperspectral image reconstruction itself is the time-critical process in many applications.The trained neural networks are capable to reconstruct all 2700 fiber cores in only 132 ms on an Intel i7-3770 CPU.In comparison, CS requires around 2 minutes per frame for processing of the full fiber bundle.Figure 6b shows the durations of the steps of DL hyperspectral image reconstruction on the Intel i7-3770 CPU, working with 32GB of RAM for network pre-loading.Including the preprocessing and rendering of the final hyperspectral images, the algorithm in its current state can reconstruct about 5 full images per second, which is faster than the total acquisition and transfer time of our CMOS camera (about 0.35s).
There is significant scope for further increasing the reconstruction speed of DL by performance optimization of the code, by using multi-CPU or multi-GPU platforms, or by developing hardware implementations of the networks, for example based on Field Programmable Gate Arrays (FPGAs) [47].In our current implementation using Python, a more direct hardware/software communication could increase frame rates to 5fps and beyond, which is outside the scope of this proof of principle study.

CONCLUSION
In conclusion, we have shown that with deep learning, a multicore multimode fiber bundle can be used as a real-time hyperspectral camera, robust to noise and spatial shifts.Using wavelength-and fibercore dependent speckle patterns, we have used deep learning to cope with large amounts of data at a video rate of several frames per second on conventional computer hardware.The imaging spectrometer and deep learning technique are versatile by design, and the calibrated wavelength range can be tailored to specific applications.The approach can be easily scaled to any number of wavelength channels, to desired spectral resolution via the length of the multicore fiber bundle and with respect to imaging spatial resolution.Deep learning in combination with speckle spectrometry enables a new class of low-cost, compact hyperspectral imaging systems with real-time data processing capabilities.

Fig. 1 .
Fig. 1. a) Scheme of the experimental setup including broadband supercontinuum laser source, acousto-optical tunable filter (AOTF), Spatial Light Modulator (SLM) used for image generation and the multicore, multimode fiber (MCMMF).b) Original projected image and detected camera image of the exit interface of the fiber bundle at a single selected wavelength with typical speckle patterns of a selected fiber core for different input wavelengths (λ1-λn).La Linea © CAVA/QUIPOS.

Fig. 3 .
Fig. 3. a) Numerical illustration and b) calculation of reconstruction quality using DL for different sampling rates Y/X, for Nλ=1 and Nλ=10 non-zero wavelengths.One wavelength carries the encoded image (smiley), all other non-zero channels encode the image of a capital "X", which becomes slightly visible at low sampling rates due to cross-talk (see Supporting Information).c) Numerical illustration of image reconstruction using DL for dense spectra (Nλ=42) showing 14 RGB images that are encoded in 42 wavelength channels, the 43 rd , blank channel serves for cross-talk control.Reconstructions are shown for undersampling Y/X=0.84 and oversampling Y/X=9.30regimes.d) Cross-correlation with ground truth as a function of number of non-zero wavelengths in the spectrum for different sampling rates.Results in (b,d) are averaged over the whole fiber stack and for 100 spectra per fiber core.Light areas indicate the standard deviation of the data.Dashed line at Y/X=1 corresponds to Nyquist-Shannon sampling limit.Dashed line at a cross-correlation of 0.5 indicates the threshold below which the reconstruction is considered to have failed.

Fig. 4 .
Fig. 4. a) Examples of speckle images and reconstructed spectral information for sparse (three top rows) and dense spectra (two bottom rows, generated by a random-walk like algorithm).Left column: oversampling regime with Y/X=9.3, right column: undersampling regime with Y/X=0.58.The black box inside the speckle pattern shows the ROI used in the undersampling case.b) Histograms comparing average cross correlations from 1000 randomly generated sparse (< 50% sparsity) and dense (all wavelengths non-zero) spectra, obtained with deep learning (DL), Tikhonov regularization inversion (TR) and compressive sensing (CS).

Fig. 5 .
Fig. 5. a) Map showing ratio of reconstruction quality (cross-correlations) for deep learning trained on noisy data (DL+N) over compressive sensing (CS) for 10% added noise.Blue indicates DL+N better than CS, red indicates CS better than DL+N.Contour lines indicate the cross-correlation of DL+N speckle reconstruction.b) Calculated cross-correlations without noise and in presence of 25% noise.DL+S outperforms CS over a large part of parameter space where the cross-correlation < 0.9.c) Robustness of the reconstruction against shifts of the speckle patterns by one pixel in a random direction.d) Calculated cross-correlations without shift and with shift of 1 pixel.DL can be trained on data including shift (DL+S), which renders the method very robust in such scenario, largely outperforming TR and CS.

Fig. 6 .
Fig. 6. a) Real-time speckle-based hyperspectral video reconstruction via DL.A video is projected on the fiber bundle using an SLM in amplitudemodulation configuration.During playback the wavelength of the projecting light is changed.Top: Original frames of the input video.Bottom: Spectral reconstruction for the first three wavelength channels of the full multi-core fiber (approx.2700 fiber cores).b) Bar graph showing the timings of the different execution steps.Full video is included in the Supporting Materials of this study.La Linea © CAVA/QUIPOS.