Does deep learning always outperform simple linear regression in optical imaging?

: Deep learning has been extensively applied in many optical imaging applications in recent years. Despite the success, the limitations and drawbacks of deep learning in optical imaging have been seldom investigated. In this work, we show that conventional linear-regression-based methods can outperform the previously proposed deep learning approaches for two black-box optical imaging problems in some extent. Deep learning demonstrates its weakness especially when the number of training samples is small. The advantages and disadvantages of linear-regression-based methods and deep learning are analyzed and compared. Since many optical systems are essentially linear, a deep learning network containing many nonlinearity functions sometimes may not be the most suitable option.

Despite the success, deep learning has its own limitations and drawbacks, like any other approach [32].For example, a huge number of training samples is usually required to train a deep neural network, which may not be always available in practical applications.The optimization of connection weights in the network with many training samples takes a considerable amount of computational cost.The design of network structure and the tuning of network parameters are often implemented empirically and intuitively with weak explainability.A deep neural network trained and tested for one category of samples may fail to work when it is generalized to other different testing samples.In fact, it is likely that deep learning may perform worse than other machine-learning (or non-machine-learning) methods in certain application scenarios.In previous works, the deficiencies of deep learning in solving optical imaging problems, compared with other methods, were seldom investigated.
In recent works [33,34], deep learning has been employed to address the problems of attacking a random-phase-encoded optical cryptosystem [33] and blind reconstruction for single-pixel imaging [34].A random-phase-encoded optical cryptosystem is a coherent imaging system with multiple diffractive optical elements such as lens and random phase masks.The input plaintext light field is sequentially modulated by each phase mask in the forward propagation and it is finally transformed to a ciphertext light field as the system output.The objective of attacking a optical cryptosystem is to recover the input image from the given output light field if the encoding of all the phase masks is unknown.In singlepixel imaging [35], the object image is sequentially illuminated by different structured light intensity patterns and the total light intensity of the entire object scene is recorded by a single-pixel detector for each pattern.Finally, the object image can be computationally reconstructed when both the illumination patterns and single-pixel intensity sequence are known.However, in a blind reconstruction [34], the objective is to recover the object image from the intensity sequence when the encoding of all the illumination patterns is unknown.
The two systems [33,34] stated above are both linear and can be regarded as a black box when the encoding of elements (phase masks or illumination patterns) is unknown.The random-phase-encoded optical cryptosystem is usually coherent while the single-pixel imaging is usually incoherent.In the previous work [36], it is shown that a multiple-phasemask diffractive system and a single-pixel imaging system are similar from several aspects such as performing optical pattern recognition.In the works [33,34], each system is modeled by a different deep learning network optimized with many pairs of input and output training samples.Then the input image can be predicted by the network from an arbitrary given output.Since there is a linear relationship between the input and output of the two optical imaging systems mathematically, we point out that simple linear-regression-based methods can produce the same results as deep learning.A linear regression scheme can recover the object image more efficiently than deep learning for these two problems in some extent.The advantages and disadvantages of linear-regression-based methods and deep learning are analyzed and compared.
This paper is structured as follows.The linear regression model is described in Section 2. The two black-box optical imaging problems, i.e. attacking a random-phase-encodingbased optical cryptosystem and blind reconstruction in single-pixel imaging, are described in Section 3 and Section 4. The results and discussions about the comparison between linear-regression-based methods and deep learning are given in Section 5. A final conclusion is made in Section 6.

Linear Regression Model
For a linear optical imaging system, both the input X and output Y can be denoted by a column vector X = [x 1 x 2 … x M ] T and Y = [y 1 y 2 … y N ] T .It is assumed that the input X has totally M pixels and the output Y has totally N pixels.The relationship between Y and X can be modeled as a matrix multiplication Y=WX , given by Equation (1).The weighting matrix W consisting of M × N elements can be employed to model a blackbox optical system. [ When a large set of training samples (many pairs of X and Y) is available, the elements in the matrix W can be estimated by optimization if they are not given.The elements in W can be iteratively optimized with a gradient descent algorithm.Initially, all the elements in W are set to be random values.Then each element in W can be updated in the following way based on the gradient descent for each training sample: w nm ′ = w nm + r(y n − y n ′)x m (1 ≤ m ≤ M, 1 ≤ n ≤ N), where y n ′ denotes the actual output generated from the input of one training sample by multiplying the current W, y n denotes the target output of one training sample and r denotes the pre-defined learning rate.If the input values are complex-amplitude instead of real intensities, the algorithm needs to be slightly modified as: , where conj() denotes the conjugate of a complex value.After many iterations, the adaptively optimized W matrix multiplied with a given input will yield an output close to the target one.A linear regression model can be considered as one-layer fully connected neural network without nonlinear activation functions, shown in Fig. 1.In a true fully connected neural network shown in Fig. 2(b), both linear connections and nonlinear activation functions are densely interconnected in the form of multiple cascaded layers between the network input and output.Modern deep learning networks such as the ones proposed in the previous works [33,34] shown in Fig. 3 and Fig. 5 usually have even more complicated structures than a fully connected neural network.Compared with a deep learning network, a linear regression model has very low complexity.

Problem 1: attacking a random-phase-encoding-based optical cryptosystem
As proposed in many previous works [33,37,38], an optical image encryption system can be constructed with an optical setup containing cascaded lens and random phase masks.Typical examples include Double Random Phase Encryption (DRPE) and Triple Random Phase Encryption (TRPE) [33].In this work, the one with a more complicated structure, i.e. a TRPE system, is considered and its optical setup is shown in Fig. ( ) where FT and IFT denotes Fourier transform and inverse Fourier transform, O denotes the input plaintext image and C denotes the encrypted light field (ciphertext).Ideally, the plaintext image O cannot be recovered from the ciphertext C if the key is not known and the information security is protected in this way.However, the encryption system can be cracked by a known-plaintext attack (KPA) if the attacker collects an adequate number of plaintext-ciphertext pairs.In KPA, the objective is to recover the plaintext O from the corresponding ciphertext C without knowing R 1 , R 2 and R 3 .The entire system is linear and the ciphertext can be regarded as the input vector X and the plaintext can be regarded as the output Y in the linear regression model described in Section 2. The two-dimensional matrices O and C can be rearranged as one-dimensional vectors X and Y. Consequently, a KPA to a TRPE system can be implemented with complexamplitude linear regression (CLR), in addition to deep learning.In the previous work [33], the deep learning network structure shown in Fig. 3, referred to as DecNet, was employed for the KPA.In this work, CLR is compared with DecNet for the same KPA attack to a TRPE system.Fig. 3 Deep learning network for attacking a TRPE system proposed in the previous work [33] (DecNet)

Problem 2: blind reconstruction in single-pixel imaging
In single-pixel imaging (SPI), the light intensity is recorded by a sensor containing only one single pixel, instead of a pixelated sensor array.A typical optical setup for a SPI system is shown in Fig. 4. The two-dimensional object image O(x, y) is sequentially illuminated by N varying two-dimensional structured light patterns P n (x, y) (1 ≤ n ≤ N) and a single-pixel intensity sequence I n (1 ≤ n ≤ N) will be recorded.Mathematically, each element in I n is the inner product between O(x, y) and each pattern in P n (x, y).The object imageO(x, y) can be computationally reconstructed when both the illumination pattern sequence P n (x, y) and the recorded intensity sequence I n are known.It is assumed that the total number of pixels in O(x, y) and P m (x, y) is M. The sampling ratio S can be defined as N/M.In single-pixel imaging, various kinds of algorithms can be employed to reconstruct O(x, y) from P n (x, y) and I n [39].However, all the illumination patterns P n (x, y) are required to be known in these reconstruction algorithms.It is usually easier to reconstruct a highquality object image when the sampling ratio S is higher.A blind reconstruction in SPI by deep learning was attempted in the previous work [34], where the object image O(x, y) is recovered from only I n when the patterns P n (x, y) are not given.The blind reconstruction in SPI is favorable for some applications such as scattering imaging [34,40].It is assumed that multiple pairs of different object images and single-pixel intensities are given for the fixed illumination patterns, which can be used as training samples in deep learning.
The blind reconstruction in SPI essentially contains two steps: (a) Recovery of the unknown illumination patterns P n (x, y) from the training samples.This is similar to the KPA in random-phase-encoding-based optical encryption described in Section 3; (b) Object image reconstruction in SPI from a given I n and the estimated P n (x, y) obtained in Step (a).The deep learning approach in the previous work [34] is end-to-end and both two steps are realized within the network shown in Fig. 5.In SPI, (, ) and   have a linear mathematical relationship.The M pixels in O(x,y) can be rearranged as the one-dimensional input vector X in Equation ( 1) and   is equivalent to the output vector Y in Equation (1).All the N illumination patterns   (x, y) will jointly constitute the weighting matrix W in Equation ( 1) and each pattern corresponds to one row in W. Consequently, the unknown illumination patterns can be recovered from the training samples by linear regression for Step (a).Then a compressive sensing scheme with total variation minimization [39,41,42] can be employed for image reconstruction in Step (b) in this work.No training samples are required for compressive sensing reconstruction since it is not a machine learning process.Our proposed scheme is referred to as "Linear Regression + Compressive Sensing (LRCS)".The LRCS scheme is compared with the deep learning network proposed in the previous work [34], referee to as Wang's Net.It shall be noted that no linear regression is performed to recover the illumination patterns in the previous work [34], even though compressive sensing is adopted for image reconstruction by assuming the illumination patterns are known.

Attacking a random-phase-encoding-based optical cryptosystem
A complex-amplitude linear regression (CLR) is performed to crack a TRPE optical cryptosystem.For comparison, a DecNet [33] is constructed and the corresponding cracking results are obtained as well.The size of plaintext image, ciphertext and random phase masks is 32 × 32 pixels.Plaintext images are randomly selected from the number-digit images in the MNIST dataset [43] and the fashion images in the Fashion-MNIST dataset [44].The output ciphertext light fields corresponding to the plaintext images are generated from a simulated TRPE system.In the training, plaintext images are used as the target output and complex-amplitude ciphertexts are used as the input for both CLR and DecNet.Various number of training samples are attempted: 50, 100, 200, 500, 2000 and 5000.In addition, 200 samples randomly selected from each dataset different from the training samples are employed to test the attacking capability of the CLR and DecNet after training.The peaksignal-to-noise-ratio (PSNR) between the original plaintext image and the recovered result from the ciphertext by these two methods is employed to evaluate their performance.
In CLR, the learning rate is 0.01 for the MNIST dataset and 0.001 for the Fashion-MNIST dataset.The number of iterations is set to be 300.The results of our complexamplitude linear regression are compared with the ones using deep learning [33] in Table 1, Table 2 and Fig. 3.It can be observed that the performance of both methods will be improved as the number of training samples increases.However, CLR performs much better than DecNet when the number of training samples is small (e.g. from 50 to 500).
The DecNet can only yield satisfactory output results when the number of training samples is at least 2000 or 5000.The results from CLR with 200 training samples is close to the results from DecNet with 5000 samples.Evidently, CLR has significant advantages compared with DecNet in attacking a TRPE system when the number of training samples is inadequate.It can be observed from Fig. 3 that the recovered MNIST images by DecNet are contaminated with stripe noise and the recovered Fashion-MNIST images by DecNet are heavily blurred when the number of training samples is small.Theoretically, it is possible for a deep learning network like DecNet to accurately model a linear system.However, the network may be overfitted at a local optimal point in the training when the number of training samples is small.Since the global optimal solution is not reached, the network may yield unfavorable prediction results for the testing images.
In the previous work [33], the DecNet can still work when only ciphertext intensities are available as the network input, instead of both ciphertext intensities and phases.However, CLR will not work in this situation since the input-output relationship is no longer linear.This is one major limitation of CLR compared with DecNet.The similarities and differences between the proposed CLR scheme and the previously proposed DecNet for attacking a TRPE system are summarized in Table 3.

Blind Reconstruction in Single-pixel Imaging
In the simulation, the size of object image and each illumination pattern is 32 × 32 pixels.
The pixel intensity values in each illumination pattern are randomly distributed between 0 and 1.Four different numbers of illuminations, N=51, N=205, N=410 and N=1024 corresponding to four different sampling ratios S=0.05, S=0.2, S=0.4 and S=1, are attempted.Various number of training images and 200 testing images are randomly selected from the MNIST dataset.The single-pixel intensity values can be obtained based on the SPI model described in Section 4. Both our proposed "linear regression + compressive sensing" (LRCS) scheme and Wang's Net proposed in the previous work [34] are implemented to recover the original object image.In the linear regression step of LRCS, the learning rate is set to be 0.01 and the number of iterations is 300 for all the cases.The average PSNR of the blindly reconstructed images from simulated single-pixel intensity values for the 200 testing samples is presented in Table 4 and Table 5.Some examples of the reconstructed images are shown in Fig. 7.It can be observed that the performance of LRCS will be enhanced as the sampling ratio increases and the number of training samples increases.The performance of deep learning will be significantly enhanced as the number of training samples increases but it will not be necessarily improved as the sampling ratio increases.At a very low sampling ratio S=0.05, the reconstructed images by LRCS are very heavily degraded but most reconstructed images by deep learning still have acceptable visual quality when the number of training samples are adequate.Since deep learning can extract some high-level common features from the training images, the test object image can still be well recovered from these features when the sampling ratio is very low.On the other hand, the feature extraction and reconstruction may cause more unpredicted errors in the recovered images when the dimension of input data is higher.So the quality of recovered images will not always be worse at a lower sampling ratio and better at a higher sampling ratio for the deep learning approach.Fig. 7 show that some reconstructed images by LRCS have quality degradation but the shapes of the digits match with the original groundtruth MNIST images.On the other hand, some reconstructed images by Wang's Net can be noise-free but the digits have distorted shapes, which will cause a lower PSNR.From the results, it can be observed that the recovered image quality of LRCS with 200 training samples is comparable with the ones using deep learning with 5000 samples, except when the sampling ratio is very low.It takes only 20 minutes to train the linear regression part in the LRCS scheme with 200 training samples in a Matlab R2018a environment with Intel(R) Core(TM) i5-7200U CPU (2.50 GHz) and 8GB RAM.In contrast, it takes at least around 6 hours to train Wang's Net with 5000 training samples in a CPU environment.Due to the model simplicity, LRCS is much more computationally efficient than deep learning in the training step.The similarities and differences between LRCS and deep learning are summarized in Table 6.The reconstruction results with LRCS and Wang's Net from the experimentally recorded data are shown in Fig. 9.It is reported in the previous work [34] that deep learning is significantly more robust to the noise in the experimental data.Since the optical setup in this work is different from the one in the previous work [34], the type of noise and its strength can be different in the experiment.For example, no laser illumination is employed in this work and the speckle noise contamination will not occur.In our observation, the performances of both LRCS and Wang's Net are slightly degraded due to the extra experimental noise that do not appear in the simulated training data.But it is still evident that Wang's Net performs better than LRCS at a low sampling ratio and LRCS perform better than Wang's Net when the number of training samples is small.

Conclusion
In this work, we point out that linear-regression-based methods can be used to solve two black-box optical imaging problems that were previously addressed by deep learning approaches.For attacking a TRPE optical cryptosystem, a complex-amplitude linear regression (CLR) scheme is proposed.For the blind image reconstruction in a SPI system, a "linear regression + compressive sensing (LRCS)" scheme is proposed.In these two problems, linear-regression-based methods show some advantages than deep learning such as small number of training samples and short training time.Simulation and experimental results indicate that deep learning does not always outperform linear regression in this type of black-box optical imaging problems and each approach has its own advantages and disadvantages.The similarities and differences between linear-regression-based methods and deep learning are analyzed and summarized.

Funding
National Natural Science Foundation of China (61805145, 11774240); Leading Talents Program of Guangdong Province (00201505); Natural Science Foundation of Guangdong Province (2016A030312010).

2 .
In a TRPE system, the pixel intensities of the input light field represent the plaintext image O. Then the input light field is optically Fourier transformed and inverse Fourier transformed with a doublelens 4f setup.The light field in the output plane becomes the ciphertext C. The plaintext image can be decrypted from the ciphertext with the same setup by backforward light field propagation.Three random phase mask R 1 , R 2 and R 3 are placed in the input plane, Fourier plane and output plane respectively.The pixel values of all the phase masks are encoded as random phases between [0 2π].The three phase masks serve as the encryption and decryption key.The mathematical model of TRPE encryption and decryption is given by Equation (2) and (3).

Fig. 2
Fig. 2 Optical setup of a triple random phase encryption (TRPE) system

Fig. 4
Fig. 4 Optical setup of a single-pixel imaging system

Fig. 5
Fig. 5 Deep learning network for blind image reconstruction in SPI proposed in the previous work [34] (Wang's Net)

( 2 )( 1 ) 2 ) 3 )
image from the ciphertext light field for a TRPE system (or other similar optical cryptosystems) after the model is trained with a certain number of training samples Difference (1)Work with a small number of training samples Simple explicit model with few parameters (3)Not work for intensity-only ciphertext Only work with a large number of training samples (Complicated black-box model with many parameters for tuning (Work for intensity-only ciphertext

Fig. 7
Fig. 7 Comparison of reconstructed image results for a SPI system with LRCS and deep learning in the simulation

( 3 )
End-to-end reconstruction (4) Possibly reconstruct high-quality results at very low sampling ratio In this work, LRCS and Wang's Net are evaluated based on the experimentally recorded data as well.The SPI experiments are conducted using the optical setup shown in Fig. 8.Each object image is printed on a paper card and illuminated by the patterns projected by a JmGO G3 projector.The single-pixel intensity values are recorded by a Thorlabs FDS1010 photodiode detector and a NI USB-6216 data acquisition card.Totally ten different object images are tested in the experiment.

Fig. 8
Fig. 8 Optical setup of our SPI experiment

Fig. 9
Fig. 9 Comparison of reconstructed image results with LRCS and Wang's Net for a SPI system based on the recorded data in real optical experiments

Table 2 Comparison of recovered plaintext image quality with complex-amplitude linear regression (CLR) and DecNet for the Fashion-MNIST dataset
Fig. 6 Comparison of recovered plaintext image results for a TRPE system with CLR and DecNet