Optical neural network via loose neuron array and functional learning

This research proposes a deep-learning paradigm, termed functional learning (FL), to physically train a loose neuron array, a group of non-handcrafted, non-differentiable, and loosely connected physical neurons whose connections and gradients are beyond explicit expression. The paradigm targets training non-differentiable hardware, and therefore solves many interdisciplinary challenges at once: the precise modeling and control of high-dimensional systems, the on-site calibration of multimodal hardware imperfectness, and the end-to-end training of non-differentiable and modeless physical neurons through implicit gradient propagation. It offers a methodology to build hardware without handcrafted design, strict fabrication, and precise assembling, thus forging paths for hardware design, chip manufacturing, physical neuron training, and system control. In addition, the functional learning paradigm is numerically and physically verified with an original light field neural network (LFNN). It realizes a programmable incoherent optical neural network, a well-known challenge that delivers light-speed, high-bandwidth, and power-efficient neural network inference via processing parallel visible light signals in the free space. As a promising supplement to existing power- and bandwidth-constrained digital neural networks, light field neural network has various potential applications: brain-inspired optical computation, high-bandwidth power-efficient neural network inference, and light-speed programmable lens/displays/detectors that operate in visible light.


Regular-2 Regular-3 Normal-3 Uniform LC Neurons
. Accuracy of using FL to train different neuron arrays with the numerical simulation.
Here we verify this concept with numerical simulation ( Figure S1), where the input neurons are point light sources Table S1. Accuracy comparison of different layer spacing with the numerical simulation. All neuron arrays are regularly aligned with two or three LC layers, where the spacing between the input and output layers are reported in the first row. The LC layers equally split the spacing. For example, the distance between two LC layers is 30 mm for the regular-3 array of 120 mm spacing.  Figure S2. Neuron arrays with 0%, 20%, 40% and 60% randomly malfunctioning neurons. We report the accuracy of testing different Bernoulli arrays on both the numerical simulation and the actual LFNN device for the 1-layer MNIST classification. 69 We further conduct a test to verify that training random neurons is also feasible in real-world systems. We apply 70 different Bernoulli distributions to the regular-2 array, numerically and physically, to generate random neuron 71 arrays Figure S2. We randomly select and deactivate certain ratios of LC neurons in our LFNN prototype. In 72 addition, all input and output gains are deactivated to emphasize the change of LC neuron number.

73
The test results are summarized in Figure S2. The physically captured LFNN output is comparable to that 74 of the equal-configuration simulation, confirming that it is possible to train random physical neurons in practice.

75
The experiment also assesses the robustness of training loose neuron arrays with the FL paradigm by showing 76 reasonable accuracy even with up to 60% of malfunctioning neurons. Given that robustness, there can be many 77 possible configurations of loose neuron arrays to adapt to different actual environments and applications. 78 3/45 S2 Light Field Neural Network 79 We physically prototype a loose neuron array, termed light field neural network (LFNN), to realize the regular-2 80 array and Bernoulli array by randomly deactivating neurons (Section S1). We use off-the-shelf components, e.g., 81 liquid crystal display (LCD) panels, polarizers, and a machine vision camera, without tedious calibration. As a 82 result, the actual LFNN is not precisely the same as the regular-2 array.

84
The applied LCD panels are expected to show a high transmittance and good linearity between the applied voltage 85 and the polarization rotation angle. Among available LCD panels, we use Chimei Innolux AT070TN83 as the 86 optical layer. The photograph of our prototype is shown in Figure S3. A modified backlight system adapted from a 87 commercial projector is used to illuminate the input plane. The front and rear polarizing films are removed from the 88 front of two LCDs. A diffuser and a polarizing film are located at the camera's focal plane as the output plane. All 89 layers were assembled into acrylic frames separately, and all frames are installed on the optical table. The distance 90 setting between the layers is 30 mm. Here we first set the distance between the input plane and the output plane as 91 90 mm so that the central neuron's energy distribution can roughly cover the entire output plane ( Figure S4).  Figure S3. Light field neural network. The LFNN prototype consists of an input plane, an output plane, two layers of liquid crystal, and three perpendicular linear polarizers. The output plane is a scattering plane followed by a camera to acquire the data. We use an extra LCD as the input plane by representing artificial neurons with pixels.

99
Note that we use only off-the-shelf, low-cost components to build the LFNN prototype without fine-tuning, thereby 100 the neurons are not consistent, and the connections are uncertain.  Figure S5 and Figure S6 visualize the characterized optical behavior of LC1. For Figure S5, we set the parameter 112 of the input plane's central neuron as 1, but others' as 0 and set all neuron parameters as 1 for LC2, then set only 113 one of LC1's neuron parameter to 1 in sequence to capture a slice of the light field. The input plane's central neuron 114 is only affected by a small part of the LC1's neurons. We can observe different patterns for different color channels.

4/45
For Figure S6, all neural parameters of the input plane are set to 1 to yield less noisy captured patterns. However, 116 the captured patterns are still not symmetrical circles or regular distributions.
117 Figure S7 and Figure S8 are the counterparts for LC2. LC2 has a wider influence area with respect to the input 118 plane's central neuron. In Figure S7, there are strange red patterns diverged from the primary bright spots, which 119 have irregular and asymmetric patterns for unknown reasons. Figure S4. Captured energy distributions of input plane's neurons. To capture this light field slice, we set one neuron parameter as 1 and others' as 0 in sequence for the input plane; set all neuron parameters as 1 for LC1; and set all neuron parameters as 1 for LC2. The captured outputs are linearly normalized between 0 and 1 for visualization.
6/45 Figure S5. Captured energy distributions of the input plane's central neuron with respect to LC1. To capture this light field slice, we set one neuron parameter as 1 and others' as 0 in sequence for LC1; set the central neuron's parameter as 1 but others' as 0 for the input plane; and set all neuron parameters as 1 for LC2. The captured outputs are linearly normalized between 0 and 1 for visualization. Figure S6. Captured energy distributions with respect to LC1. To capture this light field slice, we set one neuron parameter as 1 and others' as 0 in sequence for LC1; set all neuron parameters as 1 for the input plane; and set all neuron parameters as 1 for LC2. The captured outputs are linearly normalized between 0 and 1 for visualization.

8/45
Figure S7. Captured energy distributions of the input plane's center neuron with respect to LC2. To capture this light field slice, we set one neuron parameter as 1 and others' as 0 in sequence for LC2; set the central neuron's parameter as 1 but others' as 0 for the input plane; and set all neuron parameters as 1 for LC1. The captured outputs are linearly normalized between 0 and 1 for visualization. 9/45 Figure S8. Captured energy distributions with respect to LC2. To capture this light field slice, we set one neuron parameter as 1 and others' as 0 in sequence for LC2; set all neuron parameters as 1 for the input plane; and set all neuron parameters as 1 for LC1. The captured outputs are linearly normalized between 0 and 1 for visualization.

121
The contrasts of the input plane, LC1, and LC2 are measured and visualized in Figure S9, S10, and S11. We apply 122 256 pixel values varying from 0.0 to 1.0 to each plane's central neuron and capture the corresponding output. As 123 shown in the figures, even though the input parameters range from 0 to 1 with 256 gray levels, the actual valid 124 gray level is less than 256 since the noise is too large to identify similar values. The per-pixel valid gray level is also 125 automatically learned by the FL paradigm, and we do not need to define it explicitly.
126 Figure S9. Captured energy distributions of the input plane's central neuron with changing parameter values. To capture this light field slice, we set the central neuron's parameters from 1.0 to 0.0 and others' as 0 in sequence for the input plane; we set all neuron parameters as 1 for LC1 and LC2. The captured outputs are linearly normalized between 0 and 1 for visualization.

127
Liquid crystal panels are considered as linear components disentangled between layers 1 However, we find that the 128 LFNN prototype is not a perfect end-to-end linear system in practice, making it non-differentiable and hard to train.

129
As shown in Figure S12, we fix the parameters of two planes and assess the linearity of the last plane by linearly 130 compositing random parameters. The parameter of the third row is the sum of the first two rows. With a perfect implying that the bias is a function of the parameters of the other two planes. In practice, the whole device is a 134 high-dimensional system whose layers are entangled. It is difficult to accurately represent the system by classic 135 explicit or static implicit models. As such, our functional learning paradigm is necessary and works well even in this 136 nonlinear system.

137
There can be many possible sources of nonlinearity. One known source is the imperfectness of the electronic 138 control circuit and photoelectric conversion of the system. The actual physical response is not always linear to 139 the control parameter. To avoid such a nonlinearity, we composite binary parameters, e.g., each neuron is either 0 140 or 1, but still observe structured bias, which implies there are other sources of nonlinearity in the system. Other properties, or other non-optical sources.

143
In general, this experiment verifies that there are always gaps between real-world systems and theoretical models.

144
Because the bias sources of the LFNN device are unclear, it is impossible to accurately train such a system in classic 145 ways via numerical models.  To capture this light field slice, we set the central neuron's parameters from 1.0 to 0.0 and others' as 0 in sequence for LC2; we set all neuron parameters as 1 for LC1 and set the central neuron's parameter as 1 but others' as 0 for the input plane. The captured outputs are linearly normalized between 0 and 1 for visualization.

147
The prototype system is with off-the-shelf components for proof of concept rather than seeking cutting-edge bandwidth.

148
Thus, the components' optical and electrical parameters, such as transmittance, power efficiency, resolution, and

12/45
Assess LC2 Assess LC1 Assess Input Plan Figure S12. Linearity assessment of the LFNN device. We generate and composite random parameters for the input plane, LC1, and LC2. We change only one plane's parameters to capture three outputs in each test case. For the tested plane, the third row's parameters are equal to the sum of the other two rows' parameters. In theory, the output of the third row should be equal to the sum of the first and second row if the device is a perfect linear system. The captured outputs are linearly normalized between 0 and 1 for visualization.
tera-operations per second (TOPS). However, using the current training paradigm to train such a huge neural 152 network system is impractical in terms of time. Therefore, we only use a small portion of the computing capability.

153
Specifically, we merge multiple LC neurons as one by sharing the weights to reduce the training parameters. The including the light source, LCD panels, and the camera, is 37.275 watts. In addition, we use a computer to control 156 the system, whose power consumption is 58 watts. In general, the primary challenge to fully utilizing the computing using only trivial nonlinear activation is also an option that balances the performance and the cost, which also can 188 be robustly trained by the FL as shown in Figure S13.  Figure S13. Log of training the 3-layer LFNN for CIFAR10 classification task with or without X-activation. The training of distribution layout is discussed in Section S4.

190
We compare the functional learning paradigm with existing machine learning paradigms to verify its efficacy Table S2.

191
The test is conducted using our LFNN prototype for the 1-layer MNIST classification task. The chosen paradigms The overall results are reported in Table S2.  In order to calibrate the linearity of control parameters, we imitate gamma calibration and build a lookup table   211 for each panel by measuring the actual system output. Even though we build a forward model to approximate the 212 LFNN system, the training yields a low prediction accuracy since the actual physics contains thousands of correlated 213 parameters (Section S2.4), forming a high-dimensional state space that cannot be measured in practice. While the 214 train losses of these two paradigms are generally comparable, as shown in Figure S14 and Figure   and report the best accuracy using 0.001. We plot the train loss in Figure S16. which is the same as our FL configuration. Within equal training time (100 epochs), the prediction accuracy is 236 14.06%. As shown in Figure S17, the loss decreases steadier compared with that of the finite difference paradigm.

237
However, exploring a high-dimensional solution space without gradients is inefficient because there are too many

241
In order to evaluate the capability of the FNN in terms of reflecting the real physics, we compare the predicted 242 output of the FNN and the actual output of the LFNN. Figure S18 shows the FNN's predicted point spreading 243 functions of input neurons, which is supposed to match the results of Figure S4. Figure S19 is the difference between 244 the predicted and captured outputs. As can be seen, the FNN can well approximate the real LFNN output except 245 for a few noises. Figure S18. The FNN's predicted energy distributions of input plane's neurons. To capture this light field slice, we set one neuron parameter as 1 and others' as 0 in sequence for the input plane; set all neuron parameters as 1 for LC1; and set all neuron parameters as 1 for LC2. The captured outputs are linearly normalized between 0 and 1 for visualization.
We test objects classification, object detections, and single-image depth estimation on our LFNN prototype. The  Table S6.  The 'coffee mug' category of an RGB-D dataset 8 is used to test the 4-layer LFNN for the depth estimation task.

260
The original depth data is scanned by the Kinect and has many holes. We use the original mask labels of the dataset 261 to separate the object and background layers, then use a nearest-neighbor-searching scheme to fill the holes in each 262 layer ( Figure S20). However, the original mask labels are not very accurate and thus introduce some errors to the 263 object shapes. There are total 4, 500 such RGBD pairs in the training set and 300 pairs in the test set.

264
We use the cross-entropy loss term for the p-learning of classification and recognition tasks, and the L1 loss 265 for the depth estimation task. The optimization algorithm is Adam 9 , and the learning rate is 0.001. Because we 266 use a commercial low-speed camera (four frames per second) to implement the LFNN prototype, the performance 267 bottleneck of the training process is the time consumed by capturing z-data. Therefore, the average training time of 268 one epoch is about 4 minutes. The total training time for each task can be easily calculated given their total epochs 269 reported in this section.

270
For the classification tasks, the predicted probabilities are represented by different distribution layouts. Specifically, 271 a class probability is the intensity sum of relevant areas on the output plane. One naive probability distribution 272 layout is to divide the plane into a regular grid and use one cell to represent the probability of one class. For 273 example, using the intensity of ten light spots to represent the classification probability of ten hand-written digits 274 on the MNIST data set 2 . Because manually optimizing the probability distribution layout requires an explicit 275 model of the LFNN, there is no trivial algorithm to do so. We decide to make the probability layout distribution 276 also trainable. We find that splitting a large spot into four small spots enables large optimization space for the 277  Figure S21. Illustration of the process of training the distribution layout. This figure is an example of training the distribution layout for two classes. To begin with, we divide the output plane into 64 grids to represent output neurons. Every output neuron is connected to every class probability. The connections are trained during the training for classification. At the end of every epoch, each class probability will prune one least weighted connection linked to it. We repeat this process until only a target number of connections left for each class probability. Finally, the weights of the left connections are set to 1, so the sum of intensity in different regions of the output plane represents the predicted probabilities of different classes.
probability distribution layout. For example, the feature on one corner of the input plane might not be able to 278 reach the other side of the output image due to limited scattering angles, but using multiple gathering spots on 279 the output plane can mitigate this problem. Instead of assigning a handcrafted layout, we make the probability  . While training the distribution layout, some connections are pruned and might lead to a dramatic change in the probability distribution layout, leading to a sudden accuracy decrease. These pruned connections will be replaced by other connections later, and the accuracy gradually recovers after the pruning.

Input 1 Output 1 Distribution Reference
Test case 2, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Test case 5, the first row is LFNN captured outputs and the second row is FNN predicted outputs.
Input 1 Output 1 Distribution Reference Figure S24. Layer outputs of MNIST Classification (1 Layer) for test cases 1 to 5.

23/45
Test case 6, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 7, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 9, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 10, the first row is LFNN captured outputs and the second row is FNN predicted outputs.    Figure S33. Layer outputs of CIFAR10 Classification for test cases 6 to 10.

295
As can be seen, the classification accuracy of the LFNN is lower in the CIFAR10 dataset than in the MNIST dataset.

296
In the following sections, we analyze the possible reasons.

297
Deep learning has become a complicated system. It is supported by a highly active research society with many 298 exciting discoveries every day, such as dataset generation, problem modeling, network architecture, loss design, 299 training paradigm, and dedicated hardware. The final performance is jointly determined by all these factors, many 300 of which are orthogonal to our study. These known factors affecting the performance of neural networks can be 301 generally categorized into network complexity, network complexity, and learning complexity 10 .

303
Network complexity partially depends on the hardware, including neuron complexity, the number of neurons in each 304 layer, the number of layers, and the number and type of interconnection weights. and CIFAR10 datasets since we use the same mechanism for both datasets. The modulation resolution is mainly 314 determined by the number of trainable LC neurons at each neural network layer. As shown in Table S4, the 315 number of LC neurons affects the performance in both datasets. While increasing the resolution gains 0.03% and 316 0.24% improvement in the MNIST dataset, the improvement in the CIFAR10 dataset is up to 2.08%. The possible 317 explanation for this difference is that low modulation resolution cannot extract the detailed real-world image features 318 in the CIFAR10 dataset. This could be the first reason for the decline of accuracy in the CIFAR10 dataset and can 319 be alleviated by high-resolution LC panels. 320 Table S4. Accuracy comparison of different numbers of LC neurons with the numerical simulation. All neuron arrays are regularly aligned with two or three LC layers, and the LC layers equally split a spacing of 120 mm.

Regular-2 Regular-3 LC Neurons
2048×3 3072×3    physical implementation of the LC panels ( Figure S4), we conduct an experiment on various layer spacing (Table   340   S1). The experiment results show that small (60 nm) layer spacing, i.e., fewer interconnections, produces a larger 341 performance decline in the CIFAR10 dataset than the MNIST dataset. We believe it is because the subtle features 342 of the CIFAR10 images demand connectivity between neurons on opposite angles. The performance gap between 343 120 nm and 240 nm becomes narrow but still exists, revealing a possibility to increase the LFNN's performance 344 through such as adding some Fresnel lens. The weights of interconnection are determined by the LC neuron layout.
345 Figure S1 and Table ?? show the impact of LC neuron layout on the performance. As can be seen, 3 LC planes 346 and random neuron distribution yield higher prediction accuracy as they provide higher modulation resolution and 347 computational flexibility, which can be utilized well by the following data-driven training process. In addition, results

348
show that the CIFAR10 dataset benefits more than the MNIST dataset, and additional LC planes can continue to 349 improve the accuracy.

351
The performance does not depend only on the hardware and algorithm, but also on the problem complexity. lighting, making it harder to expose critical features for identification. Third, the noise in the data introduced by 373 motion blurring, out-of-focus effects, and system noise reduces the signal-to-noise ratio of the CIFAR10 dataset.

374
Forth, the recognizability of classes is fundamentally different between the two datasets. While the MNIST 375 dataset contains symbols designed for easy recognization and reading, the CIFAR10 dataset contains naturally 376 similar real-world objects and backgrounds with shared features and colors.

377
In our experiment, including the digital DNN, i.e., a standard multilayer perceptron (MLP), the numerical 378 simulation, and the LFNN system all produce a very large accuracy gap between the CIFAR10 and MNIST 379 datasets (Table S6). Therefore, we believe the primary reason for the difference is that the problem complexity of 380 the CIFAR10 dataset has exceeded the capability of the standard densely connected neural network.  Figure S34. Evaluation of LFNN captured and FNN predicted results for Digit 0 Recognition.

Layer 1:
LC 0 LC 1 Input Gain Output Gain Figure S35. Hardware parameters of Digit 0 Recognition.

Input 1 Output 1 Distribution Reference
Test case 2, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 3, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 4, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 5, the first row is LFNN captured outputs and the second row is FNN predicted outputs.
Input 1 Output 1 Distribution Reference Figure S36. Layer outputs of Digit 0 Recognition for test cases 1 to 5.

36/45
Test case 6, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 7, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 8, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 9, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 10, the first row is LFNN captured outputs and the second row is FNN predicted outputs.  Figure S38. Evaluation of LFNN captured and FNN predicted results for Plane Recognition.

Layer 1:
LC 0 LC 1 Input Gain Output Gain Figure S39. Hardware parameters of Plane Recognition.

Input 1 Output 1 Distribution Reference
Test case 2, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 3, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 4, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 5, the first row is LFNN captured outputs and the second row is FNN predicted outputs.
Input 1 Output 1 Distribution Reference Figure S40. Layer outputs of Plane Recognition for test cases 1 to 5.

39/45
Test case 6, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 7, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 8, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 9, the first row is LFNN captured outputs and the second row is FNN predicted outputs.

Input 1 Output 1 Distribution Reference
Test case 10, the first row is LFNN captured outputs and the second row is FNN predicted outputs.  Figure S45. Layer outputs of Depth Estimation for test cases 6 to 10.