Hyperparameter tuning of optical neural network classifiers for high-order gaussian beams

High-order Gaussian beams with multiple propagation modes have been studied for free-space optical communications. Fast classification of beams using a diffractive deep neural network, D2NN, has been proposed. D2NN optimization is important because it has numerous hyperparameters, such as interlayer distances and mode combinations. In this study, we classify Hermite-Gaussian beams, which are high-order Gaussian beams, using a D2NN, and automatically tune one of its hyperparameters known as the interlayer distance. We used the tree-structured Parzen estimator, a hyperparameter auto-tuning algorithm, to search for the best model. Results indicated that classification accuracy obtained by auto-tuning hyperparameters was higher than that obtained by manually setting interlayer distances at equal intervals. In addition, we confirmed that accuracy by auto-tuning improves as the number of classification modes increases.


Introduction
Gaussian beams have a high-order propagation mode.Examples include Laguerre-Gaussian (LG) beams and Hermite-Gaussian (HG) beams, which have cylindrical symmetry and rectangular symmetry, respectively.LG beams are light with a helical wavefront known as optical vertex.LG beams have a quantity called orbital angular momentum of ℏ ( represents an integer and ℏ represents Planck's constant) per photon [1], which can increase the number of channels in free-space optical communications.An LG beam has orders  and  in the radial direction and azimuthal directions, respectively. represents an integer greater than or equal to zero, and  represents a positive or negative integer; thus, theoretically an LG beam has an infinite number of propagation modes.
An HG beam is a rectangular beam with symmetry along the propagation axis and has an order  in the x-direction and an order  in the y-direction in the plane perpendicular to the direction of propagation z.Since both orders are integer values that are greater than or equal to zero, an HG beam also has an infinite number of propagation modes.Multiple-bit communication by assigning codes to each of these modes [2,3] along with conventional amplitude shift keying (ASK) [4] and multiplexing using orthogonality [5] have been studied.Terabit free-space optcical communication with high-order Gaussian beams has already been achieved by combining coding and multiplexing [6].
A system for detecting the mode of a beam is required to realize multibit communications.Several methods for detecting a beam's code have been investigated, including mode sorters [7], computer-generated holograms displayed on a spatial light modulator (SLM) [8], and deep learning [6,9].Imaging using a charge coupled device (CCD) camera and computational load in neural networks are bottlenecks of these methods.
Beam classification using electrical deep neural networks based on semiconductors can be expected to have high accuracy; however, the classification speed is limited by the semiconductor's clock frequency, and power consumption is also high.Recently, an optical computer known as a diffractive deep neural network (D 2 NN), which uses passive optical modulators, has been proposed [10].In a D 2 NN, training calculation is performed using a computer, and trained parameters are recorded in an optical modulator (photopolymer or SLM).Light modulators correspond to the layers of a neural network, and by arranging them in equal intervals, a deep neural network is mimicked.The input light is modulated in each layer, and a photodetector placed in the output layer detects the calculation result of D 2 NN.Since classification can be performed only by light propagation and diffraction, it is performed at the speed of light with a minute amount of power consumption.
Multimode classification of high-order Gaussian beams using a D 2 NN has been performed [6,11,12], as has multiplexed beam generation and separation [12,13,14].In both cases, beam classification can be detected by a photodetector, which can speed up the detection speed by two to three orders of magnitude compared to a CCD camera.
However, in D 2 NN training, the number of layers, distance between layers, and choice of propagation mode are hyperparameters that should be optimized to improve classification accuracy.In general, hyperparameters are empirically determined.
In this study, we propose an automatic tuning of the hyperparameters of a D 2 NN for classifying HG beams of high-order Gaussian beams.Instead of manual tuning, automatic tuning algorithms using random and grid search and Bayesian optimization have been proposed to tune hyperparameters [15].We use Bayesian optimization, tree-structured Parzen estimator (TPE), as an auto-tuning algorithm to optimize the interlayer distance and compare it with the case of equal spacing of layers.Phase disturbance due to the atmosphere was added to HG beams to assume free-space optical communications.
Section 2 describes HG beams and atmospheric turbulence, Section 3 explains the proposed method, Section 4 presents simulation results, and Section 5 concludes this study.

High-order Gaussian beam and atmospheric turbulence
In this session, we describe the HG beam used in this study and the disturbance model of the beam in the atmosphere.

Hermite-Gaussian beam
An HG beam is a beam having a rectangular symmetry [1], which has the following two degrees of freedom in the plane perpendicular to the optical axis: order  in the x-direction and order  in the y-direction.It is represented by where  = √−1  0 represents the beam diameter at  = 0 and   represents Hermite polynomials.Both  and  represent integers greater than or equal to zero, and there are theoretically infinite beam patterns.Figure 1 demonstrates examples of HG beams.As shown in Fig. 1, an HG beam is a rectangular beam with symmetry, and the number of divisions of the beam varies with  and  values.An HG beam has multiple propagation modes.In addition, multiple-bit communication is possible by assigning codes to each mode.

Atmospheric turbulence
When light propagates through the atmosphere, the phase of the beam is disturbed.In this study, we used the von Karman model to add atmospheric disturbance to an HG beam [16].The fluctuation spectrum of the refractive index   () is expressed by where  0 = 2/ 0 and   = 5.92/ 0 . 0 is the external scale of the disturbance, and  0 represents the internal scale of the disturbance. represents the spatial frequency, and   2 represents the atmospheric structure constant, which represents the strength of the disturbance caused by the atmosphere.By Markov approximation, the spectrum of the phase can be expressed by where  represents the wavenumber of the beam and  represents the propagation distance in the atmosphere.The phase disturbance of the propagated beam in the spatial domain can be expressed by where ∆ is the sampling pitch,  represents the number of pixels in the height and width of the beam, and  represents an  ×  random complex matrix generated with mean 0 and variance 1.  represents the Fourier transform and Re represents the operator to extract the real part.The resulting  represents the number of phases disturbed by the atmosphere.

Proposed method
Figure 2 shows the D 2 NN model used in this study.The D 2 NN consists of multiple diffraction layers.In the conventional method shown in Fig. 2(a), each layer is equally spaced, whereas, in the proposed method shown in Fig. 2(b), each layer is unequally spaced.Pixels in each layer correspond to neurons, and the light diffracted from the previous layer is optically coupled to each neuron in the next layer.The angular spectrum method was used for light propagation calculation.The lightwave   propagating from the  − 1 layer to  layer is calculated by where   ,   are the spatial frequencies,  −1 represents the inverse Fourier transform,  is the wavelength of light, and ∆ is the propagation distance between adjacent layers.The transmission coefficient   is the complex amplitude that changes the amplitude and phase of light passing through a neuron and is denoted by where   is the value of the amplitude and is a real number between 0 and 1.   is the phase retardation, ranging from 0 to 2.In this study, we assume that layers in the D 2 NN are pure phase modulators with an amplitude of   = 1.
The D 2 NN has several diffraction layers, with lightwave as input.In the output layer, photodetectors are classified.The D 2 NN is trained to focus light on the detector corresponding to the input beam.

Black-box optimization
Hyperparameter optimization is essential to improve the performance of neural networks.The hyperparameters of the D 2 NN in this study include the number of layers, distance between each layer, and selection/combination of the beam pattern.These are difficult to optimize using stochastic gradient methods, such as Adam optimizer [17] used in machine (b) proposed method learning because finding the gradient calculation using an error back-propagation algorithm is difficult.Thus, we need other optimization methods to find the optimal hyperparameters.The relationship between hyperparameter values and classification accuracy is unknown.Therefore, the best hyperparameter settings are obtained by training with various values and comparing losses.For this, grid and random searches have been used.Grid search is intuitive and can be parallelized.However, it evaluates nondominant hyperparameters, which increase the computation time [18].Random search has the advantage of the search being terminated in the middle of optimization, and the results of multiple searches executed independently can be combined [19].However, methods such as random and grid searches are inefficient because they require several computationally expensive learning calculations.
Therfore, we used black-box optimization.Black-box optimizations observe some hyperparameters  and the value of the loss function  and predict the  value that minimizes  from the results.Since the search proceeds while predicting the optimal hyperparameters, it is more efficient than random and grid searches.
TPE is a black-box optimization technique used as an automatic hyperparameter tuning algorithm for neural networks.TPE is a type of Bayesian optimization, consisting of a surrogate model to approximate the distribution of promising hyperparameters and an acquisition function to sample the next hyperparameters to be evaluated.TPE uses a proxy model and expected improvement as the acquisition function [20].TPE first sets a random hyperparameter  and evaluates it.After several evaluations, the evaluated hyperparameters are categorized into good and bad groups, and their distributions () and () are approximated by kernel density estimation.Next, several points are sampled, and  is selected as the next observation point, maximizing ()/().By repeating this search, TPE can efficiently obtain excellent hyperparameters with few trials.

Results
In this study, we set the wavelength of HG beams to 1550 [nm], the beam waist to 0. ⁄ ] are assumed for free-space optical communication during calm daytime.For the D 2 NN model, the number of pixels of the input beam plane, detector plane, and all layers is 256 × 256 pixels, and the sampling pitch is 39 [µm].We prepared 1000 images for each beam mode, 800 images for training data, and 200 images for validation data.We fixed the distance from the input plane to the output plane at 30 [cm] and searched for the best layer arrangement.The loss function  is the mean squared error: where  represents the number of pixels in the output image and ground-truth data,   represents the light intensity of each pixel in the output image, and   represents the light intensity of each pixel in the ground-truth image.For the ground-truth image, we prepared an image, in which only the detector region corresponding to each label was brightened.An example of our dataset is shown in Fig. 3, and parameters used for HG beam classification are listed in Table 1.In this study, we set the number of layers to two to improve the flexibility of tuning the interlayer distance.
Table 2 shows the locations of layers and classification accuracy obtained using the proposed method.Table 2 also indicates the classification accuracy of the conventional method with equally spaced layers of 10 [cm], as shown in Fig. 2 (a).Since classification accuracy depends on the randomness of the initial hyperparameter values of each layer, we trained the conventional method 40 times.We set the number of searches to 20 times in 36 classification modes because it takes a long time to train.
As shown in Table 2, comparing accuracy for 16 classification modes and auto-tuning the interlayer distance using TPE improved accuracy by 0.5% over equally spaced layer placement.Similarly, accuracy was improved by 3.4% for 25 classification modes and 10.1% for 36 classification modes over equally spaced layer placement.The difference in accuracy between TPE and equidistant placement is larger for 25 and 36 classification modes than for 16 classification modes.Note that the importance of auto-tuning the interlayer distance using    TPE increases as the number of modes increases.Figure 4 illustrates the interlayer distances obtained by TPE optimization.It is difficult to search such interlayer distances manually.
Figure 5 shows an example of an inferred image from the trained D 2 NN.The inference was performed using the D 2 NN that achieved the highest accuracy of 94.9% in classifying the 36 modes of HG beams in Table 2.As shown in the figure, a specific detector region is brightened by the inference calculation.This light intensity can be used to determine original bits encoded in an HG beam using a photodetector.The surrounding detector areas become brighter along with the desired detector area.This results in a classification error.The error occurs in areas above, below, and to the left and right of the desired detector area.This error can be caused by the proximity of the detector areas or by the similarity of beam patterns due to the proximity of input beam modes.To confirm this, we randomly switched the detector's positions and the mode of each beam; however, the same errors occurred for beams with similar modes.For HG beam classification using the D 2 NN, classification errors occurred between close modes and did not depend on the position of the detector.
Next, we describe the computational time required for training.We used Windows 10 Enterprise as the operating system, AMD Ryzen 5 2600 as the CPU, and NVIDIA GeForce RTX 2060 as the GPU.We used TensorFlow-GPU 2.2.0 and Keras 2.

Conclusion
In this study, we investigated how hyperparameter tuning could improve classification accuracy in multibit optical free-space communications with HG beams, when using a D 2 NN as a classifier.Results indicate that, by hyperparameter tuning, we can search for interlayer distances that yield higher classification accuracy than that obtained by manually placing D 2 NN layers at equal intervals.Classification accuracy was improved by approximately 0.5%, 3.4%, and 10.1% for 16, 25, and 36 classification modes, respectively, over equally placed spaced layers.In general, as the number of classification modes increases, the classification accuracy decreases, making it more important to improve accuracy by tuning the interlayer distance.To increase communication capacity in the future, it is necessary to increase the number of beam modes used, such as 49 or 64 modes.Automatic hyperparameter tuning should be performed to cope with an increase in communication capacity.
Funding.This research was funded by Japan Society for the Promotion of Science (19H04132,19H01097).
Disclosures.The authors declare no conflicts of interest.

Figure
Figure 1：Examples of Hermite-Gaussian (HG) beams.(, n) represents the orders of m and n for an HG beam, respectively.
Figure 2：A model of diffractive deep neural network.

Figure
Figure 3：Example dataset: The top images are Hermite-Gaussian (HG) beams inputted to the diffractive deep neural network, and the bottom images are the corresponding ground-truth images.(,n) represents the orders of m and n for an HG beam, respectively.
Figure 5：Example of inference results.

Figure
Figure 4：Distance between learned layers.

Table 1：Parameters
of the diffractive deep neural network model.