Recognition of Orbital Angular Momentum of Vortex Beams Based on Convolutional Neural Network and Multi-Objective Classiﬁer

: Vortex beams carry orbital angular momentum (OAM), and their inherent inﬁnite dimen-sional eigenstates can enhance the ability for optical communication and information processing in the classical and quantum ﬁelds. The measurement of the OAM of vortex beams is of great signiﬁcance for optical communication applications based on vortex beams. Most of the existing measurement methods require the beam to have a regular spiral wavefront. Nevertheless, the wavefront of the light will be distorted when a vortex beam propagates through a random medium, hindering the accurate recognition of OAM by traditional methods. Deep learning offers a solution to identify the OAM of the vortex beam from a speckle ﬁeld. However, the method based on deep learning usually requires a lot of data, while it is difﬁcult to attain a large amount of data in some practical applications. To solve this problem, we design a framework based on convolutional neural network (CNN) and multi-objective classiﬁer (MOC), by which the OAM of vortex beams can be identiﬁed with high accuracy using a small amount of data. We ﬁnd that by combining CNN with different structures and MOC, the highest accuracy reaches 96.4%, validating the feasibility of the proposed scheme.


Introduction
As a special beam-carrying OAM and possessing spiral wavefront, the optical vortex has important applications in many fields such as optical manipulation [1], optical information processing [2], photon computer [3], quantum communication [4][5][6], and free space optical communication [7]. OAM-based optical communication has become one of the research hotspots in recent years. In addition to using amplitude, phase, frequency, and polarization to modulate information, OAM can be used as a new modulation parameter in the optical communication system. Because OAM and other physical quantities are independent of each other, OAM can effectively integrate with other communication methods, which greatly upgrades the transmission capability.
The demand for the transmission of large amounts of data in the communication field is becoming more and more urgent. OAM-based free-space optical communication has unique advantages and has great prospects in future communication applications. OAM measurement is critical for these applications. Methods, such as spiral interference fringes and optical conversion, have been proposed to measure the OAM of vortex beams [8][9][10][11]. However, the above methods are limited to identify the OAM of vortex beams in free space due to the requirement of a well-defined intensity pattern. With the continuous development of artificial intelligence, machine learning techniques are widely applied in many engineering fields such as spectral analysis, computer vision, intelligent machine control, etc. [12][13][14][15] because of their advantages of automatically learning data trends and patterns that may be ignored by humans. Many OAM recognition methods based on machine learning have been proposed. In 2014, Krenn et al. proposed for the first time using the BP-ANN model to identify the intensity map of superimposed Laguerre-Gaussian (LG) beams, using 16 combinations with an error rate of 1.7% [16]. In 2016, Knuston et al. used the VGG16 network model to classify 110 different OAM states with a classification accuracy of 74% [17]. In 2017, Doster and Watnik validated that the demultiplexing effect to identify Bessel Gaussian multiplexed beams by the Alexnet network model is better than the traditional method [18]. Compared with the recognition rates of BP-ANN, Li et al. demonstrated that machine learning methods based on the convolutional neural network are better choices for demultiplexing LG beams [19]. These results show that pattern recognition based on machine learning solves the limitation that the traditional methods are limited to recognizing the OAM of vortex beams in free space, and offers a new solution for OAM recognition.
Optical fiber is widely used to transmit information over long distances. A specially designed optical fiber can ensure the distortionless transmission of vortex beams [20][21][22]. However, this specially designed optical fiber requires a complex manufacturing process and high cost, hindering popularization and application. In contrast, the manufacturing process for ordinary optical fiber is mature and low-cost. A multimode fiber (MMF) can simultaneously transmit a large number of modes, providing a solution for large-capacity data transmission. When a vortex beam is transmitted in the MMF, the irregular speckle image is generated at the distal end due to mode coupling and superposition, which hinders the recognition of the OAM [23]. A deep learning-based method realized the recognition of OAM from the speckle image at the distal end of the MMF [24], but a large amount of data is required in this method. The results of deep learning-based methods are directly proportional to the amount of data. Excellent results depend on a large amount of data, which is difficult to obtain in some practical problems. The machine learning algorithm can effectively deal with small data problems and can make correct, but not necessarily optimal, decisions. In order to fully combine their advantages, in this study we propose a framework based on a CNN and an MOC which achieves a high accuracy recognition of the OAM of a vortex beam from speckle with a small amount of data. We extract features from the pretrained CNN model, send the extracted features and corresponding tags to the MOC for training, and finally classify them. This method can greatly reduce the amount of data used while maintaining high recognition accuracy.

Dataset
The experimental setup is shown in Figure 1. A vertically polarized laser beam (Onefive Origami-10XP, 400 fs, 1 MHz) with a wavelength of 1028 nm propagates through a half-wave plate (HWP), which changes the polarization direction of the beam to 45 • respective to the horizontal axis. The transmitted light through the beam splitter (BS) is divided into two beams with orthogonal polarization by the polarization beam splitter (PBS). Two phase-only spatial light modulators (SLM1, HAMAMATSUX13138-03 and SLM2, HAMAMATSUX13138-09) impose helical phase to horizontally and vertically polarized beams, respectively. Two vortex beams with orthogonal polarization and the same or different OAM are recombined by a PBS and coupled to an MMF (Thorlabs, M31L20, 62.5 µm, NA = 0.275, 20 m) through microscope objective lens 1 (O1). The speckle image at the distal end of the MMF is collected by microscope objective lens 2 (O2). Finally, the image is captured by a charge coupler device (CCD, Pike F421B, AVT). The OAMrelated topological charge of the vortex beam generated by SLM1 and SLM2 changes from 1 to 10.9 with an interval of 0.1. As the two topological charges change alternately, Photonics 2023, 10, 631 3 of 11 10,000 different speckle images are generated at the distal end of the MMF. The distribution of light intensity is shown in Figure 2.
Photonics 2023, 10, x FOR PEER REVIEW 3 of 11 10.9 with an interval of 0.1. As the two topological charges change alternately, 10,000 different speckle images are generated at the distal end of the MMF. The distribution of light intensity is shown in Figure 2.

Network Structure
To ensure high recognition accuracy and reduce the training-required data, we designed an architecture based on CNN and MOC. The network architecture is shown in Figure 3.
The CNN is used to fully extract the image features. The manual feature extraction method of traditional machine learning technology has limitations in the correlation of features and may include human bias, affecting the quality of elements and the corresponding results. Therefore, we use CNN to extract features and learn the importance of features automatically through backpropagation, thus eliminating some of the problems and limitations of manual feature extraction. The feature extraction depends on the structures of CNN. In this study, different structures of CNN, including ResNet [25], ResNeXt 10.9 with an interval of 0.1. As the two topological charges change alternately, 10,000 different speckle images are generated at the distal end of the MMF. The distribution of light intensity is shown in Figure 2.

Network Structure
To ensure high recognition accuracy and reduce the training-required data, we designed an architecture based on CNN and MOC. The network architecture is shown in Figure 3.
The CNN is used to fully extract the image features. The manual feature extraction method of traditional machine learning technology has limitations in the correlation of features and may include human bias, affecting the quality of elements and the corresponding results. Therefore, we use CNN to extract features and learn the importance of features automatically through backpropagation, thus eliminating some of the problems and limitations of manual feature extraction. The feature extraction depends on the structures of CNN. In this study, different structures of CNN, including ResNet [25], ResNeXt

Network Structure
To ensure high recognition accuracy and reduce the training-required data, we designed an architecture based on CNN and MOC. The network architecture is shown in Figure 3.  Figure 4 show the ResNet mod and the ResNeXt module, respectively. ResNeXt decomposes the residual module of R Net into several uniform branch structures. Through this design, the network struct becomes clearer and more modular. The number of parameters that need to be adjus manually is reduced, and the performance is better in the case of the same number parameters. The CNN is used to fully extract the image features. The manual feature extraction method of traditional machine learning technology has limitations in the correlation of features and may include human bias, affecting the quality of elements and the corresponding results. Therefore, we use CNN to extract features and learn the importance of features automatically through backpropagation, thus eliminating some of the problems and limitations of manual feature extraction. The feature extraction depends on the structures of CNN. In this study, different structures of CNN, including ResNet [25], ResNeXt [26], DenseNet [27], and GoogLeNet [28], are selected to extract features. The module structure diagrams of these networks are shown in Figure 4.
ResNet and ResNeXt are composed of residual structures with jump connections. Based on this jump connection structure, the problem of gradient disappearance is solved, and a deeper network can be built. Figures a and b in Figure 4 show the ResNet module and the ResNeXt module, respectively. ResNeXt decomposes the residual module of ResNet into several uniform branch structures. Through this design, the network structure becomes clearer and more modular. The number of parameters that need to be adjusted manually is reduced, and the performance is better in the case of the same number of parameters.
The GoogLeNet is made up of several identical modules connected in series. The module structure is shown in the Figure 4c. With this structure, more convolution can be stacked in the receptive field of the same size, which is beneficial in learning more abundant features, thus improving the performance of the network.  The GoogLeNet is made up of several identical modules connected in series. The module structure is shown in the Figure 4c. With this structure, more convolution can be stacked in the receptive field of the same size, which is beneficial in learning more abundant features, thus improving the performance of the network.
DenseNet is a densely connected network. Different from the addition operation between layers of ResNet and ResNeXt, the connection between different layers of DenseNet becomes a splicing operation (superposition on dimensions). This connection mode reduces the number of parameters in the network. The feature reuse of the dense connection and rich feature information acquisition of the stitching operation benefits the extraction of more features with less convolution. In addition, this structure builds a deeper network and reduces the risk of overfitting by improving the flow of information and gradients in the whole network. The structure diagram is shown in Figure 4d.
To extract features with CNN, we used the pretrained model to finetune the network parameters with our data. The image propagates between each layer and stops at the last layer, where the current vector is taken as the feature vector. This training method is called transfer learning, and the pretraining model is obtained by training on ImageNet data [29]. The dataset includes 1000 categories composed of 14 billion images. A model trained on such a large number of data sets has learned more important features. Only finetuning is needed to train our added final classification layer.
The second part of the architecture is the MOC, which can achieve high recognition accuracy with a small amount of data. A speckle pattern is generated by the transmission of two vortex beams through an MMF, indicating that each sample has two target values. Therefore, it is necessary to use a MOC to fit and predict each target by selecting the type of evaluator. The evaluator used in this study is the random forest ensemble learning classifier [30], whose basic estimator is the decision tree [31].
As a basic classification and regression algorithm, the decision tree shows a tree structure composed of nodes and directed edges ( Figure 5). A decision tree contains a root node, several internal nodes, and leaf nodes. The root node includes all the sample sets, the leaf node corresponds to the decision result, and the other nodes correspond to the attribute test. The path from the root node to each leaf node corresponds to a judgment test sequence. The core of the algorithm is to recursively select the optimal feature and DenseNet is a densely connected network. Different from the addition operation between layers of ResNet and ResNeXt, the connection between different layers of DenseNet becomes a splicing operation (superposition on dimensions). This connection mode reduces the number of parameters in the network. The feature reuse of the dense connection and rich feature information acquisition of the stitching operation benefits the extraction of more features with less convolution. In addition, this structure builds a deeper network and reduces the risk of overfitting by improving the flow of information and gradients in the whole network. The structure diagram is shown in Figure 4d.
To extract features with CNN, we used the pretrained model to finetune the network parameters with our data. The image propagates between each layer and stops at the last layer, where the current vector is taken as the feature vector. This training method is called transfer learning, and the pretraining model is obtained by training on ImageNet data [29]. The dataset includes 1000 categories composed of 14 billion images. A model trained on such a large number of data sets has learned more important features. Only finetuning is needed to train our added final classification layer.
The second part of the architecture is the MOC, which can achieve high recognition accuracy with a small amount of data. A speckle pattern is generated by the transmission of two vortex beams through an MMF, indicating that each sample has two target values. Therefore, it is necessary to use a MOC to fit and predict each target by selecting the type of evaluator. The evaluator used in this study is the random forest ensemble learning classifier [30], whose basic estimator is the decision tree [31].
As a basic classification and regression algorithm, the decision tree shows a tree structure composed of nodes and directed edges ( Figure 5). A decision tree contains a root node, several internal nodes, and leaf nodes. The root node includes all the sample sets, the leaf node corresponds to the decision result, and the other nodes correspond to the attribute test. The path from the root node to each leaf node corresponds to a judgment test sequence. The core of the algorithm is to recursively select the optimal feature and Photonics 2023, 10, 631 6 of 11 segment the data according to the feature so as to find the best classification result for each sub-data set. Random forest is a combinatorial classification algorithm of ensemble learning (Figure 6). Ensemble learning is mainly focused on producing a strong classifier with an excellent classification performance by combining several base classifiers. Based on this idea, multiple decision trees generate random forests. The core idea of the random forest algorithm is to resample the training set to form multiple training subsets. Each subset generates a decision tree, and the final result is decided by all the decision trees through voting.

Results
As the two topological charges of vortex beams change from 1 to 10.9 alternately, 10,000 different speckle patterns are recorded by CCD. These speckle images are randomly selected as the training data or test data to train and test the network. To finetune the pretrained CNN, the input images need to be preprocessed. The method of preprocessing is central cropping and normalization, and the data are converted into tensors in the network. The training set, verification set, and test set are processed in the same way. After the training of the multi-objective classifier, the performance of the network is tested with the test data. The test results are compared with real tags to calculate the recognition accuracy of two OAMs and a single OAM. All the charts and diagrams in our study were derived from Origin, a data analysis and mapping software, and the results are as follows.
The recognition accuracy depends on the amount of data used to train the network (Figure 7). The image is cut to the size of 224 × 224 by the central clipping method. The blue curve and red curve correspond to the test result of ResNet34 and ResNet34+MOC, respectively. High accuracy can be achieved by using the CNN method with a large amount of data. The accuracy is improved by adding an MOC. The accuracy equals to 100% by training the combined network with enough data. Random forest is a combinatorial classification algorithm of ensemble learning ( Figure 6). Ensemble learning is mainly focused on producing a strong classifier with an excellent classification performance by combining several base classifiers. Based on this idea, multiple decision trees generate random forests. The core idea of the random forest algorithm is to resample the training set to form multiple training subsets. Each subset generates a decision tree, and the final result is decided by all the decision trees through voting. Random forest is a combinatorial classification algorithm of ense ure 6). Ensemble learning is mainly focused on producing a strong c cellent classification performance by combining several base classifiers multiple decision trees generate random forests. The core idea of the r rithm is to resample the training set to form multiple training subsets. ates a decision tree, and the final result is decided by all the decision tr

Results
As the two topological charges of vortex beams change from 1 10,000 different speckle patterns are recorded by CCD. These speck domly selected as the training data or test data to train and test the ne the pretrained CNN, the input images need to be preprocessed. The cessing is central cropping and normalization, and the data are conve the network. The training set, verification set, and test set are processe After the training of the multi-objective classifier, the performance of th with the test data. The test results are compared with real tags to calcu accuracy of two OAMs and a single OAM. All the charts and diagram

Results
As the two topological charges of vortex beams change from 1 to 10.9 alternately, 10,000 different speckle patterns are recorded by CCD. These speckle images are randomly selected as the training data or test data to train and test the network. To finetune the pretrained CNN, the input images need to be preprocessed. The method of preprocessing is central cropping and normalization, and the data are converted into tensors in the network. The training set, verification set, and test set are processed in the same way. After the training of the multi-objective classifier, the performance of the network is tested with the test data. The test results are compared with real tags to calculate the recognition accuracy of two OAMs and a single OAM. All the charts and diagrams in our study were derived from Origin, a data analysis and mapping software, and the results are as follows.
The recognition accuracy depends on the amount of data used to train the network (Figure 7). The image is cut to the size of 224 × 224 by the central clipping method. The blue curve and red curve correspond to the test result of ResNet34 and ResNet34+MOC, respectively. High accuracy can be achieved by using the CNN method with a large amount of data. The accuracy is improved by adding an MOC. The accuracy equals to 100% by training the combined network with enough data. Conventionally, since a larger amount of training set benefits the training of the network, the amount of the training set is much larger than that of the testing set. On the contrary, in this test, the training data contains 2000 speckle images, while the test data contains 8000 speckle images. The image is cut to the size of 224 × 224 by central clipping. The CNN used are GoogLeNet, DenseNet121, ResNet50, ResNet101, ResNeXt50, and Res-NeXt101. Figure 8a shows the test results of the CNN, and Figure 8b shows the test results of the CNN+MOC. The values in red correspond to the recognition accuracy of both OAM1 and OAM2, and the values in beige and blue correspond to the recognition accuracy of OAM1 and OAM2, respectively.
(a) Conventionally, since a larger amount of training set benefits the training of the network, the amount of the training set is much larger than that of the testing set. On the contrary, in this test, the training data contains 2000 speckle images, while the test data contains 8000 speckle images. The image is cut to the size of 224 × 224 by central clipping. The CNN used are GoogLeNet, DenseNet121, ResNet50, ResNet101, ResNeXt50, and ResNeXt101. Figure 8a shows the test results of the CNN, and Figure 8b shows the test results of the CNN+MOC. The values in red correspond to the recognition accuracy of both OAM1 and OAM2, and the values in beige and blue correspond to the recognition accuracy of OAM1 and OAM2, respectively.
Based on the ResNeXt101 network with the best accuracy, the network is also trained with 2000 pieces of data, and the image of the original size 256 × 256 is used as input. The test results are shown in Table 1. contrary, in this test, the training data contains 2000 speckle images, while the test data contains 8000 speckle images. The image is cut to the size of 224 × 224 by central clipping.
The CNN used are GoogLeNet, DenseNet121, ResNet50, ResNet101, ResNeXt50, and Res-NeXt101. Figure 8a shows the test results of the CNN, and Figure 8b shows the test results of the CNN+MOC. The values in red correspond to the recognition accuracy of both OAM1 and OAM2, and the values in beige and blue correspond to the recognition accuracy of OAM1 and OAM2, respectively. Based on the ResNeXt101 network with the best accuracy, the network is also trained with 2000 pieces of data, and the image of the original size 256 × 256 is used as input. The test results are shown in Table 1.

Discussion
With a large amount of data for training, ResNet34 can effectively learn and extract useful features. The OAM recognition accuracy is 99.8% (Figure 7). After further training by the MOC, the accuracy reaches 100%. With the continuous reduction in training data,

Discussion
With a large amount of data for training, ResNet34 can effectively learn and extract useful features. The OAM recognition accuracy is 99.8% (Figure 7). After further training by the MOC, the accuracy reaches 100%. With the continuous reduction in training data, the recognition accuracy also decreases. When the training data are reduced to 2000, ResNet34 can learn and extract useful features to a certain extent. However, due to the small amount of data, the learned features are limited, and the recognition accuracy of OAM is greatly reduced to 76.6%. After further training by the MOC, the recognition accuracy is improved, reaching 84.5%.
To further increase the recognition accuracy, we change the structure of the CNN. To effectively learn and extract the features, semantic information and resolution information are indispensable. More abstract features can be learned and extracted by increasing the depth of the network, and features in a larger resolution can be learned and extracted by enhancing the function of the convolution module. However, the performance cannot be improved by blindly deepening the number of layers and enhancing the function; the performance needs to be analyzed and balanced according to the actual problems. The recognition accuracy of GoogLeNet and DenseNet121 is only 36.3% and 56.1%, respectively. The reason is that GoogLeNet pays more attention to the module function than to the depth of the network, which leads to underfitting due to the failure to fully learn and extract the features. Although DenseNet121 pays attention to the network depth and module function simultaneously, due to the small amount of data the network fits well in the training data, while the poor fitting in the test data leads to overfitting. Since ResNet34 has better learning and feature extraction performance, ResNet50 and ResNet101 improve the performance by increasing the number of network layers to deepen the network depth. In order to further improve the performance, ResNeXt50 and ResNeXt101 are designed to optimize the module structure to enhance the module function on the basis of the deep network. ResNeXt101 has achieved the highest recognition accuracy of 86.3% with its optimal module structure and deeper network level. Then the extracted features and corresponding tags are trained by the MOC, and the recognition accuracy is further improved by up to 94.7%.
By adding information, the size of the original image is input into the network, so that the network with good feature learning and extraction functions can learn more rich information. This information can further improve the performance of the network, and the accuracy of OAM recognition reaches 96.4%.
By increasing the depth of the network and optimizing the network structure to enhance the learning and extraction of semantic and resolution information, the recognition accuracy of OAM is improved. By selecting CNN that can effectively extract semantic information and resolution information, and then through MOC training, the highest recognition accuracy of OAM is 94.7%. Further input containing more information can again improve the performance, and the recognition accuracy is more than 96%. Therefore, the architecture based on CNN and MOC proposed in this study can recognize the OAM of vortex beams from the speckle patterns with high accuracy.

Conclusions
In conclusion, we propose a combined CNN and MOC method, which successfully identifies the OAM of vortex beams from speckle patterns. Although traditional CNN can recognize the OAM of vortex beams from speckle patterns, excellent performance requires a large amount of training data support. To reduce this dependence, we further introduce a MOC to CNN. Through the combination of CNN with different structures and MOC, the highest recognition accuracy can reach 96.4% even with only a small amount data to train the network. The proposed network structure offers a solution to deal with the problem of small data.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.