Slimmed optical neural networks with multiplexed neuron sets and a corresponding backpropagation training algorithm

Due to their intrinsic capabilities on parallel signal processing, optical neural networks (ONNs) have attracted extensive interests recently as a potential alternative to electronic artificial neural networks (ANNs) with reduced power consumption and low latency. Preliminary confirmation of the parallelism in optical computing has been widely done by applying the technology of wavelength division multiplexing (WDM) in the linear transformation part of neural networks. However, inter-channel crosstalk has obstructed WDM technologies to be deployed in nonlinear activation in ONNs. Here, we propose a universal WDM structure called multiplexed neuron sets (MNS) which apply WDM technologies to optical neurons and enable ONNs to be further compressed. A corresponding back-propagation (BP) training algorithm is proposed to alleviate or even cancel the influence of inter-channel crosstalk on MNS-based WDM-ONNs. For simplicity, semiconductor optical amplifiers (SOAs) are employed as an example of MNS to construct a WDM-ONN trained with the new algorithm. The result shows that the combination of MNS and the corresponding BP training algorithm significantly downsize the system and improve the energy efficiency to tens of times while giving similar performance to traditional ONNs.

Due to their intrinsic capabilities on parallel signal processing, optical neural networks (ONNs) have attracted extensive interests recently as a potential alternative to electronic artificial neural networks (ANNs) with reduced power consumption and low latency.Preliminary confirmation of the parallelism in optical computing has been widely done by applying the technology of wavelength division multiplexing (WDM) in the linear transformation part of neural networks.However, inter-channel crosstalk has obstructed WDM technologies to be deployed in nonlinear activation in ONNs.Here, we propose a universal WDM structure called multiplexed neuron sets (MNS) which apply WDM technologies to optical neurons and enable ONNs to be further compressed.A corresponding back-propagation (BP) training algorithm is proposed to alleviate or even cancel the influence of inter-channel crosstalk on MNS-based WDM-ONNs.For simplicity, semiconductor optical amplifiers (SOAs) are employed as an example of MNS to construct a WDM-ONN trained with the new algorithm.The result shows that the combination of MNS and the corresponding BP training algorithm significantly downsize the system and improve the energy efficiency to tens of times while giving similar performance to traditional ONNs.

Introduction
Machine-learning (ML) technologies have been developing rapidly these years.The capabilities of ML are proved to catch up with or even surpass human intelligence in a few specific fields, such as speech recognition, image classification and intelligence-competitive games [1][2][3].With the technological boom of ML, especially in artificial neural networks (ANNs), optical neural networks (ONNs) enter people's vision as a potential part of the future infrastructure for ML and are believed to be a competitive alternative to their traditional electronic counterparts [4][5][6][7][8][9].Since optical systems feature inherent parallelism with low energy consumption and low latency, the merging of electronics and optics are expected to alleviate some of the drawbacks in full electronic systems [10,11].Regarding the two fundamental elements of ANNs, vector-matrix multiplication and non-linear activation function are proved to both benefit from space and time division multiplexing in ONNs [9,[12][13][14][15][16][17][18].
While remarkable efforts have been made on both hardware and software levels for a slimmed ONN, the focus of WDM technologies applied to ONN has been limited on the vector-matrix multiplication part [27][28][29].As for optical-based non-linear activation functions, various optoelectronic devices, such as SOAs, ring resonators, optical phase modulators, etc, have been proposed and experimentally investigated [17,[30][31][32][33].However, the non-linear response of those devices inevitably brings in crosstalk between channels when WDM signals are applied.There is a lack of a universal plan for slimmed ONNs through multiplexing non-linear neurons without downgrading the performance.
In this work, we propose a structure called multiplexed neuron sets (MNS) and a corresponding back-propagation (BP) training algorithm.The combination of these two parts can compress parallel-deployed n neurons into 1 with the help of WDM while maintaining the original performance.
We take semiconductor optical amplifiers (SOAs) as a typical example for the implementation of the MNS.The corresponding BP algorithm is designed to overcome the performance degradation caused by the crosstalk between wavelength channels in SOAs.A slimmed ONN constructed with the MNS is proposed and trained with the corresponding BP algorithm.The result proves that the eliminated scale greatly improves the energy efficiency of the whole system.Although SOAs have been employed as one possible implementation of the MNS for simplicity, other photonic devices are potential elements for MNS as long as they satisfy the features described in the following contents.
The designed corresponding BP algorithm is universally suitable for various ONN architectures with inter-channel crosstalk.

MNS structure and SOA-Based MNS
An simplified scheme of fully connected neural networks (FCNNs) is shown in Fig. 1(a).The neuron, as marked in a gray-shadowed box, acts as one of the basic elements of FCNNs.The propagation of data is realized through the full connections of the neurons in adjacent layers.Those connections, called synapses, have different weights and can be abstracted to a weight matrix, which executes linear vector-matrix multiplications while the data are forward propagating.The neurons, on the other hand, execute the summation (Σ) and non-linear activation (f ) when they receive the data from the previous layer.The summation function represents the last-step operation in vector-matrix multiplication which is a part of linear transformation.In conventional FCNNs, there exists only one channel in each physical connection, which strictly represents one synapses, while the weights introduced by all synapses define the weight matrix.
When WDM-ONN applies, multiple wavelength channels (i.e.multiple synapses) are compressed into one physical connection.In the mathematical picture, each column of the weight matrix can be coded onto different wavelengths and then compressed into one physical connection which virtually represents multiple synapses [22,34,35], or otherwise each row of the weight matrix can be compressed [19,20,23].However, as far as we know, all those compression approaches for WDM-ONN have been only apply to the linear transformation part, in either the input vector or weight matrix.
It is natural to think that WDM can be further deployed in non-linear activation functions.As a concept sketched in Fig. 1(b), parallel activation functions are coded onto different wavelengths and executed in a single device, which is labeled in a dashed box.If the summation function (Σ) is multiplexed together with the nonlinear activation function (f ), the multiple neurons standing in a column in Fig. 1(a) can be further compressed into one single functional unit, which we name multiplexed neuron sets (MNS).The ultimate concept of MNS is to simplify the system through implementing several summation and activation functions with one single photonic or optoelectronic device.Thus in network level, one device is multiplexed to act as multiple neurons.
In Fig. 2(a), Layer m is decomposed by a weight matrix and MNS.The corresponding physical structure of Layer m is emphasized in the gray box.The input of the MNS structure in Layer m is a vector resulted from vector-matrix multiplication in the Layer m, and is encoded on the input power of the MNS channels with various wavelengths.In this work, we give one example of MNS realized by a SOA as pictured in Fig. 2(b).The reasons why we choose SOAs as an example are: • SOAs are commercially mature devices and have become easy-to-access; • SOAs' intrinsic characteristic of gain saturation have been employed as non-linear activation functions elsewhere [33,34]; • it is suitable for SOAs to process multiple inputs encoded on various wavelengths in parallel; A MUX is used to combine the input ports of SOAs and push them to the multi-channel SOA.At the output port of the SOA, a DEMUX is used to split the outputs into separate channels.Between the input and output of each channel, a set of non-linear activation functions is accomplished.
For the ONN architecture containing a device satisfying the feature of Fig. 2(c), the concept of MNS naturally helps to scale down the number of devices in use.However, the non-linear response of this conceived device inevitably brings in crosstalk between wavelength channels.The crosstalk may cause errors to propagate and result in performance degradation.This we believe has so-far obstructed the deployment of WDM on non-linear activation functions in practice.For devices like SOAs, the crosstalk has been a pain point for their applications in ONNs [20,33].Every input channel contributes to the gain-saturation effect and the output signals suffer from the deviation in amplification.In other words, the output signal of each channel is determined by not only the input of this channel, but also the input of other channels.This phenomenon induced by the gain-saturation effect is generally called cross-gain modulation (XGM).A compact model for XGM working at a relatively low modulation rate can be written as where G and G ss are the single-pass gain and small-signal single-pass gain of the SOA respectively, and P sat is the saturation power.As shown in Fig. 2(b), P in becomes the summation of a series of optical powers of various wavelength channels.The summation of the inputs can be expressed as where P in k represents the input power of the k th channel.For simplicity, the wavelength dependence of the single-pass gain is ignored.Since we have the input power of each channel and the single-pass gain, it is easy to calculate the output power of each channel For a more straightforward demonstration of the inter-channel crosstalk, we give an example of a 2-channel-multiplexed SOA in Fig.

The corresponding BP training algorithm
To enable the use of MNS in ONN, a new BP training algorithm is developed to alleviate or even cancel the degradation caused by inter-channel crosstalk.For a SOA with multi-channel input, the output of each channel can be represented as a multi-variable function with the input of each channel being the variables.The whole output vector, composed of the outputs of all the channels, is a set of multi-variable functions sharing the same input variables.For a n-channel SOA, the i th channel output can be written as here y i is the i th channel output, x n is the n th channel input.For simplicity and universality, we abstract the multi-variable functions as y i = f i (x 1 , x 2 , ..., x n ), through which MNS constructed by nonlinear optical or optoelectronic devices are matched.For both the output layer part and the hidden layers part of the network, the output of the specific layer is a column of the multi-variable functions.
During the training process of a specific layer, the partial derivative of the loss L with respect to weight matrix W is calculated according to the chain rule.The corresponding new BP algorithm inherits the idea of minimizing the loss along the gradient direction while coupling the matrix below into the chain rule.
here s represents the result vector of the vector-matrix multiplication of this layer.We have to bear in mind that this matrix represents the inner difference between the new BP algorithm and the traditional one brought by the crosstalk.It is obvious that each element in the matrix has a definition corresponding to the crosstalk among channels, as exampled in Figure 3.In the traditional BP algorithm, elements on diagonal have definitions, while off-diagonal elements are left undefined.
The new BP algorithm deals with the physical crosstalk and couples the mathematical operations into the undefined items of the traditional BP algorithm.[For the detailed derivation of the new BP algorithm, see Supplementary Information.]

Crosstalk level evaluation in SOA-based MNS
The new BP algorithm aims to alleviate or even cancel the performance degradation brought by inter-channel crosstalk, while the device integration level rises.Therefore, factors that influence the crosstalk level of the SOA-based MNS need to be investigated.As shown in Eq. ( 1) -Eq.( 3), the output of the k th channel, P out k , alters with the input of other channels even if P in k remains constant.Based on the partial derivative of P out k to P in i , the crosstalk level that the i th channel brought to the k th can be evaluated exactly to the point how severely P out k is affected by P in i .
For gain saturation, the result of the partial derivative is shown in Fig. 4. As the two parameters G ss and P sat are set as the x-axis and y-axis and ∂P out k ∂Pin i is on the z-axis, inter-channel crosstalk becomes severer when G ss increases.

Results
The ONN architecture involving SOA-based MNS structures is trained with the new algorithm.
Simulation results based on the traditional BP algorithm and those under different crosstalk levels are obtained for performance comparison.The availability of the new BP algorithm is then evaluated for the architecture with MNS.
The proposed ONN has a scheme shown in Fig. 5(a).The hidden layer respectively utilizes 2-channel, 4-channel or 6-channel multiplexing SOAs as MNS.The output layer utilizes traditional electrically-realized Sigmoid function which is a common plan in existing on-chip ONN [9].The corresponding network scale of ONN architecture is set as: 784 inputs, 60 neurons in the hidden layer and 10 neurons in the output layer.Considering the scaling factor of MNS, the number of devices in hidden layer decreases by one-half, three-fourths, five-sixths or even more if more channels of the SOA are multiplexed.

Performance analysis
Two classification tasks are assigned for performance analysis based on two datasets: the MNIST handwritten digits and the fashion-MNIST.In Fig. 5(b)-(d If the proposed ONN is trained by the new BP algorithm, the individual figures in Fig. 5(b)-(d) shows that the classification accuracy varies slightly under different crosstalk level.Also, referring to the figures in a row, the performance among the 2-channel to 6-channel multiplexing SOAs maintains similar.On the other hand, as shown by the solid-line with triangle marks, blindly improving the integration level through WDM without utilizing the new algorithm decreases the classification accuracy greatly.These trends not only prove the new BP algorithm's strong resistance to crosstalk, but also demonstrate that the denser-multiplexed MNS can be realized without significant performance degradation with the help of the new BP algorithm.
For each proposed ONN composed of n-channel(n = 2, 4 or 6) multiplexing SOAs, the training accuracy of the new BP algorithm under different crosstalk levels is summed and averaged, and so is that of the traditional BP algorithm.The gap between these two values, which can be defined as an improvement factor, reveals the performance improvement when a proposed ONN with a n-channel MNS is trained by the new BP algorithm.From another perspective, the necessity of the new BP algorithm for the proposed ONN with a n-channel MNS can be evaluated through this factor.In Fig. 5(e) and (f), the improvement factor increases with the multiplexed level of SOAs.It is obvious that our new BP algorithm strongly alleviates the problem brought by parallel signal processing in nonlinear devices, and it becomes a necessity when the denser-multiplexed MNS (SOAs with more channels multiplexed in this case) is employed in WDM-ONNs.
The stability of the new algorithm against inter-channel crosstalk comes from the fact that it includes the error induced by crosstalk into the process of backpropagation.In other words, as long as the crosstalk could be measured (formulized in this case), the algorithm takes it into consideration and maintains the performance.The more accurately the crosstalk is measured, the better the performance is.However, as shown by the dashed-line with triangle marks, blindly improving the integration level through WDM without utilizing the new algorithm decreases the classification accuracy greatly.The green-dashed-line together with the right y-axis directly shows the performance improvement brought by the new algorithm.
The training deviation is defined as the difference between the maximum accuracy and the minimum accuracy of the 10 repetitive training processes of a certain ONN.In Fig. 6  These two phenomena indicate that the proposed ONN trained by the traditional BP algorithm does not converge as good as that trained by the new BP algorithm.As the error induced by the crosstalk is not taken into consideration in the traditional BP algorithm, the cost function does not descend along the gradient direction.As a result, whether the network converges to the global minimum becomes a random process.Also, with the increase of the crosstalk level and the number of multiplexed channels of SOAs, the descending direction of the cost function deviates from the gradient direction further.Though the randomness caused by the traditional BP algorithm may not result in a definitely larger value in the training deviation and the accuracy deviation as shown in In Fig. 6(b) since we only take finite number of simulations, it is a fatal drawback of the traditional BP algorithm.

Power consumption and integration level prospects
The performance maintenance ability of the proposed ONN and new BP algorithm is proved by the data presented in the previous section.As a result, it is fair to discuss the advantages of this combination over the traditional ONNs.One direct advantage is the elimination of the number of devices used in nonlinear activation part.Both the scaling of integration and the flexibility of signal routing would benefit.On the other hand, from the perspective of energy saving, signals are combined together in MNS, so that the required input power of each channel in the MNS could be multiple times lower than the traditional optical neuron to access to the non-linear operation regime.
In other words, the light sources could be substituted with low-power ones.For MNS realized by SOA, the power consumption of the non-linear activation part is also reduced.
Based on the principles above, we theoretically analyze the power consumption of a specific layer with 60 neurons in the proposed ONN, as the scheme shown in the inset of Fig. 7(a).The consumption induced by the vector-matrix multiplication can be seen as a black box with a constant insertion loss factor, which is a very common case in mainstream ONNs composed of passive devices.
The equations Eq. (1) -Eq.(3) used in previous simulation are applied in the analysis, and the external quantum efficiency η = 0.6 is together applied here M is defined as the number of SOAs utilized in the MNS structure of this layer.The laser power consumption is also estimated with the external quantum efficiency.
In Fig. 7(a), the decrease of total power consumption in WDM-ONNS is obviously shown by a factor of tens of times as the multiplexed channels of SOAs increase, no matter the SOA part is examined individually or together with the light source part.Furthermore, if we separate Fig. 7(a) into two parts, as shown in Fig. 7(b), we can clearly read and analyze the proportion that the SOA part occupies in the total power consumption.The green line with square marks proves that the

Discussion
We have proposed a WDM structure called MNS which can be implemented by various non-linear devices to improve the parallelism of ONN, and a corresponding BP training algorithm to alleviate or even cancel the influence of the inevitable inter-channel crosstalk brought by the high parallelism of MNS.The performance comparison proves that the the combination of the proposed MNS-based WDM-ONN and the new BP algorithm provides very similar performance as the traditional ONNs, while the footprint of the physical system is decreased.Also, the power consumption of MNS-based WDM-ONN significantly declines tens of times as the parallelism of MNS increases.Those results prooves that our work paves the way towards an new sort of ONN architectures which have smaller scale and lower energy consumption.Also, the our work is demonstrated in a highly abstracted level and thus sets up a paradigm for a bundle of works.

Figure 1 :
Figure 1: (a) A scheme of a traditional FCNN, the layers are connected by the black lines which corresponds to the weight matrix.The neurons separately realize the summation and non-linear activation function without influencing others.(b) gives an example of non-linear activation function and how it can be conceptually multiplexed in a single device.

Figure 2 :
Figure 2: (a) A block diagram of a WDM-ONN with a MNS structure.Multiple neurons are encoded on various wavelengths and input into MNS.(b) The MNS structure in this work is realized by a multi-channel SOA.(c) A schemed connection picture for a WDM-ONN with a hidden layer composed of MNS.

3 .
The output of Ch-2 versus the input of both Ch-1 and Ch-2 is shown in Fig.3(a) and the overall variation in single-pass gain is shown in the inset.When the input of Ch-2 remains constant, the gain decreases as the input of Ch-1 increases, thus the output of Ch-2 decreases.This is an obvious evidence of the crosstalk between Ch-1 and Ch-2.If we want to further investigate the influence of the inputs to the output, we can calculate the partial derivatives of the output to the inputs.As shown in Fig.3(b) and (c), ∂(P out k [Ch 2])/∂(P in k [Ch 1]) and ∂(P out k [Ch 2])/∂(P in k [Ch 2]) are plotted, as the partial derivatives are fundamentally important elements in BP training algorithm.

Figure 3 :
Figure 3: For a 2-channel SOA, (a) visualizes the output of Ch-2 versus the input of Ch-1 and input of Ch-2.The inset shows the overall gain versus the input of Ch-1 and the input of Ch-2.The partial derivatives of the output to the inputs is also visualized in: (b) ∂(P out k [Ch 2])/∂(P in k [Ch 1]) and (c) ∂(P out k [Ch 2])/∂(P in k [Ch 2]).
To compare the performance of the proposed ONN under different crosstalk levels, three G ss values (G ss = 20, 23, and 26dB) are taken to represent the low, medium and high crosstalk levels.The value of P sat keeps unchanged during training.

Figure 4 :
Figure 4: The term ∂P out k ∂Pin i evaluates the crosstalk level brought by the i th channel.The x-axis and the y-axis are G ss and P sat respectively, which are the two parameters affect the crosstalk level.The red box indicates the origin of the inset on the right.It is obvious that the inter-channel crosstalk level increases with G ss .

Figure 5 :
Figure 5: (a) The scheme of the proposed ONN for simulation.(b)-(d) The performance of the proposed ONN with 2-channel,4-channel and 6-channel multiplexing SOAs.The x-axis indicates the crosstalk level.The proposed ONN trained by the new BP algorithm demonstrates a steady performance as the crosstalk level and the number of multiplexed channels increases.The one trained by the traditional BP algorithm suffers performance degradation induced by inter-channel crosstalk.(e)-(f) The performance improvement of the new BP algorithm over the traditional one rises as more channels of SOAs in the proposed ONN are multiplexed.The new BP algorithm shows significant relevance to larger ONN network with denser-multiplexed MNS structure.
), the performance of the proposed ONN with 2-channel, 4-channel and 6-channel multiplexing SOAs as MNS is shown.The upper and the lower rows of Fig. 5(b)-(d) correspond to the task of MNIST handwritten digits and fashion-MNIST respectively.In each figure of Fig. 5(b)-(d), the solid-line with round marks comes from the result of the proposed ONN that trained by the new BP training algorithm.For comparison, the solid-line with triangle marks is the result trained by traditional BP training algorithm.The x-axis indicates the crosstalk level while the left y-axis indicates the classification accuracy after training.
(a) and (b), the training deviation of the proposed ONN trained by the new BP algorithm and traditional BP algorithm for both classification tasks is shown.The result of the n-channel multiplexing SOAs(n = 2, 4 or 6) is presented in a row.In most of the cases, the training deviation of the proposed ONN trained by the new BP algorithm is lower than that trained by the traditional BP algorithm.Also, the accuracy deviation, which is defined as the fluctuation in accuracy during an individual training process of a certain ONN, is shown in Fig. 6(c) and (d) for both classification tasks.If the standard deviation of the accuracy of the last 10 iteration steps during an individual training is taken into account, the accuracy deviation of the proposed ONN trained by the new BP algorithms is proved to be much lower than that trained by the traditional BP algorithm, regardless of the crosstalk levels and the number of multiplexed channels of the SOAs.

Figure 6 :
Figure 6: (a)-(b) The training deviation of the proposed ONN for both the MNIST handwritten digits and fashion-MNIST classification tasks.The training deviation of the proposed ONN with n-channel(n= 2, 4 or 6) multiplexing SOAs is separately shown in a row.(c)-(d) The accuracy deviation of the proposed ONN with n-channel(n= 2, 4 or 6) multiplexing SOAs is separately shown in a row.The x-axis indicates the crosstalk level in (a)-(d).The lower training deviation and accuracy deviation prove that the traditional BP algorithm results in the proposed ONN to converge along the direction deviating from the gradient.

Figure 7 :
Figure 7: The total power consumption and MNS power consumption of a specific layer with 60 neurons is shown in (a).The detailed proportion of energy consumption is shown in (b).The denser-multiplexed MNS not only lowers the overall power consumption but also occupies less in total power consumption.