Photonic Extreme Learning Machine based on frequency multiplexing

The optical domain is a promising field for physical implementation of neural networks, due to the speed and parallelism of optics. Extreme Learning Machines (ELMs) are feed-forward neural networks in which only output weights are trained, while internal connections are randomly selected and left untrained. Here we report on a photonic ELM based on a frequency-multiplexed fiber setup. Multiplication by output weights can be performed either offline on a computer, or optically by a programmable spectral filter. We present both numerical simulations and experimental results on classification tasks and a nonlinear channel equalization task.


Introduction
Feed-Forward Neural networks (FFNs) are among the most employed machine learning algorithms due to their simplicity and their universal approximation property.The training procedure of FFNs is usually both time and power consuming, consisting in optimizing each network weight via slow gradient descent algorithms.Extreme Learning Machines (ELMs, Figure 1a) are FFNs composed of a single hidden layer in which only the output weights are updated, usually in a single step, while other parameters remain fixed during training, thereby speeding up the learning [7,9,8,6].
The ELM paradigm can be implemented in physical systems of various natures.The transformation that the chosen system performs between its input space and output space is analogous to the untrained set of internal connections of the ELM (Figure 1b).A system that maps its input space into a higher dimensional output space through a nonlinear transformation is expected to be a good candidate for an ELM.The training of such a "physical ELM" consists in the search for the optimal linear transformation which, acting on the system output, best approximates the desired target.The coefficients of such a linear transformation are analogous to the output weights of the network.
The optical domain offers good parallelization capabilities, many nonlinearities and high speed, thus it is considered a promising substrate for neural network implementations [21].Many schemes used to implement optical neural networks can also be used to implement ELMs.For instance, free space propagation through scattering media, exploited by Diffractive Deep Neural Networks (D 2 NN) [12,22], has been employed for ELMs [18]; similarly, time-multiplexed fiber loops, extensively exploited for Reservoir Computing (RC) [1,16], have been employed for ELMs [15].In these ELM implementations, the physical system output is recorded by a computer and the final transformation, i.e. the multiplication by output weights, is calculated digitally.
Here we present a photonic ELM based on frequency multiplexing (Figure 2), where information processing is mostly performed optically, including the multiplication by output weights.The states of both input and hidden nodes are encoded in the amplitudes of different lines of a frequency comb.The comb is generated by a Phase Modulator acting on monochromatic laser light.Input features are encoded in the amplitudes of the comb by a programmable spectral filter.The input layer is transformed into the hidden layer via frequency mixing carried out by a second Phase Modulator: this technique, introduced in Quantum Optics [13,14], has been already employed for optical Reservoir Computing [3].A second programmable spectral filter is used either to apply output weights, thus optically generating the output layer, or to scan the frequencies of the hidden layer comb, thus measuring the state of each hidden node.The only nonlinearity is a quadratic nonlinearity performed by the readout photodiodes.
In section 2 we describe the experimental setup and the model employed in numerical simulation.In section 3 we describe all the experiment phases, from input to performance evaluation, including the training algorithm and the optical weighting scheme.In Section 4 we describe the results obtained on different classification tasks and on Nonlinear Channel Equalization task, discussing their comparison with simulations, other machine learning algorithms and previous literature.We also discuss the dependence of performances on hyperparameters.Section 5 contains conclusions and perspectives.

Experimental system 2.1 Experimental setup
Our experimental setup is depicted in Figure 3.The light source is a C-Band continuous wave laser propagating in polarization-maintaining fibers.The two Phase Modulators, PM 1 and PM 2 , are driven by the same Radio Frequency (RF) signal generator at frequency Ω/2π = 16.96860GHz.Ω defines the spacing of the comb, as shown in Section 2.2, and its exact value is not important.The same RF signal goes through two amplifiers which provide two different fixed gains (hence the RF powers reaching the two PMs cannot be set indepen-   a) is trained by acting on the output weights, W , in green, while the weights between input and hidden layer, W int , in red, are randomly selected and kept fixed.In a physical implementation of an ELM (b) the untrained connections between input and hidden layer are substituted with the action of a physical system; the output of this system constitutes the hidden layer of the network.dently, as only the RF generator power can be tuned).During the experiment, PM 1 and PM 2 are driven by RF powers of 30 dBm and 20 dBm respectively.The strength of modulation is better characterized by the dimensionless number m = πV Vπ , where V is the amplitude of the signal applied to the PM, and V π is the PM characteristic voltage.In our setup, m 1 ≈ 7.87 and m 2 ≈ 2.18.The programmable spectral filters SF 1 and SF 2 are two Finisar Waveshapers, model 1000 and 4000 respectively.SF 1 is employed to encode the input, applying the proper attenuation to each component of the comb.SF 2 , instead, allows to apply two different filters, redirecting the two results to two different outputs.The time to set a new spectral filter is approximately 500 ms.The two outputs of SF 2 are connected to two photodiodes, PD 1 and PD 2 , and their readings are transferred to a computer.Each hidden node can be read by using SF 2 to implement the corresponding notch filter.Since two filters can be set simultaneously, up to two different nodes can be read at the same time.To perform optical multiplication with output weights, instead, more complex filter shapes are set in SF 2 , in such a way that each photodiode, integrating the optical power over the whole spectrum, measures a specific linear combination of comb component powers.The programmable filter SF 1 provides a 20 GHz bandwidth resolution, while SF 2 provides a 10 GHz resolution.Considering the value of Ω, equal to the spacing between comb lines, these filter resolutions should in principle be enough to fix the attenuation of each comb component separately.However, we measured a slight crosstalk effect between two adjacent lines filtered by SF 1 , meaning that the value encoded on one input node may slightly influence the adjacent ones.Simulations suggest that this crosstalk has no effect on performances, but could be avoided by increasing Ω or choosing a better resolved programmable spectral filter.Figure 2: Conceptual scheme of the ELM based on frequency multiplexing.The physical system, in red, generates a frequency comb (input comb) and subsequently transforms it into a new comb (output comb), mixing its frequencies through phase modulation.The input layer is encoded in the input comb, hence the output comb plays the role of the hidden layer.Each hidden node is a linear combination of input nodes.The only nonlinearity is the quadratic one realised by the readout photodiodes.

Description of the electric field
A Phase Modulator acts on monochromatic laser radiation as follows: where E 0 is the input electric field amplitude, ω is the input electric field angular frequency, Ω is the RF frequency driving the PM, m is its modulation strength, and J ν (m) represent the Bessel functions of first kind.The series expansion of the term e −im cos (Ωt) is known as Jacobi-Anger expansion.The coefficients of this expansion decrease when |k| increases, thus the series can be truncated in numerical simulations.
We define E comb the electric field at the output of PM 1 ; E in the electric field at the output of SF 1 ; E hidden the electric field at the output of PM 2 and E readout, 1 and E readout, 2 the two electric fields at the two outputs of SF 2 , hence at the inputs of PD 1 and PD 2 (see Figure 3).These definitions reflect the function of the fields in the ELM context: E comb represents the blank comb before any input is encoded on it, E in represents the input layer of the ELM and E hidden represents the hidden layer.Note that E readout, 1 and E readout, 2 do not represent necessarily the output layer of the ELM: their content depends on how SF 2 is set, as described in Section 3.4, and needs postprocessing to reconstruct the actual output layer.The shape of the programmable spectral filters are described by the attenuations that they apply on the frequencies ω + kΩ, which are the frequencies of the comb components, i.e. the frequencies of each node.We define F in k the attenuation that the filter set on SF 1 applies to the frequency ω + kΩ, and F readout, 1 k and F readout, 2 k the attenuations that the two filters set on SF 2 apply to the frequency ω + kΩ.Hence, the electric fields across the setup are described by the following equations: The photodiodes PD 1 and PD 2 provide measurements of the overall optical intensity integrated over the whole spectral extension of the filtered comb: where < • > indicates a time average.The model described by Eq. ( 1) generates symmetrical input combs (Figure 4c), while the comb measured experimentally shows clear asymmetries (Figures 4a).The asymmetry suggests the presence of a second harmonic of the RF signal driving the Phase Modulators.In order to achieve realistic simulations, we correct Eq. ( 1) as follows: where the second exponential factor accounts for a new second harmonic effect and represents its strength.Simulations can still be performed easily, since the two exponential factors featuring cos (Ωt) and cos (2Ωt + Φ) can be expanded in two Jacobi-Anger series, as follows: where, as before, the sums can be truncated when the coefficients get small enough.After manipulating the indexes in Eq. ( 10), we can correct Eq. ( 2) accounting for comb asymmetries: The values of and Φ have been fitted to match experimental measures of the combs generated by PM 1 .We found = 0.0471 and Φ = 1.31 rad.The new equation provide a more realistic comb, as shown in Figure 4b.
A similar expression can be derived for Eq. ( 4), but seems not to be necessary, since PM 2 is driven by a weaker RF signal and exhibits weaker nonlinearity.
Regardless of the number of input nodes, the hidden layer is always considered to be composed of 31 nodes, i.e.only the 31 most central lines of the comb are read and linearly combined.This is because, given the values of m 1 and m 2 , components E hidden k with |k| > 15 are always too weak to encode information, as shown in Figure 5.Note that E hidden k may be too weak to be measured even for certain k ∈ [−15, 15], but these "silent" nodes are not expected to affect our training algorithm (see Section 3.2).  3 Principle of operation

Notation
In the following we indicate with u a single set of input features supplied to the ELM, i.e. a single input layer; with ỹ its corresponding target output layer, i.e. the correct output expected from a well trained network; and with h the hidden layer of the ELM.The output layer y is generated multiplying the hidden layer h with the set of output weights W: y = h • W. The multiplication can be performed digitally or optically, as described in the following.u and h are row vectors; y and ỹ are scalars if the task require only one output node, or row vectors otherwise; W is a column vector if the output layer contains only one node, or a matrix otherwise.To describe the training phase, it is useful to include all the input layers submitted to the network, all the corresponding hidden layers and all the corresponding target outputs in matrices.We hence define U, H and Ỹ in such a way that the i-th row of U represents the i-th set of input features submitted to the network, the i-th row of H represents the corresponding hidden layer generated by the network, and the i-th row of Ỹ represents the corresponding target output layer.U and H are matrices, while Ỹ is a column vector if the task requires only one output node, or a matrix otherwise.

Training algorithm
The training consists in finding the optimal set of output weights, W, such that when the input u is presented to the network, the output layer approximates the corresponding target ỹ, i.e.: y = h • W ≈ ỹ.Note that only the output weights W are trained, while the internal mechanism which transforms the input layer into the hidden layer is left untouched.Hence the ELM training consists in a single operation and does not require slow gradient descent algorithms.In this work we employ ridge regression algorithm to estimate the optimal set of output weights W. Ridge regression consists in the minimization of the quantity where λ is a regularization parameter whose purpose is described below.The W minimizing ( 12) is: where (•) T indicates the transposed matrix and (•) −1 the inverse matrix.In our system no hidden node will measure exactly zero, and the algorithm may erroneously attribute importance to dark noise, setting enormous weight to silent hidden nodes.The regularization parameter λ defines a penalty for having high components in the vector W, thus preventing this error.The optimal value for λ depends on the task and is obtained by testing different possibilities.It is worth introducing here also the Ordinary Least Squares (OLS) estimation, which is equivalent to ridge regression with λ = 0.As described in Section 3.4, OLS is employed during the optical weighting.The solution in this case is where pinv(H) = (H T H) −1 H T is the More-Penrose inverse of H.

Dataset preprocessing and input
Usually, in a FNN the number of input nodes equals the number of features of the dataset.Nonetheless, our experimental scheme allows to supply the same input feature to d input nodes, with d ≥ 1.We provided for this possibility since some components of E comb may be too weak to properly encode an input.For instance, in Fig. 4a, the comb component for k = 4 is almost vanishing: if an input feature is encoded on the amplitude of this component, it will be negligible compared to other input features.Setting d > 1 proved useful to avoid this risk, as discussed in Section 4.1.
The preprocessing of the input data consists in the following operations.The input dataset is rescaled in such a way that each input feature assumes value in the range [0, 1].Then, the feature values are linearly converted into attenuations in the range [−30 dB, 0 dB].Finally, each input entry u is stretched according to the selected value of d.If u contains N elements, it is transformed as follows: ).
(15) After being preprocessed, the feature vector u contains the attenuations to be applied to the comb.In our experiments we always encoded the input in the most central part of the frequency comb, where most of the optical power is contained.Thus, for example, if u (after the stretching operation) contains M = N • d elements and M is odd, the first attenuation u 1 is assigned to F in −(M −1)/2 and the last one, u M , is assigned to F in (M −1)/2 .The remaining part of the filter F in , i.e. the part acting on comb lines not encoding any input node, is set on zero-attenuation.Not filtering out unused parts of the input comb proved to be beneficial for tasks requiring few input nodes when operating at low d, most probably because this lets more power inside the system, hence leads to a richer hidden layer.

Weight estimation
A task is defined by the set of input features recorded in the matrix U, and the set of target outputs recorded in Ỹ, which, as described before, may be a vector or a matrix according to the number of output node required.For each task, U is preprocessed as described in Section 3.3, then it is split in two parts: 70% of the entries constitute the "train dataset", and the remaining 30% constitutes the "test dataset".The train dataset is employed to estimate the optimal set of weights W, while the test dataset is employed to evaluate the performance of the trained network.To gather statistics about the performances, for each task we tested different random repartitions in train and test datasets.
First, the optimal set of weights W has to be estimated.For each input layer contained in the train dataset, the corresponding hidden layer is recorded, hence building the matrix H.Each hidden layer node is read loading the proper notch filter, i.e. a filter selecting only the desired comb component, on SF 2 and redirecting its power towards one of the photodiodes.To speed-up the procedure, we exploited the dual-output capabilities of SF 2 , setting two different notch filters at the same time, hence selecting two different comb lines simultaneously and redirecting them towards PD 1 and PD 2 .Once H is recorded, the ridge regression algorithm described in Eq. ( 12) is applied to estimate the optimal output weights W.
Once W has been estimated, the performances of the ELM are evaluated on the train dataset, comparing the network outputs with the target ones.The output layers are obtained by multiplying the hidden layers by the output weights.This multiplication can be performed digitally or optically, as described in the following.
Digital weighting.For each entry u of the test dataset, the corresponding hidden layer h is recorded by using notch filters, as described above.Then, the output layer y = h • W is calculated on the computer.
Optical weighting.For simplicity, first suppose that the output layer is composed of a single node, hence W is a column vector.Two sets of weights, W + and W − , are generated from W: the first contains only the positive weights, and zeros in place of the negative one; the second contains only the negative weights, taken without sign, and zeros in place of the positive ones.Note that, by definition, the vectors W + and W − cannot contain two non-zero elements in the same position.Two different filter shapes, F + and F − , are then generated starting from W + and W − respectively.The procedure is similar to what employed to generate F in : the weights are rescaled in the range [0, 1] and then linearly converted into attenuations in the range [−30 dB, 0 dB], with exception of the weights valued exactly zero, which are converted into a complete block state.The readout proceeds as described by Eqs. ( 5) and ( 6), with F readout, 1 = F + and F readout, 2 = F − .Figure 6 contains an example of readout spectral filters employed during the experiment.The result of the application of these two sets of weights are read by the two photodiodes PD 1 and PD 2 .Since the photodiodes integrate power over the whole spectrum, their readings are equivalent to two linear combinations of hidden node powers, whose coefficients are the attenuations in F + and F − .The output node is reconstructed as where I 1 and I 2 represent the readings from the two photodiodes.The set of coefficients C = (C + , C − , C 0 ) could in principle be obtained from W. Nonetheless, we employed 10% of acquired data to learn the optimal set of coefficients C through Ordinary Least Squares algorithm.If the first n entries of U are employed to train C, adapting Eq. ( 14), the optimal set of coefficients is given by where I i 1, 2 and ỹi are, respectively, the two intensity readings and the target output value correspondent to the i-th entry of the input dataset.Note that the column full of ones in the inverted matrix is required to learn the optimal offset C 0 .The set of coefficients learnt in this way performs better than the one that could be obtained from W, since in this last training phase C is adjusted to compensate both for the presence of dark noise in the measurements and for the difference between the response of PD 1 , measuring I 1 , and PD 2 , measuring I 2 .Note that the coefficients C are not universal, i.e. they have to be calculated for each task, because they also account for the normalization of the task-dependent weights.
If the task requires an output layer composed of more than one node, the procedure here described is repeated multiple times, employing different sets W + , W − and C for each output node.
We found the optical weighting configuration to provide often better performances than the digital weighting one (see Section 4).This effect is most probably due to the extra training phase introduced in optical weighing mode, as described by Eq. ( 16).
Finally, we point out that the optical weighting scheme does not intrinsically require a computer to perform differences: a differential amplifier is sufficient to evaluate I 1 − I 2 .

Results
We mainly tested the ELM on classification problems, such as Iris [10] and Wine [20] Classification, as well as Banknote Authentication [2].In classification tasks the ELM is required to assign the correct class to each "sample", i.e. to each set of input features.The network has as many output nodes as possible output classes, and after each readout the class corresponding to the node getting the highest value is considered to be the prediction of the network.Note that if only two classes are present, one output node is enough to encode the prediction (if y = h • W ≤ 0.5 the network predicts the first class, otherwise it predicts the second one).Experimental results are compared both with simulations and with the scores obtained by a Support Vector Machine (SVM).We also considered the Nonlinear Channel Equalization problem [11], which is well known in the Reservoir Computer community and is described below.The results on this task are compared both with simulation and with other experimental results in the literature.
Iris Classification.The Iris Classification task consists in selecting the correct class among three different ones, given a set of four different features.The ELM is thus composed of 4 input nodes and 3 output nodes, one for each possible output class.Performances on the Iris Classification task are reported in  Note that the scores are quantized, hence these elements can be superimposed.This is the case, for example, of optical weighting scores for d = 2: the only recorded accuracies were 92.3% and 100% (corresponding to zero and one error respectively): hence, the minimum value equals the first quartile while the median equals the third quartile and the maximum value.
Wine Classification.The Wine Classification task consists in selecting the correct class among three different ones, given a set of thirteen different features.The ELM is thus composed of 13 input nodes and 3 output nodes, one for each possible output class.Performances on the Wine Classification task are reported in Figure 8.In digital weighting mode the ELM reached an accuracy of 97.5% (setting d = 1 and λ = 10 −6 ).The average accuracy recorded over 10 optical weighting runs was 94.4% (setting d = 1).A Support Vector Machine reached an accuracy of 97.8%.
Banknote Classification.The Banknote Classification task consists in selecting the correct class among two different ones, given a set of five different features.The ELM is thus composed of 5 input nodes and 1 output node, which is enough to encode the two possible classes.Performances on the Banknote Au- thentication task are reported in Figure 9.In digital weighting mode the ELM reached an accuracy of 99.4% (setting d = 1 and λ = 10 −5 ).The average accuracy recorded over 10 optical weighting runs was 98.8% (setting d = 1).A Support Vector Machine reached an accuracy of 100%.
Nonlinear Channel Equalization.The Nonlinear Channel Equalization task consists in reconstructing a signal after the transmission through a channel which induces a nonlinear distortion and has memory.The input signal is a sequence of random symbols u(t) uniformly extracted from the set {−3, −1, 1, 3}.This signal first goes through a linear channel exhibiting memory effects: + 0.04u(t − 5) + 0.03u(t − 6) + 0.01u(t − 7), and then through a noisy nonlinear channel: where ν(t) is a Gaussian noise with a power selected in such a way to achieve a certain desired Signal to Noise Ratio (SNR).For each timestep t, the channel outputs x(t − 7), x(t − 6), ..., x(t + 1), x(t + 2) are supplied to the ELM and the task consists in reconstructing u(t).Thus, this task is equivalent to a classification in four different possible classes given ten input features.Contrary from previous tasks, here we employ only one output node and we take as output of the ELM the value in the set {−3, −1, 1, 3} closest to the output node value.We tested the performances over different SNR values, ranging from 8 dB to 24 dB and in a no-noise configuration.Performances of the Nonlinear Channel Equalization task are evaluated by the Symbol Error Rate (SER), i.e. the ratio between errors and total transmitted symbols, and are reported in Figure 10.These results are obtained setting d = 2 and the best performing λ value for each SNR (selected λ values belong to the range [10 −10 , 10 −5 ]).In terms of SER, our ELM running in optical weighting outperforms by almost one order of magnitude a previous optical implementations of a time-multiplexed ELMs [15]. 1 Our ELM was tested on 1000 symbols, hence SERs less than 10 −3 are undetectable.Increasing the order of magnitude of the input symbols count is currently experimentally unfeasible, due to the slow settling time of the programmable filters.However, in numerical simulation we found SERs of 2.3 • 10 −4 for an SNR of 28 dB, 7.2 • 10 −5 for an SNR of 32 dB and 2.8 • 10 −5 in a no-noise configuration.These simulated performances are comparable with the ones obtained by Reservoir Computers (RC) reported in literature [19,17,5,4], and, in some cases, even almost one order of magnitude better.Note that these Reservoir Computing approaches also rely on the capability of the network to memorize previous input, since only the current state of the channel is supplied as input in each timestep.Contrary to RCs, an ELM does not have memory of the past inputs, since the network features no recurrency.As a consequence, for the Channel Equalization task, memory has to be implemented outside the network: both in our case and in [15] it is implemented in the script generating input layers, as described above.

Dependence on hyperparameters
The system has been simulated according to the model described in Section 2.2.For each input layer, the corresponding hidden layer is simulated and the output layer is calculated as described in Section 3.4 in the 'digital readout' case.The simulation allows to evaluate performances systematically scanning the hyperparameters d, m 1 and m 2 .Note that such an accurate scan is unfeasible in the experimental setup, both because of the prohibitive time it would require and for the impossibility of setting m 1 and m 2 independently.We scanned the performances of each tested task: Iris (Figure 12) and Wine classification (Figure 13), Banknote authentication (Figure 14) and NLC (Figures 15 and 16).
The simulated scans allow two observations about the working mechanism of this ELM.First, when d = 1, the performances are extremely dependent on m 1 and show sharp drops for certain values of this hyperparameter.We found that the positions of these drops depend on the arrangement of the input features.Since this effect is strongest when d = 1, we conclude that drops in performance happen when an important input feature is assigned to a comb component too weak to encode it properly.Second, when d = 2, Nonlinear Channel Equalization and Wine classifier perform badly for m 1 values too low.These two tasks require many features, respectively 10 and 13, which, when d = 2, are encoded in 20 and 26 input nodes respectively.Hence, they can be completely encoded only when the input comb is large enough, that is when m 1 is large enough.This last effect also applies to all the other tasks when d = 3. Scans of the performances when d = 3 do not display any additional interesting feature and are not reported here.
Simulation scans also suggest that a high m 2 parameter is not a prerequisite for good performances.In Fig. 11 we plot the simulated accuracy versus m 2 , keeping m 1 equal to the experimental value of 7.87, for two selected tasks which are the most sensitive to m 2 variations.m 2 determines how strongly input nodes are mixed to generate the hidden layer, hence when m 2 = 0 the hidden layer is an exact copy of the input one.We checked that in this situation the network performs similar to a perceptron, i.e. a machine learning algorithm whose output is simply a linear combination of input features.Also the arrangement of input features is a free parameter.As described in Section 3.3, during the experiments we encoded the inputs in the central part of the comb, assigning each feature to d consecutive comb lines.However, alternative schemes could be employed: for example, the same feature could be assigned to d non-consecutive lines, or features could be encoded in the most powerful lines of the comb, regardless of their position.Numerical simulations suggest that these approaches do not affect performances sensibly, but they could be investigated more in the future.

Conclusion
An Extreme Learning Machine consists in a randomly initialized Feed-Forward Neural network where only output connections are trained.This concept can be translated from software to real physical substrates, exploiting the transformation that a certain system acts between its input and output spaces.We demonstrated the feasibility of an ELM implemented in a frequency-multiplexing optical fiber setup, where also multiplication by output weights can be performed optically.Our experiment can be interpreted as an interferometer in the frequency domain, and is very stable: weights learned one day can be used the day after with no recalibration.
The current scheme is affected by two main limitations.The first consists in the speed of execution of the experiment.This is currently limited by the programmable filters settling time, which is ∼ 500 ms.We expect to be possible to achieve an update rate at least comparable to the video frequency of 60 Hz by employing LCD-based optical filters.The second limit consists in the topology of the network.The number of input nodes could be increased by increasing the power of the RF signal applied on PM 1 .However, the strength of the mixing, i.e. the number of input nodes contributing to the state of a hidden node, depends only on the power of the RF signal applied to PM 2 .
Typical parallelization potentialities offered by the optical field remain to be tested.As example, more than one input wavelength could lead to improvements in the scheme: one could have multiple superimposing or not-superimposing combs, which could enrich the dynamics, increase the size of input and hidden layers, or even allow for the execution in parallel of multiple tasks.This will be studied both numerically and experimentally in the future.Logarithmic scale, low is better.Note the sharp drop in performances when d = 1 and m 1 ≈ 7.9, due to the particular shape of the input comb, unable to encode an important feature.

Figure 1 :
Figure1: An ELM (a) is trained by acting on the output weights, W , in green, while the weights between input and hidden layer, W int , in red, are randomly selected and kept fixed.In a physical implementation of an ELM (b) the untrained connections between input and hidden layer are substituted with the action of a physical system; the output of this system constitutes the hidden layer of the network.

Figure 3 :
Figure 3: Scheme of the experimental setup.Red lines represent optical connections, green lines represent input from the computer and blue lines represent RF connections.The first Phase Modulator, PM 1 , generates a frequency comb out of monochromatic laser radiation.The first programmable spectral filter, SF 1 , encodes input features in this comb, thus generating the input layer.The second Phase Modulator, PM 2 , mixes the input comb components generating the hidden layer.The second programmable spectral filter, SF 2 , is employed for the readout.The two photodiodes PD 1 and PD 2 provide an integrated reading of all the optical power impinging on them.A computer drives the programmable filters (connections not shown) and records the photodiodes measurements.

Figure 4 :
Figure 4: Comb intensities after PM 1 corresponding to the parameters reported in the text.Measurements (a); simulation accounting for the second harmonic correction (b); simulation without correction (c).

Figure 5 :
Figure 5: Two hidden combs generated from two different input layers during experiment.Note that the comb contains about 31 lines.k = 0 correspond to the frequency of the laser source.

Figure 6 :
Figure 6: Typical readout filters employed during optical weighting mode.The attenuation correspondent to complete block states is plot as 30 db.

Figure 7 :
Figure 7: Experimental and simulation result for the Iris Classification task.The boxplot diagram describes statistics obtained from 100 cross-validation tests in the case of digital weighing and simulation, and 10 different runs of the experiment in the case of optical weighing.The extremes of the colored boxes represent the first and third quartile of the score distributions; horizontal lines external to the boxes represent the minimum and the maximum of the score distributions; horizontal lines inside the colored boxes represent the median of the score distributions.Note that the scores are quantized, hence these elements can be superimposed.This is the case, for example, of optical weighting scores for d = 2: the only recorded accuracies were 92.3% and 100% (corresponding to zero and one error respectively): hence, the minimum value equals the first quartile while the median equals the third quartile and the maximum value.

Figure 8 :
Figure 8: Experimental and simulation result for the Wine Classification task.The boxplot diagram (see Figure 7) describes statistics obtained from 100 crossvalidation tests in the case of digital weighing and simulation, and 10 different runs of the experiment in the case of optical weighing.

Figure 9 :
Figure 9: Experimental and simulation result for the Banknote Authentication task.The boxplot diagram (see Figure 7) describes statistics obtained from 100 cross-validation tests in the case of digital weighing and simulation, and 10 different runs of the experiment in the case of optical weighing.

Figure 10 :
Figure 10: Experimental and simulation result for the Nonlinear Channel Equalization task.All the experiments are executed with d = 2 and λ = 10 −9 over 1000 transmitted symbols.The boxplot diagram (see Figure 7) describes statistics obtained from 100 cross-validation tests in the case of digital weighing and simulation, and 10 different runs of the experiment in the case of optical weighing.Downward pointing arrows indicate that no errors have been recorded after 1000 transmissions, hence SER < 10 −3 .Note that in two less noisy configurations (SNR = 24 dB and no-noise) digital weighting always recorded either no error or only one error per run.

Figure 11 :
Figure 11: Simulated accuracy varying m 2 with m 1 = 7.87 for the Iris classification (a) and the banknote authentication (b) tasks.When m 2 is too small, the mixing effect provided by the Phase Modulator PM 2 is negligible and the hidden layer is identical to the input one, thus the accuracies are comparable to the ones obtained by a perceptron.

Figure 12 :
Figure 12: Simulated accuracy for Iris Classification task with d = 1 and d = 2 as a function of m 1 and m 2 .Higher is better.

Figure 13 :
Figure 13: Simulated accuracy for Wines Classification task with d = 1 and d = 2 as a function of m 1 and m 2 .Higher is better.

Figure 14 :
Figure 14: Simulated accuracy for Banknote Authentication task with d = 1 and d = 2 as a function of m 1 and m 2 .Higher is better.Sharp drops in performance can be noted in this plot, when d = 1 and m 1 ≈ 5.3 or m 1 ≈ 8.6.

Figure 16 :
Figure 16: Simulated Symbol Error Rate for Nonlinear Channel Equalization task with SN R = 12 dB and with d = 1 and d = 2 as a function of m 1 and m 2 .Logarithmic scale, low is better.Note the sharp drop in performances when d = 1 and m 1 ≈ 7.9, due to the particular shape of the input comb, unable to encode an important feature.