Development of Use-specific High Performance Cyber-Nanomaterial Optical Detectors by Effective Choice of Machine Learning Algorithms

Due to their inherent variabilities,nanomaterial-based sensors are challenging to translate into real-world applications,where reliability/reproducibility is key.Recently we showed Bayesian inference can be employed on engineered variability in layered nanomaterial-based optical transmission filters to determine optical wavelengths with high accuracy/precision.In many practical applications the sensing cost/speed and long-term reliability can be equal or more important considerations.Though various machine learning tools are frequently used on sensor/detector networks to address these,nonetheless their effectiveness on nanomaterial-based sensors has not been explored.Here we show the best choice of ML algorithm in a cyber-nanomaterial detector is mainly determined by specific use considerations,e.g.,accuracy, computational cost,speed, and resilience against drifts/ageing effects.When sufficient data/computing resources are provided,highest sensing accuracy can be achieved by the kNN and Bayesian inference algorithms,but but can be computationally expensive for real-time applications.In contrast,artificial neural networks are computationally expensive to train,but provide the fastest result under testing conditions and remain reasonably accurate.When data is limited,SVMs perform well even with small training sets,while other algorithms show considerable reduction in accuracy if data is scarce,hence,setting a lower limit on the size of required training data.We show by tracking/modeling the long-term drifts of the detector performance over large (1year) period,it is possible to improve the predictive accuracy with no need for recalibration.Our research shows for the first time if the ML algorithm is chosen specific to use-case,low-cost solution-processed cyber-nanomaterial detectors can be practically implemented under diverse operational requirements,despite their inherent variabilities.

nanomaterials, nano-tubes, quantum-dots, etc., can be found in the fabrication of optical detectors, 1-3 molecular and bio-sensors, [4][5][6][7][8] ion and radiation sensors, 9 chemical sensors 10,11 gas sensors, 12,13 temperature sensors 14 and many other cases of detection and sensing. There are many aspects that make nanomaterials promising candidates for these applications compared to the bulk materials. For instance, their enhanced optoelectronic and novel chemical/physical properties make them efficient choices for sensing, while their small dimensions will lead to devices with lower power consumption and smaller size. In many cases, nanomaterials are much more attractive than conventional semiconductor sensors due to their low-cost, earth-abundant availability, and compatibility with affordable solution-processable techniques. Their high surface-to-volume ratio makes them highly sensitive as chemical sensors, whereas their quantum confinement or excitonic processes enables them to be excellent target-specific photodetectors. As a result, over the past decades, there has been a tremendous progress in fundamental understanding and proof-of-concept demonstrations of chemical, biological, optical, radiological and a variety of other sensors using nanomaterials. [1][2][3][4][5][7][8][9][10][11][12][13][14][15][16] However, there exists many challenges in real-world implementation of sensors made from nanomaterials; above all the difficulties in reproducing them which makes the size and physical location of the fabricated nanomaterials on the substrate unpredictable and uncontrolable. Moreover, the nanomaterials undergo gradual decay in ambient condition called "drift", i.e. they are not very stable; also there is often a large noise in their measurement because of their small size due to the fact that nanomaterials not only respond to what they are designed to measure, but also are very sensitive to many other conditions in their environment. These shortcomings, not to mention the gradual decays of nanomaterials, have introduced huge challenges in mass production of reliable devices from them, where predictable and controllable manufacturing processes is essential to the industry.
In recent decades, the emergence of machine learning (ML) has demonstrated a great potential for enhancing statistical analysis in the field of material science. Nowadays, ML provides popular tools for obtaining information from internet of things (IoT) networks [17][18][19][20][21] such as charge-coupled devices (CCDs), 22,23 complementary metal-oxide-semiconductor (CMOS) detectors, [24][25][26] or regular Silicon-based spectrometers, which are examples of sophisticated networks of optical detectors. 27,28 In physics, on one hand, people employ machine learning to analyze, predict, or interpret physical quantities; on the other hand, underlying physical principle has also been employed to facilitate designing effective machine learning tools. 29,30 ML methods have been successfully applied for accelerated discovery [31][32][33] and development of materials and metamaterials with targeted properties, [34][35][36][37][38][39][40] predicting chemical [41][42][43][44] and optoelectronic properties of materials, [45][46][47][48] and synthesizing nanomaterials. 49 The variations in nanomaterial properties are usually considered as "noise" and various experimental or statistical approaches are often pursued to reduce these variations or to capture the useful target data from noisy measurements. [50][51][52][53][54] However, the direct applications of the data analytic approaches have never been sought on the variability of the nanomaterials themselves to utilize these variations as information instead of treating them as noise. In the context of sensing applications, one way to overcome the aforementioned challenges of nanomaterials is to use ML on a multitude of sensors in order to extract relevant response patterns towards achieving accurate, reliable, and reproducible sensing outcomes.
Our previous work (Hejazi et al., 2019 15,55 ) demonstrated the power of using advanced data analytic on the measured data from a few uncontrolled low-cost, easy-to-fabricate semiconducting nanomaterials in order to estimate peak wavelength of any incoming monochromatic/near monochromatic light over the spectrum range of 351-1100 nm with high precision and accuracy, in which we created the world's first cyber-physical optical detector. In that work, we applied a Bayesian inference on optical transmittance data of 11 nanomaterial filters fabricated from two transition-metal dichalgogenides, M oS 2 and W S 2 (see Fig. 1). We were also able to reduce the number of filters to two filters via step-wise elimination of least useful filters and still achieve acceptable results even with two filters. We also discussed that it is possible to choose suitable materials for desired spectrum ranges for optical filter fabrication.
In the present work, our aim is to augment our analytical tools by employing various ML techniques, compare their efficacy in color sensing, and finally choose the most suitable ML algorithm for color detection based on the application requirements. We note in doing so, it is important to discuss the data-analytical process of ML techniques within the context of nanoscience datasets, so that they can be appropriately utilized in analyzing nanoscience data of other types as well. Hence, we provide below a brief outline, using schematic visualizations, of how different ML approaches are analyzing our data. When ML is used as a discriminative model in order to distinguish different categories (e.g. different optical wavelengths), it comes in one of these two forms: "supervised learning", where new samples are classified into N categories through training based on the existing sample-label pairs; and "unsupervised learning", where the labels are not available, and the algorithm tries to cluster samples of similar kind into their respective categories. In our target application in this paper, labels are wavelengths that combined with measured transmittance values that we will call filter readings, create the set of sample-label pairs known as the training set. Therefore, we chose our analytical approaches based on the supervised ML algorithms. Apart from the Bayesian inference, we employed k-nearest neighbour (kNN), artificial neural networks (ANN), and support vector machines (SVM); the details of each can be found in the Computational Details section. In the following discussions, we provide a brief overview of each method to clarify their algorithmic steps.
As for the Bayesian inference, we discussed its underlying statistical approach in details in our previous article. 55 For a given set of known sample-label pairs (i.e. training set), Bayesian inference gathers statistics of the data and uses them later to classify an unknown new sample by maximizing the collective probability of the new sample belonging to corresponding category (see Fig. 2(c)).
In pattern recognition, the kNN is a non-parametric supervised learning algorithm used for classification and regression, 56 which searches through all known cases and classifies unknown new cases based on a similarity measure defined as a norm-based distance function (e.g. Euclidean distance or norm 2 distance). Basically, a new sample is classified into a specific category when in average that category's members have smallest distance from the unknown sample (see Fig. 2(a)). Here, k is the number closest cases to the unknown sample, and extra computation is needed to determine the best k value. This method can be very time-consuming if the data size (i.e. total number of known sample-label pairs) is large.
ANNs are computing models that are inspired by, but not necessarily identical to, the biological neural networks. Such models "learn" to perform tasks by considering samples, generally without being programmed with any task-specific rules. An ANN is based on a collection of connected units or nodes called artificial neurons, that upon receiving a signal can process it and then pass the processed signal to the additional artificial neurons connected to them. A neural network has always an input layer that are the features of each training sample and an output layer that are the classes in classification problem, There is one probability distribution per wavelength per filter By doing a kernel trick we can transform the data from feature space t to its dual space φ(t).
while it can also be only a number in regression problem. However, there are often more than just two layers in an ANN model. The extra layers that are always located between the input and output layers are called hidden layers. The number of hidden layers, the number of neurons in each layer, and how these layers are connected form the neural network architecture. [57][58][59] In general, having more number of hidden layers increases the capacity of the network to learn more details from the available dataset, but having much more layers than necessary can result in overfitting the model to the training set i.e. the model might be performing well on the training set but poorly on the unseen test set. 58,60 In this work we have used two different fully-connected ANN architectures to investigate their efficacy on optical wavelength estimation. The schematics of a three layered fully-connected ANN model is shown in Fig. 2(d). Backpropagation is the central mechanism by which a neural network learns. An ANN propagates the signal of the input data forward through its parameters called weights towards the moment of decision, and then backpropagates the information about error, in reverse through the network, so that it can alter the parameters. In order to train an ANN and find its parameters using the training set, we give labels to the output layer, and then use backpropagation to correct any mistakes which have been made until the training error becomes in an acceptable range. 61 When it comes to supervised classification, SVM algorithms are among the powerful ML inference models. [61][62][63] In its primary format as a non-probabilistic binary linear classifier, since it leads to a model that is more generalizable to an unseen test data. An interesting observation from our results is that the SVM model shows slightly larger estimation errors compared to the rest of the algorithms, however it is not sensitive to data size and is more resistant to time-dependent variations in optoelectronic response of nanomaterials i.e. to drift. Bayesian inference turns out to be very accurate, and quite fast as well.
By looking at the outcomes of our estimation problem, we have also discovered another important aspect of the data that we are dealing with in the nanomaterial applications. We noticed a significant nanomaterial measurements drift over time in our dataset, which can be described as "evolving class distributions". This means the same object (i.e. light ray) will not create the same responses on the nanomaterial filters over time. Therefore, a model where the nanomaterial filters have drifted even more. By observing the transmittance of filters over the period of more than a year, we were able to predict the drift in transmittance after two months and improve the performance in the wavelength estimation. This was however only possible in the kNN and Bayesian algorithms since they employ no other parameters than the transmittance values themselves, while SVM and ANN train their own corresponding parameters. In the next section we will summarize the main findings of our work.

Results and Discussion
The detailed description of each ML algorithm, the number of parameters to be trained, and the computational complexity of each technique will be discussed in the Computational Details section. The resolution of the collected wavelength samples is 1 nm. To discuss the efficacy of our wavelength estimators, we define the estimation error percent as error% = We first present our results comparing the wavelength estimation accuracy from various techniques. Fig. 3 Fig. 4(a)). In addition, we have performed the AAN using both 1 and 2 hidden layers, which has been presented in the comparison data shown in Fig. 4 and subsequent figures, where we can see a fifth batch of columns for 2 hidden layer ANN shown with AAN(2h) as opposed to ANN(1h) with 1 hidden layer.
To investigate the sensitivity of the models to the size of the training set, we randomly picked different portions of the training set to perform the training and testing, i.e. by randomly choosing 1 5 , 2 5 , etc. of the original dataset (see Fig. 4(a)). As it was expected from theory, the SVM model is least sensitive to the size of training data, followed by the Bayesian inference. However, the ANN and kNN show considerable reduction in performance by reducing the training set size. We can see that Fig. 4(a) where SVM shows minor changes from one data size to other, while for instance 1 hidden layer ANN shows steep change in error values. Another important fact that we learn from this figure is the minimum size of training set required to perform reasonable estimation. As we see, in each case using only   in the 1 hidden layer ANN, suggesting that ANN with more hidden layers appears to "learn" better from the available data and yield more accurate estimations. The other consideration is the available data is not exactly enough for this problem even when all data is used. This can be justified by seeing that even from going from 4 5 to all of the data there is a noticeable change in overall accuracy, while we expect to see minor change in accuracy of each model if the supplied data was sufficient.
We next analyze the performance of each algorithm in terms of the required time for each model to train, and afterwards to test. In kNN and Bayesian models there are no real likely it would not be the case if larger dataset were used (see Fig. 4(b)). As for Bayesian algorithm, the training part is limited to collecting the statistics from training data. In testing step, the model searches through all probability distributions and maximizes the posteriori; though it is obviously time consuming but is independent from the training set size. Hence, in both models the main and/or only required time is for testing.
As for ANN and SVM the training step can be dynamically decided by desired conditions.
In the case of SVM, the training step is governed by choice of tolerance, kernel type, etc..
After the support vectors are found, the testing step is carried out by checking which side of the hyperplanes the test sample falls. In our study different choices of kernel/tolerance did not pose meaningful enhancement on the estimation efficacy of the trained SVM models.
The situation is quite different for ANN, since one can iterate the training loop infinite times and the results may either improve, converge, or just get stuck in a local minima.
Time and computational resources for training are the real costs of the ANN algorithm, but in general ANN can fit very complicated non-linear functions that other models might not have as good performance as ANN. After the end of training step (decided by the experimenter based on the desired level of accuracy), the testing step is basically a few matrix multiplications only, as explained in Computational Details section. Hence, the testing time of ANN is quite short and independent from the size of training set. IN addition, we found that with smaller training sets the ANN model is prone to over-fitting, i.e. the model might perform well on the training set itself but not on new test set. The required testing time for each sample when all training steps are completed is shown in Fig. 4(b), which is the

Conclusions
In conclusion, we have successfully demonstrated the efficacy of various ML techniques in estimating the wavelength of any narrow-band incident light in spectrum range 351-1100 nm with high accuracy using the optical transmittance information collected from a few low-cost nanomaterial filters that require minimal control in fabrication. With the available data the kNN algorithm shows highest accuracy with the average estimation errors reaching to 0.2 nm over the entire 351-1100 nm spectrum range, where the training set is collected with 1 nm spectral resolution; but this method is not suitable for real-time applications since the required testing time is linearly proportional to the training set size. The situation is almost the same with the Bayesian algorithm which performs very well, but although its speed is not data size dependent, still the process is much slower than the other methods. The real-time speed considerations can be very well satisfied with ANN models where the estimation time can be as low as 10µs, but these models as well as Bayesian and kNN turn out to be more sensitive to drift in spectral transmittance of nanomaterials over time. On the other hand SVM models show a bit lower accuracy compared to the rest but do not suffer from smaller data sizes and are more resilient to drift in spectral transmittance. Even though we have shown in our previous work that re-calibrating the filters will overcome the drifts and wears in nanomaterials, but if the re-calibration is not a readily available option for the user, the SVM model offers acceptable accuracy and longer usability over time. On the other hand if speed is a consideration the ANN models would be the best choice, which turn out to perform well if enough data is provided. We also observed that ANN models with more number of layers seems to learn better from the available data. The choice of model depends on the application; for instance spectroscopy does not demand a fast real-time output but accurate and precise estimations. There are other applications especially in biology, for instance in DNA sequencing, 66 where the accuracy of the peak wavelength is not of importance as long as it is estimated close enough, but the time is of vital importance. can be huge and very complex. The future work is to generalize the methods of this paper to broad-band optical spectra. All said, we believe that application of advanced data analytic algorithms has been very limited in optical sensing applications, and our findings can open up a new path for designing new generation optical detectors by harnessing advanced data analyzing algorithms/ ML techniques and significantly transform the field of high-accuracy sensing and detection using cyber-physical approaches.

Computational Details
Data Structure. The analysis of our data were performed on transmittance values measured over a wide spectral range, 351nm < λ < 1100nm) for each of the 11 nanomaterial filters, as well as 110 repetitions of these wavelength-dependent data. As mentioned in previous article, the repeated data was acquired to account for drifts, fluctuations, and other variations commonly observed in physical measurements especially in nanomaterialbased systems, which tend to be sensitive to their environments. 15,55 On the other hand larger training data usually results in better performance of most ML algorithms. From the mentioned 110 spectra of each nanomaterial filter, 100 of them were labeled as "training data" or sample-label pairs and used for training the models (M = 750 × 100 = 75000 training samples). The other 10 spectra per filter were labeled as initial "test samples" (M = 750 × 10 = 7500 test samples or original test samples), and were used only for testing the "trained" models. In another words the test samples were not part of the training process and the machine learning models did not "see" these samples until the testing step.
In our classification problem there are N different classes: one per wavelength, and we are trying to classify our transmittance data into these N classes. Here, we will concisely introduce each ML method and give their mathematical equations; also we will mention the number of parameters that are being trained in each model. Computations are carried out in Python 3.7 using a 2.5 GHz Quad-core Intel Core i7.
Bayesian Inference. The filters are not chemically independent from each other, for they are mixtures from different proportions of the same two nanomaterials; so for computational purposes we assume independence between their outcomes, and model them with Naive Bayes algorithm. 55,67 The Bayesian inference for wavelength estimation problem can be formulated as follows: Let Λ = {λ 1 , ..., λ i , ..., λ N } be N different wavelengths in desired spectral range and with specified granularity (i.e. 351-1100 nm with 1nm step in this study), and T = {t 1 , ..., t i , ..., t Q } be the transmittance vector of Q filter values (i.e Q = 11 when all of the filters are used in this study). Employing the Bayesian inference, the probability of the monochromatic light having the wavelength λ j based on the observed/recorded transmittance vector T is called posterior probability which is the conditional probability of having wavelength λ j given transmittance vector T ; P (λ j ) = 1 N is the prior probability which is a uniform weight function here since all of the wavelengths are equally-likely to happen; N is the total number of quantifiable wavelengths in the range under study. Moreover, is the probability of observing transmittance data T given wavelength λ j , and is called the likelihood, which is the probability of having transmittance vector T if wavelength is λ j ; P (T ) is the marginal probability which is the same for all possible hypotheses that are being considered, so acts as a normalization factor to keep the posterior probability in the range of 0 to 1.
Individual P (t i | λ j ) values are assumed to be Gaussian normal distributions for each filter at each wavelength, and their mean values and standard deviations were calculated from the training data (i.e. the 100 measured transmittance spectra) collected for each filter at each wavelength. Finally, given the measured transmittance sample T (a vector of Q = 11 elements -one transmittance value per filter at an unknown wavelength), the wavelength λ * of the unknown monochromatic light is estimated by choosing the value of λ j that maximizes the posterior probability P (λ j | T ): This optimization called the maximum a posteriori (MAP) estimation. [68][69][70] (2) by-instance-based, which is the standard kNN approach, in which a new case is classified by a majority vote of its neighbors, with the case being assigned to a class that is most common among its k nearest neighbors measured by a distance function. If k is 1, then the case is simply assigned to the class of its nearest neighbor. Since kNN model with small k is prone to over-fitting, usually a finite odd number is chosen for k. There are various kinds of distance functions which from them the four famous distance functions: Euclidean, Manhattan, Chebyshev, and Minkowski are used in this study, but only the results of Euclidean distance function is presented which is the classical presentation of distance and is given by Here, X refers to each sample in the training set and Y refers to the unknown (test) sample. To apply it to our data we need to find distance of a new transmittance vector of Q = 11 elements, T = {t 1 , ..., t i , ..., t Q }, with all known transmittance vectors T = {t 1 , ..., t i , ..., t Q } that are already known and labeled in the training set, so the distance function is The distance between T and all M training samples is calculated, and the M calculated distance values are sorted from smallest to largest using a typical sorting algorithm. Afterwards, the k nearest neighbors i.e. wavelengths that have smallest distance values from the test T are found, which are the arguments of the first k numbers of the sorted list. Each nearest neighbor is assigned a uniform weight of 1/k, and the k neighbors are classified.
Then, the test case T is assigned to the group with largest vote or population. In order to find the best k for our system we tried different values for k in the range k = [1,20], and picked k = 7 which performed the best. As mentioned before, kNN is a non-parametric classification algorithm so, no parameters are being learned in kNN.
Artificial Neural Networks. In an ANN model each layer is made from a fixed number of neurons. The output of each neuron is linear combination of corresponding input followed by a non-linear activation function such as logistic sigmoid or softmax. These layers are connected by weight matrices, so that for an input sample T , by performing layer by layer matrix multiplication we would like to get as close as possible to the the real label (y value) of that sample. Let's show a three layer ANN model (with 1 hidden layer) with the layers by a (1) , a (2) and a (3) . To calculate each layer a (l) , l > 1), first, a matrix multiplication is performed between previous layer and the hypothesis matrix θ (l−1) to get z (l) ; then, an activation function g(z) (usually sigmoid function g(z) = 1 1+e −z ) is applied on z (l) which results in i th layer.
where, j = 1, ...H, and H is the size of first hidden layer; and the superscript (1), (2) indicate that the corresponding parameters are in the first or second layer of the network.
w ji is corresponding weights. w (1) j0 is referred as biases; z j are called activations and g(z) is the mentioned nonlinear activation function. At each layer of ANN, there is such a transformation; for example in three layer ANN which includes only 1 hidden layer, the elements of the third layer will take the form where, n = 1, .., N and N is the total number of outputs. a (3) n is the final output of the hypothesis that is going to be compared with the known target wavelengths The bias parameters can be absorbed into the set of weight parameters by defining an additional input variable t 0 whose value is kept fixed at t 0 = 1, and the same for other layers, so we combine these various stages to give the overall network function that, for sigmoidal output unit activation functions, takes the form or more explicitly r mn ln y mn + (1 − r mn ) ln(1 − y mn ) .
where y mn denotes y n (X m , W ). So far, we have explained the FeedForward propagation.
At first there is a large cost because the model is not trained yet. An important step in ANN learning process is called Backpropagation which unlike the FeedForward propagation explained above, it propagates from last layer and stops on second layer. In Backpropagation, each of the weight parameters are updated a small amount proportional to the gradient of cost (error) function with respect to that weight parameter. The proportion factor is called learning rate that defines updating rate for each parameter in Backpropagatoin. The training process happens by iterating many cycles, that in each cycle, we perform the Feed- where we have taken the factor 1 |w| outside the optimization over m because W does not depend on m. 61 On the other hand there are different kernel tricks to create a non-linear models, hence create larger feature space by a non-linear kernel function k(T i , T j ) = φ(T i ) φ(T j ).
This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space by replacing the Equation (11) The transformation may be non-linear and the transformed space high-dimensional. 62,71,72 The RBF kernel for example uses a Gaussian distribution for each feature T i and creates M different features using kernel function k( T i , T j ) = e −γ|( T i − T j | 2 for γ > 0. In this work, apart from linear SVM, we also tried different kernel functions as Polynomial, Gaussian Radial Basis Function (RBF), sigmoid, and Hyperbolic Tangent, but we only report the results of the linear and RBF models, since they performed slightly better that the other models, and their outputs are pretty much the same for our data; for that reason we are presenting only one set of results for SVM which is for linear SVM and SVM with RBF kernel. The number of parameters to be learned is N (Q + 1) = 9000 for linear model, and N (M + 1) ∼ 56 × 10 6 for RBF model, where N = 750 is the number of classes, M = 75000 is number of training samples and Q = 11 is dimension of each training sample.
Even though these numbers seem pretty large specially for RBF kernel, but most of these parameters are zero, and the calculation is carried out using sparse matrix of parameters. In fact, SVM kernels are called sparse kernel machines. 61 In this project, the Python's SciKit package is used for building the SVM model.
Time Complexity Analysis. As given above, we have N = 750 classes of all possible wavelengths, Q = 11 filters as feature number and totally M = 75000 samples for training.
Once trained, we care more about their inference efficiency. The time complexity is analyzed as following.
For Bayesian estimation, for each data point, the conditional distribution P (T |λ j ) is calculated with all possible wavelengths which is N . The production for joint probability takes N operation as well. However power operation is included in Gaussian distribution density function. As we compute this density across whole spectrum, exponent in this operation will be N related; therefore, each iteration of the implementation takes O(N where H i stands for the hidden neuron numbers at layer i, L stands for the total layer numbers (except input). As the Q and H will be data size-independent, this method is supposed to be much faster than the other methods. In our ANN architecture the number of neurons increases almost by order of magnitude as we go from input to output layer, so the complexity is dominated by the last layer. N is output layer size and if we denote the last hidden layer size by H L , the time complexity will be in the order of O(H L N ).
For SVM, for each query the kernel operation is across all support vectors within training data. The inference complexity for linear and RBF model will be O(QM sv ) since we are solving the dual form here; M sv stands for the number of support vectors, which most of the times will be much less than M but it can also be upto M , so we can show its upper limit as O(QM sv ). The theoretical complexity estimations are in agreement with the measured time required for testing each sample (see Fig. 4(b).