Machine learning for composition analysis of ssDNA using chemical enhancement in SERS

: Surface-enhanced Raman spectroscopy (SERS) is an attractive method for biochemical sensing due to its potential for single molecule sensitivity and the prospect of DNA composition analysis. In this manuscript we leverage metal speciﬁc chemical enhancement eﬀect to detect diﬀerences in SERS spectra of 200-base length single-stranded DNA (ssDNA) molecules adsorbed on gold or silver nanorod substrates, and then develop and train a linear regression as well as neural network models to predict the composition of ssDNA. Our results indicate that employing substrates of diﬀerent metals that host a given adsorbed molecule leads to distinct SERS spectra, allowing to probe metal-molecule interactions under distinct chemical enhancement regimes. Leveraging this diﬀerence and combining spectra from diﬀerent metals as an input for PCA (Principal Component Analysis) and NN (Neural Network) models, allows to signiﬁcantly lower the detection errors compared to manual feature-choosing analysis as well as compared to the case where data from single metal is used. Furthermore, we show that NN model provides superior performance in the presence of complex noise and data dispersion factors that aﬀect SERS signals collected from metal substrates fabricated on diﬀerent days.


Introduction
Surface-enhanced Raman spectroscopy (SERS) discovered in the 1970s [1][2][3] provides an attractive method for bio-sensing applications [4][5][6][7][8], as it combines high degree of specificity inherent to Raman scattering with high scattering cross section mainly due to electromagnetic enhancement (EM) mechanism. These features turn SERS into an appealing method for DNA composition analysis in order to discriminate between DNA sequences according to the total number of bases of each type, with a variety of potential applications in genome evolution studies [9][10][11], cell sorting [12,13] and mutation detection [14], where the information of exact DNA sequence is not crucial. Furthermore, SERS admits several advantages over standard Raman spectroscopy such as overcoming of strong fluorescent background and requiring less excitation power, leading to a prominent increase of the signal-to-noise ratio (SNR) [2] and simplifying the complexity of optical spectrometers and detection systems necessary for biomedical analysis and biomedical sensing applications. Moreover, in contrast to fluorescent microscopy, SERS is a label-free technique which does not require complex preparation steps such as specific probe design, and results in simpler bio-assays [15,16]. Despite its large potential for a wide range of applications in bio-detection and sensing, especially due to the prospect of single molecule sensitivity [15,17,18], the results are known to be highly sensitive to preparation methods due to several physical mechanisms which affect adsorbed molecule orientation [19] and the chemical enhancement effect (CE) which stems mostly from the charge transfer mechanism between molecule and the metal [20,21]. The latter leads to discrepancies in the SERS spectra reported in the literature [22,23], and also gives rise to recent attempts that take advantage of the CE effect for nucleotide detection [24].Given the complexity of the involved effects and numerous features of the corresponding SERS spectra, signal processing seems to be a highly relevant resource to account for spectral variability and heterogeneity. In particular, principal component analysis (PCA), which employs a linear transformation to identify a smaller set of linearly uncorrelated variables referred as principal components (PCs), is one of the most commonly used techniques in analyzing Raman spectroscopic data. For instance, Raman spectral analysis using PCA has been used to interpret complex tumor signatures [25], or to assist with identification of the microstructure of a DNA helix [26]. Furthermore, pairing PCA method with a supervised machine learning (ML) algorithm is known to improve classification of Raman scattering results. For example, Sitole et al. [27] combined PCA with a linear discriminant analysis algorithm to develop a reliable HIV bio-marker, work [28] reported PCA paired with Euclidean-distance classification to discriminate melanoma from normal skin cells, work [29] employed PCA paired with SVM model for superior diagnosis of prostate cancer, and work [30] used PCA paired with SVM for detection of drugs in human urine using dynamic SERS. However, despite the relative simplicity of the PCA method and the basic ML techniques, these have not been widely explored in the realm of DNA composition analysis especially for long (>30 bases) ssDNA molecules. The latter are particularly suited for overall composition analysis because DNA has complementary base pairing characteristic; the composition of one of the DNA strands uniquely determines the composition of the other strand and of the full DNA molecule.
In this manuscript, with Fig. 1 schematically describing its main concept, we employ gold and silver nanorod array substrates, fabricated using a straightforward single-step obliqueangled deposition (OAD) method [31] known to provide a prominent EM effect [32][33][34], to experimentally study SERS spectra of 200-base length ssDNA molecules adsorbed on these substrates. The need for using SERS is demonstrated in Appendix Section A2, where we include a comparison between normal Raman spectra (i.e. without metal substrate) and SERS spectra of same ssDNA molecules used in this work (see Fig. 8). Clearly SERS spectra admit much higher SNR values, and consequently are more efficient than standard Raman spectroscopy for detection applications. We experimentally show that adsorption of ssDNA molecules to gold and silver gives rise to a distinct CE effect, which manifests as characteristic spectral peak shifts and selective intensification of various vibrational modes, as also qualitatively supported by our numerical simulation results. More importantly, while several works exploited SERS for DNA sensing on gold [7] and on silver [8,16], in our work we employ both metals to demonstrate distinct CE effect associated with each of them, which can be used for enhanced specificity. Particularly, we employ PCA in order to enable identification of the corresponding orthogonal features present in the experimentally acquired ones, and then use these features (i.e. PCs) as a training set of our linear regression model. Furthermore, we also incorporate several data pre-processing methods, including Gaussian smoothing, normalization and multiplicative-scattering correction (MSC). The use of MSC here is especially important due to numerous noise generating factors such as multiplicative light scattering [35], and the fact that SERS is an extremely sensitive method. In principle, its spectra can contain very detailed information which allows to detect small differences from sample to sample, which in turn can strongly affect the visual assessment of the spectrum, by causing small arbitrary spectral shifts that may not contain information relevant to ssDNA composition (see Appendix section A7). We show that PCA multiple-feature linear regression greatly benefits from such noise elimination procedure, and consequentially demonstrates elevated sensitivity relative to the more basic one-feature peak-ratio regression model. Beside linear regression, we also utilize another ML -deep learning model namely neural network (NN). The NN concept was established in the 1980-90s and has since then been developed by numerous scientists and researchers [36,37]. Most NN organize their neurons into layers, and in layered NN the neurons in the input layer can accept numeric data points as their inputs. In particular, each neuron admits a weight, which upon multiplication with the input data yields neuron output and is transferred to the next layer [38]. We employ this NN model and its feature for training and testing data combined from both metals for superior performance.  1. Schematic description of key experimental and data processing components: (a) Rough metal surface (gold or silver) formed by an array of nanorods of mean height h and mean distance Λ, functionalized with Raman active ssDNA molecules comprised of adenine (pink) and of cytosine (blue) bases. (b) DNA composition analysis which employs PCA and linear regression (for gold and silver's separate datasets) and neural network (for gold and silver's combined datasets) to predict the percentage of adenine and cytosine bases in the ssDNA molecule. List in (c) presents the 200 bases long ssDNA training molecules (see for specific sequences and the random test sample in Appendix section A1 Table 5.)

Metal nanorod array substrate fabrication
The metal nanorod array structure is fabricated by using OAD technique with a Denton Discovery Sputter system. The average height of the nanorods, schematically described in Fig. 1(a), is h = 200 nm, and the average diameter of the rods is approximately Λ = 50 nm for gold and Λ = 100 nm for silver (see relevant SEM images in Fig. 9 in Appendix section A3). The substrate is tilted at an angle such that the zenithal deposition angle is at α = 75 o . Following Barranco et. al. [31], we set the tilt angle of the nanorods to approximately 60 o relative to the substrate normal (see also Appendix section A4 for detailed calculation).

ssDNA functionalization
The ssDNA solutions are prepared by diluting the DNA stock solution to 25 µM in 10 mM 4-(2-hydroxyethyl)-1-piperazine ethanesulfonic acid (HEPES), and then forming a 1:3 mixture with a 10 mM MgCl 2 solution. We then drop-cast this ssDNA solution onto the metal nanorod substrate and let it dry overnight. Before SERS measurements, all samples are rinsed with deionized water in order to remove excess of crystallized salt and unbound ssDNA molecules, and then blow-dried.
A table of the sequences for each ssDNA mixtures is listed in the Appendix section A1. The ssDNA concentration is chosen to provide sufficiently large surface concentration enabling us to measure SERS signal over a map of units on substrates uniformly. The salt ratio is used to neutralize the negative phosphate backbone and enable bonds between the DNA bases and the metal substrate [18].

SERS spectra acquisition and data processing
The SERS spectra are collected by using a Renishaw inVia Raman spectrometer with the following settings. Each spectrum is obtained by employing a 785 nm Raman excitation laser with 50 mW output power, acquisition time of 5 s and 1 accumulation per spectrum. The objective magnification is 50x with NA = 0.75. The grating type used is 1200 l/mm at 785 nm. The grating setting in the built-in spectrometer software is set to a static regime with acquired spectrum range extending between 600 cm −1 to 1700 cm −1 . The resultant spectral resolution in our setup is approximately 1 cm −1 . For presentation purposes, most of the SERS spectra presented in this work are cropped from 600 cm −1 to 1200 cm −1 only. Full SERS spectra can be found in Appendix section A7. The mapping setting of the spectrometer is used to acquire 100 measurements from a total substrate area of dimensions 50 × 50 µm 2 ; this area is divided into 10 × 10 square units, where the dimension of each area unit is 5 × 5 µm 2 and each acquired spectrum measurement is taken from a different unit.
To analyze the CE effect, 100 SERS spectra are acquired from DNA bases without phosphate backbone adsorbed to gold and silver nanorod substrates. To demonstrate ssDNA composition analysis, 200 SERS measurements in total (divided into 2 maps, 100 measurements in each map) are made for each 200-base ssDNA composition (5 control sequences and a single test sequence).
The single-feature linear regression model is developed with MATLAB using LinearModel.fit algorithm, whereas the PCA multiple-feature linear regression model is developed in Python using ML libraries including scikit-learn, numpy, scipy, pandas and matplotlib, and the NN model is built in Tensorflow and Python using NN and ML libraries including keras, scikit-learn, scipy and matplotlib. In all models, we have five control sequences to be our training dataset with the following A and C compositions: 100% A -0%C, 75% A -25% C, 50% A -50% C, 25% A -75% C, 0% A -100% C, and a single MATLAB generated random sequence with composition of 54% A -46% C for testing (the percentage of A and C in the testing sequence is checked by MATLAB after generation, see Appendix section A1 for specific sequences used). In some models, i.e. the PCA linear regression and NN, a validation set which is randomly extracted from one third of the testing set is also used for model validation. For each control sequence, we have three samples of dataset made on different dates. Multiple samples are needed for training of some model to improve performance such as the PCA linear regression. Similarly, we have three different samples used for testing and calculation of errors in order to justify the robustness of the system. The list of pre-processing steps for the data consists of baseline subtraction and cosmic-ray removal which are performed by employing built-in algorithms of Renishaw WiRE 4 software, Gaussian smoothing (smoothing window = 5), signal normalization and MSC which are implemented in both Python and MATLAB, and are described in details in Appendix section A7.

Results and discussion
3.1. CE effect of a single DNA base: comparison between gold and silver nanorod substrates To probe the distinct CE effects introduced by gold and silver on the SERS spectra of DNA bases adsorbed to these metals, we first consider a numerical simulation using density functional theory (DFT) by employing Gaussian 09 [39] software on the Gordon supercomputer at the University of California, San Diego [40]. In particular, we consider a simplified model which considers only a single nucleotide without a phosphate backbone adsorbed to the corresponding metal with a fixed nitrogen atom, and examine the effect of modifying the type of the metal on the corresponding Raman spectra. The simulation result is presented in Appendix section A5, where Fig. 10(a,b) presents the simulated Raman spectra intensity (see [41] for formal definition of Raman intensity and activity) of adenine (A) and cytosine (C), respectively, by employing a B3LYP computational method and LANL2DZ computational basis function. Each base is bound to a tetrahedral nanoparticle comprised of 20 silver/gold atoms. In so doing, the model probes the CE effect for a given orientation by introducing a metal-nitrogen bond, and bypasses computationally expensive crystalline structures which typically require larger amount of metal atoms [42]. Importantly, the orientation dependent effects are eliminated because the models describe DNA bases bound to the different metals with the same orientation.
In particular, the simulated RBM peak of A unbound to metal is centered at 711 cm −1 , and is shifted to 712 cm −1 when bound to silver and to 713 cm −1 when bound to gold. More significant shift is detected for the simulated RBM of C; 754 cm −1 without metal, 757 cm −1 when bound to silver and 769 cm −1 when bound to gold. Furthermore, the simulated spectra show different Raman intensity of RBM mode. For example, the intensity of RBM of A bound to gold is higher than that of A bound to silver. We attribute the differences in the features of these SERS spectra of molecules adsorbed to gold and silver due to the difference in Fermy energy levels of these metals, which in turn affects the charge transfer effect [4]. With this said, a model that relates the strength of dominant SERS spectral mode to the work function of relevant metal is beyond the scope of this work. Nevertheless, the distinct CE effects observed for the simple numerical cases suggest that distinct CE effects should be present in the experimental results considered below.
In the next step, we perform experimental SERS measurements of A and C (i.e. DNA bases without phosphate backbone) adsorbed to gold and silver nanorod array substrates, and to the same substrates but covered with a 2-nm thickness dielectric layer of Al 2 O 3 deposited via the atomic-layer deposition (ALD) method operated by the Beneq ALD system. Small thickness of this layer guarantees a prominent EM effect, but is expected to eliminate the CE effect by blocking the charge transfer between the adsorbed ssDNA molecule and the metal surface [24], and also allows quantitative measures for the CE factor of each one of the metals as described shortly below. Fig. 2(a,b) present arithmetic mean of 100 SERS spectra measurements results of A and C, respectively, whereas Table 1 presents the intensities of the corresponding RBM modes and their dependence on the CE effect. These results indicate that RBM peaks in SERS spectra are higher when EM and CE effects are both present compared to a case with just EM effect is present. We can estimate the relative strength of the CE effect on a certain vibrational mode by considering the so-called chemical enhancement factor (EF) [43], given by the following ratio, where the indices i, j stand for the substrate metal and the DNA base adsorbed to that metal, respectively. Here, I surf (EM+CE) and I surf (EM) correspond to peak intensities of relevant vibrational mode in the SERS spectra with and without the CE effect, respectively; i.e. in our setup corresponds to the cases of metal nanorod substrate without and with the thin Al 2 O 3 layer. All SERS intensity spectra in our plots are normalized between zero and unity, both for uniform presentation and as a preparation step for ML model training. The peak intensities that serve the purpose of this experiment were extracted prior to normalization. Note that the enhancement factor defined in Eq. (1) is a proper figure of merit for the strength of the CE effect under the plausible assumption that the two cases with and without CE effect (i.e. without and with ALD of Al 2 O 3 dielectric layer) admit the same total number of optically excited molecules (see Appendix section A6) and identical EM effect. Table 1 below indicates that the chemical EF of adenine RBM adsorbed to gold nanorod substrate is approximately given by Γ (Au,A) CE 23, whereas in a case when it is adsorbed to silver nanorod substrate the EF of adenine RBM is given by Γ (Ag,A) CE 2. Similarly, the corresponding EFs of cytosine adsorbed to gold and silver nanorod substrates are both given by Γ (Au,C) CE Γ (Ag,C) CE 2; both EFs correspond to the RBM. In both numerical simulation and experimental results, the CE effect appears to be stronger in DNA bases adsorbed to gold, compared to when they are adsorbed to silver. However, the calculation in our simulation only took into account the non-resonant charge transfer effect, which includes charge redistribution within the molecule or the metal structure itself at ground state. Therefore, a higher CE effect in our experiment is most likely due to the involvement of the resonant charge transfer between the Experimental results presenting normalized SERS spectra of: a) adenine (A) and b) cytosine (C) bases bound to gold and to silver nanorod array substrate. The resulted SERS spectrum of DNA bases binding to gold nanorod substrate appears to be higher without the Al 2 O 3 layer, similar trend is presented also in the SERS spectrum of ssDNA binding to silver nanorod substrate.

Table 1. RBM peak values of experimentally acquired SERS spectra of A and C molecules on
Ag, Au nanorod substrates as well as Ag, Au covered with thin Al 2 O 3 film, and the corresponding values of the chemical enhancement factors, Γ CE , defined by Eq. (1). All spectra were obtained by averaging the 100 measurements results. metal and our DNA bases. Moreover, we also observe a high standard deviation in the EF values, which stems from fluctuation of SERS intensity signal due to low number of adsorbed molecules (surface concentration), leading to uneven distribution of hotspots in the nanorod substrate [44]. Specifically, the final concentration of the DNA solution we used in this experiment was about 6 µM, which converts to about 0.5 molecule/ cm 2 of metal, if all molecules have the same chance to bind. This number is not high, and could be much lower in reality because many molecules in the droplet might not come in contact with the metal surface. Additionally, Fig. 2(a,b) presents a shift of the RBM peaks in SERS spectra of DNA bases adsorbed to metals, relative to the cases when the bases are measured in bulk or on top of an Al 2 O 3 layer. For instance, A's RBM peak shifts from 723 cm −1 to 738 cm −1 and to 740 cm −1 respectively, when adsorbed to gold and silver, compared to when adsorbed to the Al 2 O 3 layer or when measured in its bulk form. Fig. 2(a,b) results indicate that the CE effect is practically eliminated in case when the nanorod array substrate is covered with a thin Al 2 O 3 layer, presumably due to a blockage of charge transfer between the DNA bases and the metal. This phenomenon is demonstrated by the resemblance between SERS spectra of ssDNA molecules adsorbed to the dielectric layer, and Raman spectra of bulk samples. More importantly, the different CE effects induce changes in SERS scattering wavelengths and intensities, resulting in distinguished spectral shifts and enhancement ratios, which account for the DNA prediction uncertainty for different types of metal.

RBM (A)
The differences between experimental SERS spectra and numerical simulation can originate from the effect of different DNA bases binding sites to the metal substrate which result in different orientation of the molecule relative to the metal, which is known to depend on numerous experimental conditions and occasionally lead to controversial results [19]. Specifically, while in our simulation we consider the CE effect of a DNA base with a fixed binding site and orientation, in practice each nucleotide admits several binding possibilities, each giving rise to a different orientation and in principle leading to a different CE effect [42].

Composition detection of ssDNA using single-feature linear regression model
In this section, we analyze SERS spectra of ssDNA sequences and employ a simple linear regression model to probe their chemical compositions. Fig. 3 and Fig. 4 present SERS spectra of 200-base long ssDNA molecules bound to gold and silver nanorod array substrates, respectively. Interestingly, Fig. 3(a-e) and Fig. 4(a-e) both present two prominent peaks; p A at ∼ 725cm −1 which is associated with ssDNA sequences where A is present, and p A+C at ∼ 790cm −1 which is present in all cases including those where A or C are missing. Consequently, we employ the ratio between peaks' intensities, p A /p A+C , as a single feature in the regression model. The ratio values of each ssDNA composition undergo a lognormal distribution fit, and the natural logarithm of their medians are taken as control values which are used for building a linear regression model. Fig. 3(f-j) and Fig. 4(f-j) present the corresponding probability distribution functions (PDFs) of the ratio for each one of the control sequences, together with their median values (R) and the corresponding natural logarithm values (ln(R)). The latter reflects a nonlinear relation between R and C A caused by a nonlinearity of the CE effect [45] as a function of C A . Similar effect was also reported in work [16], for the case of adenine adsorbed to silver random islands. Fitting a linear regression model against these five control ln(R) values yields the following linear regression function f (C A ) and the corresponding inverse function Here a i and b i (i = Au, Ag) are obtained by employing a built-in MATLAB curve fitting algorithm, given by a Au = 0.0215; b Au = −1.68; and which are employed in Fig. 3(m) and Fig. 4(m), respectively. Applying the linear regression model described by Eq. (2) and Eq. (3), to SERS spectra of the test sample described in Fig. 3(k-l) and Fig. 4(k-l) leads to the following prediction of the adenine concentration: on gold:C A = 51.2% ± 7.32%; on silver: where the corresponding predicted cytosine concentration satisfiesC C = 100% −C A on each one of the substrates with the same detection errors. Detailed calculations are described in Appendix section A9. Since the ground-truth composition of the test ssDNA molecule contains 54% A (and 46% C), we conclude that in our setup silver nanorod array substrate appears to give a better prediction than gold substrate; silver substrate yields both smaller difference relative to the actual concentration value and lower detection error. The higher detection error in gold could stem from several factors, physically from the SERS enhancement effects or from the data analysis process. As we have seen from section 3.1 above, SERS enhancement of Au nanorod substrate is higher and fluctuating more than with Ag nanorod substrate. Moreover, the lognormal distribution goodness of fit was not as good for Au substrate compared to Ag, as described in details in Appendix section A8 (see Figs. 16,17).
To test repeatability of the experiments we consider model performance on three samples with the same ssDNA test sequence performed on substrates fabricated on different days. Table 2 below summarizes results where the training was performed on set #1 and then applied to predict test sequence on other substrates (see also Table 7 in Appendix section A11 for more details). For a convenient metric for model's sensitivity, we utilize the following two parameters: regression residual (RR) and standard deviation (SD). The RR value presents how much the predicted (mean) value deviates from the actual ground truth value, while the SD, which is given by the root mean square error (RMSE), indicates the spread of the predicted values around the predicted mean. As expected, the table indicates that on average SERS spectra acquired from samples fabricated on different days leads to higher error than performing training and test on the same day. We will find similar conclusion in the multiple-feature analysis below.

Composition detection of ssDNA using PCA with multiple-feature linear regression
Consider the experimentally acquired SERS spectra as being composed from a set of L features with L=1021, corresponding to the number of wavenumbers in one spectrum. In so doing, we notice that each spectrum in the training set shares some common features with others.  By diagonalizing the covariance matrix formed by treating each spectra as a vector in an Ldimensional space, we are able to describe each spectrum by the weighted summation of a smaller subset of features also known as PCs. By determining this PC subset, as briefly shown below, we are able to extract additional features which was not possible in the single-feature model considered above. After Gaussian smoothing and MSC, we implement PCA transformation, which reduces the number of features to some number n L. First, we perform error analysis study as a function of n in order to determine the optimal number n specific to our data. The numbers of PCs used for training were chosen by looking at the Mean-square error (MSE) curves, which were calculated from the predicted and the actual values of the validation ssDNA composition set. The two chosen n numbers for gold and silver substrates are n = 43 and n = 145, respectively. Detailed results and analysis are listed in Appendix section A10.
We then construct a linear regression model and train it with this set by employing built-in Python capabilities (see more details in Methods section) to predict the chemical composition of our test sample. Fig. 5 presents numerical simulation results of ground truth versus predicted adenine count for gold ( Fig. 5(a,b)) and silver (Fig. 5(c,d)), leading to the following predictedC A values on gold:C A = 53.5% ± 2.17% on silver:C A = 54.3% ± 1.30%.
These results indicate that silver nanorod array substrate admits a better detection sensitivity than the gold one; the corresponding RMSE between the predicted and the actual values are given by 1.36% for silver substrate and 2.20% for gold substrate, equivalent to approximately two and four bases in our 200-base length sequences. Importantly, our analysis indicates that the RR and SD values of C A are reduced significantly when PCA and ML linear regression model is employed (Fig. 5) as opposed to just single-feature linear regression analysis (Fig. 3). For these datasets, the two most important pre-processing steps we use are Gaussian smoothing and MSC. PCA linear regression greatly benefits from pre-processing procedures that eliminate the multiplicative scattering noise because the whole spectrum is taken as its input. However, single-feature linear regression in some cases may not require such steps because the noise originates mostly from spectral regions which are not near the two RBM-related peaks (see Figs. 11, 12 in Appendices section). Therefore, if we include the MSC in the input data of the single-feature regression, the sensitivity may not improve as much as the case of PCA regression. The standard detection errors are summarized in Table 6 in Appendix section A7.

Combining Au and Ag models for composition detection of ssDNA
Although PCA linear regression model could already be powerful compared to the single-feature one, we are still limited by the number of features contained in a certain type of metal. Therefore, by developing a fusion model where the data acquired from different metals like gold and silver can be merged together as one input, we could increase the number of input features that could be beneficial for effective training. Instead of using the simple multivariate linear regression, the model we used to fit this combined dataset is a neural network (NN) model. Since according to Fig. 5, Ag model appears to have a better ssDNA composition detection results than Au, we would like to assign different weights to input data corresponding to different metals in order to optimally utilize the Au-Ag combined dataset for training. This task can be done in NN training, which is markedly different from our linear regression model described above, where all input spectra are treated equally (with the same weight). Our NN model consists of five layers in total; one input and one output layer, and three hidden layers connecting between the input and output (see Fig. 6(a)). There are several criteria to determine the number of hidden layers and hidden neurons [38], which allow to bypass underfitting or overfitting of the model. In this work we chose to implement an NN network where the number of hidden neurons in one layer is 2/3 of the number of neurons in the one directly prior to that, which provides the lowest detection RMSE in our case. In particular, for each spectrum we have 1021 wavenumbers, representing 1021 features. When combining two metals together, our total number of features becomes 1021 × 2 = 2042, which also act as inputs of our NN model. The number of our model's hidden neurons, according to the mentioned rule would be: 2042 in the first hidden layer, 2042 × 2/3 ≈ 1361 in the second hidden layers, and 1316 × 2/3 ≈ 907 in the third hidden layer. We also used a nonlinear activation function activation, ReLU, for input and hidden layers, which allows back-propagation for error minimization while the model was trained and validated through several rounds, namely "epochs." The termination criteria for training is when the mean-squared errors between the predicted values and ground-truth values of the validation set is at its lowest and unchanged within the next 30 epochs. Fig. 6, presents comparison between a linear regression for gold (b), linear regression for silver (c) and a NN model which combines gold and silver (d). Importantly, the result of combining the data of two metals appears to be superior to the result obtained by considering each one of the metal models separately. Table 4 presents NN model performance in predicting the composition of the test sequence on substrates fabricated on different days with training done only on the first day data (referred below as Test set #1). The results over three test sets indicate smaller errors and more consistent prediction values compared to the linear multiple-feature PCA model. For a fair comparison with the PCA regression method, we also build additional NN models in which different numbers of training sample sets are used and where silver and gold are treated as separate inputs (see Tables 10,11,12,and Figs. 20,21 in the Appendix section A11). In particular, the results in Table 10 and Fig. 20, indicate that unlike the case of PCA linear regression, the prediction sensitivity in the metal-fusion NN model does not improve if the number of training sets is increased. Importantly, our result indicates the following two points: first NN model yields superior detection sensitivity (see Fig. 22) relative to linear multi-feature PCA model, and second the average sensitivity of the NN model is superior compared to both single-feature and multiple-feature cases (Fig. 7). While in this work we employed ssDNA molecules with two DNA bases and showed that NN can reduce significantly the prediction errors relative to linear regression models, incorporating all four bases in the molecule should be in principle possible in future works as all four DNA bases admit different features in SERS spectra (see [42] for studies of bases adsorbed to silver) and thus well adapted to NN method.

Conclusions
In this work we experimentally and numerically studied CE effect of nucleotides adsorbed to gold and silver nanorod structure and employed PCA paired with ML algorithm as well as NN model to demonstrate a highly sensitive ssDNA composition detection method for 200-base long ssDNA sequences. In particular, we demonstrated that our method is superior to a single-feature linear regression model which takes into account only SERS intensity of two dominant peaks, and leads to lower RMSE prediction values for SERS spectra acquired on both gold and silver nanorod array substrates. Importantly, our results indicate that each metal possesses a distinct CE effect with the same molecule. and therefore it operates as an independent probe which provides additional information due to distinct metal-molecule interaction regime. In particular, our NN-based analysis indicates that the prediction error drops once input data from both metals is used as a training set compared to the case when only input data from only one of the metals is used. Furthermore, comparing performance of multi-feature PCA and NN on cases where training and test data belong to samples prepared on different days, indicates that NN provides superior performance compared to multi-feature PCA, indicating that it is useful to mitigate the effects of data dispersion which may emerge due to bio-degradation of the samples as a function of time or due to slight differences of the nanorod substrates fabricated on different days. The latter is particularly relevant as fabricating substrates with nano-scale size features, which are required for SERS sensing, with high levels of uniformity/reproducibility is still challenging [46]. In future studies it would be interesting to investigate the effect of these and other potential factors on the SERS signal and determining more effective NN modalities. Straightforward future directions that can be employed to extend our basic NN model include improvement of detection performance and of its explainability/ interpretability, as well as incorporation of non-linear models with data visualization in order to enhance ssDNA sensitivity. We hope that our work will stimulate future studies where different ML and deep learning models could be exploited to realize all-optical and SERS-based single-base detection sensitivity of long ssDNA and dsDNA molecules. Table 5. ssDNA sequences employed in our work; five control sequences and a test sample.

A2. Comparison between SERS spectra and normal Raman spectra of ssDNA
The normal Raman spectra were collected by depositing ssDNA solution onto a plain silicon substrate, with the same ssDNA solution concentration and same way we deposited ssDNA on gold and silver substrates. The results are shown in Fig. 8, where SERS signal of ssDNA measured on gold nanorod substrate are shown on the left (a) and normal Raman signal of ssDNA measured on plain unprepared silicon substrate are shown on the right (b). Similar to the single-feature model above, we test prediction performance of the multiple-feature model on test sequences deposited on samples fabricated on different days where the training was done on various days. However in this case, due to the variation in the data components from one day to another, a model that has been trained on one dataset with a fixed number of PCs does not perform well on datasets of different dates (see Table 3 below). The prediction errors can be reduced once there are enough samples coming from different measurement dates. In order to effectively monitor the prediction improvement, we averaged out the RR and SD values of the above three test datasets. Fig. 19 indicates that increasing the number of datasets used in the training from one to two and to three, leads to enhanced average detection sensitivity. The average detection composition improved from 47.1% ± 3.55% to 54.7% ± 3.77% for Ag, and from 47.8% ± 2.98% to 55.3% ± 2.63% for Au. Detailed predicted values for each set in different training rounds and their average values are listed in Appendices section A11 (see Tables 8,9). Fig. 9. SEM images of gold (left) and silver (right) nanorod array substrates.

A4. Oblique-angled deposition (OAD) tilted angle calculation
The tangent rule reported in [31] is given by Inserting α = 75 o into the equation above leads to the nanorod tilt angle of value β = 61.8 o . Fig. 10(a,b) present the effect of different metals on the ring-breathing mode (RBM) of the adsorbed A and C bases; peak #1 and peak #2 stand for RBM modes of A and C, respectively.

A6. Calculation of CE effect factor Γ CE
From the average SERS EF equation EF = (I surf /N surf )/(I bulk /N bulk ), where I surf , N surf are SERS intensity and number of measured molecules, and I bulk , N bulk are normal Raman intensity and number of measured molecules in bulk respectively, we have the EF calculations for system with and without CE to be: Here, I surf , N surf are SERS intensity and amount of DNA molecules measured on the surface, and I bulk , N bulk numerical simulation results of ground tr are Raman bulk intensity and amount of DNA molecule measured in the bulk. Because the amount of DNA solution drop cast on the substrate is similar, we can assume the number of molecules that got excited from the incoming laser is the same regardless of the CE effect, N surf (EM+CE) = N surf (EM). Therefore, Eq. (7c becomes

A7. Signal processing of SERS measurements
After being collected, the SERS spectra are put through Gaussian smoothing, Normalization and Multiplicative Scatter Correction (MSC) before putting into calibration curve. For single-feature linear regression, the data only undergoes Gaussian smoothing and normalization. For multiplefeature linear regression, the data undergoes all the pre-processing steps: Gaussian smoothing, normalization and MSC. The reason has been explained in the text above. To normalize data in between 0 and 1, each spectrum is scaled using the following equation: Where X represent the SERS intensities in one spectrum. Afterwards, the MSC can be either performed or not on the dataset. In the MSC processing, we assume that the measured spectrum was scaled and added with some white noise, which we need to remove to make the data closer to the ideal data. In other words, the light scattering or change in path length for each spectrum in this case is estimated relative to that of an ideal spectrum: Here we assume that the noise is Gaussian white noise: E ∼ G(0, νI). Because we have = X c − b × R c , the variable X c also has a Gaussian PDF: The scattering factor b can be calculated using the Maximum Likelihood estimation (MLE) function: To find value of b such that X c − bR c 2 is minimum, we need to find the zero value for the first derivative: We input data that has and has not been pre-processed with MSC into our two linear regression methods, and compared the detection errors obtained from the two methods. The results are shown in Fig. 13, 14, 15.

A8. Probability functions with lognormal distribution fit
For the single-feature linear regression calculation, the data was fit under Lognormal Distribution because the ratios of peak cannot be negative, and this distribution was proven before to have high goodness of fit [16]. In this paper, the PDFs and CDFs are plotted again just to confirm the fit still holds. The plots were created from the histfit function in MATLAB. If the empirical CDF is similar to the theoretical CDF, the chosen distribution is a good fit for our data. From the plots, we can see that the fit was not as good for gold compared to silver. As the percentage of Adenine increase, the histogram becomes more skewed and it was unable to find a good fit for the data. However, lognormal distribution was the best choice compared to other type of distribution. The lognormal fit was good for other mixtures, and exceptionally good for those measured on silver.

A9. Single-feature linear regression model and DNA composition calculation
The single-feature Linear Regression was performed in MATLAB. We used a built-in MATLAB function called LinearModel, which takes in the five training datapoints (lognormal of ratio A/(A+C) for the five DNA compositions) and give out a line that fit with the five datapoints. The testing data is then plugged into this linear equation for the prediction of DNA composition. The quantification error for this case is calculated from the RMSE of the fitting model. Consider the general form of the regression line and of the DNA composition C A as: Where R is the ratio of the two peaks mentioned in the main text, A 1 is the central value of adenine percentage, and A 2 is the prediction uncertainty. With the standard deviation of the peak ratios std(R) included, the percentage of adenine becomes: For example, the lines derived from the LinearModel function for gold and silver without MSC processing are: Therefore, from the lognormal of the ratio obtained from the testing sample, we can estimate the adenine percentage and the prediction uncertainty. The PCA linear regression was performed in Python notebook. A built-in function PCA from sklearn library was used to transform and reduce the dimension number of the input matrix. Afterwards, another built-in function LinearRegression also from sklearn was used to take in the PCA-processed data for training and prediction of the testing sample. Fig. 18 indicates that for both gold (Fig. 18(a)) and silver ( Fig. 18(b)), the variance is a rapidly converging function towards unity; at 10 PCs about 97.5% of the data is explained whereas above 30 PCs approximately 100% of the data is explained for ssDNA binding.

A10. Determination of the PC number for training and testing data in PCA linear regression model
From the red dotted lines in Fig. 18, we conclude that the optimal PC numbers which correspond to the lowest amount of quantification errors for gold and silver substrates, are n = 43 and n = 145, respectively. Fig. 18. Explained variance and MSE of prediction plots from gold (a) and silver (b) nanorod array substrates. The explained variance plot (blue) shows the number of PCs required to express 100% of the data variance (where explained variance = 1). The MSE plot (red) points out the number of PCs needed for the detection error to be lowest.

A11. Predicted C A values for different test datasets using different ML models and training sets
We have 200 spectra in test sets #1 and #2, and 400 spectra in test set #3. The average RR and SD values were calculated based on the value collected from each test set and the number of spectra in each set. In particular, we will have: Average SD = 2 × (RMSE#1) 2 + 2 × (RMSE#2) 2 + 4 × (RMSE#3) 2 8 (19) GT = 54% (20) A11.1. Single-feature linear regression    Fig. 19. Plots of predicted average adenine composition values when model was trained on different number of datasets. The results were averaged from three test sets described above. As more datasets got trained, a higher detection sensitivity was obtained for the model.

A11.3. Neural Network models
• For (Au + Ag) fusion • For Au only • For Ag only