Retrieving genuine nonlinear Raman responses in ultrafast spectroscopy via deep learning

Noise manifests ubiquitously in nonlinear spectroscopy, where multiple sources contribute to experimental signals generating interrelated unwanted components, from random point-wise fluctuations to structured baseline signals. Mitigating strategies are usually heuristic, depending on subjective biases like the setting of parameters in data analysis algorithms and the removal order of the unwanted components. We propose a data-driven frequency-domain denoiser based on a convolutional neural network with kernels of different sizes acting in parallel to extract authentic vibrational features from nonlinear background in noisy spectroscopic raw data. We test our approach by retrieving asymmetric peaks in stimulated Raman spectroscopy (SRS), an ideal test-bed due to its intrinsic complex spectral features combined with a strong background signal. By using a theoretical perturbative toolbox, we efficiently train the network with simulated datasets resembling the statistical properties and lineshapes of the experimental spectra. The developed algorithm is successfully applied to experimental data to obtain noise- and background-free SRS spectra of organic molecules and prototypical heme proteins.

Nonlinear optics has enabled and fostered the application of spectroscopy to the ultrashort time scales.5][16] A common approach consists in using multiple ultrashort pulses, shaped in frequency and time, to resolve the induced modifications to a certain optical observable in a differential manner when one of the pulses is switched on and off.Dynamical insights can be obtained by adding a photochemical actinic pump and/or by tuning the pulses in resonance with the electronic absorption edges from which excited-state relaxation occurs.4][25][26] Post-processing routines are usually strongly dependent on the specific sample and on the experimental parameters.The sources of noise are not quantitatively known and often the exact lineshapes of both the baseline and signal cannot be predicted in advance. 27,28Moreover, point-wise denoising and baseline subtraction are generally heavily correlated operations, which cannot be factorized, particularly when the target clean spectrum contains asymmetric lineshapes or components with largely different relative intensities. 29he practical solution is often delegated to the experienced eye of the spectroscopist, a strategy which hampers automatization of the routine and may lead to suboptimal resolutions, ambiguity and human biases.
To overcome these critical issues and enhance the resolution of nonlinear Raman techniques beyond the limitations induced by the background and low signal-to-noise conditions, we have devised and trained a deep neural network (NN) based on multiple convolutional layers operating in parallel for denoising and baseline removal of raw SRS spectra.By designing the network architecture and choosing a suitable loss function and optimization strategy during training, we show how to perform the two tasks in parallel and avoid the difficulties which hinder the applications of standard data processing algorithms.
0][41][42][43][44] The lineshapes in these techniques are always positive and the luminescence background is usually not structured.More importantly, the removal of the luminescent background and the denoising is tipically tackled separately.In the case of nonlinear Raman spectroscopy, due to asymmetric and complex lineshapes which are typical of SRS signals, [45][46][47] there are no optimal methods to disentangle the two tasks and perform them sequentially.Supervised training of the algorithms is further complicated by the absence of large enough labeled datasets ensuring a statistically relevant representation of the diverse SRS baseline and peak structures.To address these challenges, we combine here a multi-parallel convolutional NN architecture with a supervised training built on a theoretical toolbox based on the density matrix perturbative expansion for accurate modeling of the spectroscopic signals and their characteristic noise.

A. Signal and noise in Stimulated Raman Scattering
SRS is a third order nonlinear optical effect that can be generated in the sample by the joint action of a broadband femtosecond Probe pulse (PP) and a narrowband picosecond Raman pulse (RP) overlapped in time, 17 and then detected for spectroscopic applications to access the vibrational structure of the system under investigation (Fig. 1a).The signal is coherently stimulated when the energy difference between the two laser pulses matches a Raman active molecular transition.Similarly to the spontaneous case, where the Raman peaks are spectrally located at the red and blue shifted sides of the excitation wavelength, the SRS signal is positively and negatively offset with respect to the RP.SRS features, however, arise as peaks, dips or even dispersive signals generated on top of the broadband spectral envelope of the probe, which is spectrally dispersed by means of a monochromator and detected (Fig. 1b).SRS spectra usually display the normalized difference between the PP spectra recorded after the sample with the RP switched on and off (Raman Gain, RG), as a function of the detected frequency shift with respect to the RP (Raman shift).If the RP wavelength is tuned in resonance with the absorption of the sample, the transmission of the PP itself can be modulated even in the absence of stimulated Raman, due to the fast electronic response of the sample (Transient Absorption effect, TA). 48This, together with additional nonlinear processes, including but not limited to solvent effects 49 and nonlinear phase modulation, 24 causes the presence of an unwanted baseline, usually broader than the Raman features, which needs to be removed from the SRS raw data to extract the Raman spectra and correctly retrieve the vibrational information.
1][52] This latter is typically due to the electronic fluctuations of the photogenerated carriers, the error related to the readout process, and the intrinsic shot noise limit caused by the quantum nature of light.Expected in any type of heterodyne-detected spectroscopy, these sources of noise combine with the inherent ambiguity of the baseline subtraction procedure, with the result that the overall signal-to-noise ratio achieved in real case scenarios is often considerably lower than the nominal sensitivity of the technique.This is exemplified in Fig. 1c, in which we show the outcomes of different baseline estimation techniques applied to a typical SRS spectrum of the wild-type Green Fluorescent Protein.Without any assumption on the nonlinear dynamics and on the Raman peaks lineshapes and spectral positions, the estimated baseline is highly dependent on the subtraction algorithm.Consequently, it is often not possible to infer any conclusive statement on such ambiguously extracted features.
In order to simulate SRS spectra closely resembling the data obtained by the experiments, we used a perturbative framework based on the density matrix expansion. 53e simulated two datasets with different levels of noise, the high noise (HN) and low noise (LN) datasets.Each dataset consists of 5000 raw SRS spectra of 801 points each, with noise and baseline, associated to the corresponding clean spectra, serving as ground truth (GT).80% of the samples (4000 simulated spectra for each dataset) were used during the training phase.Additional details on the datasets are given in Methods and SI, section Ib.

B. Supervised training of the neural network
Considering the different spectral scales present in the problem, we adopted a convolutional neural network architecture [54][55][56] combining different kernel sizes in parallel to reach different receptive fields at the same time.The general architecture of the model is shown in Fig. 2 and is based on parallel linear and nonlinear transformations consisting in zero padded convolutions with different kernel sizes and nonlinear activation functions.Each of the N kernel parallel convolutional branches features N layers layers of branch-depended kernels with N param trainable parameters.A detailed description is provided in Methods.The network's output is y = f θ,w,b (x) where the parametric nonlinear map f θ,w,b depends on the hyperparameters {θ}, summarized in table I, and on the weights w and biases b of the convolutional layers.The task of the training step is to learn the set of {w, b} which being the hyperparameters {θ} fixed before the training.In eq. 1, {(x 1 , y GT 1 ), (x 2 , y GT 2 ), . . .(x L , y GT L )} is the training set of simulated noisy raw spectra (x k ) and corresponding clean ones (y GT k ), i.e. the ground truth (GT).Both x k and y GT k are 1D vectors of size n input , functions of the sampled frequencies ω 1 , . . ., ω ninput .We performed stochastic gradient descent 58 to minimize the custom loss function L: where ∥ • ∥ ℓ indicates the ℓ norm and y = f θ,w,b (x).L contains a reconstruction term and a derivative term depending on the gradients of the reconstructed spectrum and GT.The two terms are activated at different epochs during training: initially only the reconstruction term is adopted for N 0 epoch , allowing the algorithm to perform first estimations of possible baselines and noise without heavily impacting on the latter.Then the gradient term is included in the loss calculation and the model is trained for additional N 1 epoch epochs.The gradient term impacts heavily on the removal of the point-wise fluctuating noise, while the reconstruction term is sensitive to both the noise and corrections of the broadband baseline.The parallel optimization of these two terms is the key to obtain more accurate results with respect to performing these two tasks sequentially.The balance weight between the reconstruction and the gradient term is controlled by the hyperparameter W grad ∈ (0, 1), which has been optimized by looking at the network performances and fixed to the value reported in table I nator on the signal-to-noise level of the data, as detailed in section II.d of the SI.

C. Validation on simulated data
We tested the networks trained on the HN and LN datasets on test sets containing the 20% of the samples of the corresponding datasets, previously unseen by the networks during training.In Fig. 3, we present the result for five typical spectra extracted from the HN test set.The corresponding results and the analysis for the LN test dataset are reported in section II.c of the SI.The spectra reconstructed by the NN (green lines) show an excellent agreement with the clean GT spectra (black dotted lines), for very different baseline shapes, which are shown by the raw data in red in the top panel.These examples are compared to the results obtained by an iterative spline (iSpline) procedure for baseline removal 59 followed by a Savitzky-Golay (SG) filtering.We note that the parameters of the SG filter have been chosen a posteriori by using the GT, in order to obtain a tradeoff between an optimal smoothing of the point-wise noise and the preservation of the linewidth of the peaks.Consequently, this is an optimal procedure which cannot be meet in a real case scenario, i.e. when the GT is not available.Notwithstanding, the reconstruction of the lineshape, linewidth and relative intensity of the peaks and the overall removal of the baseline and the noisy pointwise fluctuations are qualitatively better when using the NN.
For a quantitative comparison, it is necessary to define indicators that measure the capability of a given algorithm to perform a specific task.We combined selected metrics from the literature with custom-defined ones to evaluate the performance of the NN for the tasks more closely related to spectroscopic purposes, namely the identification of peak position and lineshapes, evaluation of relative intensities, separation of overlapped components and minimization of false-peak predictions.Specifically, we computed the Structural Similarity Metric (SSIM), Normalized Mean Absolute and Mean Squared Errors (NMAE and NMSE), as defined in Methods.In addition to these standard metrics, we developed a custom edge finder, which determines the spectral positions FIG. 3. Evaluation of representative simulated test samples from the HN dataset.For each sample, noisy raw data are shown in the top panel (red lines).Central and bottom panels show the SRS spectra obtained by using the NN (green) or the iterative spline procedure for baseline subtraction followed by a Savitzky-Golay filtering (blue).GT are also reported for comparison (black dotted lines).
of the positive and negative peaks from an input spectrum resulting from the processing of the noisy raw data and compares them to the associated GT spectrum.We used the edge finder to count the correctly assigned edges (true positive, TP) and the errors induced by the data processing (false positive and false negative, FP and FN).The edges are considered TP if their spectral position deviates from the corresponding GT more than a tunable value of tolerance.In the following, we set the tolerance to one pixel.This allows us to redefine this problem to a classification task and calculate standard metrics as the F1 score and precision on retrieving the spectral positions of the features accurately.Finally, we defined a custom Signal-to-Noise Ratio (SNR) metrics as where ω ν = {ω i , ∀i : |ω i − ω GT peak | < ϵ} and ϵ = 80 pixels, which was set taking into consideration the typical width of the Raman features.This metric measures the capability of an algorithm to obtain a clean and smooth baseline with respect to the area of the main peaks in the spectral position defined by the GT.
We have compared the results obtained by the NN with those obtained by four different traditional algo-rithms for baseline removal: third-order modified polynomial (iModPoly), 60 iterative spline, SG filter and peakscreened Asymmetric Least Squares (ALS). 61For each algorithm, baseline removal was followed by an additional SG filter for point-wise denoising and smoothing (see Methods for additional details).The results obtained for the HN dataset are shown in Fig. 4, while the corresponding results for the LN set are reported in section II.c of the SI.The NN outperforms the standard algorithms in identifying all the edges in the SRS spectrum, with a full precision score achieved in the 63% of the test samples, compared to the 42% of the best standard algorithm (iSpline+SG filter).More importantly, the distribution of the F1 score achieved by the NN is narrower and shifted to the higher values, with a mean of 0.86 and standard deviation of 0.18 to be compared with the mean of 0.70 and standard deviation of 0.30 of the standard data processing routine.This is shown in Fig. 4a, where the distribution of the results on the F1 score for the worst quartile of the test set is presented by whisker box plots, while the full histogram is reported in Fig. S2 of the SI.We note that the values of precision depend on the amount of tolerance on the peak position discrepancy between the GT and reconstructed data, but this dependence does not alter the results presented here (details are reported in SI, section IIb).Also for the other tasks that we have considered, the NN shows superior performances, as illustrated by the analysis of the associated metrics in Fig. 4b.In particular, for the metrics NMAE, NMSE and for 1/SNR, which project optimal results towards zero, the medians relative to the NN are lower than the first quartile of all the distributions of the standard methods.There is also a large improvement in the SSIM metric -optimal at SSIM = +1 -for which the median of the NN is larger than the 95th percentile of the best standard method.This is correlated to the ability of the NN to also recover the overall lineshape of the Raman features, in addition to the signal-to-noise, intensity and peak positions, which are best measured by the other metrics.The performances of the standard methods are comparable among each other, with the polynomial and iterative spline algorithms giving the best results.

D. Validation on experimental data
To validate the NN denoiser in a real case scenario, we have applied it to the resonant SRS spectra measured on equine heart deoxy Myoglobin (Mb) dissolved in pH 7.4 buffer and Cresyl Violet (CV) dissolved in methanol, using the hyperparameters found by training with the LN dataset, whose signal-to-noise ratios are comparable to the ones obtained for these particular samples with our SRS setup.In Fig. 5a, we report the raw experimental SRS spectra of Mb pumped across the Soret absorption band, with the RP wavelength tuned at 447 nm and 472 nm for the red and blue sides of the spectrum, respectively.The resonant condition impacts differently on the two sides of the spectrum and makes the Raman lineshapes highly dependent on the RP wavelength and, for the blue side, on the frequency of the normal mode. 45In the central and lower panels, we show the corresponding spectra processed by the NN and the polynomial algorithm.The results are in good agreement with the literature 59,62 and the best results obtained by traditional processing routines, for both the gain and dispersive lineshapes.
We then tested the algorithm in the more demanding case of CV undergoing photoexcitation due to a resonant, high-fluence RP tuned at 580 nm.In these conditions, the RP also acts as an actinic pump, inducing electronically excited-state populations in the sample by means of two additional light-matter interactions preceding the SRS process. 64The efficiency of this higher-order nonlinear process depends on the time delay between the Raman and the probe.Tuning the delay, it is possible to control the amount of excited-state population induced by the Raman pulse.Here a positive (negative) sign indicates that the PP follows (precedes) the RP.At large negative delays, the Raman scattering mainly involves molecules initially in the electronically ground state.At positive delays, SRS can probe vibrational transitions originating from a mixture of excited and ground-state populations, with the cost of a lower signal-to-noise ratio, spectral distortions and an overall increased complexity of the baseline and lineshapes.We considered the SRS spectra for two different delays between the Raman and probe pulses: −0.8 ps (PP preceding RP) and +1.2 ps (PP following RP).In Fig. 5b we report the raw SRS spectra at the two delays and the corresponding ones processed by the NN and by the iModPoly algorithm followed by SG filtering.For the negative delay (left panel), the processed spectra present sharp and intense features, with a high signal-to-noise ratio and spectral positions matching the frequencies of both the electronically excited and ground-state Raman modes of CV, due to the resonant RP wavelength.For the positive delay (right panel), as a consequence of the interference with the additional Raman processes originating from the excited-state populations, spectral features appear broadened and red-shifted, their lineshapes become dispersive and asymmetric and the signal-to-noise ratio decreases drastically, with different impact on the two methods used to process the raw spectra.Notably, the combined iModPoly and SG algorithm are not able to resolve the excited-state peak at 733 cm −1 , which is conversely retrieved by the NN.The polynomial processing also shows parasitic broad peaks between 850 and 900 cm −1 , and poorly captures the dispersive lineshape at 680 cm −1 .Moreover, the two approaches give different relative intensities between the peaks, as also observed for the simulated datasets.These results confirm that the NN algorithm achieves higher performances also in the case of complex experimental spectra.

II. DISCUSSION
Low signal-to-noise ratios and structured spectral lineshapes in nonlinear Raman spectroscopy cause loss of information and increase the complexity in performing the measurements, due to the need of longer exposure times for averaging, and data analysis.We have demonstrated that wisely-designed deep neural networks can overcome these limitations and achieve background removal and denoising of stimulated Raman spectra.By means of multiple kernel sizes operating in parallel within a convolutional residual neural architecture, it is possible to adapt the receptive field of the network to the informative features in the spectra and treat the multiple spectral scales present in the SRS data, related to the complex Raman lineshapes, to the background and to point-wise noisy fluctuations.We have shown that such architecture is able to adapt to different levels of noise and prominence of the Raman bands from the baseline, by training on datasets simulated through the diagrammatic theory and characterized by noise and material parameters resembling those present in experimental conditions.Depending on the level of noise, the NN demonstrated itself comparable or superior to the standard algorithms commonly used for SRS data postprocessing.The advantages are particularly evident in presence of multiple features with asymmetric lineshapes and intensities that are weak compared to the noise and to the other bands in the same spectral region.The NN algorithm was able to identify the Raman bands, reconstruct the correct lineshapes and relative intensities and enhance the spectral resolution by resolving the vibrational frequencies and bandwidths for close or overlapping Raman features.These abilities are pivotal for the interpretation of experiments leveraging on resonant optical excitations to detect cooperative mechanisms between coherent vi-  S2).(b) SRS spectra of Cresyl Violet obtained upon resonant excitation with a RP tuned at 580 nm at two different delays between the RP and PP: −0.8 ps (PP preceding RP, left panels) and +1.2 ps (PP following RP, right panels).Raw data are indicated by red lines, while green and blue lines represent processed spectra predicted by the LN network and by the iModPoly algorithm for baseline removal followed by a SG noise filter.Vertical dashed lines indicate the spectral position of the Raman modes of the ground (in black) [63][64][65][66] and the excited states (in orange), [63][64][65] which are reported in SI, table S3.
brations and electronic excitations.Once trained, the network generalizes to experimental spectra obtained on different samples, preserving its high performances.We note that SRS is a optimal test-bed for investigating deep learning applications to nonlinear spectroscopy, given its spectral complexity and variety of lineshapes for both the Raman features and background contributions.For these reasons, we anticipate the possibility to extend the use of the proposed NN architecture to different linear and nonlinear spectroscopic techniques which are affected by similar noise effects by means of fine tuning of the final NC layers and transfer learning techniques.

A. Simulated training datasets
The HN and LN datasets have been simulated using the nonlinear response perturbation theory.In this theoretical framework, the nonlinear signal is obtained from the n-th order nonlinear optical polarization P (n) , which consists of the convolution between matter correlation functions and the electromagnetic fields.The radiationmatter interaction is treated perturbatively and the density matrix is expanded in power of the fields, applying many-body Green function techniques in the Liouville space.Diagrammatic representations are exploited to isolate all the relevant terms in the expansion and calculate non-equilibrium expectation values of the correlation functions. 67,68Different sets of diagrams are associated to the SRS signal and to the TA baseline, 69 which can be simulated separately (see SI, section Ia).Shot noise is also included by means of fluctuations scaling as the square root of the number of detected photons and the associated uncertainty is propagated through the spectroscopic signal a .Consequently, all the three different spectral scales which are peculiar of the SRS data are present in the simulations: long-scale baseline variations, point-wise noise fluctuations and, in between of these two extrema, the Raman features.Within each dataset, half of the samples has been generated by simulated signals in the red side of the SRS spectrum with respect of the narrowband RP, half in the blue side.All the molecular and experimental parameters have been varied randomly through the datasets and sampled from a uniform distribution within the boundaries summarized in the SI, table S1.The main difference between the two datasets consists in the different number of averaged acquisitions per sample, N acquisition .Averaging impacts on the overall noise of the sample as a multiplicative factor inversely proportional to the squared root of the number of acquisitions.Each sample in a dataset resulted from the average of N acquisition noisy replica of the experiment.N acquisition was set to 1 and to 100 for the HN and LN datasets, respectively.To avoid biases due to the selection of a particular dynamical model or pulses synthesis methods, no explicit dynamics is considered and the pulses are modeled as Gaussian temporal profiles with random duration.A variety of possible lineshapes and spectral properties is sampled by the randomly selecting the experimental and molecular parameters.
a The electronic noise due to the detection process is normally negligible with respect to the shot noise when using single shot detection with a large dynamic range CCD camera, which is the common acquisition scheme for SRS.In different contexts, it can be included by means of a random additive noise.

B. Neural network architecture
The network takes as an input the SRS spectra sampled at n input points in frequency.The input is replicated to feed N kernel parallel branches, which are composed by N layers convolutional blocks each and return feature maps of different dimensions due to the different number of output channels.In particular, each block performs c i convolutions with a different, randomly initialized filter of size k i , followed by a Rectified Linear Unit (ReLU) activation function. 70Both the number of filters c i and the kernel size k i are branch-dependent and fixed along the branch.After the last convolution block, each branch returns a feature map of dimension b + {c i , (1, n input )} where b is the size of the training batches.The features from different branches are concatenated along the channel dimension after the last convolutional block of each branch and feed a single final branch of N conv convolutional blocks with ReLU activation.In the final branch the kernel size is fixed to one and the number of filters downscales by half each block.In such a way, we obtain a parametric linear combination of the features obtained from the different branches to be optimized by training.The last convolution has n input filters so that the output channel dimension matches the input to recover feature interpretability and precedes a final residual step in which the input is subtracted from the feature map.The residual layer allows the neural model to learn only the structure of the noise, reducing the number of required labeled samples and the overall complexity of the training phase.The final model architecture was selected after cross-validating over all the model hyperparameters by means of grid-search, including N layers , N conv , the kernel sizes, learning rate, batch sizes, number of training epochs and the relative weight between the gradient and reconstruction loss terms.The learning rate was also dynamically decreased during training accordingly to a scheduler.This also allowed for balancing the effects of the training with the full loss function in eq. 2, which occurs with a smaller learning rate.Batch normalization and regularization techniques were not adopted since overfitting was not observed and they decreased the performance of the neural net.The kernel sizes have been varied linearly across the branches, from a minimal size of 5 to a maximal size M k which was chosen by an hyperparameter sweep and then fixed to M k = 88.The number of filters and hence the output channels of each convolutional block was chosen in a kernel-size dependent way to obtain the same number of parameters N param in each convolutional branch.Also N param has been chosen by an hyperparameter sweep, in order to find a trade-off between the needs of computational feasibility and performance.Backpropagation with the Adam optimizer was used to train the model. 70

C. Algorithm implementation
The algorithms described in this work have been developed in Python (3.7) based on the open source libraries Keras and TensorFlow (2.10) for deep learning and the NVIDIA driver CUDA (10.1) for GPU acceleration.Training was performed on a Tesla V100-SXM2-32GB provided by the dual NVIDIA DGX-1 at SapienzaAI. 71raining and optimization have been tracked by using the Weights&Biases platform. 72All the samples in each dataset have been preprocessed by subtracting the mean and normalizing by the standard deviation calculated across the dataset and further rescaled by the maximum of the training set.The experimental data have been preprocessed in the same way and then rescaled by the same factor used for the training data.The optimal learning rates to be used during the training have been identified during the hyperparameter sweep and then fixed.The initial learning rate was set to 10 −3 and then to 5 • 10 −4 at the beginning of the second phase of the training, after N 0 epoch epochs, when the gradient term is included in the loss function.Additionally, the value of the learning rate has been slightly decreased every epoch by means of an exponential scheduler.The final model required 1h46m to be trained for the HN network and 35 minutes for LN network.Code with minimal example implementation, datasets and pretrained weights discussed here are available online. 73

D. Standard denoising and background subtraction methods
In this work, we compared four standard methods for baseline removal: • Savitzky-Golay (SavGov) • Asymmetric Least Square (ALS) We combined each method with a sequential application of a Savitzky-Golay filter 74 for point-wise noise subtraction.For each spectral bin of a sample in the dataset, the SavGov filter uses a least-squares procedure to fit the data in a window of length l centered at the input bin to a polynomial of order p and replaces the input value with the central point of the fitted polynomial function.We used the SciPy implementation 75 and fixed the order of the polynomial to p = 3 and the window length to l = 551 for the baseline removal routine and to l = 17 for the denoising filter.In the iSpline method, we used iterative spline interpolation to distinguish the baseline and the Raman features, thanks to their different spectral complexity.Specifically, initially the spline is performed on the full spectral range, and the regions that deviate from it up to a positive fixed threshold are defined as Ramanlike.Then, the interpolation is repeated for the spectral regions marked as baseline in the previous step with a lower threshold, until the number of iterations provided by the user is reached.We chose the thresholds to be proportional to the standard deviation of the retrieved baseline.The ALS algorithm attempts to find the baseline by a weighted least square fit in which values of the candidate baseline lower than the data at a given point are favored and in which the second derivative is penalized at the same time.We used a modified version implemented in, 61 which includes peak screening.Finally, the iModPoly baseline subtraction routine 60 consists in an iterative polynomial fitting with a peak-removal procedure and a correction to account for noisy fluctuations by means of the standard deviation of the residual.For each iteration, to classify a specific region of the data as either Raman feature or baseline, the input is compared to the sum of the residuals of the fit and their standard deviation.We used the implementation in 76 with a third order polynomial.

E. Evaluation metrics
The SSIM reported in the Results section measures the structural resemblance of two dataset within a given convolution window by means of the following quantity: with and µ i ,σ i and σ ij for i, j = x, y are the local mean, standard deviation and covariance of the two datasets within the convolutional window, C n = K n L for n = 1, . . ., 3, being L the dynamic range of the data, are regularizers to avoid singularities.Following, 77 we set C 1 = 0.01, C 2 = 0.03 and Mean SSIM is then calculated as the mean over all the scanned windows.In addiction to the signal-to-noise-ratio (SNR) defined in the main text, we have also analyzed Normalized Mean Absolute Error (NMAE) and Mean Squared Error (NMSE) which are less and more sensitive measures to outliers, respectively.Finally, using the custom edge finder, 73 we redefined the evaluation problem as a classification task and calculated the values of true positives (TP), false positives (FP) and false negatives (FN) predicted by the NN or the reference standard algorithm with respect to the GT.Here with edge we refer to both gains and losses in the Raman spectra.From these values, standard metrics as the precision and recall have been calculated: precision = T P T P + F P (8) recall = T P T P + F N Precision measures the number of correctly retrieved edges scaled by the total number of edges found, while recall scales the retrieved true positives to the number of all the actual positive values (TP+FN).The F1-score is calculated as the harmonic mean of precision and recall and is a trade-off between possible predicting biases of a model towards high false positives and false negatives: F. Spectroscopic setup and measurements The SRS setup have been described in details in previous works. 59The two beams needed for SRS are synthesized from a common regenerative amplified Ti:sapphire mode-locked laser, which generates 3 mJ, 40 fs pulses at 800 nm with 1 kHz of repetition rate.The femtosecond probe pulse is obtained by white light continuum generation in 2-mm-thick Sapphire crystal.A narrowband Raman pulse is synthesized and tuned by means of a two-stage optical parametric amplifier (OPA) followed by a spectral compression stage based on frequency doubling in a 25 mm beta barium borate crystal.An additional cleaning of the spectral frequency is performed by a double-pass (2f) spectral filter.The temporal ordering of the pulses is controlled by mechanical delay lines.
The Raman and probe pulses have linear and parallel polarization.After interaction with the sample the probe pulse is frequency dispersed by a spectrometer onto a CCD, able to perform single-shot acquisitions.A synchronized chopper blocks alternating Raman pulses in order to obtain the Raman gain using successive probe pulses.Cresyl Violet (CV) was dissolved in methanol and measured at ambient temperature.Details on the preparation of the Mb can be found in. 59nd where N counts are the photon counts revealed by the detector, OD is the optical density of the material which can be used to accounts for the linear absorption of the PP (here it is set to one) and S (3) is given by eq.S8.
b. Parameters selection for the high and low noise datasets The HN and LN datasets used to train the neural networks have been generated using the model presented in the previous section.The experimental and molecular parameters were randomly extracted by uniform distributions whose limiting values are reported in table S1.The dipole elements µ ij in eq.S9 have been grouped in common terms, leading to three effective transition factors that give the relative probability between TA and SRS transitions (µ T A ), between SE/GSB and ESA transitions (µ ESA ), and among different Raman peaks (µ vib ).The latter are indicated in the dipole matrix elements in the SoS expressions by the primed mute indexes.Additionally, we accounted for a different value for the number of averaged acquisitions per sample N acquisition in each dataset, i.e. each sample in a dataset resulted from the average of N acquisition noisy replica of the experiment.N acquisition was set to 1 and to 100 for the HN and LN datasets, respectively.

Parameter (units) Description
Value   S1 for a graphical description of the physical meaning of the parameters.

II. ANALYSIS OF THE PERFORMANCE OF HN AND LN NETWORKS ON THE SIMULATED DATASETS a. Additional analysis of the performances of the HN network
In Fig. S2-S3, we report the histograms of the classification and regression metrics for the HN network and the iSpline method, trained and tested on independent subsets of the HN dataset.
We now evaluate the dependence of the HN network on the number of parameters in the convolutional branches.N param and N kernel define the majority of the parameters of the network and thus largely determines its complexity FIG.S2.Histograms of the F1 score, precision and recall obtained on the test set by the HN network and the iSpline algorithm for baseline removal followed by a SG noise filter.FIG.S3.Histograms of the performances obtained for selected metrics on the test set by the HN network and the iSpline algorithm for baseline removal followed by a SG noise filter.and computational cost for training.Their values are larger for the HN network with respect to the LN.We found that similar results for the metrics can be obtained on the HN dataset using a network with N param = 10k and N kernel = 21, the same number of parameters used for the LN, by using ℓ = 1 norm for the gradient component of the loss function and re-training the network.This is shown in Fig. S4, where NN reduced is the new, lighter network with 10k parameters in the convolutional branches.This may be indicative of the fact that a further increase of the number of parameters, besides requiring more computational cost, does not necessarily guarantee an increase of the performances.However, a closer look at specific samples in the dataset demonstrates that NN reduced is less effective in recovering the details of the lineshape in situations with large noise (see for example the relative intensities of the peaks in panel 2 and the shoulder of the central peak in panel 4 in Fig. S5).The analysis is also indicative of the necessity to find additional metrics to make the evaluation of the performance more robust against subtle differences between GT and reconstructed lineshapes.The value of classification metrics based on the edge finder for a specific algorithm depends on the amount of tolerance on the peak position discrepancy between the GT and reconstructed data.Setting a very small tolerance (as in the case of the analysis reported in Fig. 3a) reports on the performance of the algorithm in recovering both the peaks positions and the number of peaks.A larger value of tolerance is less sensitive to the peak position and consequently allows to discriminate the cases in which the number of peaks is not correctly retrieved.These dependence affects similarly the different algorithms and does not affect the analysis of the performances reported in the main text.This is shown in Fig. S6, in which we report the percentage of perfect score on precision (precision=1) obtained by different algorithms as a function of the tolerance in pixels.For each value of the tolerance, the best performances are obtained by the NN and NN reduced .The best standard algorithm achieves a full score for about 20% fewer samples than the threshold set by neural networks.

c. Results on the LN dataset
In this section, we report the analysis of the performance of the LN network, trained and tested on independent subsets of the LN dataset.In Fig. S7, we report the reconstruction of clean lineshapes for typical samples in the LN test set and compared the results of the LN network with those of the iSpline algorithm followed by SavGov filtering.A comparison between the performance of the NN and the standard algorithms is reported in Fig. S8.Similarly to what observed for the HN dataset, the NN outperforms the standard methods for all the tested metrics.In Fig. S9-S10, we report the histograms of the classification and regression metrics for the LN network and the iSpline method.

d. Combining different networks for different noise level
The HN and LN network can be combined to tackle input data with an unknown level of noise by means of a discriminator.We first obtained a raw estimate of the SNR using the following procedure, which is applicable also to experimental data.We extracted n random spectra from the simulated dataset and select a narrow spectral region in which Raman peaks are not present.Then we subtracted a raw local baseline, which is well approximated by applying a moving average filter and calculate the standard deviation.This procedure gives an estimation of the SNR, whose error can be inferred by changing the extrema of the spectral region and the number of spectra and repeating the calculation.For the two simulated dataset, we obtained σ high = 1e −2 and σ low = 1e −3 .The choice of the NN architecture can be adapted to the noise of the experimental data using a discriminator which measures σ for the input and then adopt one of the two networks depending on the result: for σ greater than a threshold fixed at 0.5e −3 , the input is processed by the network LN, otherwise by the network HN.Alternatively, given the adaptability of the parallel kernel convolutional architecture on very different amount of noise, depending on the details of the experimental setups and data acquisitions, the NN can be retrained or selectively fine tuned on specific layers, including more noise levels and additional sources of noise.

FIG. 1 .
FIG. 1.(a) Experimental scheme of SRS spectroscopy.A femtosecond broadband probe is focused on the sample together with a narrowband picosecond Raman pulse.The probe is then spectrally dispersed and detected.The relative arrival time of the pulses can be varied by a delay line.A mechanical chopper allows for the differential detection of the signal with and without the Raman pulse.(b) Spectral envelope of the probe pulse detected after the sample in presence (black line) and in absence (black dashed line) of the Raman pulse (grey line).SRS features are obtained on the top of the probe spectrum when the two pulses interact with the sample.(c) Example of the ambiguities that can arise from baseline removal in SRS red side data measured on a fluorescent protein (wt-GFP).The application of different algorithms -polynomial fitting, Asymmetric Least Square (ALS), Penalized Least Squares (PLS), iterative morphological fitting-determine the baseline differently (bottom panel).This results in differences in the retrieved SRS spectra (top panel).

FIG. 4 .
FIG. 4. Comparison of the statistical analysis over selected metrics between the NN method and multiple non-data driven routines applied to the HN test set.Panel a shows the results of the classification metric obtained by means of the edge finder algorithm.The whisker plots report on the F1 score relative to the portion of 30% of the test set for which each method has the worst score.Panel b shows the whisker plots relative to the results obtained by the different methods measure by the SSIM, SNR, NMSE and NMAE metrics.For all the whisker plots, the box covers from first to the third quartile, while the whiskers extend from the box to the 5th and to the 95th percentile.The orange line and dot indicate the median and the mean, respectively.Black dots indicate values that are past the end of the whiskers.

FIG. 5 .
FIG. 5. Application to SRS experimental data.(a) SRS spectra of deoxy Mb.Left panels: red side Raman spectra obtained with RP tuned at 447 nm.Right Panels: blue side Raman spectra obtained with the RP tuned at 472 nm.Raw data are indicated by red lines, while green and blue lines represent processed spectra predicted by the LN network and by the iModPoly algorithm for baseline removal followed by a SG noise filter.Vertical dashed lines indicate the spectral position of the Raman modes of deoxy Mb for resonant excitation in the Soret absorption band 59 (reported in SI, tableS2).(b) SRS spectra of Cresyl Violet obtained upon resonant excitation with a RP tuned at 580 nm at two different delays between the RP and PP: −0.8 ps (PP preceding RP, left panels) and +1.2 ps (PP following RP, right panels).Raw data are indicated by red lines, while green and blue lines represent processed spectra predicted by the LN network and by the iModPoly algorithm for baseline removal followed by a SG noise filter.Vertical dashed lines indicate the spectral position of the Raman modes of the ground (in black)[63][64][65][66] and the excited states (in orange),[63][64][65] which are reported in SI, tableS3.
FIG. S4.Histograms of the performances obtained for selected metrics on the test set by the HN network and its reduced version.
FIG. S5.Comparison between of evaluation of the selected HN test samples by the NN (bottom panel, green) and NN reduced (top panel, red).The samples and the output of the NN network are the same of those reported in Fig.3of the main text.GT are also reported for comparison (black dotted lines).
FIG. S8.Comparison of the statistical analysis over selected metrics between the NN and multiple standard routines applied to the LN test set.Panel a shows the results of the classification metric obtained by means of the edge finder algorithm.The whisker plots report on the F1 score relative to the portion of 30% of the test set for which each method has the worst score.Panel b shows the whisker plots relative to the results obtained by the different methods measure by the SSIM, SNR, NMSE and NMAE metrics.For all the whisker plots, the box covers from first to the third quartile, while the whiskers extend from the box to the the 5th and to the 95th percentile.The orange line and dot indicate the median and the mean, respectively.Black dots indicate values that are past the end of the whiskers.

TABLE I
FIG. 2. Architecture of the neural network.Input data (blue box) feed multiple convolutional branches which operates in parallel (green boxes).Within each branch, data are transformed by convolution layers (Conv1, . . ., Convn) with the same kernel size and number of channels, connected by nonlinear activation functions (ReLU, black squares).Concatenated output from the convolutional branches feed a series of 1D convolutional layers.The output (orange box) is obtained after a last residual layer. .Hyperparameters for the HN and LN networks obtained by training with the HN and LN datasets and optimization with a grid search over the network architecture shown in Fig. 2.

TABLE S1 .
Experimental and molecular parameters used to generate the simulated datasets.Values in brackets refer to the lower and upper limits of the uniform distribution from which the actual parameters are extracted for each sample of the dataset.ωRP and tPP, for which single values are shown in the table, were kept fixed for all the samples.We refer to Fig.