Pipeline for the removal of hardware related artifacts and background noise for Raman spectroscopy

Raman spectroscopy is a real-time, non-contact, and non-destructive technique able to obtain information about the composition of materials, chemicals, and mixtures. It uses the energy transfer properties of molecules to detect the composition of matter. Raman spectroscopy is mainly used in the chemical field because background fluorescence and instrumental noise affect biological (in vitro and in vivo) measurements. In this method, we describe how hardware related artifacts and fluorescence background can be corrected without affecting signal of the measurement. First, we applied manual correction for cosmic ray spikes, followed by automated correction to reduce fluorescence and hardware related artifacts based on a partial 5th degree polynomial fitting and Tophat correction. Along with this manuscript we provide a MatLabⓇ script for the automated correction of Raman spectra.• “Polynomial_Tophat_background_subtraction _methods.m” offers an automated method for the removal of hardware related artifacts and fluorescence signals in Raman spectra.• “Polynomial_Tophat_background_subtraction _methods.m” provides a modifiable MatLab file adjustable for multipurpose spectroscopy analysis.• We offer a standardized method for Raman spectra processing suitable for biological and chemical applications for modular confocal Raman spectroscopes.


Introduction
Raman spectroscopy is a vibrational spectroscopic technique, based on an energy transfer between an illuminated sample and the irradiated light. In contrast with e.g. infrared (IR) spectroscopy, which analyses absorbed and transmitted fractions of the light, Raman spectroscopy makes use of scattered radiation. Although the predominant mode of scattered light is elastic Rayleigh scattering, a small proportion (1 to 10 9 or 10 10 ) of the photons is scattered inelastically. These photons shift to a higher or lower energy status resulting in stokes and anti-stokes scattering [1] .
Raman spectra provides both qualitative and quantitative molecular-level information. The basis of the qualitative information is the fingerprint nature of the Raman shift, which is unique to each material. This makes Raman spectroscopy also usable in an aqueous environment [2] , and an interesting and suitable technique for ophthalmic purposes. Raman spectroscopy is a non-contact and non-destructive technique with real-time visualization, which make it also suitable for in vivo application.
Biological samples often emit fluorescence signals that may interfere with Raman signals since the intensity of the fluorescence emission has a much higher yield than Raman signals [3] . Further, hardware related artefacts (instrumental noise) are found in Raman spectra. In order to extract Raman signal from the raw acquired spectrum, it is therefore necessary to pre-process the acquired spectra [4] . As recognized by Byrne et al. no standardized protocols are available for this purpose yet [5] . Hence, we developed a method to deal with multiple source background influences. This paper guides you through the steps taken to optimize Raman spectra and make them ready for analysis as done in the study from Bertens et al. [6] . For the full data-set of this project we refer to the supplementary data of Zhang et al. [7] .

Background of the data processing
As mentioned earlier, there is no gold standard for the processing of Raman data. Several approaches have been proposed to minimize the influence from background fluorescence [5] . Raman scattering is an instantaneous effect, whereas fluorescence requires time to occur. If one can switch on and off the detector (or a filter) at a high temporal resolution, fluorescence signal could be prevented from interfering with the Raman signal. However, this is expensive, complicated, and commercially not available [8 , 9] . Therefore, the most accepted method for fluorescence background subtraction is polynomial fitting, for which unfortunately no standardized protocols are available ( Figs. 1-3 ) [5] . Zhao et al. introduced an automated polynomial background subtraction method for biomedical applications, which could subtract the background [10] . Zhang et al. also developed a proper automated method for fluorescence background subtraction named: "automatic preprocessing method for Raman imaging data set (APRI)" [11] . However, both methods encountered difficulties when handling spectrums containing instrumental noise. In some in vivo experiments, the contribution from instrumental noise is inevitable and cannot be neglected, thus affecting the conventional polynomial methods. Hence, further treatments have been developed to eliminate the instrumental noise. Perez-Pueyo et al. introduced a morphology-based baseline removal method for Raman spectrums [12] . It employs Tophat filtering using basic operations as dilation and erosion to filter the features beyond or below a pre-set threshold, thereby removing the instrumental noise ( Figs. 1-2 ).
A third influencer affecting Raman spectra are cosmic rays. Cosmic rays create spikes that are randomly generated due to cosmic radiation ( Fig. 1 -1 ). Cosmic rays affected different wavenumbers each time they occur, and can easily be detected by comparing different frames of one measurement. Spikes created by cosmic rays need to be removed before the frames are averaged, else they can be interpreted as peaks [4 , 11] .

Set-up of the Raman system
A modular confocal Raman spectroscopic system was used in the study. The Raman system was connected via a power conditioner, to prevent power peaks to disturb the measurements and to protect the system. The Raman system is equipped with a diode-emitting laser of 785 nm with a continuous power of 26 mW, and a 671 nm diode-emitting laser with a continuous power of 14 mW. Raman spectra were recorded with a high-performance Raman module model 2500 with a chargecoupled device (CCD) operating at −60 °C. This module introduces the laser light through a diamond optical fiber, shapes and conditions the beam through a pinhole to the measurement stage ( Fig. 2 ). The emitting light from the spectrometer is collimated using a converging lens (f80 see Fig. 2 -f). Collimation of the light was checked using the Melles Griot shear-plate. The lens was moved along the laser optic axis towards or away from the exit aperture of the spectrometer until the stripes provide a collimated position ( Fig. 3 ).
Three types of sample set-ups were performed: • Cuvette set-up ( Fig. 4 a) • In front of the sample, a f80 lens was used when the sample was measured in a Brand® cuvette. • Jena lens set-up ( Fig. 4 b) • In front of the sample, a long-working-distance microscope objective lens, Jena lens. • Gonio lens set-up ( Fig. 4 c) • In front of the sample, first a lens with a f60 lens is placed, followed by a Gonio lens. The Gonio lens was connected to the cornea of an eye ( in vivo or ex vivo ) using topically applied Methocel® 2%.

Calibration
When the laser from the Raman system is collimated, the lens used for the measurement is set in place and the system is calibrated by built-in calibration procedure of the spectrometer. Hereafter, the system is further calibrated by the reference spectrum obtained by the provided National Institute of Standards and Technology (NIST)-standard calibration glass (was provided with the spectrometer). The full calibration was done according to the spectrometer manual. All measurements were performed in the dark.

Data acquisition
When the laser was correctly positioned, fingerprint-signal of the material was measured with the 785 nm laser and exported as ' .txt ' file further processing. An example of a measurement is provided in Fig. 6 .   Fig. 6. Raman spectra providing fingerprint signal (left column) and a high wave number signal (right column) of PBS and three different drugs (ketorolac (Acular R ), Bromfenac (Yellox R ), and Diclofenac (Naclof R )) in ophthalmic solution. With corresponding molecular structure.  zones (1, 2, and 3), where after, a line was fitted through those zones based on a 5 th order polynomial function. The predicted line was withdrawn from the graph.

Removal of cosmic ray spikes
All Raman spectra were loaded into OriginPro 9.0.0 (64 bit ed. OriginLab corp. Northampton, US) and were one-by-one checked (manually) for cosmic ray spikes. The wavenumbers affected by cosmic ray spikes were replaced by the values of the same wavenumbers from another frame. When this was done, the files were saved and loaded into MatLab C (Version 2017b, The Mathworks Inc., Natick, MA, US) for further processing.

Averaging of the frames, and removal of background-and instrumental noise
The following process is programmed in the MatLab C file ("Polynomial_Tophat_background_ subtraction _methods.m "), provided with the manuscript.  First, frames were averaged to reduce fluctuations. Because the baseline has a strong influence on the polynomial approximation, the polynomial degree must be selected according to the shape of the baseline. In our system, using eyes, a 5 th degree polynomial fitting resulted in the most optimal background correction (figure S1). Therefore, we applied partial 5 th degree polynomial fitting with the morphology approach of Perez-Pueyo et al. [12] . to remove instrumental noise. First, all spectra were dissected in different zones, 350 cm −1 to 450 cm −1 , 450 cm −1 to 750 cm −1 , 750 cm −1 to 1250 cm −1 , 1250 cm −1 to 1650 cm −1 , and 1650 cm −1 to 1800 cm −1 . Zones that only contain fluorescence (400 cm −1 to 450 cm −1 , 800 cm −1 to 1200 cm −1 , and 1600 cm −1 to 1800 cm −1 ) ( Fig. 7 , zone 1, 2, and 3) are used calculate the polynomial function coefficients. The zone containing the waterpeak (1550 cm −1 to 1650 cm −1 ) was excluded from the polynomial function fitting calculation. The achieved 5 th degree polynomial function was applied on the full spectrum (40 0 cm −1 to 170 0 cm −1 ) to remove the fluorescence background ( Fig. 7 ). Hereafter, the morphology-based Tophat method from Perez-Pueyo et al. [12] . was applied to eliminate instrumental noise. Examples of processed Raman signals are shown in Fig. 8 . Fig. 9 shows the effect of data processing using the MatLab C program on a sample without ( Fig. 9 a) and with ( Fig. 9 b) instrumental noise. In both occasions, a flat baseline is observed, and in Fig. 9 b instrumental noise is reduced without affecting the peaks. A full overview of the corrected data can be found in Bertens et al. [6] . and the full data-set is available supplementary to the manuscript from Zhang et al. [7] .