BACTERIAL PATTERN IDENTIFICATION IN NEAR-INFRARED SPECTRUM

Microorganism identification, primary bacterial identification and pathogen detection, is important in a lot of microbial scientific areas (diagnosing of infection diseases, food protection). In this paper, the identification of the strains was performed by Near Infrared spectroscopy (wavelength from 900 nm to 2500 nm). Different techniques for classification (CVA, ANN...) were examined. It was reached to 100% accuracy on limited count of samples. Because a removing of water from sample represents a time-consuming step in sample preparation process, influence of water to spectrum was examined. Near Infrared (NIR) spectroscopy seems to be a suitable method for rapid bacteria identification. It can be used in a wide variety of food protection, medicine microbiology, bio-terrorism threats and environmental studies.


Introduction
Traditional methods of bacterial identification are based on morphological examination of cells or colonies, using gram staining, examining growth characteristics and biochemical testing. These tests are time-consuming (in vitro cultivation) and subject to important sources of uncertainty. Recently, advantages of molecular techniques were exploited. Real time polymerase chain reaction is currently the most frequent molecular technique.
This highly sensitive and specific method is used to amplify DNA copies and enables its sequencing. Alternative molecular technique is infra-red spectroscopy. Advantage of infrared (especially NIR) spectroscopy lies in a rapidity and cheapness (no additional reagents or pretreatment is applied). This type of spectroscopy is a widespread technique in analytical chemistry. Recently, we can see a growing interest of the NIR spectroscopy in microbiology.
The membrane structure of bacterial cell and the ratio of lipids, proteins, and polysaccharides (IR active molecule bonds C-H, N-H, O-H) depend on the bacterial species. These changes can appear in the IR vibration spectrum. An overtone and combination bands would be detectable in the NIR spectrum. Unlike the IR spectrum, NIR spectrum is difficult to interpret, because of wide and overlapping combination and overtone peaks. To eliminate this problem, chemometrics statistical methods are utilized. Applied procedures for signal pre-processing and classification have essential influence to final identification efficiency.

Material and methods
In the first step, influence of water was examined. An absence of water could leads to significant acceleration of examination process. Main part of experiment was focused to verification and validation of models.
Firstly, the spectra of a solution with three different common food pathogen bacteria (Listeria ivanovii, Escherichia coli, Salmonella) were acquired. Bacteria have grown in tryptic soy (TSB) broth for 24 hours. Instant shaking accelerated bacteria grow. To eliminate the effect of the supernatant, the samples were centrifuged and the tryptic soy broth was changed by distilled water. This procedure removes all potential products of bacteria and provides the measurement of clear biomass. To ensure of the same level of concentration, sample absorbance of wideband (420nm-580nm) light was maintained around 0.85 (measured in 0,1 ml by Labsystems Bioscreen C). Near infra-red spectra of bacteria in an aqueous solution were acquired by diffuse reflectance integrating sphere in a region 1100 nm -2500 nm using FT-NIR Perkin Elmer Spectrum One NTS (Shelton, CT, USA) instrument.
Transparent capsule with the test suspension was directly placed on a measurement window. This technique allows collecting transmitted and scattered light, which is useful for studying chemical and physical properties of bacteria. All accessories were cleaned by 70 % ethanol between measurements. Final spectrum was consisted from 5090 points (4 cm -1 spectral resolution) calculated from 128 scans.
Exact optical path length was specified by mirror adapter, which defined samples thickness to 1mm. For ensure a sufficient generality in data, 15 spectra from each bacterial strain were acquired (see Fig. 1). These data was used for development and validation of models.

Fig. 1. Spectra of bacteria cells in water
In the second experiment, the same bacterial strains were grown TSB. Samples were centrifuged and resuspended in distilled water. Further, bacterial dispersion were filtered by glass fiber filter and dehydrated to remove absorption effect of water. The filter was putted directly to measuring window and covered by aluminum mirror. On each filter, five measurements on different place were carried out. Together, 20 spectra from each bacterial strain were acquired (see Fig. 2). Spectra-processing techniques in a both experiments were carried out by Mathworks Matlab 8.1a. Derivation was performed by Savitky-Golay algorithm, in order to ensure smooth spectra without artifacts. For separate chemical absorption and light scattering, EMSC algorithm [5] (Extended Multiplicative Scatter Correction) was performed. EMSC algorithm can treat the NIR spectrum to be more feasible to Beer's law, which takes into consideration only chemical light absorption. Because bacteria identification is classification problem, and Beer's law quantifies components, it must be realized that identification process is based on different amount of chemicals in bacterial strains. EMSC model considers wavelength-depended, multiplicative and additive effects (see eq. 1).
where I is a unity matrix, m is a mean spectrum (then mean of all measurements), p belongs to first canonical component and λ is a vector of wavelengths. The symbol ε represents a residuum caused by uncertainties measurements. This equation is minimized using the method of least squares. Four classification methods were compared. Canonical Variate Analysis (CVA) [7] is a set of classical methods related to linear discriminant analysis. CVA is a discrimination method in which vectors maximizing the ratio of within-groups and between-groups variation.
Formalizing equation for performing a canonical correlation is relatively simple (eq. 2.). Correlation matrix (R) is formed of inverse correlation matrix (R-1) and correlation between input (independent, spectral data) and output (grouping variable) -R yx R xy . Space of CVA score computed in second experiment is showed in figure 4. We can clearly see separated clusters of different bacteria strains.
Next often used classification procedure is Soft Independent Modeling of Class Analogies (SIMCA) [7]. In this procedure, separate model for each class is created by principal component analysis (PCA), considering only data related to one class. Each New unknown item is assessed against each of the groups depending on relative position of item to PCA systems. SIMCA was developed as alternative to principal component regression for classification purposes, like PLS-DA was developed as extension of PLS.
Partial least squares Discriminant Analysis (PLS-DA) [7] is mathematically an adequate method, because PLS is defined for quantitative variables. However, PLS-DA has been previously shown that this works well in practice. Next introduced method, Artificial Neural Network (ANN), [7] is loosely modeled brain's neural net for computation purposes.
It consists from number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. Connections are attenuated by weights and outputs of neurons (processing element) are defined by an activation function. ANN paradigm determines parameters of ANN (num. of layers, num. of neurons, activation function, learning function) and its need to be set very carefully. For increasing prediction accuracy and making the model simpler, spectrum resampling method can be applied. Spectra were divided to 509 bands. 509 models were evaluated based on the spectra without each band. By eliminating bands, model increases classification accuracy which is repeated until reaching the highest value of accuracy.
The method is similar to the well known "Jack-knifing method". An alternative method is to select the variables, according to Analyse of variance. Only variables with a high variation between classes (dependency to variation among classes) are considered.

Results and conclusions
In the spectrum of aqueous solution, peaks caused of water are dominated (O-H band overtone occurred 1450 nm and wide O-H combination band at 1950 nm).
Since these peaks are significantly stronger than signal caused by bacteria cells, a detection of the signal of bacteria cells is difficult task. Despite the application of advanced statistics, all the performed models exhibit very low correct classification rates. In the Fig. 3, the pure spectra of bacteria cells are shown. High data variance according to different bacteria strains can be found at band 2300-2500 nm. It could be related to lipids and proteins contained in bacteria cells. The peaks around 1750 nm are related to C-H first overtone and it can be next indicator of proteins and their different ratios.
However, these signals are very weak in contrast of strong absorption of water. This implies that direct NIR measurements of biomass would need extreme concentrated samples to gain sufficient signal from bacteria cells without spectra saturation by water. Nevertheless, microbiology examination of aqueous specimens according to overall chemical changes is still possible. All classification models were performed on a same set of data. Accuracy of classifiers is expressed by correct classification rate (CRR). CRR is computed by Leave-One-Label-Out Cross-Validation method. Spectra obtained from different measurements have diverse labels, totally 4 groups of measurements comprising 15 (5 for each bac. strain) spectra were used.
In every cross validation step, one group (1 measurement, i.e. 15 spectra) is excluded from training, and it is utilized for validation. Successively, all groups are selected and final CRR is computed as mean of CRR computed on all iterations. Classification at CVA method was accomplished by K-nearest neighbors algorithm (K=5). Technique is sensitive to the quality of input spectra, in a case of improperly chosen properties of preprocessing, algorithm exhibits lower accuracy. Whereas PLS-DA is more robust, but total accuracy (CCR) is lower. SIMCA shows better results, if EMSC is utilized.
It can be due to performing PCA which takes into account variance in spectra irrespective to classes. Strong adaptability of ANN leads to high correct classification rate even without preprocessing. For correct working, ANN needs to enough number of samples, if not this technique leads to over-fitting problem. Therefore, in case of CRR of ANN, only preliminary results are introduced.
We can see a positive effect of the resampling method. All models achieved a high score because resampling procedure is able to correct data and remove non dependent data. A similar effect was reached by optimization. The optimized model shows better results. Another losing of preprocessing methods leads to performance decrease.
Overall, results indicated a good performance of models to classify the different tested microorganisms; even though, accuracy dependences on right choose of classifier and preprocessing method.