A Reliable Method for Identification of Antibiotics by Terahertz Spectroscopy and SVM

,


Introduction
Antibiotics are a large class of antibacterial chemical substances that occur naturally or are semisynthetic or synthetic.
ere are a great variety of antibiotics, which are further divided into seven major classes, namely, tetracyclines, macrolide antibiotics, aminoglycosides, peptide antibiotic, lincosamides, streptogramins, and β-lactam antibiotics [1]. Nearly every bacterium has a specific antibiotic against it. Antibiotics are mainly used to treat various types of bacterial infections in humans or in livestock to promote their growth. However, the problem of antibiotic residues has become increasingly severe due to excessive antibiotics use. erefore, antibiotics detection and identification is of high importance [2]. In the past few decades, numerous efforts have been made to develop analytical methods for qualitative or quantitative determination of antibiotics. Conventional methods for antibiotics detection mainly include high-performance liquid chromatography (HPLC) [3] and gas chromatography mass spectrometry (GC-MS) [4]. Although these chromatography-based techniques are sensitive and reliable, they are usually timeconsuming. Capillary electrophoresis (CE) [5], immunochemistry [6], and enzyme-linked immunosorbent assay (ELISA) [7] can achieve high-accuracy detection of antibiotics. However, these procedures usually involve complex sample preprocessing, which needs to be done by welltrained professionals. e expensive costs of surface plasmon resonance (SPR) sensors [8] and Raman spectroscopy [9] have restricted their extensive application. Given the above, it is necessary to establish a sensitive, fast, and reliable method for antibiotics detection [10].
e THz band has a wave frequency ranging between 0.1 and 10 THz, which is between the infrared and microwave frequencies.
e THz waves are transient, safe (single photon energy, 4.1 meV) and highly penetrating, and have fingerprinting properties [11]. THz-TDS has already found extensive applications in biological tissue identification [12,13], food and drug detection [14,15], and explosive detection [16]. e vibration-rotation energy levels of such macromolecules as antibiotics are located within the THz band. As compared with other spectral detection methods, THz spectroscopy exhibits unique advantages. THz spectroscopy can detect not only molecular spinning and lattice vibration but also the inner structure and organizational features of the drugs. Limwikrant et al. [17] obtained the THz spectra of ofloxacin and complex of oxalic acid. Zhang et al. [18] analyzed the molecular vibration modes of piracetam and 3-hhydroxybenzoicacid. Zhang et al. [19] obtained the THz fingerprinting spectra of metronidazole, tinidazole, and ornidazole. Xie et al. [20] showed through DFT calculation that tetracycline had definitive THz absorption spectra at certain frequencies. Many studies have demonstrated the feasibility of applying THz spectroscopy to antibiotics detection. Qin et al. [21][22][23][24] applied THz spectroscopy to the detection of tetracycline hydrochloride and achieved good results. Massaouti et al. [25], Wang et al. [26], and Long et al. [27] used a similar method, the quantitative detection of antibiotics in the samples.
e THz technology offers extensive applications in the research fields of pesticide and antibiotic identification and residual pesticide detection [28]. Many studies have shown that THz spectroscopy is a feasible detection technique for antibiotics. Most of the studies focus on quantitative detection of antibiotics, though the use of THz spectroscopy to identify antibiotics has been rarely reported. Yan et al. [29] applied three-layer BP neural networks to identify absorption spectra of nine illicit drugs and six antibiotics, but the average identification rate was low. In this study, THz-TDS was applied to the detection of sixteen types of antibiotics, including penicillins, cephalosporins, macrolides, and tetracyclines. en, the THz absorption spectra of these antibiotics were calculated. Dimensionality reduction was performed using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). Next, pattern recognition was performed using the GS-SVM, GA-SVM, and PSO-SVM models, and the best identification model was found by comparison.
us, a novel method for fast and reliable antibiotics identification was established.

Terahertz Spectroscopy System. A Z-3 Time-Domain
Spectrometer (Zomega, USA) was used for the experiments. e system was located within a closed hood during the measurement process to reduce the influence of water vapor. e ambient temperature was controlled at 23°C, and humidity was below 2%. e THz-TDS parameters were set as follows: wavelength of the femtosecond laser system 800 nm, frequency 80 MHz, pump light intensity 100 mW, probe light intensity 20 mW, scan stroke 50 ps, useful spectral range 0.2-2 THz, and dynamic range above 70 db. All the experiment process was shown in Figure 1.

Sample Preparation.
Sixteen types of antibiotics, including β-lactam, tetracyclines, macrolides, and cephalosporins, were used. e name, class, and main ingredient of these antibiotics are shown in Table 1. First, the drug samples were ground in an agate mortar to avoid scattering of the THz waves caused by particle heterogeneity and also to increase the signal-to-noise ratio. en a certain amount of the sample was weighed and placed on the automatic tablet press, with the pressure set to 2 tons and pressure maintenance time of 1 min. A digital caliper (precision 0.02 mm) was used to measure the thickness of the sample tablets.
us, 40 samples were prepared for each of the 16 types of antibiotics and used to detect THz absorption spectra. Finally, the 640 samples were randomly divided into a training set (16 * 30 samples) and a test set (16 * 10 samples).

Data Processing.
e THz time-domain spectral information of the samples was obtained. e reflection peaks were removed by empirical mode decomposition [30]. Denoising was done by Savitzky-Golay filtering, followed by Fourier transform to convert the time-domain information to the frequency-domain information.
e model was extracted based on the optical parameters proposed by Dorney et al. [31] and Duvillaret et al. [32], and the absorption coefficient of the sample α(ω) was calculated.
where ρ(ω) is the amplitude ratio; A s is the signal amplitude of the sample; A r is the signal amplitude of the reference; φ(ω) is the phase difference; φ s is the phase of the sample; and φ r is the phase of the reference. e index of refraction and absorption coefficients are calculated using the formulae below: where n(ω) is the index of refraction; α(ω) is the absorption coefficient; c is the speed of light in a vacuum; ω is the angular frequency; and d is the sample thickness.

Results and Discussion
3.1. Spectral Analysis. THz-TDS was performed for the sixteen types of antibiotics shown in Table 1. e THz timedomain spectra thus obtained are shown in Figure 2(a), and on this basis, the frequency-domain spectra and absorption spectra were calculated. e spectra corresponding to the frequency from 0.2 to 1.5 THz are shown in Figures 2(b) and 2(c). e sixteen types of antibiotics were barely differentiated by the time-domain and frequency-domain spectra. Some of the antibiotics shared the same absorption peaks, and the antibiotics could not be differentiated by the spectral features alone. To solve this problem, we introduced chemometric pattern recognition and established identification models.

Visualization of the Sample Classification.
PCA can reduce a large number of intercorrelated indicators into a group of fewer and nonintercorrelated synthetic indicators. PCA usually consists of the following steps [33]. First, calculate the covariance matrix of the sample data, and then calculate the eigenvalues of the covariance matrix and the corresponding orthogonal unit eigenvectors. Sort the eigenvalues, and choose the maximum eigenvalues and the corresponding eigenvectors. Convert the data to the new space constructed by these eigenvectors. PCA can effectively  Figure 1: Experimental layout of terahertz spectroscopy system. restore the original data and solve the problems of information overlap and multicollinearity while reducing the dimensionality of data. t-SNE is a method that introduces a t-distribution to optimize the crowding problem suffered by the original SNE algorithm [34]. e core principle of t-SNE is to perform similarity modeling of the data points by using a normalized Gaussian kernel in the high-dimensional space and by using a t-distribution in the low-dimensional space. Following this principle, there will be a higher probability of similar points being selected and a lower probability of nonsimilar points being selected.
is algorithm consists of the following steps [35]: First, represent the similarity between the two data points by conditional probability.
en, represent the joint probability distribution of the low-dimensional data by a t-distribution with a degree of freedom of 1.
Finally, obtain the optimal simulation points by gradient descent that minimizes the KL divergence of all points. us, samples in the low-dimensional subspace are obtained. Make sure that the probability distribution q j|i of data mapped into the low-dimensional space can effectively simulate the probability distribution in the high-dimensional space p j|i . For the selected frequency band, the number of dimensions of data from the absorption spectra was as high as 143. In order to reduce the training time of the model and to increase the accuracy of the identification models, dimensionality reduction was performed using PCA and t-SNE, which was followed by pattern recognition using different models. en, different methods were compared to find the optimal dimensionality reduction method for antibiotics identification. PCA was applied to the absorption spectra of 640 samples (16 * 40). Figure 3(a) shows the 3D distribution of the principal components of the absorption spectra for different antibiotics. ree principal components (PC1, PC2, and PC3) were identified, and their contribution rates were 86.62%, 10.23%, and 1.13%, respectively. e sum of the contribution rates of the three principal components was 97.98%. erefore, these three principal components could sufficiently represent the original absorption spectra. Figure 3(b) shows the 3D distribution of the different antibiotics visualized by t-SNE. It is clear to see that the divergence of the samples in Figure 3(b) is far higher than that in Figure 3(a). e samples were well clustered together, with few overlaps between different classes.

Identification Analysis.
After dimensionality reduction by either PCA or t-SNE, the new data matrix (640 samples × 3 dimensions) was used to replace the original spectral data matrix (640 samples × 143 dimensions). e 640 samples were randomly divided into a training set (16 * 30 � 480 samples) and a test set (16 * 10 �160 samples). e parameters of the SVM model were trained using the training set. en, SVM was, respectively, combined with GS, GA, and PSO to optimize the model parameters [36,37]. Finally, the prediction accuracy of the model was evaluated using the test set. e optimal combination of dimensionality reduction method and model parameter optimization was determined by comparison. us, the optimal identification model for the THz spectra of the antibiotics was established.
Here, the identification model was built based on an SVM. An SVM is a supervised machine learning algorithm. In SVMs, the optimal decision hyperplane is found that maximizes the distance from the two sides of the hyperplane to the two classes of samples nearest to the hyperplane. In this way, good generalization is achieved for identification. e performance of an SVM mainly depends on the penalty factor c and kernel parameter g of the model. e model should be trained to achieve the optimal identification result, and the optimal model parameters should be chosen. To do this, parameters c and g were first optimized by grid search (GS). en, GS was combined with different dimensionality reduction methods to establish the No-GS-SVM, PCA-GS-SVM, and t-SNE-GS-SVM models. e optimal cross-validation accuracy (CVAccuracy) of each model was determined using 5-fold cross-validation, along with the prediction accuracy of this model on the training set and test set. e results are shown in Table 2. Figure 4 shows the results of parameter optimization by GS-SVM. Figure 4(a) shows the 3D results of parameter selection by No-GS-SVM, with the CVAccuracy being 99.5833%. Figure 4(b) shows the 3D results of parameter selection by PCA-GS-SVM, with the CVAccuracy being 99.7917%. Figure 4(c) shows the 3D results of parameter selection by t-SNE-GS-SVM, with a CVAccuracy of 100%. It is clear to see that the recognition accuracy was the highest after dimensionality reduction with t-SNE.
A genetic algorithm (GA) and particle swarm optimization (PSO) were introduced to find the optimal combination of parameters c and g to further improve the prediction accuracy. e initial population size was set to 20, and the number of iterations was 50.
e CVAccuracy, training set accuracy, and prediction set accuracy of No-GA-SVM, PCA-GA-SVM, and t-SNE-GA-SVM under 5-fold cross-validation are shown in Table 2. e fitness curves of the three models are presented in Figure 5. Figure 5(a) shows the fitness curve of No-GA-SVM, with a CVAccuracy of 99.7917%; Figure 5(b) shows the fitness curve of PCA-GA-SVM, with a CVAccuracy of 100%; Figure 5(c) shows the fitness curve of t-SNE-GA-SVM, with a CVAccuracy also of 100%. As compared to the GA, PSO does not include crossover and mutation operations, and the global optimum is searched by tracking the current optimal value. For this reason, the accuracy of PSO is higher. e initial population size was set to 20, and the number of iterations was 50. e CVAccuracy, training set accuracy, and prediction set accuracy of No-PSO-SVM, PCA-PSO-SVM, and t-SNE-PSO-SVM under 5-fold cross-validation are shown in Table 2. e fitness curves of the three models are shown in Figure 6.  Figure 6(c) is the fitness curve of tSNE-PSO-SVM, with a CVAccuracy also of 100%. e optimal recognition accuracy was reached after PSO.
As shown in Table 2, under 5-fold cross-validation, the CVAccuracy of the same SVM model combined with t-SNE was higher than that of the SVM model combined with PCA or the SVM model without dimensionality reduction. When the same dimensionality reduction method was used, PSO-SVM exhibited the best identification performance compared to GA-SVM and GS-SVM.
PCA and t-SNE were, respectively, combined with GS-SVM, GA-SVM, and PSO-SVM. e samples were randomly divided into a training set and test set. Each model was run 100 times to calculate the average training accuracy, average prediction accuracy, and average time consumption.
e comparison results are shown in Table 3. For the same recognition model, both the average training accuracy and prediction accuracy were higher with dimensionality reduction than without dimensionality reduction. t-SNE was consistently superior to PCA for dimensionality reduction for the THz spectra of antibiotics and also better than no use of dimensionality reduction. Additionally, the training time of the model was significantly shortened after dimensionality reduction. e time of a single training run after dimensionality reduction with PCA was shorter than that with t-SNE. is comprehensive comparison indicated that of 9 recognition models, t-SNE-PSO-SVM had the highest Journal of Spectroscopy        erefore, t-SNE-PSO-SVM was better for the recognition of THz spectra of antibiotics and had higher practical application value.

Conclusions
e present study was mainly concerned with antibiotics identification based on THz-TDS. Antibiotics come in many forms, and their direct differentiation may be impossible. We found that the THz time-domain spectra and absorption spectra only displayed minor differences between different antibiotics, which made direct differentiation difficult. erefore, chemometric pattern recognition was introduced to build recognition models for antibiotics. PCA and t-SNE were, respectively, used for feature selection and dimensionality reduction. en, these two methods were combined with GS-SVM, GA-SVM, and PSO-SVM to build the identification models.
e optimal model was chosen after parameter optimization and comparative analysis. e experiments showed that the training time of the identification model was significantly shortened after dimensionality reduction, and the recognition accuracy was higher with t-SNE than with PCA. e comprehensive comparison indicated that t-SNE-PSO-SVM had the highest average prediction accuracy among all models, which was 99.91%. erefore, t-SNE-PSO-SVM was more suitable for antibiotics identification. Our study also confirmed that the combination of THz-TDS and chemometric pattern recognition has great potential for drug detection.

Data Availability
e data used to support the findings of this study have not been made available because the experimental data involved in the paper are all obtained based on our own designed experiments and need to be kept confidential, we are still using it for further research.