Malignant and benign thyroid nodule differentiation through the analysis of blood plasma with terahertz spectroscopy

: The liquid and lyophilized blood plasma of patients with benign or malignant thyroid nodules and healthy individuals were studied by terahertz (THz) time-domain spectroscopy and machine learning. The blood plasma samples from malignant nodule patients were shown to have higher absorption. The glucose concentration and miRNA-146b level were correlated with the sample’s absorption at 1 THz. A two-stage ensemble algorithm was proposed for the THz spectra analysis. The first stage was based on the Support Vector Machine with a linear kernel to separate healthy and thyroid nodule participants. The second stage included additional data preprocessing by Ornstein-Uhlenbeck kernel Principal Component Analysis to separate benign and malignant thyroid nodule participants. Thus, the distinction of malignant and benign thyroid nodule patients through their lyophilized blood plasma analysis by terahertz time-domain spectroscopy and machine learning was demonstrated.


Introduction
Thyroid nodules are the most widespread among the pathologies of the endocrine system. About 5-15% of them are malignant and require surgical intervention, the decision for which is to be taken based on a preoperative diagnosis [1,2]. For this purpose, various imaging techniques are used, including computed [3], magnetic resonance [4,5], and positron emission tomographies [6]. However, ultrasonography [7] and thyroid nodules biopsy under ultrasonography control with fine-needle aspiration cytology (FNAC) [8,9] are the most widely used. The latter requires highly skilled executing personnel and has a risk of errors [10]. Even after adequate sampling, the conclusion turns out to be indefinite in 15-30% of cases because cytological features are insufficient to distinguish between benign and malignant thyroid nodules. In these cases, it is necessary to carry out lobectomy to establish the concluding diagnosis [11].
Diagnostics accuracy increases when using immunocytochemistry and thyroid cancer's molecular biomarker evaluation [12][13][14]. The latter includes the detection of somatic mutations and translocations, determination of expression pattern of protein-encoding genes, and specific micro-RNA (miRNA) [15], as well as oncometabolites [16]. MiRNA is a class of small endogenous non-coding RNA functioning as the negative regulators of gene expression. These miRNAs may contribute to major cell processes in carcinogenesis, namely the growth, differentiation, apoptosis, and adhesion [17]. The miRNA tissue profiles help differentiate benign and malignant thyroid nodules; however, it uses an invasive biopsy procedure [18,19]. Quite the contrary, analysis of miRNA [20] and metabolites in body fluids (blood serum or plasma, urine) possess potential in early cancer detection [21].
Despite a broad list of existing methods, thyroid cancer early detection is not still solved. Optical spectroscopy allows one to measure molecular biomarkers in various samples of biological origin. For example, Raman or Fourier-transform infrared spectroscopy combined with multidimensional analysis distinguished pathologically changed and normal thyroid nodules [22][23][24].
Terahertz time-domain spectroscopy (THz-TDS) is attractive for designing new noninvasive or minimally invasive diagnostic tools [25][26][27]. The THz-TDS gives a possibility of direct refractive index measurement, in addition to the absorption coefficient. Herewith, the sample's dielectric function can be restored [28,29], providing a more informative analysis [30][31][32]. The THz-TDS proved to be a sensitive method of determining structural changes in tumor tissues [33][34][35], particularly in vivo [36]. The high sensitivity of THz spectroscopy to blood content was established [37][38][39][40][41][42][43][44][45]. In particular, the diabetes patients' blood absorption in the THz range depends linearly on the glucose level [41,43]. THz spectroscopy distinguished the blood plasma of healthy and diabetic rats [38,44]. Blood plasma THz absorption was decreased for mice with grafted Ehrlich's carcinoma compared to healthy ones [46]. The absorption in the 0.05−1.0 THz range of rat's blood serum was changed during the hepatic cancer development, correlated with blood total protein concentration [47]. It confirms that pathologies, including cancer, cause essential changing blood optical characteristics in the THz range.
This work investigates the abilities of THz-TDS and machine learning in differentiating malignant and benign thyroid nodules by analyzing blood plasma. The liquid and lyophilized blood plasma were studied. The spectral data analysis was conducted by the Kernel Principal Component Analysis (PCA), multidimensional scaling, and Uniform Manifold Approximation and Projection (UMAP) methods. The prognostic models were developed by the linear kernel Support Vector Machine (SVM). The spectral data were also compared with a biochemical composition of blood and the tissue's miRNA profile.

Sample preparation
The study was conducted following the Russian Federation's legislation; each patient signed informed consent; the clinical data were depersonalized. The Ethics Committee of Municipal Clinical Hospital No.1 (Novosibirsk, Russia) approved the study protocol.
Healthy volunteers (n = 6) and patients with thyroid nodules (n = 10), provided by the Municipal Clinical Hospital No. 1 of Novosibirsk, were recruited. Peripheral blood samples were collected in vacutainers with EDTA (Sarstedt AG & Co. KG, Germany). The plasma was separated by centrifugation at 2800 g for 15 min at +4°C. Blood plasma samples were frozen and stored at -80°C. Standard biochemical blood tests were conducted at the Municipal Clinical Hospital No.1 (Novosibirsk).
For all patients with thyroid nodules, FNAC was performed, and molecular markers in the biopsy material were determined. Histological examination was carried out for 5 samples. We ranked the importance of these methods (from more reliable to least) as follows: histological, molecular, and cytological analysis. A minimum set of markers was extracted (levels of HMGA2 mRNA and miR-375, -221, and -146b combined with the mitochondrial-to-nuclear DNA ratio) and yielded highly accurate discrimination between benign and malignant thyroid nodules [19]. The final clinical diagnoses were made by combining FNAC with histological examination, analysis of molecular markers, and clinical data. Based on this, patients with thyroid nodules were divided into two groups: group 1, with benign thyroid nodules (1b -5b), and group 2, with malignant thyroid nodules (1m -5m) ( Table 1). The detailed patients' description is shown in Table S1. The blood plasma was lyophilized by freeze-drying VaCo 2 (ZirBus, Germany) at -50°C and a pressure of 3 Pa. Pellets with a diameter of 5 mm had been made from lyophilized plasma samples using laboratory press «Specac, GS15011 (Great Britain)» at the pressure of 1 ton. Two variants of the pellets of different thickness («thick» and «thin») from each sample were made. The pellets' weight and thickness were measured by analytic scales «A&D, GR-200 (Japan)» and micrometer «MK-25 0.01(Russia)». The thick pellets' weight and thickness were in the range of 21.1-26.2 mg and 0.75-1.11 mm, respectively, and of the thin pellets were 9.3-13.2 mg and 0.385-0.565 mm, respectively. The density of pellets was in the range of 1.128-1.562 mg/mm 3 .

Experiment setup
The measurements were carried out with a THz-TDS spectrometer based on a femtosecond Spectra Physics Tsunami laser system (λ = 800 nm, τ = 80 fs, 5 nJ, rep. rate is 80 MHz) [48]. A commercial multi-dipole photoconductive antenna was used as the emitter of THz radiation. The radiation power of the latter was within 70 ÷ 75 mW, and the bias voltage was 15 V. The detector was a 4 mm ZnTe crystal. The repetition period of the THz pulses was 12 ns; the THz pulse energy was 10 −13 J. The latter is below a tissue damage threshold [49]. A detailed description of the spectrometer was presented in our previous works [44,47,48].
Measurements were carried out at a temperature of (21 ± 1)°C. A temporal averaging over 1024 sample scans with a 25 ps total duration was used. The registered spectral range was from 0.2 to 2 THz for lyophilized samples and from 0.2 to 1.6 THz for liquid plasma samples with a spectral resolution of 40 GHz. One measurement takes about 3 minutes. For increasing reliability, each serum sample was analyzed in triplicates, and then all spectral scans were averaged. The algorithm of the measurements is shown in Fig. 1. The pellets from lyophilized blood plasma were placed on a holder and measured in a transmission mode. Liquid plasma analysis was carried out using two identical cells of 120 µl volume (A145, Bruker Optics) with a spacer of 0.12 mm thickness [50]. One cell was filled with distilled water, the second one -with blood plasma. The plasma was thawed at 4°C and centrifuged at 1000 g for 10 minutes to remove any precipitate. Transmission of the cell with water was used as a reference signal, making it possible to eliminate water's dominant contribution in blood plasma [47]. The cell transmission measurements had been repeated several times to reduce the influence of laser radiation power temporal drift.

Spectral data extraction
The blood plasma transmittance T w was normalized to air for dry plasma and to water for liquid plasma: where E sample (f ) is the spectrum of the THz wave transmitted through a dry or liquid sample, E reference (f ) is the transmission spectrum of the reference THz wave transmitted through an empty frame for pellets or a cell with water for liquids. The use of water to measure a reference signal makes it possible to consider the reflection losses and re-reflections on the cell walls. A plasma sample absorption coefficient and the refractive index were calculated using the simplified formulas [47]: where n average = c·∆t d + 1 is the averaged refractive index, R = n average −1 n average +1 is the reflection loss, ∆t is the shift of the pulse transmitted through the sample relative to the reference signal, d is the sample thickness, f is the frequency, c is the speed of light.
We do not consider multiple reflections in thin pellets since this effect is negligible due to the strong absorption of the pellet material. The absorption coefficient for the field amplitude and not for the power was used, which is twice as large and generally accepted. Instrumental errors associated with the accuracy of the cell thickness measurement and the baseline drift ultimately lead to a total error of up to 5%. The error deteriorates up to 15% for liquid plasma at the frequency spectrum edges (0.05-0.07 and 1.2-1.6 THz) and lyophilized plasma less than 10% for the total spectral range.
The approximate formulas (2), (3) cause an error in absorption calculation of less than 1% for frequencies above 0.05 THz and for liquid plasma layer of at least 0.12 mm and lyophilized plasma pellets thickness from 0.385 to 1.11 mm. The variation in pellet thickness causes a total error of optical characteristics below 15%, with the method described above.

Spectral data analysis methods
The following dimensionality reduction methods were used: Kernel PCA, multidimensional scaling, and UMAP methods [51][52][53]. We also applied a Composite Multiscale Entropy (CMSE) method for the THz data cluster analysis [54]. These methods were chosen due to the difference in respective basic ideas and the application fields' versatility. Kernel PCA with linear, radial based function kernel and original kernel, derived from the Green's function for the Ornstein-Uhlenbeck (OU) process [55] was used. The latter has the following form: where parameters γ, λ define the shape of the kernel. We used the following values: γ = 0.001, λ = 0.01, and the pairwise Euclidean distance as a norm ∥ . . . ∥.
The UMAP method was applied with the number of neighbors 4, minimal distance 0.7. Multidimensional scaling was used with the default set of parameters. CMSE has the embedded dimension 2, threshold value 0.15, and scale numbers were assigned to 10 because we have lower spectral resolution than in the reference work. All these computations were conducted in Python 3.8.2 and Scikit-learn package 0.23.2.
The amount of the data is not large, and the classification could suffer overfitting. Here, we used Supervised Learning methods to estimate the linear separability of the data, for example, by using linear kernel SVM [56]. To validate estimation of the data separability, we used a standard approach for Supervised Learning: initial data were randomly divided into 3 splits with the same distribution of the data among classes as in the original dataset. Linear SVM was trained on each split with class weight balancing. Obtained quality metrics such as sensitivity, specificity, accuracy, and precision were averaged. Also, Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC) analyses were performed for each data split and the averaged ones.
A two-stage ensemble scheme was proposed for the data analysis (see Fig. 2). The first stage aims to separate healthy and thyroid nodules, enabling to estimate of the potential of sensitivity. The second stage was aimed to separate benign and malignant thyroid nodules, allowing evaluating the potential of specificity. By applying them consequently, the possibility of constructing a good prediction model arises.

Statistical data analysis
The OriginLab Corp. statistical software package was used to process the data. Mann-Whitney test was used for paired comparisons. The Pearson correlation coefficient was used to estimate linear relations between biochemical parameters and THz optical characteristics. The null hypothesis feasibility was taken at 5% of significance.

Liquid blood plasma
Liquid blood plasma samples were analyzed without additional processing. Their absorption coefficient and refractive index spectra are shown in Fig. 3.   Fig. 3. Average absorption coefficient a) and refractive index b) spectra of water (1, blue), healthy individuals (2, green), and patients with thyroid nodules (3, red) with error bars. Averaging was carried out over 30 measurements for blood plasma of patients with thyroid nodules, over 15 measurements for both blood plasma healthy individuals and water.
The absorption coefficient and refractive index values of blood plasma from patients with thyroid nodules are slightly higher than those from healthy individuals, but there are no statistically significant differences.
The shapes of blood plasma spectra and the water spectrum are the same; however, the amplitude of blood plasma absorption and refractive index spectra is much less than that of water. We observed a similar phenomenon in human and animal blood plasma [38,44,47,57]. The coincidence of blood plasma and water spectra shape is explained by the fact that water makes up more than 90% of blood plasma. At the same time, such blood plasma components as glucose, proteins, and various salts have a significant effect on the structure of water (it mainly bind water molecules to a state with a lower THz absorption) and change its THz response [29,48,50,[58][59][60]. Despite the high absorption of THz radiation by water, the THz TDS is sensitive to small variations of the liquid blood plasma component concentrations in the low-frequency range [40,41,43,47].
It is convenient using spectra subtraction to visualize the differences between the optical characteristics of the blood plasma: n dif = n plasma − n water (6) where α dif , n dif -difference of absorption and refraction values spectra of liquid plasma, α plasma , n plasma − plasma spectrum values for the case when the reference signal E reference (f ) is air, α water , n water − spectral characteristics of water. The difference of absorption coefficient and refractive index values spectra of blood plasma of healthy individuals (green squares) and patients with thyroid nodules (red squares) are shown in Fig. 4. Spectral differences of liquids are small compared to the total error. Figure 5 shows the difference in absorption coefficient and refractive index values spectra for blood plasma from group 1 (purple squares) and 2 (orange squares).  5. Averaged difference spectra of absorption coefficient and refractive index for the liquid plasma of two groups: malignant thyroid nodules (purple squares) and benign thyroid nodules (orange squares) normalized to water. Averaging was carried out over 15 measurements (each measurement with an independent reference signal obtained by passing through a cell with water). Group 1 is characterized by significantly higher blood glucose levels than group 2 (5.66 ± 0.17 mM versus 4.84 ± 0.16 mM, p <0.05) and 1.2 times higher protein levels. The absorption coefficient and refractive index, normalized to water, are lower for group 1 than group 2. The differences are most significant at 0.2 THz. Earlier, the most reliable transmission spectral range of 0.07 to 0.5 THz for blood serum was found for rat experimental liver cancer [47].

Lyophilized blood plasma
Since the water contribution to the THz blood plasma spectrum dominates over the other components, lyophilization was used to increase THz spectroscopy's informativity. Figure 6 (a,  b) shows the THz absorption coefficient of lyophilized blood plasma pellets. Using pellets of various thicknesses improves the optical parameters evaluation accuracy and expands the data's spectral range. Thus, when thickness increases, the transmitted pulse is more attenuated, and the error in thickness determining affects less. But at high frequencies, the spectral intensity is attenuated below the noise level. The advantage of thin pellets is more transparency at the high-frequencies of the THz spectral range and the additional ability to determine the pellet's thickness and the average refraction through internal re-reflections. Figure 6 shows that the absorption coefficient is the same for both types of pellets independently of the sample type. Moreover, "thick" pellets expand the reliable spectral range to the region from 0.04 to 0.1 THz, and "thin" ones to the region from 1.5 to 2 THz. Note that no difference was found in the refractive index between thick and thin pellets. It allows us to average each sample over two pellet types. The resulting spectra are shown in Fig. 7. The absorption is higher in blood plasma from healthy individuals up to a frequency of 0.4 THz. In the range of 0.4-0.6 THz, there is a change in the dynamics of absorption. Samples from individuals with thyroid nodules exhibit stronger absorption, which is higher from 0.6 THz and more. The same trend was observed in the absorption spectra of liquid samples (see Fig. 3 and Fig. 4).
The pellet spectra do not have any prominent features. Glucose has phonon lines at 1.45, 2.12, 2.5 THz [50]. However, there is evidence that, upon lyophilization of liquid samples containing proteins and sugars, bonds are formed between them, and resonances disappear [61]. The refractive index spectrum does not change, but the absolute values correlate with the disease's presence (see Fig. 7(b)).
The spectra were averaged over groups 1 and 2 to identify spectral differences depending on benign or malignant thyroid nodules (see Fig. 8). The samples from malignant nodule patients had higher absorption compared with that of benign nodule patients. The same trend was observed when studying liquid samples (see Fig. 5). However, spectral differences in liquids are small compared to the total error. Thus, both liquid and lyophilized blood plasma samples with malignant thyroid nodules have higher absorption than samples with benign thyroid nodules. For dry pellets, the differences are significant. The shapes of the spectra showing the increase in absorption with frequency are the same for all samples.

Correlation of spectral data and blood plasma composition
The absolute value of the Pearson correlation coefficient |R| is a measure of the strength of the linear relationship between the values of a data pair. A value of |R| equal to 1 means that there is a perfect linear relation; |R|>0.5 represents a moderate to a strong relationship, 0.3< |R| <0.5 represents a weak to reasonable relationship, and |R|<0.3 represents a weak relationship [41].
The Pearson correlation coefficient was applied to study the relationship between the blood plasma THz absorption spectra and glucose or protein concentrations. We used the two-tail test, which allots half of the α in one direction (r>0 side) and half of the α in the other direction (r<0 side). We used p=0.05 as a threshold of significant correlation. If p <0.05 was true, we took an |R| to indicate their correlation strength. If p> 0.05, no correlation was found. The strength of the linear correlation between glucose concentration and the absorption at 1 THz is shown in Fig. 9.
For glucose, the calculated Pearson's correlation coefficient is R=-0.84, p=0.0047 in liquid blood plasma ( Fig. 9(a)) and R=-0.79, p=0.01 in lyophilized blood plasma ( Fig. 9(b)), which means the correlation is high and negative. Correlation dependencies for total protein and miRNA are presented in Table 2. Notably, the THz response change depending on glucose concentration is more significant than expected from an effective medium parameters model at such a low volume fraction of glucose. It can be presumably caused by rebuilding protein structure due to glucose presence during lyophilization [61]. Probably, for this reason, no correlation between protein concentration and the absorption coefficient was found. A strong correlation was observed between the absorption coefficient and glucose concentration both for liquid and lyophilized samples. Glucose and protein make the most significant contribution to the blood plasma absorption in the THz range [41,43,44,47].

Table 2. The correlations between the absorption coefficients at 1 THz and the examined variables. R is the Pearson correlation coefficient; p is the two-tail p-value
The dependence of the malignancy degree of thyroid nodules on the several miRNAs content in tissue and blood plasma has been established [18][19][20]. Overexpression of miRNA-146b in papillary and follicular thyroid carcinoma compared to levels in normal thyroid tissues was reported [19]. This upregulation was positively correlated with tumor aggressiveness. Essential miR-146b content differences between groups (p<0.05) were observed. Group 2 had higher miRNA-146b values compared to Group 1 (see Table 1). The level of miRNA 146 had a positive correlation with the lyophilized plasma absorption at 1 THz (R = 0.63, two-tail p-value = 0.049) (see Table 2).

Machine learning application
For the first stage of the ensemble scheme (Fig. 2), kernel PCA, multidimensional scaling, and UMAP showed the best results (see Fig. 10). The exception was CMSE, which produced features with a dimension of scale number (10 in our case), and additional dimensionality reduction was required. Linear separability of the healthy versus benign and malignant thyroid nodules classes was verified by linear SVM, which was applied to the data, preliminary transformed by the linear kernel PCA (see Fig. 11 (a)). The averaged quality metrics: sensitivity was equal to 0.92 ± 0.04; specificity was equal to 0.85 ± 0.05 accuracy was equal to 0.88 ± 0.04; precision was equal to 0.80 ± 0.03. The mean ROC curve and AUC analysis for the linear SVMs is shown in Fig. 11 (b). The results demonstrate the potential of developed models for the separation of healthy and thyroid nodules groups.
Also, linear SVM allows obtaining the most informative features, based on the proximity of the feature vector to the separating hyperplane. An averaged relative features' importance was estimated for the three train/test splits and linear SVMs. The most significant absorption frequency for group separating was 1.11095 THz; relative importance was 0.008. The latter was low due to the normalization of a large number of the feature vector components. Figure 12 illustrates the downfall of relative features importance and most informative THz absorption frequencies.
Next, we developed a method to test the separability of the benign and malignant groups. The thick and thin blood plasma pellets were found to give different information for malignant and benign thyroid nodules distinguishing. For example, data after multidimensional scaling and linear kernel PCA are not linearly separable; UMAP performs well on both data. We suppose that it can be related to the high complexity of the UMAP model and the low data volume.
The OU Kernel PCA provided explicit separability of groups of benign and malignant thyroid nodules when lyophilized blood plasma pellets with a thickness of 0.385-0.565 mm had been used (see Fig. 13(a)). We believe that it relates to their "transparency" in the THz high-frequency range, as was mentioned earlier. That is why the thickness of the pellets should be taken into  account: thin ones are more suitable for the groups under study distinction. This result is consistent with the established spectral data and blood plasma composition correlations.
The benign and malignant groups' size is small. Therefore, the creation of a predictive model cannot be reliable. Still, the confirmed separability of these groups is very promising.

Conclusion
We investigated blood plasma samples of healthy individuals and patients with benign and malignant thyroid nodules by THz-TDS. A strong correlation between the glucose concentration and the absorption for liquid and lyophilized blood plasma samples was established. We also demonstrated the correlation between miRNA-146b level and the absorption at 1 THz for the lyophilized blood plasma samples.
In total, THz spectra differences of liquid blood plasma samples were comparable with the experimental error value. The study of lyophilized blood plasma, pressed into pellets, showed a reliable separation of healthy individuals and patients with thyroid nodules and separation of patients with benign and malignant thyroid nodules through THz absorption coefficient and refractive index values. This separation was confirmed by machine learning as follows. The raw data were projected to the lower-dimensional space by kernel PCA, multidimensional scaling, and UMAP methods. All these methods give similar results except CMSE. Linear separability between healthy and thyroid nodules classes was demonstrated by the linear SVM with 92% sensitivity, 85% specificity, and 88% accuracy. The informative feature histogram was obtained to show the most informative frequencies for the classification.
In our case, an individual algorithm was not sufficient to distinguish patients with benign and malignant thyroid nodules using THz spectra of lyophilized blood plasma samples. A two-stage ensemble scheme was proposed. On a first step corresponded to a sensitivity estimation, the thyroid nodule group is separated from the healthy one. On a second step corresponded to a specificity estimation, benign and malignant groups are separated.
The current number of blood plasma samples from patients with benign and malignant thyroid nodules is not enough to create a robust predictive model. Still, the confirmed separability of these groups is very promising. Thus, we proved that THz TDS could be sensitive to blood composition changes, depending on the thyroid nodule malignancy degree. Collecting enough samples can give an opportunity for differential diagnosis of thyroid nodules through blood plasma analysis by THz spectroscopy and machine learning.