Diagnosis of hepatocellular carcinoma based on a terahertz signal and VMD-CWSE

: A novel strategy on combining variational mode decomposition (VMD) and composite weighted-scale sample entropy (CWSE) modiﬁed from composite multiscale entropy (CMSE) is proposed to screen hepatocellular carcinoma (HCC) by measuring the terahertz (THz) pulse signals of ten normal and ten HCC serums. Eight measured HCC specimens are negative in serum biomarker alpha fetoprotein (AFP) determination. In CWSE, the time series with weighted-scales are generated from the weighted average processing in the coarse-grained time series corresponding to each scale of the CMSE algorithm. VMD served as a preprocessing method was introduced into decomposing THz signal to obtain the mode functions of speciﬁc bandwidth for identiﬁcation. Final results reveal that more obtainable entropy values of CWSE for recognition in comparison to those of CMSE on the basis of the rule of statistically signiﬁcant diﬀerence and eﬀect size and also manifest the stronger discriminability than the traditional THz parameters. This study provides a new potential auxiliary tool for diagnosis HCC and develops the methodology on the discrimination for similar THz signals.


Introduction
Liver cancer is a kind of high mortality malignant tumor, mainly containing hepatocellular carcinoma (HCC), cholangiocarcinoma, and combined hepatocellular-cholangiocarcinoma. HCC is the most common one among them and has become the fifth most common cancer worldwide and the second frequent cause of cancer-related death in terms of World Health Organization [1]. Nowadays, with the advancement of medical technology, quite a few breakthrough works have been achieved for diagnosis HCC. However, there are some defects in the clinical application. In serological diagnosis, past decade has witnessed extensive clinical availability dependent on serum biomarker including alpha fetoprotein (AFP), Golgi protein 73 (GP73), and desγ-carboxyprothrombin (DCP, also known as prothrombin-induced by vitamin K absence-II, PIVKAII), etc. Nevertheless, the low sensitivity of AFP [2,3] may result in misdiagnosis. Additionally, it is reported that the elevated AFP levels as an early alarm suggests the advent of yolk sac cancer, liver metastasis from gastric cancer, testicular cancer and nasopharyngeal cancer as well [4]. A higher sensitivity (69%) of GP73 is utilized to detect early-stage HCC and 57% sensitivity to screen AFP-negative HCC [5]. With regard to DCP, roughly 30% AFP-negative HCC can be identified as DCP-positive [3]. Therefore it is difficult to diagnose HCC. An effective and reliable rapid screening method or auxiliary technique is urgently needed to identify HCC.
Terahertz (THz) region, defined as 0.1-10THz in the electromagnetic spectrum, provides masses of information on rotational or vibrational transitions of molecules and intermolecular vibrations. It is feasible and convenient to discriminate masses of materials by virtue of this unique property. Moreover, considering the intriguing nature of low energy and nonionizing of THz radiation, THz spectroscopy has been made wide applicability in biomedical field [6]. More than that, THz system can currently be operated by unprofessional experts [7] with the advances made in hardware and commercialization. However, the polar substances (i.e. water) with high absorption of THz wave limits the generalized biomedical applications. Conversely, it is this opposite mechanism that was substantially utilized to distinguish diverse types of tissues [8][9][10], assess the living state of bacteria [11,12], and detect the PCR amplified DNA in aqueous solution quantitatively [13] pursuant to different absorption of water content or of hydration and bulk water. It should be noted that those above researches were conducted by using thin volume aqueous solution in THz transmission spectroscopy or by employing reflection spectroscopy to protect THz signal against immersion. Although the previous reports have described the THz properties of whole blood, plasma, blood cells, thrombus and glucose [14][15][16] as well, the serious possibility of specific spectral feature is usually not available for diagnostic purposes due to the dominated factor by water. Obviously it is hard to identify HCC samples only dependent on the nuance of spectral properties and whereupon a novel strategy combining THz technique is strongly desired.
Until now, multifarious complexity measures of time series have been presented for understanding or solving the issues in the nonlinear systems [17]. Recent years have seen that some entropy-based measures are powerful tools for quantifying the complexity of time series [18][19][20][21]. Sample entropy (SaEn) is a popular method among these entropy-based measures, which is less dependent on the length of time series and remains relative consistency [20]. However, some results indicated that these measures dependent on single-scale often failed to analyze the complexity of long-term structure correctly due to the multiple scales of the structures in time series from complex systems [22]. As regards this weakness, multiscale entropy (MSE) [23] based on coarse-graining technique was put forward to solve various issues on long-term complex time series [24][25][26][27]. Nevertheless, to reduce the estimated errors at large scale in entropy measurements resulted in proposing composite multiscale entropy (CMSE) [28]. Their results regarding the performance comparison between MSE and CMSE demonstrated the latter could provide a more precise estimation of entropy and elevate the distinguishability as a latent tool for identification.
Analogously, the entropy values of different scales reflecting some properties of the samples can be regarded as an alternative approach for identification. THz signal can be seen as a type of time series so that its complexity is apropos to be measured by SaEn. So the measured material information in THz signal can be transformed into entropy values. Zhang et. al. [29] discriminated the THz signals between fresh porcine skin and muscle tissues by employing CMSE to measure the entropy, which showed the preponderance of CMSE method for identification comparing with some familiar parameters (such as the specified amplitudes in time-domain or frequency-domain, refractive index, and absorption coefficient at some frequencies).
In order to produce more distinguishable SaEns and enhance discrimination, the weight coefficients within the data points were taken into account for further processing the coarsegrained time series in CMSE or MSE. Thus the time series with weighted-scales are generated and the following calculation steps are similar to those of CMSE. We named this modified algorithm composite weighted-scale sample entropy (CWSE). Prior to that, variational mode decomposition (VMD) technique served as a preprocessing tool was attempted to introduce to decompose THz signal into several band-limited mode functions so as to select the significative components for subsequent analysis. VMD with solid theoretical background and noise robustness [30] was recently proposed for adaptive decomposing the signal into some components compactly around the central frequencies so that its attractive characteristic has been widely applied in bearing fault diagnosis [31], processing wind power series [32], and analysis and forecasting crude oil price [33], etc. Normal and HCC serums are probably sensitive to dissimilar frequency components within THz band, which is reasonable to introduce VMD to this experiment. In this work, THz signals of normal and HCC serums (including eight AFP-negative samples) were acquired and processed by means of VMD-CWSE. To prove the validity of VMD-CWSE, some traditional THz parameters and VMD-CMSE were considered as comparison.

THz measurement
THz signals of samples were measured using a Picometrix T-ray 5000 fiber-coupled THz-TDS system in reflection mode, which is displayed as the sketch in Fig. 1. The equipment mainly consists of pulsed THz controller, transmitter and receiver heads, and computer with processing software. The sample was placed on a quartz plate (55mm*55mm*20mm) with two centimeter thickness so that it is thick enough to isolate the sample signal from the interference for signal of back surface.

Sample preparation
Ten normal and ten HCC serum specimens (including eight of AFP-negative a ), were provided by clinical laboratory of Beijing Tongren Hospital, Capital Medical University. All the HCC patients were diagnosed by imageological examination or biopsy. Chemiluminescent immunoassay was used to detect AFP level of the serum. The AFP level of each patient is listed in Table 1. All the samples were frozen in the refrigerator at -80°C for 3 months. They were transferred into the cold storage at 4°C in the day before experiment. Prior to THz measurement, the samples were fetched out and put under the room temperature for about 10 minutes. And next the serums were inhaled 1mL by syringe and uniformly spread upon the prepared quartz plate, in terms of labeled number in sequence. After each measurement, the dropper was employed to suck serum and a piece of paper towel was used to soak the residual liquid gently. A new paper towel was taken out to clean the surface after dropping 3∼4 drops of alcohol and pure water. Note that these behaviors were cautious enough to guarantee highly precious phase to avoid corresponding errors. Human serums measured in this experiment were approved by the Ethics Committee of Beijing Tongren Hospital, Capital Medical University.

Composite multiscale entropy (CMSE)
Given an one-dimensional time series k,i+p } defined as coarse-grained time series at scale s can be generated according to Eq. (1). By means of dividing time series into non-overlapping windows of length s, coarse-grained time series is a sequence consisting of the mean of data points within each window. SaEns are computed from all coarse-grained time series and CMSE (Eq. (2)) is attained by calculating the averages of s entropy values. The more detailed description about CMSE can be consulted in [28]. In this study, the embedded dimension m=2, threshold value r = 0.15σ and scale numbers were assigned 20 in terms of the previous study [22,28,35], where σ is the standard deviation of the time series.

Composite weighted-scale entropy (CWSE)
Deriving from SaEn, the multiscale established in CWSE can be viewed as the weighted-scales based on each corresponding scale of CMSE. Herein we took the scale factor 2 to illustrate our modified method in constructing the time series of weighted-scales from schematic THz signal. As shown in Fig. 2, the coarse-grained time series is produced, under which is the time series with weighted-scales. It is obtained by performing weighted averaging in dividing the coarse-grained time series into non-overlapping windows, where the number of windows is dependent on weighted-scale factors. The weight coefficients of data points are represented by the rate of change of amplitude ω (the ratio between amplitude difference and corresponding time) between the point and the consecutive prior one while the weight coefficient of the first point is reasonable to be ignored as its amplitude difference cannot be calculated. We select this rate of change as weight coefficient in that its combination the characteristic for amplitude and time delay of THz signal, which manifests the optical properties of the material to some extent, is able to represent the weightage. Based on the kth coarse-grained time series of CMSE, the weighted average of data points could be obtained as the detailed expressions listed in formula (3), where s and ss denote the scale and weighted-scale factors separately. The further interpretation can be seen the pseudocode listed as follows. And final CWSE value is calculated by defining as averaging the first k entropy values for the kth weighted-scale, expressed with the Eq. (4)∼(6). Thus CWSE, with respect to scale and weighted-scale, has access to more entropy values (several times the amount of CMSE) to supplement some certain information useful. Herein the weighted-scales under each scale were set to 20 to facilitate comparison.
1 ≤ k ≤ s, 1 ≤ kk ≤ ss The essential of VMD is to seek the optimal solution for constrained variational model to decompose the multi-component signal into K discrete number of specified mode functions that are compact around corresponding center frequencies . The quadratic penalty term α that controls bandwidth and Lagrangian multiplier λ were introduced to better solve the constrained variational problem. A more detailed principle description of VMD algorithm is obtainable in literature [30]. Generally speaking, the number of mode functions K and parameter α affect the performance of VMD. However, there is no unified standard to determine K and α. In this paper, K was set to 3 and the top two mode functions were chosen. The further demonstration can be seen in the Appendix. Additionally, the next section will discuss the selection of α.

Results and discussion
THz time signals with 1000 data points for all the measured normal and tumor (HCC) samples are shown in Fig. 3(a) and they are seriously overlapped and unrecognized seemingly. The common time-domain parameters of E max , E min , E max -E min were applied to discriminate the normal and HCC samples. The results were listed in Table 2, from which only E min can manifest the marked difference. Generally, in the statistically significant difference levels of t-test, p-value less than 0.05 denoting significant difference is different from p<0.01(very significant difference) and p<0.001 (extremely significant difference) [36]. Furthermore, it is beneficial to use the effect size of standardized difference between two means to understand how substantially different between two groups. In compliance with the rule of thumb [37,38], its magnitude can be considered as large and very large with d>0.80 and d>1.20 respectively, which means different levels of practical importance. So the significant difference and large practical importance exist between the normal and HCC samples with the aid of the parameter E min . Figure 3(b)∼(d) show the spectra of FFT amplitude (the effective band selected from 0.1to 0.6THz), refractive index and absorption coefficient of the normal and HCC samples. According  to the results of t-test, there is no obvious difference (p>0.05) between these two groups of samples. So it is unnecessary to consider effect size for this situation. CMSE and CWSE were independently applied to identify them with the criteria of statistical differences assessed by Student's t-test. Effect size was further evaluated if significant difference statistically. Profile curves on averaged CMSE as a function of scales are shown in Fig. 4(a). It can be seen from the CMSE profiles that there is a sharp rise for the entropies of the first 6 scales and gradual descent for the remaining scales. And there is no significant difference in CMSE between the normal and tumor group. Note that in this paper 83% confidence interval was adopted as error bars [39]. The images of CWSE distribution for the average normal and tumor sample groups are shown in Figs. 5(a) and 5(b), where it can be seen that entropies posterior to 5th scale and weighted-scale are mainly larger than those in the previous scales and weighted-scales and the majority of entropies are larger than those of CMSE. It can be implied that the generated weighted-scales, as a preprocessing tool, improved the complexity of signal, which is probably pertinent to the enhanced distinctions between two groups. It should be note that in CWSE from 14th scale some undefined entropy values owing to the shortened data [40] are responsible for the vacant p-values. Here binary image representation was pondered to exhibit the results for CWSE due to the cumbersome representation of entropy curves for all scales. The binarization result of CWSE, the area of the statistically significant difference is denoted with red, is also unable to discriminate them [ Fig. 4(b)].  THz signal can be seen as a synthesis composed of sub-signals with different frequency components. We speculate decomposing the original signal into several parts could enhance the differences and integrated signal potentially restricts the identification. Therefore it is necessary to employ VMD to decompose the signals into mode functions with different bandwidths for further analysis and yielding more expected entropies.
Herein an exhaustive search optimization was applied to seek the optimal parameter α without any prior knowledge. The searching area is defined to initialize α=1200 as a center and 1000 as a radius with the interval of 10. Above t-test method is employed as the criterion to select optimum α. It is worth noting that the searching area needs to expand to seek other possible available α in the case of the optimum α adjacent to the border. In the searching area, a total of 201 candidates α produce different mode functions to calculate CMSE and CWSE.
(A) The first mode function (M1) Figure 6(a) displays the p-values distribution of all α for CMSE applied to M1(s), from which it can be seen that there is no distinguishable entropy value in CMSE. Apropos of CWSE, the top three α (α=360, 270, 280) were extracted in the approach of sorting α by counting the total number of entropies that meet the criterion of t-test, as shown in Figs. 6(b)-6(d), where the total amount of CWSE values in Fig. 6(b) is 33. It is sufficient to exhibit the overwhelming superiority of CWSE, which benefits from substantial available entropies of weighted-scales. Additionally, the magnitudes of effect sizes were large for those significant differences. We select the case of α=360 at the 4th scale, including the largest number of separated entropies, in above CWSE to obtain the corresponding curves (Fig. 7). The p-values for significant differences and corresponding effect sizes are listed in Table 3. It can be seen the distinguishable entropies from 10th to 20th scale, which demonstrates the complexities of M1(s) between these two types of samples are different statistically. Nevertheless, there is no case of very significant difference (p<0.01) and very large effect size (d>1.2). In order to further validate the preponderance of VMD and VMD-CWSE, especially for comparing with the parameter E min , the binary image of effect size and three-level thresholding image of p-values are presented in the discussion of the second mode function.  In the three-level thresholding image (Fig. 8), the area of p<0.01, 0.01≤p<0.05, and p≥0.05 is respectively denoted with yellow, red and blue. Comparing with M1, more discriminative entropies either for CMSE or CWSE are gained from the global view of the p-values distribution. Besides that, the very large effect sizes can be observed from the binary images (Fig. 9). Figure 10 shows the specific cases of above CMSE (α=390) and CWSE (α=1610, scale=7) that include the largest amount of distinguishable entropies in the premise of p<0.01 and d>1.2. Tables 4 and 5 list the p-values for very significant differences and corresponding effect sizes.  In contrast to the previous results without executing VMD, either in CWSE or CMSE, considerable improvement on identification for normal and tumor groups can be observed. This just verifies above assumption that the sufficient discrimination was limited by the integrated raw signals. The larger distinctions exist in the frequency components of M2(s) than those of M1(s)  between the measured normal and tumor specimens. This verifies VMD is an indispensable step for knotty identification.

Conclusion
In this work, we have successfully distinguished the normal and HCC serum samples (including AFP-negative) by adopting the novel strategy combined VMD and new proposed CWSE based on measured THz reflection signals. This provides the motivation to be considered as the potential auxiliary tool for diagnosis HCC (including AFP-negative) as it is hard to diagnose clinically. In consequence of the large number of established weighted-scales, it is more possible for CWSE than CMSE on recognition with the aid of VMD, which also suggests CWSE can be deemed as a new index for identification. Furthermore, it is promising and attractive to adopt machine learning tools based on the proposed strategy in the case of handling with complex identification or relevant qualitative analysis. Nevertheless, apropos of VMD, the approach for determining the bandwidth control parameter (α) seemingly intricate. In spite of that, the powerfulness and potential of VMD are conspicuous. As the purpose of this work is mainly concentrated on discrimination the normal and HCC serum samples, the concise and precise methodology in determining the parameter will be investigated in the future.

Appendix
The parameter α was tentatively assigned to 700, 1200, and 1800 for preliminary evaluation the performance of VMD. Figure 11 exhibits these three mode functions (abbreviated M1, M2, M3 in turn) that represent different frequency components. It can be seen from Fig. 12 that different components information are preserved in terms of the corresponding correlation coefficients and energy ratios and M1 contains majority of original information. It can be observed the slight differences for different α in these mode functions. Therefore it is necessary to determine the apropos parameter α for identification. Nevertheless, the correlation coefficient and energy ratio of M3 are the lowest among these mode functions and M3 maybe contain noise components. So M1 and M2 were selected as the above discussion.