Data selection. To be more relevant and to obtain general experimental results, the thyroid disease dataset throid0387 from the UCI machine learning dataset, which contains thyroid disease records from the Garavan Institute and J. Ross of the College of New South Wales, was selected for this paper [5–6]. This dataset is a standard thyroid dataset and is used as a research target by a wide range of data analysis and machine learning scholars. The dataset contains a total of 20 subcategories and 6 major categories with a total of 9172 samples. In this paper, the thyroid disease types was classified into the following categories: hyperthyroid conditions, hypothyroid conditions, binding protein, general health, replacement therapy, discordant results. To facilitate the analysis, the labels were transformed into numerical labels of 0, 1, 2, 3, 4, and 5 for subsequent analysis.
Data mining. Through statistical analysis, it was found that the number of samples corresponding to the six categories was statistically available as shown in Table 1. It is not difficult to find that the sample classification gap is within an acceptable range and no serious imbalance of sample categories occurs. At the same time, a large number of unlabeled samples were removed, and the final number of available samples was 2282. describing objective information relevance can usually be described by functional or statistical relationships. Obviously, the correlation between the features in this paper obviously cannot be described in a general way using a simple functional relationship. In this paper, the Pearson was used for perform correlation analysis[7]. Pearson correlation analysis is a statistical indicator introduced according to the degree of linear correlation between two variables, and its correlation is defined as the quotient of the covariance and standard deviation between two variables
$$r= \frac{\sum _{i=1}^{n}({x}_{i}- \stackrel{-}{x})({y}_{i}- \stackrel{-}{y})}{\sqrt{\sum _{i=1}^{n}{({x}_{i}- \stackrel{-}{x})}^{2}}\sqrt{\sum _{i=1}^{n}{({y}_{i}- \stackrel{-}{y})}^{2}}} \left(1\right)$$
where r represents the Pearson correlation coefficient, \({(x}_{i},{y}_{i})\) is the point distribution of sample i, and \(\stackrel{-}{x},\stackrel{-}{y}\) are its mean values, respectively. The specific distribution of each category is shown in Fig. 1. The statistical results of the data are shown in Table 1. The analysis is transformed into a heat map as shown in Fig. 2, and the features with strong correlation are several important hormonal indicators used as the main diagnostic basis as previously described.
Table 1
Statistics of data set categories
class | hyperthyroid | hypothyroid | binding protein | general health | replacement therapy | discordant results |
index | 0 | 1 | 2 | 3 | 4 | 5 |
num | 182 | 593 | 412 | 562 | 336 | 197 |
Feature interpolation algorithm. The diagnosis of thyroid disease mainly collects information related to patients' TSH, etc., but the actual data is likely to have missing data. In this paper, firstly, the missing values of each feature are counted, and the data missing are mainly divided into two categories, one is the continuous value missing and the other is the Boolean value missing. The missing continuous values are shown in Table 2, and the following strategy will be used to deal with the missing values: the features with missing values accounting for more than 50% will be deleted. Finally, this paper removes the TBG, and the referral source that are not related to the analysis of etiology, and gets 27 valid feature columns.
For the treatment of other missing values, it is mainly the missing continuous values of hormone measurements. However, there are also validity and importance distinctions for hormone diagnosis in current medicine, for example, TSH can be used as a basis for the initial diagnosis of many thyroid diseases, and its diagnostic reliability is usually at the level of this paper proposes a weighted feature analysis method based on feature importance. Its main algorithmic steps are: first, the data are trained using the random forest algorithm to obtain its relevance basis. To further enhance the trainer effect, a grid search and cross-validation algorithm is used in this paper to find and adjust the specified training parameters and perform k-fold cross-validation to obtain the optimal hyperparameters. Then, the importance statistics are obtained by random forest-based importance analysis as shown in Fig. 3, and the importance index is obtained as noted as
$$I= RF\left(x,y\right) \left(2\right)$$
The importance score I corresponding to each missing term feature column is obtained. then, the importance score is normalized to:
$$I\_norm=\left[\frac{{I}_{1}}{{I}_{max}- {I}_{min}} ,\dots ,\frac{{I}_{n}}{{I}_{max}- {I}_{min}} \right] \left(3\right)$$
The coefficients of the corresponding fitting functions are obtained by iterating through the missing columns, analyzing the characteristic fitting relationships between the current missing columns and other columns, and fitting them by least squares.
$${\text{a}}_{ij},{\text{b}}_{ij},{\text{c}}_{ij},{\text{d}}_{ij}= \text{F}\left(\text{min}\left(\left({\phi \left({y}_{j}\right)-{y}_{i})}^{T}\right)\left(\phi \left({y}_{j}\right)-y\right)\right)\right) \left(4\right)$$
where \({\text{a}}_{ij},{\text{b}}_{ij},{\text{c}}_{ij},{\text{d}}_{ij}\) denote the values of the coefficients in the objective function obtained from the j-th column of features and the i-th column of missing values, respectively. f represents the solution by the method of coefficients to be determined.
The result of fitting the i-th column using the j-th column is
$$\widehat{{y}_{ij}}= {\text{a}}_{ij}+ {\text{b}}_{ij}{y}_{j}+{\text{c}}_{ij}{{y}_{j}}^{2}+{\text{d}}_{ij}{{y}_{j}}^{3} \left(5\right)$$
The filling of feature interpolation can be expressed as the process of introducing prior knowledge by using the feature importance scores to obtain the weighted fitting results for a certain column of missing values using the feature importance of 'TSH measured', 'TSH', 'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured', 'T4U', 'FTI measured', 'FTI', and thus selecting that value for interpolation to fill the missing values.
$${y}_{ij}= \sum _{i=0}^{k}{I\_norm}_{j}\widehat{{y}_{ij}} \left(6\right)$$
The results after fitting are shown in Fig. 2(b). Compared with the sequence after feature interpolation, the detail expression is richer using only the static interpolation algorithm, incorporating prior knowledge, and the value distribution fluctuates in the vicinity of the static set, preserving the distribution trend of the original feature columns. After removing the feature columns containing a large number of missing values, the remaining missing feature columns are subjected to the levy interpolation algorithm, and the specific interpolation process is as follows: first, all feature columns are initially complemented and set aside using the median interpolation algorithm, and if different feature columns are missing in the peer group when complementing, the mean values of other columns are used as temporary data for interpolation to obtain the initial complemented data. Using the a priori data of other associated features of the same sample, the weighted fit is performed according to the feature importance score to complete some items in the missing columns of the original data. Later in Chap. 5, ablation and comparison experiments are conducted, which show that the feature interpolation algorithm has outstanding results in terms of accuracy improvement.
Table 2
Missing values statistics table
Feature | Missing value ratio |
TBG | 0.994303 |
T3 | 0.223488 |
T4U | 0.042507 |
FTI | 0.042068 |
TT4 | 0.005259 |