E ffi cient thyroid disorder identification with weighted voting ensemble of super learners by using adaptive synthetic sampling technique

: There are millions of people su ff ering from thyroid disease all over the world. For thyroid cancer to be e ff ectively treated and managed, a correct diagnosis is necessary. In this article, we suggest an innovative approach for diagnosing thyroid disease that combines an adaptive synthetic sampling method with weighted average voting (WAV) ensemble of two distinct super learners (SLs). Resampling techniques are used in the suggested methodology to correct the class imbalance in the datasets and a group of two SLs made up of various base estimators and meta-estimators is used to increase the accuracy of thyroid cancer identification. To assess the e ff ectiveness of our suggested methodology, we used two publicly accessible datasets: the KEEL thyroid illness (Dataset1) and the hypothyroid dataset (Dataset2) from the UCI repository. The findings of using the adaptive synthetic (ADASYN) sampling technique in both datasets revealed considerable gains in accuracy, precision, recall and F1-score. The WAV ensemble of the two distinct SLs that were deployed exhibited improved performance when compared to prior existing studies on identical datasets and produced higher prediction accuracy than any individual model alone. The suggested methodology has the potential to increase the accuracy of thyroid cancer categorization and could assist with patient diagnosis and treatment. The WAV ensemble strategy computational complexity and the ideal choice of base estimators in SLs continue to be constraints of this study that call for further investigation.


Introduction
Thyroid cancer is one of the most common endocrine malignancies, accounting for approximately 3.4% of all new cancer cases globally [1,2]. It is estimated that there were 567,233 new cases of thyroid cancer and 41,071 deaths from the disease in 2020 alone [3]. Thyroid cancer is particularly prevalent in women, with a female-to-male incidence ratio of 3:1 [4]. Figure 1 shows the hypothyroid infection in various countries as percentage of the population. Risk factors for developing thyroid cancer include exposure to ionizing radiation, family history of thyroid cancer and certain genetic mutations [5]. This growing incidence has been contributed to numerous aspects, such as increased exposure to ionizing radiation, environmental pollutant and improved diagnostic techniques such as high-resolution ultrasound and fine-needle aspiration biopsy [6,7]. Thyroid gland has two hormones, thyroxine (T4) and triiodothyronine (T3) and dysregulation of thyroid hormones can result in several pathological conditions, including hypothyroidism, hyperthyroidism and thyroid cancer [8]. Noninvasive technique such as ultrasound computed tomography (CT) scans and an invasive technique such as fineneedle aspiration biopsy (FNAB) are used for detection of thyroid cancer [9][10][11]. Ultrasound can distinguish between solid and cystic nodules and can also identify cancers, such as irregular borders, microcalcifications and increased vascularity of thyroid cells. If an ultrasound reveals a suspicious nodule, an FNAB may be performed to obtain a tissue sample for microscopic examination. Early diagnosis is very important for improving the prognosis and reducing the mortality rates associated with this malignancy. Recent advances in machine learning (ML) techniques have the potential to significantly improve the accuracy and efficiency of thyroid cancer classification, aiding clinicians in making better-informed treatment decisions. Machine learning techniques have demonstrated their utility in various aspects of cancer research and clinical practice, such as disease diagnosis, prognosis and treatment selection [12]. In the context of thyroid cancer, ML algorithms have been employed to analyze a variety of data types, including imaging data, genomic data and clinical data, to provide valuable insights into the classification and prediction of the disease [13][14][15][16][17]. For instance, ML techniques have shown promising results in the classification of thyroid nodules using ultrasound images [18,19], prediction of aggressive tumor features based on clinical and histopathological data [20,21] and molecular classification of thyroid cancer subtypes using genomic data [22,23]. The integration of ML techniques into thyroid cancer diagnostics and treatment decision-making has the potential to enhance patient care by improving the accuracy of diagnoses, reducing unnecessary interventions and facilitating personalized treatment planning. However, despite these promising advancements, challenges remain in terms of data quality, model interpretability and clinical implementation, warranting further research and development in this area.
After the pre-processing part, handling the class imbalance issue effectively for both datasets is a top concern. In machine learning, imbalanced datasets occur frequently when there are significantly less examples in one class than in another. This can have a negative impact on the accuracy of machine learning models, which is especially problematic for marginalized groups. The two datasets have been used in this research work. The first dataset comprises the three classes in the target variable, with the representation of normal, hypo and hyperthyroidism. The total instances for each target class are, for normal total of 166, hypothyroidism consists of 6666 samples and hypothyroidism includes only 368 samples. Similarly, the second dataset includes two classes, 3,481 samples labelled as P and 291 as N updated as suggested. It is clearly shown that both datasets include imbalance classes that can directly affect the performance and accuracy of the proposed model, due to the limited number of training samples for the specific target variable. Therefore, in this study, we implemented adaptive synthetic (ADASYN) sampling to resample the minority class target variables. Table 2 includes a detailed discussion about the total number of samples for the original and ADASYN-generated datasets.
The last part of this research study focused on the implementation of the ensembling technique for the two implemented super learners. Although super learning itself is an ensembling technique where multiple base meta estimators have been used to combine the predictions of the models. Super learners are a sort of ensemble learning in which the predictions of numerous models are combined to improve overall performance. Cross-validation is used by the super learner method to estimate the performance of many machine learning models. By lowering bias and variance and eliminating parametric assumptions, the super learner method can increase the accuracy of machine learning models. It can also assist to avoid overfitting and increase model generalization. In this study, two super learners were implemented that again undergoes the voting ensemble, to consecutively improve the accuracy and performance of the proposed approach on both datasets.
The main contributions of this work are below: • Novel ensemble modeling approach utilizing two super learners, each containing three distinct classifiers, to boost classification performance and reduce model variance. • Various preprocessing and feature selection techniques employed, including feature importance techniques, dimensionality reduction methods, class imbalance handling, outlier detection and feature standardization, to streamline the datasets and identify the most relevant features for thyroid disease classification. • Class imbalance issues were addressed using the adaptive synthetic (ADASYN) sampling technique, oversampling the minority class to ensure equal representation of all classes.
The paper is structured into several sections, starting with a review of the relevant literature and prior research on thyroid disease classification in Section 2. Section 3 describes the methodology employed in this study, including data acquisition, preprocessing, feature importance, outlier detection, class imbalance handling and ensemble modeling. Section 4 presents the findings of the study, including a comparison with existing works. Section 5 provides an interpretation and analysis of the results. Finally, Section 6 summarizes the study's main contributions, limitations and potential for future research.

Related works
Several studies have used ML techniques such as support vector machines (SVM), artificial neural networks (ANN) and deep learning algorithms such as convolutional neural networks (CNNs) for detection and classification of thyroid cancer. One area where ML has shown promise is in the diagnosis of thyroid nodules, which is crucial for accurate and timely treatment planning. The study [24] proposed a deep learning technique that is based on a deep convolutional neural network (CNN) to distinguish between benign and malignant thyroid nodules using ultrasound images. The dataset consisted of 1,000 ultrasound images of thyroid nodules, which were divided into training, validation and testing sets. The CNN model achieved an accuracy of 87.6% on the testing set and demonstrated high sensitivity in detecting malignant nodules. Another study [25], employed a machine learning approach to predict the presence of the BRAF mutation in cancerous thyroid nodules. The researchers used 96 ultrasonic images of thyroid nodules and extracted 86 radiomic features. They utilized three different models, namely linear regression (LR), support vector machine (SVM) and random forrest (RF), to predict the likelihood of the BRAF mutation being present. Another study [26] proposed a thyroid nodule classification system based on feature fusion and deep learning techniques. The dataset consisted of 5,310 ultrasound images of thyroid nodules and the proposed system achieved high accuracy (95.2%), sensitivity (93.1%) and specificity (96.8%) using a combination of CNN and LSTM networks. In the research conducted in [27], Chen et al. utilized the LASSO technique along with a LR model to pick out the ultrasonic characteristics associated with malignant thyroid nodules. Subsequently, they employed RF to categorize the malignant thyroid nodules. By using LLR in conjunction with RF, they achieved the highest level of accuracy, which was 82%.
ML has also been applied to predict the risk of malignancy in thyroid nodules using radiomics features extracted from CT images. ML has also been applied to prognostic modeling in thyroid cancer, which is essential for personalized treatment planning and improved patient outcomes. The study described in [28] used two machine learning techniques, namely SVM and RF, to detect thyroid disorders using the thyroid dataset provided. The SVM model achieved 91% accuracy, while the RF model achieved 89%. In the study done in [29], the research aimed to forecast thyroid disease, categorizing it into two types: hypothyroid and euthyroid. The assessment criteria adopted in the research encompassed accuracy, precision, recall, F1-score, ROC-AUC, confusion matrix and classification. The random forest classifier stood out as the most effective approach, achieving a success rate of 99.5%. The study emphasized the capacity of machine learning algorithms in detecting and diagnosing thyroid disease in its initial stages. In another work [30], The model employed in the study was an in symbol of homogenous ensembles that combined multiple attributes selection approaches. The findings of the study demonstrated that the proposed method achieved impressive accuracy of 99.6% with surpassed the other state of the art approaches. The algorithm emerged as the best technique used in the study. Another study [31] proposed an artificial neural network (ANN) model to differentiate between benign and malignant nodules and improve the accuracy of objective diagnosis based on ultrasound (US) images. The ANN accurately predicted 82.3% of thyroid cancer cases with an AUC value of 0.818 and an accuracy rate of 84.5%.
In another study [32], it was observed that SVM was more effective than RF in identifying thyroid conditions. The study employed ML classifiers to predict the presence of thyroid disorders. To enable algorithms to identify the likelihood of patients developing a particular disease, data preparation techniques were implemented to simplify the data. Disease prediction using machine learning is a common practice and several methods are employed by scientists, such as SVM, DT, LR, ANN and KNN, to predict the likelihood of a patient acquiring thyroid disease. In this study [33], clinical datasets were employed to evaluate and compare the performance of three classifiers: SVM, NB and DT. SVM is widely utilized in machine learning. The study [34] categorized thyroid disease into three groups based on data i.e. overactive thyroid and hypothyroidism. The study implemented several classification methods, including SVM, DT, RF, NB, LR, KNN, LDA and MLP. The most accurate classifiers was RF, achieving 89% accuracy.
In another study [35], researchers employed three ML techniques ANN, RF and SVM to identify thyroid texture. The researchers created 30 attributes based on spectral energy using autoregressive modeling for a 2D thyroid ultrasound image variation to train the classifiers. The characteristics of thyroid tissues were illustrated using image-based features instead of text-based descriptors. When the three techniques were combined, the accuracy rate was around 90%. In [36], the authors used data mining techniques using python to create algorithms for identifying thyroid illness types. It has enabled cost effective thyroid diagnostic reports to be available to patients. Two well-known systematic attribute selection techniques, namely sequential forward and sequential backward selection, were utilized. The evolutionary method was used as a popular strategy for picking features in nonlinear optimization problems. The SVM was employed to detect hypothyroidism.
In a cross-sectional study [37], a classification algorithm was developed by integrating SVM, MLP, CHAID and iterative dichotomiser-3. To address dataset imbalance issues, classification methods, bootstrap aggregating (Bagging) and boosting procedures were utilized, which improved the classification outcomes. The study revealed that SVM bagging produced 100% precision and specificity, 73.33% recall and 84.62% F-measure. In a different study [38], the attribute partitioning criteria for detecting thyroid disease were determined using DT. The authors aimed for an accuracy rate of 99.89% and compared the diagnostic results using DT, SVM and NB methodologies. In another study [39], DT, KNN and SVM were used to evaluate the risk of thyroid illness based on a patient's medical history using various ML methods for disease-prevention diagnostics. Table 1 compares existing studies on thyroid disease detection using various datasets for evaluation. For our study, we chose a well-known UCI dataset. While previous studies achieved high accuracy in detecting and classifying thyroid disease, there has been limited research on feature selection for this classification problem. Prior studies on thyroid problems categorize them into three classes: normal, hypothyroidism, or hyperthyroidism. However, for proactive prediction and treatment, categorizing patients based on their treatment and general health condition would be more effective. Furthermore, there has been limited discussion on evaluating and comparing the performance of machine learning and deep learning-based techniques for thyroid disease classification. To address these limitations, we propose a multiclass solution for thyroid disease classification that utilizes feature selection and provides a comprehensive performance comparison of machine learning and deep learning-based approaches.

Materials and methods
The methodology employed in this study is depicted in Figure 2 and comprises the following steps: data acquisition, preprocessing, feature importance, class imbalance handling, outlier detection, feature standardization, ensemble modeling and performance evaluation. Two datasets were used for analysis: the KEEL thyroid disease dataset and the hypothyroid dataset from the UCI repository. During the preprocessing phase, various data exploration techniques were applied to gain insights into the datasets. Feature importance techniques were employed to identify the most relevant features for thyroid disease classification. To explore the selected features and their relationships, dimensionality reduction techniques like principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) were employed.
The class imbalance issue was addressed using the adaptive synthetic (ADASYN) sampling technique, which oversampled the minority class to ensure equal representation of all classes. Subsequently, outlier detection techniques were applied to identify and remove anomalous observations from the selected features. The features were then standardized to ensure consistent scaling across all variables. The methodology employed in this study involved the development of an ensemble model composed of two super learners, with each super learner containing three distinct classifiers. This ensemble model aimed to boost the classification performance by leveraging the strengths of multiple classifiers and reducing the overall model variance. The ensemble model was evaluated using a range of performance metrics, such as accuracy, specificity, sensitivity and F1-score, to thoroughly assess its effectiveness in classifying thyroid diseases.

Dataset description
Two datasets were employed in this work to enhance the analysis and thyroid disease classification. The first dataset, the KEEL thyroid disease dataset, offers a comprehensive collection of attributes related to thyroid function tests, patient demographics and clinical data. The second dataset, the Hypothyroid dataset from the UCI repository, complements the KEEL dataset by providing additional instances and features pertinent to hypothyroidism, a common type of thyroid disorder. By utilizing both datasets, the analysis benefits from a diverse and extensive set of instances that cover a broader spectrum of thyroid disease cases. This comprehensive dataset allows for a more accurate evaluation of the classification models and ensures a robust analysis of the factors influencing thyroid disease classification. The details of each dataset are given as follows:

Dataset 1: KEEL Dataset
The KEEL thyroid disease dataset provides a comprehensive collection of data related to thyroid conditions, enabling us to develop and evaluate machine learning models for diagnosing and predicting thyroid disorders. The dataset combines demographic information (age, sex), medical history (on thyroxine, on antithyroid medication, thyroid surgery, I131 treatment) and various thyroid-related conditions and treatments (query on thyroxine, query hypothyroid, query hyperthyroid, lithium, goitre, tumor, hypopituitary, psych) as attributes. It includes essential thyroid hormone levels (TSH, T3, TT4, T4U, FTI), providing valuable insights into the patient' thyroid function. The three classes in the dataset represent distinct thyroid disease categories, enabling researchers to develop multi-class classification models for disease detection and prognosis.

Dataset 2: Hypothyroid Dataset
The second dataset under consideration consists of 30 attributes for 3,772 patients, with 29 variables being categorical and one being an integer value. Dataset-1 has a significant amount of missing data. Among the 30 attributes, eight crucial features contain missing data. These features are TT4, FTI, T4U, age, sex, TSH, T3 and TBG, with 231, 385, 387, 1, 150, 369 and 769 missing samples out of the total 3,772 instances, respectively. The TB feature is entirely comprised of missing values. The target class distribution, represented as a binary class, includes 3,481 samples labeled as P and 291 as N.

Data preprocessing
In the initial stage of preprocessing, the datasets are thoroughly examined for potential errors and inconsistencies, such as incorrect formatting, duplicate entries and invalid values. These issues are rectified through a meticulous data cleaning process, ensuring the integrity of the data. Subsequently, the datasets are scrutinized for missing values, which are imputed using a variety of techniques, encompassing mean, median, mode and k-nearest neighbor' imputation methods.
In the succeeding step, features that has no significant contribution to the model are identified and eliminated from the datasets. Such features may encompass irrelevant or redundant data or data that exhibits high correlation with other features. This step aids in streamlining the datasets and mitigating noise, which ultimately enhances the model's accuracy and reliability.
Following this, redundant values are identified and removed from the datasets. This process entails detecting and eliminating duplicate data present across the datasets, as well as any additional redundant information that may exist. By eradicating redundant values, the datasets are further simplified, which bolsters the efficiency and accuracy of the machine learning model applied in the study.

LASSO model-based attribute importance
In conjunction with the pre-processing steps detailed earlier, this study also utilized a least absolute shrinkage and selection operator (LASSO) model-based attribute importance technique to identify significant features from the preprocessed data. The LASSO model, a well-established linear regression model, is frequently employed in machine learning for the purpose of feature selection. The model is advantageous as it not only minimizes the residual sum of squares but also constrains the sum of the absolute values of the coefficients. This constraint leads to the shrinkage of some coefficient estimates to zero, effectively excluding them from the model and resulting in a more parsimonious and interpretable model.
The LASSO model can be represented mathematically as follows: where y is the dependent variable, x 1 , x 2 , , x p are the independent variables, β 0 , β 1 , β 2 , , β p are the regression coefficients and ϵ is the error term. The LASSO model objectively try to minimize the following equation: where (||y − Xβ||) 2 is the residual sum of squares, λ is the penalty parameter and ||β|| 1 is the L1 norm of the coefficients.
The coefficient estimates can be obtained by solving the following equation: whereβ is the coefficient estimates and X is the preprocessed data.
Once the LASSO model was trained on the preprocessed data, we extracted the non-zero coefficients as the most important features for predicting the target variable. These important features were then used as input for the final machine learning model, which was trained and evaluated using standard techniques such as cross-validation and hyperparameter tuning.

Data visualization
Upon determining the most significant features from the preprocessed data using the Lasso modelbased attribute importance technique, the subsequent step involves visualizing the data in a manner that emphasizes its inherent structure. In this research, we employed both principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) to investigate the chosen features and discern any patterns or clusters present within the data.

Principal component analysis
Principal component analysis (PCA) is a widely employed dimensionality reduction technique utilized in various fields like image processing, finance and genetics [49]. The method is a mathematical algorithm that endeavors to decrease the number of features within a dataset while preserving the most essential information. PCA does this by transforming the dataset into a new coordinate system that is aligned with the principal components of the original data, where each principal component constitutes a linear combination of the original features. The objective of PCA is to maximize the variance of the data along each principal component, thereby ensuring that the most significant information in the data is retained. PCA is particularly useful for visualizing data in two or three dimensions, but it can also be applied to higher dimensional data.
Given a dataset X, which contains n observations and p features, the first step of PCA is to calculate the covariance matrix C. The covariance matrix describes the relationship between the different features of the dataset. Specifically, it measures how much two features vary together.
The diagonal elements of the covariance matrix represent the variance of each feature, while the offdiagonal elements represent the covariance between the features.
Next, PCA computes the eigenvectors v 1 , v 2 , , v p and the corresponding eigenvalues λ 1 , λ 2 , , λ p of the covariance matrix C of X. The eigenvectors v 1 , v 2 , , v p form an orthonormal basis for the p-dimensional space and can be used to project the data onto a new coordinate system that captures the maximum amount of variance in the data.
The eigenvectors are the directions in which the data varies the most and the corresponding eigenvalues indicate the amount of variance in the data along these directions. The eigenvectors and eigenvalues are sorted in descending order of the eigenvalues and only the top k eigenvectors are retained. These eigenvectors form an orthonormal basis for the p-dimensional space. PCA then projects the data onto a new coordinate system that captures the maximum amount of variance in the data. The projection of X onto the k-dimensional subspace spanned by the first k eigenvectors is given by the matrix multiplication: where V k is the matrix consisting of the first k eigenvectors of C. The projected data Z has dimensions n × k, where k is the number of retained eigenvectors. The resulting projected data Z can be used for further analysis or visualization. PCA is particularly useful when dealing with high-dimensional datasets, as it can significantly reduce the number of features while retaining the most important information. PCA is also used for feature extraction, anomaly detection and clustering. Figure 5 (a) presents the projection of PCA of first dataset, while Figure 5 (b) presents the PCA projection of second dataset before resampling.

t-Distributed stochastic neighbor embedding
t-Distributed stochastic neighbor embedding (t-SNE) is a commonly utilized technique for nonlinear dimensionality reduction which enables the representation of high-dimensional data in a lower-dimensional space in a visual format.. Given a dataset X, which contains n observations and p features, t-SNE constructs a lower-dimensional map Y where the distances between points reflect the similarities in their probabilities.
The first step of t-SNE is to model the high-dimensional data as a set of probabilities. Specifically, it constructs a probability distribution P over pairs of high-dimensional data points such that similar points have a higher probability of being chosen than dissimilar points. It then constructs a probability distribution Q over pairs of low-dimensional data points that aims to preserve the similarity structure of the high-dimensional data. The algorithm works by minimizing the Kullback-Leibler divergence between the joint probabilities P and the conditional probabilities Q.
The cost function to be minimized is given by C = KL(P||Q), where KL is the Kullback-Leibler divergence. This cost function measures the difference between the probability distributions P and Q. The cost function to be minimized is given by: The probability P i j that point i would choose point j as its neighbour in the high-dimensional space is computed using a Gaussian kernel: where x i and x j are the feature vectors of points i and j in the high-dimensional space and ||x i − x j || is the Euclidean distance between them. The parameter σ i is the standard deviation of the Gaussian kernel for point i and is computed as the distance to its kth nearest neighbor. This parameter is chosen to reflect the density of the data around each point, which helps to balance the probabilities for points in dense and sparse regions of the data. To compute the probability Q i j that point i would choose point j as its neighbor in the lowdimensional space, t-SNE uses the students t-distribution. Specifically, it defines Q i j as: where y i and y j are the coordinates of points i and j in the low-dimensional space and ||y i − y j || is the Euclidean distance between them. To normalize the probabilities, the parameter in the denominator represents the summation over all other points in the low-dimensional space. By employing gradient descent, t-SNE reduces the cost function with respect to the coordinates of the points in the low-dimensional space, y i . This is accomplished by continuously updating the coordinates of the points until the cost function reaches a minimum. The algorithm is recognized for its capacity to preserve the local structure of data, which renders it highly advantageous for visualizing intricate datasets like images and text. Figure 6 shows the t-SNE projections of first and second dataset, respectively.

Adaptive sampling
The issue of class imbalance in machine learning has long been recognized as a significant challenge, as it can introduce bias in favor of one class while under-representing another. To address this challenge, a range of sampling techniques have been developed over time, including synthetic over-sampling, which involves generating artificial data points to represent the minority class.
Among the various synthetic sampling methods, ADAptive SYNthetic (ADASYN) sampling has emerged as particularly effective, owing to its non-linear interpolation scheme. This approach introduces non-linearity to the sampled dataset by generating synthetic examples that lie between existing minority examples and their k-nearest neighbors from the majority class. These synthetic samples are then generated in accordance with the density distribution of the minority class in the feature space, which captures the underlying non-linear relationship between the minority and majority classes.
As a result, ADASYN generates minority samples that are uniquely representative of the minority group, introducing new patterns and variations within the dataset. This, in turn, can enhance the ability of machine learning models to capture the non-linear relationships between the feature and target variables.
Let X be a dataset with N samples and M features. Let C 1 be the majority class and C 2 be the minority class, where C 1 < C 2 . To reduce the class imbalance, we use ADASYN sampling to obtain To apply ADASYN sampling, we first calculate the density distribution of minority class samples. For each minority sample x i in class C 2 , the density distribution D(x i ) is calculated as: where D(x i ) is the density distribution of the ith minority sample dist(x i , x j ) is the Euclidean distance between the ith and jth nearest neighbor samples from both classes C 1 and C 2 and p is the decay parameter that controls the rate of decay of contribution of distant samples to the density distribution. The weight w j (x i ) is a function that assigns a weight to each sample based on its similarity to x i : where d j (x i ) is the Euclidean distance between x i and x j and σ is a bandwidth parameter. The class imbalance ratio I r is determined by calculating the ratio between the number of majority class samples S M and the number of minority class samples S m : To determine the number of synthetic samples to generate for each minority sample x i , we use the following equation: where G(x i ) is the generated synthetic sample for the ith minority sample, α is a hyperparameter that controls the degree of randomness in the sampling process. Finally, the synthetic samples are generated in a series of iterations. For each minority sample x i , we select k i nearest neighbors from both classes C 1 and C 2 , where k i is a hyperparameter. We then use the following mathematical equation to generate the k i synthetic samples: where S S k is the kth synthetic sample generated for the ith minority sample, α k and β k are random numbers between 0 and 1, x r is a randomly chosen minority sample and x j is a randomly chosen sample from the k i nearest neighbors. The values of α k and β k are determined using the following relations: where δ is a hyperparameter that controls the degree of randomness in the sampling process and r 1 and r 2 are random numbers between 0 and 1. Figure 7 shows the target variable distribution of first dataset and second dataset after resampling using ADASYN technique with PCA while Figure 8 shows the target variable distribution of first dataset and second dataset after resampling using ADASYN technique with t-SNE.  Figure 8. Visualization of both datasets using t-SNE after ADASYN resampling: (a) first dataset after resampling; (b) second dataset after resampling Figure 9 illustrates the comparison of the target variable distribution between the original and resampled datasets for the two datasets. Figure 9 (a) represents the first dataset before and after applying ADASYN resampling and Figure 9 (b) represents the second dataset before and after applying ADASYN resampling. It can be observed that the distribution of both classes has become more balanced in the resampled datasets. This indicates that the ADASYN resampling technique has successfully addressed the class imbalance problem in both datasets, which can potentially lead to more accurate and robust classification models.

Local Outlier Finder
The Local Outlier Factor (LOF) algorithm is a popular technique for detecting anomalies in a dataset. In this research, we employed the LOF algorithm as part of our methodology for identifying outliers in our dataset. To implement this algorithm, we first defined a distance metric between data points using the commonly used Euclidean distance. We then used the algorithm to calculate the local density of each data point by measuring the average distance between the point and its k-nearest neighbors. To determine the value of k, we performed a sensitivity analysis and chose the value that resulted in the best performance.
With the local density of each data point calculated, we proceeded to compute the LOF score for each point. This score reflects the degree to which a data point deviates from its neighbors in terms of local density. Specifically, a data point with an LOF score significantly lower than its neighbors is considered an outlier.
Let X = x 1 , x 2 , , x n be a dataset consisting of n data points, where each data point x i belongs to a d-dimensional feature space. We define the distance metric between two data points x i and x j as the Euclidean distance, given by the equation Given a data point x i , we define its k-distance as the distance between x i and its kth-nearest neighbor, given by the equation: where x k (i) is the kth-nearest neighbor of x i . Using the k-distance of each data point, we define the local reachability density (LRD) of a data point x i as the inverse of the average k-distance of x i 's k-nearest neighbors. This is given by the equation: where the sum is taken over x i 's k-nearest neighbors. Finally, we define the LOF score of a data point x i as the average ratio of the LRD of x i to the LRD of its k-nearest neighbors, given by the equation: where the sum is taken over x i 's k-nearest neighbors, excluding x i itself. Table 2 shows the improvement in the number of training samples for both datasets before and after applying ADASYN and outlier removal. Dataset 1 had 5,760 samples in the original dataset, which increased to 14,410 after applying ADASYN, and reduced to 12,969 after outlier removal. Similarly, Dataset 2 had 3,017 samples in the original dataset, which increased to 5,592 after applying ADASYN and then reduced to 5,032 after outlier removal. The number of samples increased significantly after applying ADASYN, which helps to balance the class distribution in both datasets.

Classification models
In this research study, we utilized two iterations of the super learner (SL) ensemble technique as our primary methodology to predict outcomes in our dataset. The SL ensemble technique is an effective method for combining multiple machine learning models to achieve higher prediction accuracy.
In our study, we employed two SL ensembles that consisted of three base estimators each. The first SL ensemble comprised of logistic regression, decision trees and support vector classification. We selected these estimators based on their individual strengths and potential synergies when combined. To combine the predictions of the base estimators, we utilized a random forest as the meta estimator known for its ability to reduce overfitting and improve prediction accuracy. Similarly, for the second SL ensemble, we selected random forest, adaBoost and bagging Classifier as base estimators based on their respective strengths in handling large datasets, improving weak learners' performance and reducing overfitting. In this study, a decision tree was employed as the meta estimator to integrate the predictions of individual models. This was achieved by recursively partitioning the dataset into smaller subsets based on input features, until a stopping criterion was reached. The decision tree algorithm then predicted the output variable based on the most prevalent class within each subset and these predictions were used to generate the final prediction. Figure 10 shows the overall overview of implemented model in this study.
By utilizing two iterations of the SL ensemble technique with different combinations of base estimators and meta estimators, we were able to achieve higher prediction accuracy than any individual model could achieve alone. After applying SL 1 and SL 2 ensemble techniques to our dataset, we wanted to further improve the accuracy of our predictions. Therefore, we decided to use a weighted average voting technique as our final step.
After calculating the performance metrics for each model, we assigned weights to each model based on their performance. We assigned higher weights to the models with better performance and lower weights to those with weaker performance. The weights were assigned in such a way that the total sum of weights was equal to one. After assigning the appropriate weights to each model, we combined their predictions by calculating a weighted average of their outputs. To obtain the final prediction, we took a weighted average of the predicted probabilities for each potential outcome. This approach allowed us to derive a more accurate prediction by taking into account the strengths and weaknesses of each individual model.
To gauge the effectiveness of the weighted average voting technique, we compared its performance to that of the individual models and super learner ensembles. We utilized several performance metrics, including accuracy, precision, recall and F1-score, to evaluate the effectiveness of the weighted average voting technique. Further details on the classifiers utilized in this study are provided in the subsequent subsections.

Classifier 1: Logistic regression
Logistic regression is a popular and powerful algorithm for supervised machine learning that can be used for binary classification tasks [40]. The goal of logistic regression is to estimate the probability of a binary outcome based on one or more input features [34]. The input features are combined with a set of weights and an intercept term to produce a linear combination of the inputs. This linear combination is then passed through a sigmoid function to obtain the predicted probability.
The logistic regression algorithm determines the weight and intercept terms that minimize the disparity between the predicted probabilities and the factual binary outcomes within the training data. This is accomplished by reducing a cost function utilizing an optimization algorithm like gradient descent. The logistic regression model can be expressed as follows: where (y = 1|x) is the predicted probability of y = 1 given input features x, w is the weight vector and b is the intercept term.
To train the logistic regression model, we use a training set of input features and binary target variables. The weights and intercept term are initialized randomly and the cost function is iteratively minimized using an optimization algorithm such as gradient descent. The resulting model can then be used to predict the binary outcome for new input features, as Figure 10.

Classifier 2: Decision trees
The decision trees algorithm is a highly adaptable supervised machine learning model that can accommodate both categorical and numerical data and perform both classification and regression tasks [29]. The algorithm does this by recursively dividing the data into smaller subsets based on the most important attributes until a stopping criterion is reached [34]. The resulting structure is a visual representation of a decision-making process, with nodes representing decisions based on particular attributes and branches representing the outcomes of those decisions [48]. The root node signifies the initial decision with maximum entropy, while the leaf/terminal nodes indicate the final decisions with zero entropy [50]. This approach has proven highly effective in a wide range of problem domains and can offer valuable insights into complex decision-making processes.
Assuming we have a dataset D 1 = {(x 1 , y 1 ), (x 2 , y 2 ), (x n , y n )} with input features X = {x 1 , x 2 , x 3 , x 4 , , x n } and a target variable Y. where, • D 1 is the dataset with x i input features and y i target variables.
., x i n} is the feature vector for ith observation.
• y i = {y i 1, y i 2, y i 3, ., y i n} is the ith target variable.
To select the root feature, we use an entropy-based impurity measure. Entropy is calculated using the following: where, H(s) is the entropy and p(x) is the percentage of class x in the attribute node. The Information Gain after the split is calculated using the following: where, The Gain for each feature in x in is computed and the feature that maximizes the impurity reduction is selected, i.e., G(S , x i1 ) > G(S , x i2 ). This process is iterated until a stopping criterion is met, such as reaching the maximum depth or minimum impurity reduction threshold or having a minimum number of observations per node.

Classifier 3: Support Vector Classification
Support vector classification (SVC) or support vector machines (SVM) is a popular and effective supervised machine learning algorithm that can be used for both classification and regression tasks [34]. The goal of SVC is to find a hyperplane that best separates the input data into different classes or to find a hyperplane that best fits the input data for regression tasks [45,51]. The hyperplane is chosen to maximize the margin, or the distance between the hyperplane and the closest data points from each class [40,41].
To train an SVC model, we first select a kernel function, denoted as K(x, x ′ ), that maps the input features to a higher-dimensional space, where the input data is more separable [50]. The kernel function takes two input vectors x and x ′ and outputs a scalar value that measures the similarity between them. The most common kernel functions are linear, polynomial and radial basis function (RBF) [51]. The choice of kernel function depends on the characteristics of the input data and the specific problem domain [52].
The input data consists of n feature vectors x i , where i ∈ [1, n] and the corresponding binary labels y i , where y i ∈ {−1, 1} for classification tasks and y i ∈ ℜ for regression tasks. For classification tasks, we aim to find a hyperplane in the feature space that separates the two classes with the largest possible margin. For regression tasks, we aim to find a hyperplane that best fits the input data with minimum error.
The optimization problem for SVM is defined as follows: subject to y i (w T x i + b) ≥ 1 − ψ i and ξ i ≥ 0 where w is the weight vector, b is the bias term and ξ i is the slack variable that allows for misclassifications in the margin. The parameter C controls the trade-off between maximizing the margin and minimizing the classification or regression error [52]. The solution to the optimization problem is obtained by solving its dual form, which is given by: subject to 0 ≤ α i ≤ C and i α i y i = 0 where α i is the Lagrange multiplier associated with the ith data point and the support vectors are the data points with non-zero Lagrange multipliers. The weight vector w and the bias term b can be computed from the support vectors and their corresponding Lagrange multipliers.
Once the hyperplane is determined, new input data can be classified or predicted by computing its distance from the hyperplane. For classification tasks, the predicted label is determined by the sign of the distance.

Classifier 4: Random forest
The algorithm is an influential machine learning technique that finds extensive application in both classification and regression tasks [53]. It falls within the category of ensemble learning algorithms, which entails the combination of multiple decision trees to produce more accurate predictions [47]. One of the principal advantages of random forest is its utilization of bootstrap aggregation (also referred to as bagging) to enhance the performance of the model by decreasing variance [50]. This is achieved by training each decision tree on a randomly selected subset of the original data, which helps to mitigate overfitting and enhance the generalization capability of the model. For a given dataset X consisting of n examples, where each example has m features and Y is the respective set of class labels, the random forest algorithm aims to learn a function f : X → Y that can predict the class for a new input vector x.
To create a random forest classifier, we first generate a set of D i bootstrap samples of size n ′ that are uniformly and randomly selected with replacement from the original dataset X. We can represent X as a collection of bootstrap samples, D i , where: For each feature F and target variable T in each sample set D i , we calculate entropy and gain to create n = D i decision trees. The entropy of a selectable feature F and the target variable T is calculated using: where E(c) is the entropy of the respective class and P(c) is the proportion of samples belonging to the respective class.
We can then calculate the information gain after the split using: where E(T ) is the entropy of the target variable and E(T, F) is the entropy of the target and the feature. The predicted outcome t from each of the bootstrap samples in D i is then compared to form an aggregate score:ŷ whereŷ is the predicted outcome and δ is the Kronecker delta function.
Finally, the predictions from a random forest algorithm can be given by: where f (x) is the predicted class label for an input vector x, N is the number of decision trees in the forest and T i (x) is the prediction of the ith decision tree. The scaling factor 1 N ensures that the output is a probability distribution over possible class label.

Classifier 5: AdaBoost
AdaBoost, short for adaptive boosting, is a popular ensemble learning algorithm used for binary classification and regression tasks [30]. AdaBoost combines multiple weak classifiers into a strong classifier by assigning weights to each weak classifier based on their accuracy [53]. Let C(x) be the binary classifier that predicts the label of input data x. The final prediction of the Adaboost model is given by: where H(x) is the final prediction, α t is the weight assigned to weak classifier C t and sign is the sign function that returns +1 or −1 depending on the sign of its argument.
To update the weights of the input data samples after each iteration, we use the following formula: where w i is the weight of data point i, y i is the true label of data point i and x i is the input data. If data point i is correctly classified by weak classifier C t , y i * C t(x i ) is positive and the weight w i is decreased. If data point i is misclassified, y i * C t(x i ) is negative and the weight w i is increased. The weight assigned to each weak classifier is determined by its accuracy on the training data. Let ϵ t be the classification error of weak classifier C t , defined as: The weight α t is then computed as: The weight α t is positive if the classification error of C t is less than 0.5 and negative otherwise. A higher weight is assigned to weak classifiers with lower classification error.

Classifier 6: Bagging classifier
The bagging classifier is a powerful ensemble learning method that utilizes multiple independently trained classifiers to enhance prediction accuracy and mitigate overfitting [47]. Bagging Classifier, a contraction of bootstrap aggregating, generates several bootstrap samples of the input data and trains individual classifiers on each sample [50].
The bagging classifier training process commences by randomly selecting data points from the input dataset with replacement, creating several bootstrap samples. Subsequently, a classifier is trained on each bootstrap sample, utilizing the same learning algorithm. Upon completing the training phase, the trained classifiers are employed to make predictions on unseen data. The final prediction is derived by consolidating the predictions of each classifier via majority voting. The bagging classifier's final prediction for a given input sample can be represented as: y = mode(ŷ 1 ,ŷ 2 ,ŷ 3 , ...,ŷ T ) (3.34) whereŷ is the final prediction of the bagging classifier,ŷ 1 ,ŷ 2 ,ŷ 3 , ...,ŷ T are the predictions of individual classifiers and mode is the statistical mode, which is the value that appears most frequently in the set of predictions.

Super learners
Super learners (SL) are ensemble methods that combine multiple machine learning models to improve prediction accuracy and reduce overfitting. The SL algorithm uses two stages: base learning and meta learning. During the base learning stage, multiple models are trained using the training data, while in the meta learning stage, a meta model is used to combine the predictions of these base models to generate the ultimate prediction. Figure 11 illustrates the implementation of super learner in this study. Let X be the input data, y be the target variable and M be the set of base models. For each base model m in M, let f m (X) be the predicted outcome of m on X. Then, the super learner output is given by: Figure 11. The implementation of super learner in this study.
where g is the meta model that combines the predictions of the base models. For our first SL ensemble, we selected three base estimators: logistic regression (LR), decision trees (DT) and support vector classification (SVC). We chose these estimators based on their individual strengths and potential synergies that could be achieved by combining them. Once we had trained our base estimators, we used a random forest as the meta estimator to combine their predictions. The random forest algorithm is known for its ability to reduce overfitting and improve prediction accuracy by using multiple decision trees.
For our second SL ensemble, we selected three different base estimators: random forest (RF), adaBoost and bagging classifier. The random forest algorithm was selected due to its capacity to handle large datasets with high dimensionality, the AdaBoost algorithm was chosen for its effectiveness in enhancing the performance of weak learners and the bagging classifier was chosen for its ability to alleviate overfitting and enhance generalization. We trained and evaluated each of these base estimators using various metrics to identify their strengths and weaknesses. To combine the predictions of the base estimators in our second SL ensemble, we used a DT as the meta estimator. The decision tree algorithm is a simple yet powerful algorithm that recursively splits the dataset into smaller subsets based on input features to predict the outcome.

Weighted average voting
In our research, we also employed weighted average voting (WAV) as another ensemble method to combine the predictions of our base estimators. The weights assigned to each base estimator were based on their performance on the training data. The weights assigned to each base estimator depend on their performance on the training data. The better a base estimator performs on the training data, the higher its weight in the ensemble. The weighted average of the predicted values of the base estimators can be represented as: (3.36) whereŷ WAV is the final prediction of the ensemble, n is the number of base estimators,ŷ i is the predicted value of the ith base estimator and w i is the weight assigned to the ith base estimator. We found that WAV can be a simple yet effective ensemble method for combining the predictions of multiple base estimators, especially when the base estimators have comparable performance on the training data.

Performance evaluation metrics
In this section, we will provide a brief overview of utilized performance evaluation metrics, along with our model. The F1 score is a performance evaluation metric that combines precision and recall providing a single metric that balances both metrics. The F1 score is defined as the harmonic mean of precision and recall and is calculated as follows:

Results
In this research study, we utilized two iterations of the super learner (SL) ensemble technique as our primary methodology for predicting outcomes in our dataset. The SL ensemble technique is an effective method for combining multiple machine learning models to achieve higher prediction accuracy. We used an 80-20 data splitting approach to train and test our models. Two super learner (SL) ensembles with three base estimators each were employed to enhance prediction accuracy. The first SL ensemble included logistic regression, decision trees and support vector classification, while the second SL ensemble comprised random forest, AdaBoost andbBagging classifier. To improve prediction accuracy further, a weighted average voting technique was utilized based on the performance of the individual models. After evaluating the performance of the weighted average voting technique using various metrics such as accuracy, precision, recall and F1-score, we found that the ensemble model consistently outperformed individual models. The results highlight the effectiveness of our methodology in predicting thyroid disease outcomes with high accuracy, highlighting the benefits of using SL ensembles and the weighted average voting technique. We assessed the effectiveness of the weighted average voting technique by comparing its performance to that of the individual models and super learner ensembles. We employed multiple performance metrics, including accuracy, precision, recall and F1-score, to evaluate the performance of the weighted average voting technique. Table 3 shows the performance of the proposed methodology for the first dataset. The original dataset without ADASYN resampling had an accuracy of 99.58%, precision of 99.48%, recall of 95.87% and F1-score of 97.56%. After resampling without ADASYN, the accuracy improved to 99.90%, precision to 99.89%, recall to 99.90% and F1-score to 99.90%. This improvement in performance indicates that the proposed methodology is effective in dealing with imbalanced datasets.  Table 4 shows the performance of the proposed methodology for the second dataset. The original dataset without ADASYN resampling had an accuracy of 99.602%, precision of 99.785%, recall of 97.413% and F1-score of 98.565%. After resampling without ADASYN, the accuracy improved to 99.714%, precision to 99.711%, recall to 99.717% and F1-score to 99.713%. This improvement in performance again demonstrates the effectiveness of the proposed methodology in dealing with imbalanced datasets. The confusion matrices for the first dataset before ADASYN resampling are illustrated in Figure 12. Similarly, Figure 13 illustrates the confusion matrices for the first dataset before ADASYN resampling.  Figure 15 show the confusion matrices for both datasets with ADASYN resampling and with weighted average voting. Figure 14 (a) shows the number of test samples classified as the model and Figure 14 (b) shows the results in percentage from first dataset. Similarly, Figure 15 (a) shows the number of test samples classified by the model and Figure 15 (b) shows the results in percentage from second dataset. We have compared our proposed methodology with existing works done on the same dataset. Table  5 presents a comparison of our proposed methodology, which uses an ensemble, with existing studies on the first and second datasets. The performance metrics, such as accuracy, precision, recall and F1score, are used to evaluate and compare the effectiveness of each technique employed in the respective studies.
For the first dataset (KEEL), our proposed SL ensemble method achieved an accuracy of 99.602%, precision of 99.785%, recall of 97.413% and F1-score of 98.565%. When compared to other studies on the same dataset, our methodology demonstrates superior performance in all metrics. For instance, study [54] using KNN achieved an accuracy of 98.600%, while study [55] employing KNN reached an accuracy of 96.900% and study [56] utilizing a liquid state machine (LSM) Autoencoder had an accuracy of 98.900%. Additionally, our method's precision and recall values outperform those reported in [56].  [57] using KNN achieved an accuracy of 98.000%, while study [58] employing a random forest (RF) with sequential minimal optimization (SMO) reached an accuracy of 99.440% and study [59] utilizing a decision tree (DT) had an accuracy of 99.580%. It is worth noting that our method's recall value surpasses that reported in study [59]. As illustrated in Table 6, the proposed methodology, which employs an ensemble, demonstrates superior performance in terms of accuracy, precision, recall and F1-score when compared to other studies on both datasets.  [57] 2018 Hypothyroid KNN 98.000 % --- [58] 2021 Hypothyroid RF with SMO 99.440 % --- [59] 2022 Hypothyroid DT 99.580 % -99.600 % - The proposed approach in the last also compared with the multiple distinct methodologies having a similar set of approach for the separate problem statement. For this purpose three studies have been selected . In [60] the Kernel slow feature analysis (KSFA) is implemented for the fault detection of the air unit. KSFA is a feature extraction approach that may capture time series data's temporal dynamics. KSFA can extract time-invariant slow features that may be utilized to enhance the performance of machine learning models using time series data. KSFA may be used to extract features from time series data in batches, which is beneficial when working with huge datasets. On the other hand, ADASYN can effectively produce additional training samples in order to build a somewhat balanced dataset and therefore get an efficient and robust prediction model. In [61] hybrid resampling technique (HRT) used with extreme learning machine ensemble. Both ADASYN and HRT are excellent oversampling approaches for enhancing machine learning model performance on unbalanced datasets. The approach used is determined by the unique use case and the type of the data. ADASYN is a computationally efficient and simple oversampling approach that can produce synthetic data for minority class instances adaptively. HRT is a hybrid resampling approach that uses oversampling and undersampling to balance the dataset and decrease model bias and variation. In comparison to ADASYN, HRT might be computationally costly. HRT may not be appropriate for all sorts of unbalanced datasets, but ADASYN is a computationally efficient and simple oversampling strategy. In [62] a technique of feature sparse representation is implemented. While feature sparse representation is intended to solve the issue of lack of features in machine learning, ADASYN is intended to address the issue of class imbalance. Understanding why feature sparse representation occurs is essential when constructing models since it may lead to issues like overfitting and less-than-ideal outcomes in learning models. The oversampling method ADASYN is useful for enhancing the performance of machine learning models on unbalanced datasets. A solution to the issue of sparse features in machine learning is feature sparse representation. The unique use case and the kind of data determine the approach to utilize.

Discussion
The research study presented in this paper aims to provide insights into the effectiveness of a proposed methodology for dealing with imbalanced datasets and the performance of an ensemble model in predicting thyroid disease outcomes. The results of the study demonstrate that addressing class imbalance through resampling techniques is an essential step in the preprocessing of imbalanced datasets, as it significantly improves the performance of machine learning models. The accuracy, precision, recall and F1-score showed significant improvements after applying ADASYN resampling in both datasets. This indicates that addressing class imbalance through resampling techniques is an essential step in the preprocessing of imbalanced datasets, as it improves the performance of the machine learning models. The results corroborate the importance of considering and addressing class imbalance in the data during the preprocessing stage.
The results demonstrate that the ensemble model, which combines multiple machine learning models, achieved higher prediction accuracy than any individual model alone. This finding supports the idea that combining the strengths of different models through ensemble techniques can lead to improved performance. The use of ensembles with distinct combinations of base and meta estimators further reinforces this notion, as it allows for leveraging the advantages of each model while mitigating their individual weaknesses. The weighted average voting technique, which assigns different weights to the models based on their performance, further improved the prediction accuracy of the ensemble model. This demonstrates that the incorporation of the weighted average voting technique helps to better capture the strengths of each model in the ensemble, leading to a more accurate and reliable prediction. The results obtained for both datasets indicate that the proposed methodology is not only effective in dealing with imbalanced datasets but also robust in predicting thyroid disease outcomes. The consistency in the improvement of performance metrics for both datasets demonstrate the potential of the methodology to be generalized and applied to other datasets with similar challenges.
For the first dataset (KEEL), our proposed SL ensemble method achieved an accuracy of 99.602%, precision of 99.785%, recall of 97.413% and F1-score of 98.565%. When compared to other studies on the same dataset, our methodology demonstrates superior performance in all metrics. The work [54] employed KNN, a powerful and flexible class of models that have shown remarkable success in various tasks. However, their accuracy of 98.600% falls short compared to our SL ensemble. The work [55] used k-nearest neighbors (KNN), a simple yet effective algorithm for classification tasks, but only reached an accuracy of 96.900%. The work [56] utilized a LSM autoencoder, a bio-inspired neural network model and achieved an accuracy of 98.900%. The precision and recall values reported in the paper [56] are lower than those in our method, which indicates that our method has better discriminatory power between the classes.
For the second dataset (hypothyroid), our proposed SL ensemble method achieved an accuracy of 99.714%, precision of 99.711%, recall of 99.717% and F1-score of 99.713%. When compared to other studies on the same dataset, our methodology again demonstrates superior performance in all metrics. The work [57] used KNN and achieved an accuracy of 98.000%, which is lower than our SL ensemble. The paper [58] employed a random forest (RF) with sequential minimal optimization (SMO), a combination of a powerful ensemble method and a technique for solving large-scale optimization problems. However, their accuracy of 99.440% is still lower than ours. Study [59] utilized a decision tree (DT), a popular and interpretable machine learning model and achieved an accuracy of 99.580%. Although their recall value is comparable to our method, our method still outperforms the study [59] in accuracy, precision and F1-score.
However, it is important to consider some limitations of the research study and potential areas for improvement. While the ensemble model showed improved performance, it may be computationally expensive due to the use of multiple base estimators and iterations of the super learner technique. Future research could explore methods to optimize the computational efficiency of the ensemble model without compromising its performance. Additionally, the selection of base and meta estimators in the super learner ensembles was based on their individual strengths and potential synergies when combined. However, the optimal combination of models may vary depending on the dataset and the problem at hand.

Conclusions
The proposed methodology for thyroid cancer classification using a super learner ensemble model with resampling techniques and weighted average voting showed significant improvements in performance on imbalanced datasets. The results demonstrate the importance of addressing class imbalance in the data during the preprocessing stage and the benefits of combining multiple machine learning models for improving prediction accuracy. The super learner ensemble method achieved higher prediction accuracy than any individual model alone and the use of distinct combinations of base and meta estimators further improved performance. The proposed methodology showed superior performance compared to other studies on the same datasets, demonstrating its potential to be applied to other datasets with similar challenges. However, the computational complexity of the ensemble model and the optimal selection of base and meta estimators remain as limitations that require further research. Overall, the proposed methodology shows promise in improving the accuracy of thyroid cancer classification and can potentially aid in the diagnosis and treatment of thyroid cancer patients.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.