Predict The Thyroid Abnormality Particular Disease Likelihood of The Symptoms’ Certainty Factor Value and Its Confidence Level: A Regression Model Analysis

The traditional expert system (TES) in the medical field commonly uses a certainty factor (CF) rule-based algorithm that can be calculated several symptoms to determine the inference solutions. The main issue for this TES included a prediction for some particular disease likelihood in the cases of new patients. CF is calculated based on symptoms related to clinical signs in patients’ diagnoses. For some reason, this TES probably won’t predict uncertain things, such as particular disease likelihood of some diseases. So, supervised learning, such as linear regression, can solve this problem. We tried to analyse the existing TES for thyroid disorders due to modelling the regression equation to predict the thyroid abnormality particular disease likelihood, based on the symptoms’ CF value and its confidence level. We used multiple linear regression (MLR) and multiple polynomial regression (MPR) to analyse the best regression model to solve the problem. The results show that the MPR model indicates the best regression model for predicting particular disease likelihood of thyroid abnormality, supported by R-squared 94.7%, R-squared adjusted 94.4%, F-value 265.925, and p-value < 0.05, which are higher than MLR model. Our study proposed a foundation for expert system development by focusing more on machine learning expert system (MLES) analysis approaches than TES.


Introduction
Thyroid glands regulate the hormone condition and are a crucial body organ supporting quality of life [1].Thyroid patients might have different symptoms and are reported several psychological issues caused by abnormal T4 hormone regulation [2].Patients with this illness frequently ignore their actions [3], [4].Besides, in the era of health informatics, the role of the expert system (ES) was used to diagnose this issue [5].
ES, as the product of artificial intelligence (AI), tried to integrate with the medical expert for developing this system [6], [7].Related studies about ES implemented in the medical field, such as COVID-19 [8], [9], psychology [10]- [13], traditional medicine [14], [15], internal medicine [16], [17], cancer [18]- [21].ES is purposed to help medical experts solve health problems, including diagnosis, prediction, and treatment [22].Meanwhile, Hariadha et al. [23] developed the medical ES to solve thyroid disorders.Besides, Al-Hakim et al. [5] coded this current ES to an Android-based app within the inference of traditional medication.Unfortunately, this current ES has not been evaluated for its performance nor for essential future prediction purposes (for new patients).So this study tried to evaluate its ES performance with supervised learning (SL) regarding the regression model analysis.
This paper proposes an analysis of a regression model for predicting (predicting or estimating) thyroid abnormality, which is the probability of a specific disease.In a regression model, this study tries to model the likelihood of particular diseases of thyroid abnormality based on the collected symptoms and increase the Expert System (ES) performance.This paper is a new and timely paper that supports the role of health information in the function of an expert system, as machine learning (ML) is vital in predicting the particular risk of thyroid abnormalities in thyroid diseases.Therefore, the research focuses on the following research questions: 1.What is the performance based on the statistical analysis results for prediction in this study?2. How to predict the particular disease likelihood in thyroid abnormality's symptoms, based on the certainty method results using a supervised machine learning method?3.If the prediction model better performing, should it be applied to support the future new or related symptoms of thyroid abnormality?
The paper contributes to performance upgrading in the traditional expert system (ES), which uses certainty factor method-based, providing the opportunity for predicting the particular disease likelihood in thyroid abnormality cases using an intelligent computing method.Health informaticists, biomedical engineers, and computer scientists put a role in upgrading the performance of the traditional expert system (ES) method, including the certainty factor method.
The traditional expert system (TES), such as the certainty factor (CF) method, was used to calculate the uncertainty things that need to be transformed into certainty value, as well as containing knowledge based on human experts (CFrule) and users (CFuser).One of the most rule-based algorithms used to calculate CF values is MYCIN [23].This rule-based algorithm tried calculating CF values representing symptoms [8], [13], [59], [60], or specific-related symptom datasets [61]- [63].The following equations (1) explain the calculation of the CF value [64].
Based on equation (1), integers from 0 (for categorically untrue) to 1 make up the certainty factors (CFs) (for definitely true).The rule structure also allows for the inclusion of conditional statements of CFs.When a rule's premise is unclear owing to uncertain facts, and the conclusion is uncertain due to the rule's specification, the calculation of CF, including CFrule (RuleCF) and CFuser (PremiseCF), is performed.The final CF value determines the confidence level value called by CFcombine [32].
In the era of artificial intelligence (AI), ES was developed and integrated with machine learning algorithms (MLAs) [65], [66].One of the MLAs' algorithms is supervised learning (SL) based on the label information [67], and MLAs would be trained on it [66].Besides, most SL analysis is linear regression, which is easily used and statistically procedurally [68].SL tried to solve the uncertain things [69] commonly as the ES's central issue, including in the context of the medical problem [70].SL methods work on the premise of constructing theories on existing dataset instances to predict future data instances.A set of labelled cases is fed into an SL algorithm, which builds a model to categorise or predict future occurrences [71].
Especially for regression analysis methods, usually for regression equation models, generally, research is focused on choosing the best regression model, where this procedure allows determining As intelligent tools for helping an expert solve the problem, ES must be evaluated for performance, including prediction performance.Evaluating the performance was required to predict future instances [71] based on dataset knowledge-based representations.Besides, in the context of medical ES must investigate the performance, one of the best ways such as regression analysis [71], [73].
ES was developed to diagnose thyroid disorders [23] and several psychological signs for thyroid patients [5].There are many implemented ES for thyroid cases [74].For the gold standard issue, the thyroid must be required laboratory test results for its diagnosis [73].Besides, each patient has a different physiological condition.In addition, Hariadha et al. [23] reported that thyroid symptoms, as well as clinical signs, may be different for each person.
Furthermore, as the conclusion of this section, as well as for future prediction reasons, existing ES must evaluate the performance of predictions.We proposed different regression analyses like MLR and MPR to get the best prediction model.So, this study proposed two hypotheses to support this issue: 1. H1: MLR regression model can be used for the best prediction model (if p-value < 0.05, onetailed); 2. H2: MPR regression model can be used for the best prediction model (if p-value < 0.05, onetailed).These hypotheses might be acceptable to the prediction model; indeed, we tried to propose the best and good model based on R-squared, R-squared adjusted F-value, as well as p-value.

Data Collection, Acquisition, and Analysis
Figure 1 shows the flowchart of this study.We adopted the thyroid dataset based on a previous study [5], [23] with permission to investigate the role of input features, such as certainty factor value (CF) based on symptom collection and confidence level (CL) in decimals, for the prediction of the likelihood of a particular disease based on thyroid disease information and modelling the relationship between the input features and the target variables.Investigating the input features using multiple linear regression (MLR) and multiple polynomial regression (MPR) analysis, as well as this one of the supervised machine learning algorithms used for regression tasks [22], [68], [75].MLR involved CFrule (RuleCF) and CFuser (PremiseCF) as predictor variables (Xn) as well as confidence level (CL) as a response variable (Y1), while the MPR wouldn't be involved between any predictor variables for the CL variable (Y2).It was used for the best prediction of the value of the CFrule and CFuser variables in conjunction with the CL variables that can be determined through the regression equation, and it is essential to predict future CF values, especially in the case of new patients based on this dataset.We used R Studio [76] to analyse the MLR problem, as seen in equation ( 2), as well as the MPR equation can be seen in the following equation (3) [77], included in this study.

Expert Interview
Symptom collection was adopted from a medical doctor (Aviasenna Andriand, MD) and supported medical references, including New Castle Index (NCI), Wayne Index (WI), and Billewicz Index (BI).Based on a previous study by Hariadha et al. [23], as well as improved by Al-Hakim et al. [5], this symptom dataset was also used for MYCIN rule-based algorithm calculation.A rule-based algorithm was used for the basis of certainty factor value (CF) calculation, as well as adopted knowledge-based from the expert (MD, then represented as RuleCF or CFrule) and user (diagnosed patient, then represented as PremiseCF or CFuser).These 64 symptoms were collected within the code, respectively.Each symptom was identified as a code as well as containing a CF between 0 and 1 decimally, and then it was used as the dataset for this study.

Data Modelling and Evaluation
CF values for 64 symptoms are categorised as thyroid disease information (label).This parameter (CFrule and CFuser) used as predictor variables also calculated the percentage of confidence level (CL, then used as a response variable) for its thyroid disease label, respectively.According to the previous study by Hariadha et al. [23] and Al-Hakim et al. [5], this dataset is significantly used to determine the diagnosis's inference based on the inference machine used is forward chaining.
Hariadha et al. [23] and Al-Hakim et al. [5] categorised four thyroid diseases: hypothyroidism, hyperthyroidism, goitre, and thyroid nodule.These studies also reported the percentage of confidence level (CL) for each disease.Meanwhile, every uncertainty, including thyroid symptom predictions, must be made continuously.Unfortunately, those study ( [5], [23]) was not discussed in this issue, so we tried to use that dataset to predict continuous values based on a set of input features and to make predictions on new data based on this dataset.The results that would be analysed in this paper include statistical analysis of multiple linear regression and statistical analysis of multiple polynomial regression.

Statistical Analysis of Multiple Linear Regression
The result of MLR analysis shows that the F-value is 3.394 and the p-value 0.04 < 0.05 (onetailed), as well as its significance (Table 1 for statistical reports).The regression model for this regression analysis shows in equation ( 4).This result shows that the constant value of 0.7439 is when the CFrule predictor has not been affected by the CFuser predictor or is in a constant state, and then the CL variable (Y1) is positive.Besides, the CFrule value (regression coefficient value) is 0.16, meaning that if the CFrule value increases while other predictors are constant, in this case, the CFuser, then the CL variable (Y1) would also increase.In addition, the CFuser value (regression coefficient value) is 0.133 means that if the CFuser value increases while the other variable (in this case, it is the CFrule value) is constant, then the CL variable (Y1) would also increase.
Based on the R-squared adjusted value, which determines how much the degree of variability of the predictors (CFrule and CFuser) that has been adjusted by the weakness of the R-squared (coefficient of determination) value is related to explaining its response variable (Y1, which means CL for the MLR regression analysis).The R-squared adjusted result is 0.071, which means that CFrule and CFuser have a 7.1% influence contributing to the CL value, so this indicates that the combination of CF variables affects the value of the CL variable in a small way or this value is generally considered a none or very weak effect size [78].The remaining 92.9% of other factors were not studied in this study, but we assumed that it was caused by involving each symptom related to the end of CF calculation (CFcombine) [22]; it means that the certainty factor method must be involved the CFrule against CFuser for the calculation of the CFcombine or final CF value [22].The study proposed a linear regression model, which is related to the research by Al-Hakim and Andriand [79] for leptospirosis cases, but only used a single linear regression model.In this study, the linear regression model has successfully modelled the multiple linear regression, so the model would be implemented for disease prediction.

Statistical Analysis of Multiple Polynomial Regression
Meanwhile, the MPR analysis result shows that the F-value is 265.925 and the p-value is 0.00 < 0.05 (one-tailed), as well as its significance (Table 2 for statistical reports).The regression model for this regression analysis shows in equation ( 5).This result shows that all the regression coefficients are significant, supported by p-value < 0.05, one-tailed, while the coefficient of determination (Rsquared) is 94.7% and the R-squared adjusted is 94.4%.According to Moore et al. [78], if the Rsquared value is more than 0.7, this is considered a large impact size.This MPR model is better than the MLR model, so the best regression model in this study is the proposed MPR regression model.
The MPR regression model (equation 5) shows that the predictors (CFrule and CFuser) are not involved with each other, which means that both the patient's symptoms (source of CFuser) after the doctor's confirmed symptoms (source of CFrule), are not related.This is important because each patient has a different body physiology condition, so it does not mean that the symptoms experienced by the patient always lead to the certainty of the doctor's anamnesis results.When this model (equation 5) is used in new patients with new symptoms, it will better predict possible thyroid disorders in the future.

Conclusion and Recommendation
Based on our analysis of the two regression models, MLR (multiple linear regression) and MPR (multiple polynomial regression) are acceptable to the prediction model (both show a significant value with p-value < 0.05).However, the MPR regression model provides better predictive capabilities for new patient cases in the future.It can be a foundation for expert system (ES) development by focusing more on machine learning (ML) analysis approaches than traditional rule-based expert systems.
However, our analysis does not provide certainty about the correctness of predictions from existing expert systems.It only helps through a statistical regression approach to test which regression model is better in terms of future predictions; of course, with this approach, learning-supervised analysis can be an alternative to the study of expert systems, and hopefully, it can be a foundation for further research related to improving the performance of expert systems.
[68]ables to be included in the regression.The purpose of selecting the best regression model is usually in the interest of forecasting or prediction[72].Besides, types of regression analysis, including multiple linear regression (MLR) and multiple polynomial regression (MPR), are commonly used to model the prediction based on existing datasets.MLR focused on involving each predictor and response variable.Besides, MPR ignored the involvement of predictor and response variables[68].