Autism Spectrum Disorder (ASD) Identification Using Feature-Based Machine Learning Classification Model

Autism Spectrum Disorder (ASD) adalah gangguan perkembangan yang mengganggu perkembangan perilaku, komunikasi, dan kemampuan belajar. Deteksi dini ASD membantu pasien mendapatkan pelatihan yang lebih baik untuk berkomunikasi dan berinteraksi dengan orang lain. Dalam studi ini, kami mengidentifikasi individu ASD dan non-ASD menggunakan pendekatan machine learning (ML). Kami menggunakan K-Nearest Neighbor (KNN), Random Forest (RF), Regresi Logistik (LR), Naive Bayes (NB), Support Vector Machine (SVM) dengan fungsi basis linier dan Decision Tree ( DT). Kami preprocessing data menggunakan metode imputasi, yaitu regresi linier, Mice forest, dan Missforest. Kami memilih fitur-fitur penting menggunakan teknik pemilihan dan peringkat fitur perturbasi Simultan (SpFSR) dari semua 21 fitur


INTRODUCTION
A collection of mental diseases known as autism spectrum disorders (ASD) are distinguished by some difficulty in social interaction and communication [1].These difficulties include unusual patterns of activity and behavior, such as difficulty switching between activities, difficulty concentrating, and strange responses to sensations.However, a diagnosis of autism is frequently established much later in life, although symptoms may first appear in early childhood.Epilepsy, depression, anxiety, hyperactivity, attention deficit disorders, and other co-occurring disorders of the central nervous system, as well as risky behaviors, including difficulty falling asleep and self-harm, are frequently present in children with autism [2].Children with autism spectrum disorders have various intellectual abilities, from severe conditions to higher levels [2].
Around one in 100 youngsters globally has autism [3].Before 2000, there were 2-5 to 15-20 cases of autism per 1,000 live births or 1-2 cases per 1,000 people worldwide [4].According to ASA (Autism Society of America) statistics from 2000, 1 in 250 people were autistic [4].However, according to data from the CDC (Centers for Disease Control and Prevention, USA), there were 1 in 150 residents with autism in 2001, and it was between 100 people in various parts of the USA and the UK [4].CDC also recorded that 1 in 88 children had autism in 2012, a rise of 30% to 1.50%, or 1 in 68, in 2014 [4].The prevalence of ASD from 2000 to 2018 is shown in Table 1 [5].ASD has been detected in around 1 in 44 children, according to estimates from the CDC's Autism and Developmental Disabilities Monitoring (ADDM) Network [6].ASD has been reported in people of all races, ethnic, and economic and social groups and is more significant than four times as typical in boys than in girls [6].
The worldwide increase in the prevalence of ASD cases has prompted the need to compile behavioral trait-related data.It is difficult to conduct a thorough investigation to improve the efficacy, sensitivity, specificity, and predictive accuracy of ASD screening.There are currently few clinical or screening datasets about autism, most related to genes [7].
This study aimed to create an efficient ASD prediction approach based on crucial selected features by combining machine learning (ML) classification, imputation, and feature selection (FS) approaches.Recent ASD-related research has been conducted using various classification methods, but only some studies focused on the study of critical features.In a 2016 study, M. Duda et al. conducted research to develop six ML algorithms with an average prediction accuracy of 95.6% [8].In 2018, Heinsfeld et al. conducted research using deep learning methods and achieved an accuracy of 70% [9].In 2018, a study by Vaishali et al. using the binary firefly algorithm method achieved an accuracy of 92.12% [10].In 2019, a Support Vector Machine (SVM) used by In-On Wiratsin et al. achieved a mean prediction accuracy of 90.8% [11].SVM RFE (Support Vector Machine Recursive Feature Elimination) was used in a study by C. Wang et al. in 2019 that had a prediction accuracy of 90.6% [12].
Numerous factors can result in missing values, such as respondents who did not wish to be questioned or could not be located, data not collected owing to officer errors, equipment malfunctions, and application malfunctions.In addition, missing values can appear as outliers, discordant with the initial value [13], or anomalous data entries.Missing values can be associated with several issues, including inefficiency, difficulty handling and interpreting data, and anomalies or distortions between data containing missing values and complete data [13].Therefore, additional processing is required to address the issue of missing values using the imputation technique.
Different algorithms and approaches to addressing missing values can result in different estimation outcomes.As a result, this study intends to enhance the ASD's predictive performance by incorporating imputation ML techniques and emphasising the relevant aspects with FS techniques.Conda version 22.9.0 and Python version 3.9.12 were used for all research analyses.Multiple modules, among them scikit-learn, a python machine learning module based on "imbalanced-learn", "miceforest packages", "scipy package", and "missingpy", were utilized to generate and select the most critical features from the data.

Data Collection
This study used three publicly accessible datasets from the machine learning repository at the University of California Irvine [14].The dataset includes data on routine health examinations from 1,100 participants, ages 4 to 64, gathered in 2017.Due to the similarity in data type and structure among the three datasets used in this investigation, they were merged into a single dataset to increase the power of prediction by taking advantage of increasing the sample size.
To use the imputation approach in this study, the dataset with the sign ("?") is converted to ("NA"), then imputed three using ML-based imputation methods proposed in this study [20].We pre-processed the input by encoding the class target with a number between 0, 1, and so on.Using a set of independent data points, it ascertains the probabilities of a particular occurrence, such as participating in the vote or not participating in the vote, and then reports that probability.We also used normalization techniques, such as Min-Max scaling for the age variable, to adjust the range of the dataset so that it falls between 0 and 3, utilizing the highest and lowest possible values for each feature and changing the wrong data values to "NA" values with the total number of missing values.We compared the best imputation approach among those based on Linear Regression (LR) [21][22], Mice Forest (MC) [23], and Missforest (MF) [24].

Multiple Imputation Techniques
Several imputation techniques were used before classifying the data.We compared the performance of each imputation technique to obtain the best-imputed dataset to be integrated with the classifier.

2.1 Linear Regression
Regression is one of the most frequently employed statistical methods.Regression is a form of model that is used to describe the actions of an intriguing random variable.This variable could be the stock market value in the financial sector, the development of a species, or the probability of detecting gravitational waves.It is the dependent variable and is denoted by "y".[21].Consider the following multiple linear regression model: Y = 1θ0 + Xθ + ϵ.Where: Y = [y1, y2,…, yn] 1 is a "n x 1" vector of responses, 1 is a "n x 1" vector of one, X = [x1, x2,…, xq] is a "n x q" is a "n x q" non-stochastic design matrix, θ = [θ1, θ2,… θq] 1 is a q x 1 vector of unknown coefficients and є = [є1, є2,… єn] 1 is a vector of independent and identically distributed error terms [21].

2.2 Mice Forest
The MICE (Multivariate Imputation by Chained Equations) algorithm is likely one of the most widely utilized imputation algorithms and a standard interview topic.MICE first calculates each column mean with a missing value and then uses the mean as a substitute [25].It then executes a sequence of regression models (chained equations) to impute each missing value sequentially [25]

2.3 Missforest
The Missforest method uses random forests to impute phenomics data.Missforest trained a random forest (RF) on the observed values for each variable, employing an iterative strategy for imputation, anticipating the missing values, and continuing until the stopping requirement was met.In addition, it can be executed in parallel to save computation time and to evaluate the OOB (out of bag) imputation error for the continuous and categorical portions of the imputed datasets.OOB is a method for calculating errors in random forest prediction.Observe the effectiveness of this evaluation by comparing the absolute difference between actual imputation error (errortrue) and OOB imputation error (errorOOB) across all simulated iterations [26].

3 Feature Selection
The methods of feature extraction and feature selection are two frequently used methods for decreasing data dimensionality.Feature extraction makes new features by mapping the original features onto a new (lower-dimensional) space [27].We used several feature selection techniques that represented each approach and were widely used in ML studies [28].

3.1 Feature Selection F-Score
Feature selection is required for a classifier that may use many observational variables to choose a relatively limited subset of variables, decrease computation requirements, and enhance algorithm performance [29].The F-score formula is as follows in equation ( 1): Where:   (+) ,   (−) (The average of each i th feature across positive and negative datasets),  , (The i th feature of the k th positive instance),  , (−) (The i th feature of the k th negative instance).

3.2 Mutual Information (MI)
MI uses the amount of information when the variables exchange can be scaled, and the uncertainty of the random variables can be measured using information entropy [30].Entropy can be represented as follows [31], as shown in equation ( 2): Where: () = marginal probability density.Mutual reliance, which is defined as: can be measured by mutual information (MI), which is show in equation (3): Where: (, ) = joint probability density, and H(Y|X) = conditional entropy at X is known, which is computed as show in equation ( 4): (5)

3.3 Random Forest Importance (RFI)
Random Forests (RF) produce many distinct decision trees during the training phase.The mode of the classes for classification is the final prediction or the average forecast for regression, which is a combination of the projections from all trees.They are called ensemble methods because they rely on a set of outcomes to conclude.The probability of a node is computed by dividing the total number of pieces by the number of samples that arrive at the node.The greater the value, the greater the importance of the trait [32], as shown in equation (6).
Where: N (total number of rows present in the data),   (number of rows in that specific note),  (ℎ) (number of nodes in the right node),  () (number of notes in the left node), Impurity (a Gini index value).

3.4 Simultaneous perturbation feature selection and ranking (spFSR)
SpFSR is a unique FS and rating technique that extends the stochastic optimization algorithm for general applications.SpFSR begins with the initial solution 0 and utilizes recursion to determine the local minimum ^, as show equation ( 7): ^ + 1: = ^ −   ^(^) (7) Where:   (order of iteration gain);   ≥ 0 and ^(^) are gradient estimations at k.

Techniques for machine learning classification 2.4.1 Logistic regression
Regression methods are now a required step in each data research describing the link between a response variable and one or more explanatory factors.Usually discrete, the outcome variable has two or more possible values.The most popular regression model for analyzing these data is the logistic regression model [33].The conditional probability can characterize the link between the outcome variable y and the independent variable x if x = (x1, x2,..., xp) is a 1 p independent variable [34], as shown in equation (8).

Random Forest
Based on regression trees, the supervised ensemble learners are random forests, a nonparametric model that learns variable interactions through recursive partitioning [35].Highdimensional nonlinear issues are particularly well-suited for relationship detection by random forests [36].However, their primary concern is classification or regression.They have just recently been used as time-series predictors [37].CART is the primary approach for creating regression trees [35], which applies the subsequent formulation.To minimise the prediction error on the output space, z is used to partition the input space X into K regions Mk and assign an output value YMk to each region.If a sum of squared errors is used to minimise the prediction error, the optimal output predictor ḟ for a new input observation x(t) is, at shown in equation (9).

Naïve Bayes (NB)
Knowing that the marginal density ratio is the best univariate classifier, we enhance this ratio by fusing the prior probability and the computed boundary, as shown in equation (10).

Support Vector Machine (SVM)
Using the margins approach as its foundation, the SVM is a classification tool, where an ideal hyperplane can most effectively distinguish classes by lowering structural risk.This provides SVM with a robust ability to generalize and resistance to the issue of overfitting.In addition, SVM may handle nonlinear classification problems by selecting kernel functions to transfer a few high-dimensional feature spaces from the original feature space, which cases are linearly separable.Furthermore, SVM can perform novelty identification [39] [38].

DT: Decision Tree
DT is a product of the community of Machine Learning (ML).Because multivariate statistics is a broad area of study within machine learning, computer science, bioinformatics, artificial intelligence, and some chemometrics, the notation and machine learning terminology is frequently different, commonly observed in the chemometrics literature.To aid the reader, the following glossary defines a few terms.This syntax is consistent with what is typically found in the machine learning literature [40].

K-Nearest Neighboar (KNN)
Allocating unlabeled observations to the class with the most comparable labelled samples is the goal of the KNN classifier.Both the training and test datasets gain observational properties [41].By contrasting test dataset observations with training dataset observations, the KNN algorithm categorises test dataset observations.To assess the effectiveness of the KNN model, we are aware of the basic types of observations in the test dataset.The average accuracy, as given by the following equation, is one of the most often employed parameters.Average accuracy, as shown in equation (11): Where: TP, TN, FP, and FN stand for the true positive, false positive, and false negative, respectively.Category is indicated by the subscript i, while the word "l" stands for "total category," [41].

Evaluation
By comparing the results of each method, we can figure out which gives the best performance in accuracy by following this formula, as shown in equation ( 12 We utilized a stratified 10-fold cross-validation technique (max inter = 10) with three rounds to evaluate performance to reduce variability while maintaining computation speed.To enable future replication and independent confirmation of our results, the random state has been set to 999 (random state = 999).All feature selection techniques were installed and evaluated on the same data partition, and the random state was maintained throughout all processes for crossvalidation.This method indicates that our experiments were conducted in pairs with far less variability than when performed individually.To determine whether there is a statistically significant performance difference between the two FS or ML techniques or whether the difference is the result of sampling variation, statistical tests are necessary because the crossvalidation technique employs a random procedure.We did a combined t-test on the data before and after imputation to see if there were statistically significant differences between ML-FS methodologies and feature-based ML approaches.Using the "stats.ttest"function from the "Scipy" Python library, we run a combined t-test and then analyze the p-values.A p-value of less than 0.05 indicates that the difference is statistically significant at a 95% confidence level.

Results and Analysis
Using all the features (21 features), Table 2 compares the performance of the ML techniques utilized in this work.It is evident in Table 2 that all the imputation methods have slightly higher accuracies in predicting ASD compared to those using the unimputed data.All the imputation methods performed almost similarly, with Linear Regression (LR) being the best.

Machine learning classification technique performance using all features
In conducting our test using a sample dataset from the LR imputation method, some unrelevant columns were removed, such as result, age_desc, ethnicity, country_of_res, used_app_before, relation, jaundice, and autism.We resulted in 13 features in total.Table 3 evaluates how well the ML algorithms used in this study worked with all available features.It was found that SMV performed the best than other ML algorithms, with LR coming in second.To establish whether the differing accuracies were statistically significant and did not occur by chance, we performed several paired t-tests.We found that all significant P-values were less than 0.05, with the most considerable P-value 0.001, which was the lowest possible value to compare the performance of SMV and LR.Thus, SMV is the finest method for 100% accuracy in predicting ASD.

Support vector machine classification technique performance with various number of features
We used the SMV method as a wrapper for classifiers in the ML-FS framework to determine as few features as possible to achieve the same result as the full-feature prediction.We began with four features and increased the number until we reached the same level of performance accuracy with all features, as can be seen in Table 4.To evaluate all the key indicator performances of accuracy, precision, recall, and F1-score, Table 5 presents the results of FS performances using ten key selection features.The outcomes showed that the 10-feature spFSR-SVM and all techniques outperformed the other FS techniques.It also shows that the predictive ML performances using the full and ten key features can attain the same highest accuracy of 100%.With ten features, we could reduce the number of features while still determining the most critical aspects, which was the aim of this research.The top features of the spFSR are shown in Figure 2, together with the accompanying critical scores, indicating that the A9_Score is the most crucial variable in predicting ASD.

Discussion
Our study applied several FS approaches to a previously normalized ASD dataset.In diagnosing ASD, FS reduces the number of features, resulting in an accurate, efficient, and costeffective prediction.The ASD prediction using SVM with the full features was equivalent to SVM with a subset of ten features, yielding the maximum predicted performance across all attributes (100% F1-score, recall, and accuracy).Importantly, we accomplished a similar outcome utilizing only the ten features chosen by spFSR, which incorporates spFSR-SVM.
Our study yielded the highest accuracy of 100% in predicting ASD using the FS technique; this exceeds previous studies, which only achieved accuracy between 70% and 95.6% [11][12][8][9] [10].However, all reported results cannot be directly compared due to different datasets and validation methodologies.The previous study by M. Duda et al., using the same dataset with complete features, only achieved the highest accuracy result of 95.6% using SVM [8].By integrating the FS-ML approach, we can still achieve slightly higher accuracy by combining only half of the features that demonstrate the method's efficiency.
Then this research also provides an alternative method of ML classification by using fewer features, which is faster because collecting a complete set of features will require more effort, time, additional costs, and computational complexity [42] From the results of our study, the ten recommended features for predicting ASD attributes are A6_Score, A7_Score, A8_Score, A1_Score, A3_Score, A2_Score, A10_Score, A4_Score, A5_Score and A9_Score.These features become essential for the ten-feature method, especially the outstanding A9_Score feature where the "Usually, I can tell what someone is feeling or thinking by looking at their face" classification is high enough to affect the patient's condition, which can result in ASD status.The findings also indicate that emotional challenges including the inability to discern another person's feelings are common in people with autism.There isn't much scientific evidence to support the idea that this trait is a component of autism, despite the fact that this trait is nearly universally acknowledged as such.
Although this research has many advantages, it is constrained.Since the quantity of data provided is not high-dimensional, we train all the complete data using the FS method.Then we test it using an iterative cross-validation procedure on the full dataset.It can result in overfitting with a simple technique.The combined split-train-test approach will be suggested for a better strategy.The data set can be split into training and test halves, and the most significant key features in the training data can then be selected using cross-validation procedures.The performance of the features on the test data can be re-evaluated using iterative cross-validation approaches.Another way to ensure the same high accuracy can still be attained is to replicate the procedure on different datasets.

CONCLUSIONS
The computational complexity of disease diagnosis will be reduced by incorporating spFSR in the SMV approach.In this study, an accuracy of 100% was achieved by using ten features, representing the highest performance.This study shows that to accurately and reliably predict ASD on the initial dataset, only half of the features can be proposed for efficiency while highlighting the most important ones.Future studies could consider applying this approach to larger or different disease datasets.

Figure 2
Figure 2 displays the importance of the top ten spFSR features . The FS technique aims to reduce variables and adequately represent relevant and needed data.Integrating FS techniques into ML methods can aid in the efficient and low-cost prediction of ASD.◼ ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol.17, No. 3, July 2023 : 259 -270 268

Table 2 .
The accuracies of various machine learning classification methods on all feature data imputed using different imputation methods

Table 3 .
Values of various feature selection algorithms using all features

Table 4 .
Values of various feature selections with various numbers of features

Table 5 .
Values of various feature selection techniques utilizing ten features