Data Analysis and Classification of Autism Spectrum Disorder Using Principal Component Analysis

Autism spectrum disorder (ASD) is an early developmental disorder characterized by mutation of enculturation associated with attention deficit disorder in the visual perception of emotional expressions. An estimated one in more than 100 people has autism. Autism affects almost four times as many boys than girls. Data analysis and classification of ASD is still challenging due to unsolved issues arising from many severity levels and range of signs and symptoms. To understanding the functions which involved in autism, neuroscience technology analyzed responses to stimuli of autistic audio and video. The study focuses on analyzing the data set of adults and children with ASD using practical component analysis method. To satisfy this aim, the proposed method consists of three main stages including: (1) data set preparation, (2) Data analysis, and (3) Unsupervised Classification. The experimental results were performed to classify adults and children with ASD. The classification results in adults give a sensitivity of 78.6% and specificity of 82.47%, while the classification results in children give a sensitivity of 87.5% and specificity of 95.7%.


Introduction
Autism spectrum disorder (ASD) is a condition that can be characterized by a constant deficit in social communication, social interaction, and the presence of restrictive and repetitive behavior. It is an early developmental disorder characterized by alterations in socialization associated with a deficit in the visual perception of faces and emotional expressions. is deficit in the perception of faces and emotional expressions seems to be linked to the peculiarities of the gaze in autistic pathology [1]. e study of this behavioral disorder is carried out by the measurement of different ocular parameters (fixation time, distance and speed of exploration, ocular path) during the perception of neutral (with direct or deviant gaze) and emotional faces (expressing joy or sadness) [2].
Some symptoms in ASD typically appear a er 2 years of age [3]; therefore, the early diagnosis could be a better opportunity to get treatment and healing [4]. It is generally recognized that traditional clinical methods have difficulty in well distinguishing patients from healthy controls (HC) [5]. erefore, data analysis and classification of ASD is still challenging due to unsolved issues arising from many severity levels and range of signs and symptoms. e commonly used tools for analyzing the dataset of autism are functional magnetic resonance imaging (fMRI), Electroencephalography (EEG), and more recently "eye tracking". Eye tracking is a system of monitoring of the gaze grouping together a set of techniques which make it possible to record the ocular movements and to measure several parameters such as the time of fixation of the image, the number of fixations of an area of the image, etc. e objective of eye tracking system is to examine perceptual characteristics of ASD and facilitate study into the abnormal behavior of visual attention and oculomotor patterns that contribute to clinical characteristics of ASD. e detailed and objective measures of pupil eye behavior, eye tracking system used to identify disorder specific characteristics, enhance early identification, and inform treatment. Particularly, examiners of ASD have benefited from integrating eye tracking into their research paradigms.
Eye tracking technique has been largely applied in these studies to reveal mechanisms underlying impaired task performance and abnormal brain functioning, essentially through the processing of social information. While older children and adults with ASD comprise the superiority of research in this area, eye tracking is useful for studying young children with the disorder as it offers an extensive tool for assessing and quantifying early emerging developmental abnormalities. Implementing eye tracking of children with ASD, therefore, is associated with a number of challenges, including problems with compliant behavior resulting from the given task requested and disorder related psychosocial considerations [6]. e eye tracking implementation includes: (1) Eye tracking equipment, (2) Testing environment and stimuli, (3) Procedures & analysis, and (4) Representative results. e recordings in step1 have been carried out using a look-up system comprising a computer equipped with two analogue cameras as illustrated in Figure 1. Following a projection of images representing neutral faces or deviated eyes, this system makes it possible to capture the directions, movements, and positions of the eyes during the projection and to superimpose them in order to calculate in real time the temporal and statistical measurements [7]: In order to produce interesting results, the eye tracking device can be a better implementation option for use in processes that possibly involve lack of perception, such as photographs or films which involve sense. e model involved in this test shows to a human face a screen (or a movie involving social interactions), and at the same time, catches the position and interest of the patient on the screen as data in order to analyze.
Classifying autism automatically according to time is interesting in more ways than one. It allows, in particular, to follow the evolution of the pathology following the medicated or nondrug therapy practice. One can, for example, judge its reeducation according to whether the position of the subject is close or not to the group of people without autism. e second point concerns the informative parameters that allowed this classification. e temporal follow-up and the connection of these parameters with neurophysiological information can certainly help in understanding the mechanisms put into action in people with autism [8].
is study is based on the classification of data provided by the follow-up material according to the two groups (autistic and control). e main aspect is to implement the Principal Component Analysis (PCA) which will allow us to reduce the size of the representation space and to retain only the parameters that provide discriminating information. Two ways will be followed, first concerns the development of classifiers based on statistical data already provided by the system "eye tracking". Second finds a new descriptor using the eye trajectories. e second aspect of this study is directed towards searching for new parameters according to the analysis of trajectory. Given the complexity of the dynamics underlying the time series or trajectories, it is natural to turn to tools from the information theory or chaos theory. is assumption is realistic if we consider that the trajectory corresponds to the output of a nonlinear dynamic system (the brain) excited by an input-the visual stimulus. erefore, the main contribution by using PCA is to decrease the dimensionality of a dataset consisting of a large number of consistent variables, while retaining the variation present in the data set by choosing a threshold to retain only those that express a significant difference.

Materials and Methods
is study aims to analyze and classify the dataset of ASD in adult and child patients. e framework of the proposed method is illustrated in a block diagram as shown in Figure  2. It consists of three main stages: (1) data set preparation, (2) Data analysis, and (3) Unsupervised Classification including data recovery and thresholding. In the first stage, dataset that is used in this study with their characteristics is explained. Establishing the mathematical foundations of Principal Component Analysis (PCA) which is considered as a method of reducing the size of data is presented in stage two. In the third stage, the unsupervised classification method is used to classify results in adults and children with ASD by using two steps including: data recovery and data thresholding.

Data Set Preparation.
e dataset used in this work consists of two groups as presented in Table 1. e first group includes 30 adult patients with ASD (15 male, 15 female) and 36 adults without ASD (17 male, 19 female). Second group includes 14 child patients with ASD (9 male, 5 female) and 22 children without ASD (12 male, 10 female).
All datasets were used in the age range of 4 to 60 years. Each dataset includes five fields ( , , distance, le diam, right F 1: Monitoring system [7]. diam, and time), where and were obtained from trajectory eye tracking system. e distance field represents the length of distance between the points 푥, 푦 and the central point on screen (384, 512). Le diam and right diam represent the le eye and right eye, respectively. e time field represents the start and end of the experiment. A preliminary study on eye tracking trajectories of patients studied as seen in Figure 3 showed a rudimentary statistical analysis. Principal Component Analysis (PCA) provides interesting results on the statistical parameters that are studied such as the time spent in a region of interest, the attachment time. Some of the other studies, involving tools using Euclidean geometry and nonEuclidean, also show interesting results.

Data Analysis Using PCA Method. Principal Component
Analysis (PCA) is a method of extracting important variables (in form of components) from a large set of variables available in a dataset. It elicits low from high dimensions of the featured dataset with a motive to possibly capture as much more information. In addition, with any variables, visualization becomes much meaningful. PCA is useful when dealing with three or higher dimensional data. It is carrying out symmetric correlation or covariance matrix. e inherent problem in multivariate statistics is one of the obstacles in visualizing data that have many variables. e datasets contain many variables, groups of variables are o en moving together. More than one variable might be measuring the principle governing the system. e affluence of usefulness can enable to measure scores of variables. When it happens, the advantage can be taking it to redundancy of information, and the problem can be simplified using replacement of a group of variables with a single new variable. PCA method can generate a new set of variables; it is called principal components [9]. Each principal component represents a linear combination of original variables. All principal components are perpendicular to each other, so there is no redundant information. e principal components as a whole form an orthogonal basis for the space of the data. e first principal component is a single axis in a matrix. And the variance of variable is the maximum of all possible choices of the first axis. e second principal component is related to another axis in a matrix, perpendicular to the first. e observation on this axis generates another new variable. e variance of this variable is the maximum of all possible choices in the second axis.
In this study, the main purpose of PCA is to decrease the dimensionality of a dataset consisting of a large number of consistent variables, while retaining the variation present in the dataset. us, from the matrix, 푀 [푚 × 푛] of the data ( is the number of observations and represents the number of parameters), we project the data in a reduced-size basis to establish two groups. To do this, we began by reducing the variables of the matrix , by choosing a threshold to retain only those that express a significant difference [10,11]. ere are three roles of the PCA such as: (1) Study the linkage (correlation) between the variables; (2) Project the observations following new axes results of linear combinations of the initial variables, reduction of dimension and obtaining new coordinates; (3) Change to a new orthonormal basis to implement data variances.

Unsupervised Classification
Unsupervised classification is used when the class number is not known. ere are two categories of unsupervised classifications: hierarchical and nonhierarchical. In the hierarchical classification (HC), the created subsets are nested hierarchically in one another. We distinguish the descending HC, which starts from the set of all the individuals and breaks them into a certain number of subsets, each subset then being divided into a certain number of subsets, and so on, and the ascending HC starts from the individuals that are grouped into subsets, which are in turn grouped, and so on. In nonhierarchical classification, individuals are not structured hierarchically. If each individual is only part of a subset, it is called partition. If each individual can belong to several groups, with the probability 푃(푖) of belonging to group , then we speak of overlap [12]. In this study, the unsupervised classification consists of two main steps such as illustrated in the following subsections. Advances in Bioinformatics 4

First
Step. Suppose l = line and = column in the dataset file. e standard deviation matrix on the datasets has been used in order to reduce the size of the matrix by removing the standard deviation data, and used threshold. e number of patients participated in the experiment is 45. e file "Photo.txt" contained 275 records and the data file which is recorded is related with data of criteria used; Table 2 presents the criteria of time with sample datasets for three ASD patients. Also, Table 3 presents the criteria of statistical measurements with sample dataset for three ASD patients.
Each element in the data record can be represented in one of criteria. For example, the first criterion in the file "crietria. txt" is Time span shown start (seconds) which represents the start time of experiment, and so on.

Data resholding.
e main aim of data thresholding in the methodology is to transport dataset file based on the standard deviation matrix in order to reduce the size of the matrix by using threshold. erefore, three main steps are implemented such as: extracting the mean, and medium value. Table 4 shows (stdr, stdm, median, and variance) measurements. ird Step. It is used to reduce the matrix by using standard division of the datasets values. Accordingly, the algorithm below can be given using Matlab: 퐵 = repmat(퐴, 푛) returns an array containing copies of in the row and column dimensions. e size of is size (퐴) * 푛 when is a matrix.    In order to evaluate these results, two metrics, sensitivity and specificity, are used. We recall that sensitivity is defined as: where VP indicates the true positive and FN the false negative. e specificity is defined by: where VN indicates the true Negative and FP false positives. us, the test performed to classify patients with autism gives: a sensitivity of 78.6% and a specificity of 82.47%.

Result of Classification in Adults.
For adult data, we have an matrix [182, 73] (182 = 154 controls + 28 autistic). For a threshold, equal to 10, only 15 parameters are retained, the matrix becomes: [182,15]. e results of the manual classification are given in Figure 4. Two groups are formed:   (2) Data thresholding is applied for transportation dataset file based on the standard deviation matrix in order to reduce the size of the matrix by using threshold. For adult classification, a threshold equal to 10 and only 15 parameters are retained, while for children classification, a threshold equal to 10 and only 19 parameters are retained. Finally, for both children and adults, the performance of classifiers is good since there is on average 80% sensitivity and 90% specificity.

Conclusion
e main goal of this study is to analyze and classify the data set of autism specter disorder (ASD) in adult and child patients based on practical component analysis (PCA) method. e framework of this study consists of three main stages including: data set preparation, Data analysis and classification. Two groups of datasets were used in the age range of 4-60 years. e first group includes 30 adult patients with ASD (15 male, 15 female) and 36 adults without ASD (17 male, 19 female). Second group includes 14 child patients with ASD (9 male, 5 female) and 22 children without ASD (12 male, 10 female).
Unsupervised classification stage consists of three steps: Data recovery, Data thresholding, and Eentrance the PCA. Data recovery contains 73 criteria in the file, 275 person record-some persons have 6 datasets and some others have 8 datasets. e number of patients participated in the experiment is 45. Data thresholding is needed to transport dataset file based on the standard deviation matrix in order to reduce the size of the matrix by using threshold. For adult classification, a threshold equal to 10 and only 15 parameters are retained, while for children classification, a threshold equal to 10 and only 19 parameters are retained. e main purpose of PCA is to decrease the dimensionality of a dataset consisting of a large number of consistent variables, while retaining the variation present in the data set. e results obtained a er the applying PCA method on the dataset record show a fairly good classification for adults and a very good classification for children. On the other hand, out of 73 criteria, only 15 were retained in adults and 19 in children. To classify adult patients with autism, the test performed gives a sensitivity of 78.6% and a specificity of 82.47%, while the test performed to classify child patients with autism gives a sensitivity of 87.5% and a specificity 95.7%.
Finally, for both children and adults, the performance of classifiers is good since there is on average 80% sensitivity and 90% specificity. Future studies will use the neuron sign technologies to classify the signals obtained by the EEG device.
Data Availability e data used to support the findings of this study are available from the corresponding author upon request. Finally, the test performed to classify patients with autism gives: a sensitivity of 87.5% and a specificity of 95.7%. Finally, thecomparison results of classification performance in both adults and children are presented in Table 5.
Comparison of methods in Table 5 shows that the proposed method obtained a sensitivity of 87.50% in children, the proposed method has a higher performance than the other methods with respect to children. As for specificity, the obtained result of 95.71% for children is considered approximately equal to the result of the SVM method. As a final result, the proposed method shows a higher percentage in children than in adults.

The Discussion for Results
e results obtained a er applying the PCA method on the dataset record show a fairly good classification for adults and a very good classification for children. On the other hand, out of 73 criteria, only 15 were retained in adults and 19 in children. e correlation circle for adults and for children is shown in Figures 6 and 7, respectively. e following are some of the parameters selected by the most discriminating PCA method: As mentioned earlier, unsupervised classification is based on two main steps: (1) Data recovery that contains 73 criteria in the file, 275 person record-some persons have 6 datasets and Advances in Bioinformatics 8