Multichannel Deep Attention Neural Networks for the Classification of Autism Spectrum Disorder Using Neuroimaging and Personal Characteristic Data

Computer School, Beijing Information Science and Technology University, Beijing 100101, China CAI, School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, Australia Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA Ningbo Institute of Information Technology Application, CAS, Beijing, China Computational Bioscience Research Center (CBRC), Computer Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), /uwal 23955, Saudi Arabia Department of Pediatrics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA


Introduction
Autism spectrum disorder (ASD) has been estimated to occur in more than 1.6% of children aged 8 across the United States [1]. As a chronic neurological condition, ASD is characterized by impairments in social interaction and communication, as well as by a restricted repertoire of activity and interests [2][3][4][5]. Patients with ASD exhibit different levels of impairments, ranging from above average to intellectual disability. In neuroscience, ASD remains a formidable challenge, due to their high prevalence, complexity, and substantial heterogeneity, which require multidisciplinary efforts [6][7][8]. Although clinical therapies have been developed to treat the symptoms, the diagnosis of ASD remains to be a challenging task. Currently, behavior-based test is the standard clinical method for diagnosing ASD [9]. However, the diagnostic process for ASD is not only time consuming but also costly [10]. is results in a tremendous financial burden for patients' families. Meanwhile, with this lifetime ASD, the patients may have difficulties in normal socialization and working environments, increasing the overall social costs. erefore, an automated diagnosis approach is desirable for earlier identification of ASD.
Machine learning is a promising tool for investigating the replicability of patterns across larger, more heterogeneous datasets [11][12][13]. For automated diagnosis of ASD, personal characteristic (PC) data, such as intelligence quotient (IQ) and Social Responsiveness Scale (SRS) score have been adopted in several studies [14][15][16]. In the study of ASD, IQ is a type of standard score that is derived from several standardized tests designed to assess human intelligence, and the SRS score includes a 65-item standardized questionnaire regarding behaviors that are associated with ASD [17]. ASD is highly associated with intellectual disability which is mainly measured by IQ. Meanwhile, some studies [18,19] indicate that IQ discrepancy marks a meaningful phenotype in ASDs. In this way, IQ becomes an important biomarker to classify the ASD.
Neuroimaging data have also been investigated to explore ASD biomarkers in recent decades. To facilitate the ASD research community, Autism Brain Imaging Data Exchange (ABIDE), an international collaborative project, has collected data from over 1,000 subjects (e.g., structure MRI (sMRI), resting-state functional MRI (rs-fMRI), and PC data) and made the whole database publicly available.
is provided a common platform to test hypotheses, search key biomarkers, and develop advanced statistical and machine learning algorithms. For example, Ghiassian et al. [20] proposed an automated classifier by combining the histogram of orientated gradients approach for feature extraction from sMRI and rs-fMRI data and support vector machines (SVMs) for decision making. eir method was tested on the ABIDE dataset and achieved 65.0% accuracy on hold-out set. Of late, Sen et al. [21] developed a LEFMS learner, which applies sparse autoencoder to extract features from sMRI and spatial nonstationary independent components on rs-fMRI data. SVM was the utilized to classify ASD and improved accuracy by 0.042. Katuwal et al. [22] applied a random forest classifier to classify ASD and achieved an AUC of 0.61. Adding verbal IQ and age to morphometric features, AUC was improved to 0.68. By introducing hypergraph learning technique, Zu et al. [23] proposed a novel learning method to discover complex connectivity biomarkers that are beyond the widely used region-to-region connections in the conventional brain network analysis.
Deep learning has had a profound impact on many data analytic applications, such as speech recognition, image classification, computer vision, and natural language processing [24]. Based on data-driven feature construction, deep learning provides a new direction for data analytic modelling. Over the past few years, an increasing body of the literature confirmed the success of feature construction using deep learning methods. Deep learning has been demonstrated to outperform traditional machine learning algorithms on numerous recognition and classification tasks [24][25][26][27][28][29], which inspires the researchers in the ASD community to apply deep learning approaches on ASD classification. Earlier, deep neural networks (DNNs) have been applied to identify ASD patients using rs-fMRI [26]. eir model achieved 70% on accuracy by using the functional connectivity (FC) matrix as features for model training.
Kong et al. [27] constructed individual functional brain networks using the rs-fMRI data from 182 subjects of NYU Langone Medical Center, a data site within ABIDE repository. FC features were used to represent the networks of all subjects and further ranked using F-score.
en, a stacked sparse autoencoder-based DNN model was developed. Significant performance improvement was achieved by comparing the proposed method with two existing algorithms.
More recently, an ASD-DiagNet, a joint learning procedure using an autoencoder and a single layer perceptron, was presented [28]. A data augmentation strategy was also designed for the FC features of functional brain networks based on linear interpolation of available feature vectors to ensure the robust training of the ASD-DiagNet. By evaluating the model on 1035 subjects from 17 different sites of ABIDE repository, ASD-DiagNet achieves 70.1% on the accuracy, 67.8% on sensitivity, and 72.8% on specificity in 10-fold cross validation. In the mode evaluation of individual data centers, ASD-DiagNet outperformed other stateof-the-art methods and increased the accuracy performance up to 20% with a maximum accuracy of 80%.
In this work, we aim to develop a novel deep learning model for automated diagnosis of ASD. Specifically, we proposed a multichannel deep attention neural network, called DANN, by integrating multiple layers of neural networks, attention mechanism, and feature fusion to capture the interrelationships in multimodality data (functional neuroimaging data and PC data) to distinguish ASD patients from typical development controls (TDCs). e attention mechanism-based learning is a type of deep learning which is a recent trend for understanding what part of historical information weighs more in predicting diseases [30,31]. Taking advantage of large heterogeneous dataset from ABIDE, multiscale brain functional connectomes and PC data were obtained as the features. We systematically evaluated the diagnosis power of our multichannel DANN on ASD classification and compared the performance of the proposed model with peer machine learning models. e rest of paper is organized as follows. Section 2 describes ASD data and multichannel deep attention neural network.
e experimental setup is shown in Section 3, followed by the experimental results and discussion in Section 4. Finally, the conclusion of this work is described in Section 5.

Subjects.
We collected preprocessed rs-fMRI and PC data from 809 subjects from publicly accessible ABIDE repository, including 408 ASD subjects and 401 TDC subjects. Detailed demographic information of subjects is listed in Table 1. e incidence of ASD between male and female subjects is significantly different, and thus the majority of the subjects in ABIDE dataset are male. ere is no significant difference between the age of ASD and TDC groups. All three IQ scores had significant difference between two groups. Later, the variables' gender, age, and three IQs were used as PC data in our ASD classification experiments.

Data Preprocessing.
Each of rs-fMRI data has been preprocessed using Configurable Pipeline for the Analysis of Connectomes (CPAC) preprocessing pipeline, which includes slice timing correction, motion realignment, and intensity normalization. Nuisance variable regression was implemented through bandpass filtering and global signal regression strategies to clean confounding variations introduced by heartbeats and respiration, head motion, and low-frequency scanner drifts. Furthermore, boundary-based rigid body and FMRIB's linear and nonlinear image registration tools were used to register functional to anatomical images. en, both functional and anatomical images were normalized to template space (MNI 152). ree scales of brain functional connectomes were extracted in this work. Mean blood oxygen-level dependent (BOLD) time-series signals for three sets of regions of interests (ROIs), i.e., atlases, including the Automated Anatomical Labeling (AAL) atlas, Harvard-Oxford (HO) atlas, and Craddock 200 (CC200), were calculated. e weights of functional brain connectivity were defined using Pearson's correlation coefficient between any pair of two ROIs. For AAL atlas, each subject was represented by a 90 × 90 FC adjacency matrix, symmetric along diagonal, in which each entry represents the brain connectivity between each pair of ROIs. Similarly, each rs-fMRI data was also represented by 110 × 110 and 200 × 200 symmetric FC adjacency matrices using HO and CC200 atlases, respectively. In addition, from 809 subjects, we obtained five PC data, including sex, handedness, fullscale IQ (FIQ), verbal IQ (VIQ), and performance IQ (PIQ).

Overview
Structure. An overview of multichannel DANN is given in Figure 1. It consists of blocks of multichannel inputs, multilayer perceptron (MLP), self-attention, fusion, and aggregation. e various components are described in the following sections.

MLP.
e MLP block is composed of 5 layers, which are one dropout layer and four dense layers. e details of the block are shown in Figure 2.
A dropout layer, which prevents overfitting during training the model, is applied on input data, e.g. AAL FC (input size is 4005). e white circle in Figure 2 denotes dropped units according to dropout probability. e dropout layer is followed by four dense layers, whose hidden units are 1024, 512, 128, and 32, respectively, and corresponding activation functions are "elu," "tanh," "tanh," and "relu," respectively.

Self-Attention.
e attention is proposed to compute an alignment score between elements from two sources [32]. In particular, given an input FC adjacency matrix, which can be transformed into a FC adjacency sequence, and a representation of a query q ∈ R d , attention [33] computes the alignment score between q and each element x i using a compatibility function f(x i , q). A softmax function then transforms the alignment scores means that x i contributes important information to q. is attention process can be formalized as (1) e output s i is the weighted element according to its importance, i.e., Additive attention mechanisms [33,34] are commonly used attention mechanisms where the compatibility function f(·) is parameterized by a MLP, i.e., where is an activation function. In contrast to additive attention, multiplicative attention [35,36] uses cosine similarity or inner product as the compatibility function for f(x i , q), i.e., In practice, although additive attention is expensive in time cost and memory consumption, it usually achieves better empirical performance for downstream tasks.
Self-attention [37,38] explores the importance of each feature to the entire FC given a specific task. In particular, q is removed from the common compatibility function which is formally written as the following equation: e output s i is the weighted element according to its importance, i.e.,

Fusion.
e fusion output u is obtained by combining the outputs of the two dense layer blocks, which can capture the correlation between the types of spaces. e combination is accomplished by a fusion gate, as shown in Figure 3, i.e., where and b (f) ∈ R are the learnable parameters of the fusion gate.

2.3.5.
Aggregation. To aggregate dense layer, self-attention, and fusion into a DANN, the outputs of self-attention and fusion blocks can be concatenated, multiplied, or averaged. In our implementation, the outputs of both the self-attention blocks and the fusion blocks are concatenated, followed by a dense layer and sigmoid layer for classification:   Complexity where v is a vector of the combined outputs of both the selfattention blocks and the fusion blocks. v � [s aal , s ho , s cc , u 1 , u 2 , u 3 , Demo] represents the concatenation of outputs s aal , s ho , s cc from the self-attention blocks, u 1 , u 2 , u 3 from the fusion blocks, and Demo from demographic data. A sigmoid function on dense lay is then used for data classification.

Model Evaluation.
We conducted a comprehensive evaluation in this study by employing the proposed multichannel DANN on ABIDE dataset to classify the ASD subjects from TDC subjects. Two evaluation strategies, k-fold cross validation and leave-one-site-out cross validation, were designed in our experiments. For k-fold cross validation, whole ABIDE dataset would be divided into k portions. In each repeated iteration, we randomly used one portion of the data as testing data and applied the remaining (k − 1) portions of the data as training data. is process would be repeated k times until all data have been tested once. For the leave-onesite-out cross validation, we separated the whole ABIDE dataset according to their data sites. We removed the SBL site from this experiment due to its small subject size (N � 4). is resulted in a total of 12 data sites. We randomly used data from one site as testing data and treated the remaining data from 11 data sites as training data. is is repeated 12 times until data from all sites have been evaluated as testing data. Both the k-fold cross validation and leave-one-site-out experiments were repeated 50 times to understand the variability of the results. Mean and standard deviation (SD) were calculated. Student's T-test was applied to test the difference between continuous values, and chi-square test was used for discrete values. One-way analysis of variance (ANOVA) was utilized to compare multiple conditions (i.e., multiple k-fold cross validation experiments). A p value < 0.05 was used for inferring statistical significance.
We calculated true positive (TP), false positive (FP), true negative (TN), and false negative (FN) for the classification by comparing the classified labels and gold-standard labels. en, we calculated accuracy, sensitivity, precision, and F-score by

Peer Machine Learning Models.
To compare our multichannel DANN with existing machine learning models, we also implemented random forest (RF), support vector machine (SVM) models, and multichannel DNN. Each model was designed to take multimodality data as inputs.

Random Forest (RF)
. RF is one of the classic ensemble learning methods by learning multiple decision trees to improve classification performance and control overfitting. e number of trees in the forest was optimized from empirical values [20,40,60,80,100]. We set the maximal depth of the tree as 10.

Deep Neural Networks (DNNs).
In terms of existing deep learning model, we compared our model with a DNN model developed previously for ASD classification [26]. In brief, the compared existing DNN model is a 5-layer DNN, with input number of nodes in input layer, followed by 1024, 512, 128, and 32 nodes in hidden layers, and the output layer contains two output units. A cross entropy loss function was adopted. Learning rate was set as 0.0001. 10 epochs were applied to ensure the convergence of the model.

Developmental Environment.
e proposed DANN and peer machine learning models were implemented in the Python 3.7 environment. To build the deep learning related models, we applied Keras (2.2.4) package with TensorFlow (1.13.1) backend. For the traditional models, we adopted the models from Sklearn 0.20 [39]. Statistical analyses were performed using Matlab 2019b.
All the experiments were conducted on a workstation with 10 cores of Intel Core i9 CPU and 64 GB RAM. Due to the high computation cost of deep learning algorithm, we configured one GPU (Nvidia TITAN Xp, 12 GB RAM) to accelerate the training speed of the models.

Performance Comparison on the Whole ABIDE Dataset.
We first compared the ASD classification performance of the proposed multichannel DANN model and multiple peer machine learning models, including RF, SVM, and multichannel DNN. e results were calculated based on 50 repeats of 10-fold cross validation experiments by using the entire ABIDE dataset. e mean and SD of the performance metrics are listed in Table 2. e proposed multichannel DANN exhibited a significantly higher accuracy than multichannel DNN (p � 0.01), SVM (p � 0.014), and RF (p � 0.008) models. Similarly, the multichannel DANN also had better F-score than multichannel DNN (p � 0.004), SVM (p < 0.001), and RF (p < 0.001) models. e sensitivity of the multichannel Complexity 5 DANN was significantly higher than that of multichannel DNN (p � 0.009), SVM (p � 0.015), and RF (p � 0.005) models. e specificity of the multichannel DANN was significantly higher than that of SVM (p � 0.004) and RF (p < 0.001) models but was not significantly better than multichannel DNN (p � 0.082). Since the multichannel DNN had a relatively lower sensitivity (0.673), it achieved the best mean precision in our experiments. No significant difference (p � 0.219) was found between multichannel DNN and DANN on precision. e multichannel DANN model still exhibited higher precision than SVM (p � 0.003) and RF (p < 0.001). Overall, the proposed multichannel DANN achieved improved ASD classification accuracy, sensitivity, F-score, and specificity among compared machine learning models, while the multichannel DNN had the highest precision.
Inspiringly, the proposed multichannel DANN significantly outperformed multichannel DNN on four of five performance metrics, increasing mean accuracy by 0.025, sensitivity by 0.072, F-score by 0.018, and specificity by 0.017. Although no significance was found, the precision of the proposed approach is slightly lower than multichannel DNN by 0.01. e attention mechanism in our model, as the name implies, aids the deep learning model to make choices about which features it should pay attention. Our model can allocate attention by adjusting the weights they assign to individual FC features. is process can decide which FC features are more important than others in terms of the ASD classification task. In another word, it optimizes the feature selection during the learning of a deep learning model. e improved performance of DANN over DNN demonstrated the validity of the attention mechanism. e results in Table 2 also showed that multichannel DANN achieved significantly improved performance, compared to traditional models SVM and RF.
is is consistent with multiple previous ASD classification studies [26,27]. e improvement was likely due to a combination of attention mechanism and the superior capability of deep learning model on complex data patterns, such as FC features.

Leave-One-Site-Out Cross Validation of Multichannel DANN.
To test the generalizability of the proposed model on unseen data from different data sites, we performed a leave-one-site-out cross validation. Similar to k-fold cross validation, we reserved data from one data site as testing data and trained our model by using all data from the rest of the 11 data sites. But, since the training data were the same across all repeats, the performances have much smaller variations than k-fold cross validation. Table 3 shows the classification performance of our model and the size of subjects for each data site.
In the NYU data site that contains the largest sample size, our model achieved an accuracy of 0.709 ± 0.019, sensitivity of 0.720 ± 0.086, the precision of 0.758 ± 0.127, F-score of 0.738 ± 0.069, and specificity of 0.689 ± 0.072. When examining data sites with more than 40 subjects, we found that our model achieved the highest accuracy (0.803 ± 0.045) on the USM site and the best F-score (0.745 ± 0.052) on the UCLA site. ese two sites contain nearly 100 subjects, so the results are very informative. We also noted that the lowest accuracy our model returned was 0.684 ± 0.026 from UM site, suggesting that the data here may have variability that is different from other sites. Overall, our model reached a mean accuracy of 0.713 ± 0.022 and mean F-score 0.707 ± 0.043.
is was significantly lower than accuracy (p � 0.002) and F-score (p < 0.001) from the cross validation results in Table 2, indicating a large data variability among different data sites.

Robustness of Multichannel DANN on Varying Data Split
Schemes. Next, the robustness of our DANN was further tested using varying k-fold cross validation. A classification model that is not robust may appear to perform very differently with different k. Figure 4 shows plots of the accuracy, sensitivity, precision, F-score, and specificity of the proposed DANN over k-fold cross validation strategies (k � [6,7,8,9,10]). Using one-way ANOVA, the proposed DANN exhibited no significantly different performance across varying k-fold experiments (p � 0.082), indicating the robustness of the proposed multichannel DANN model.

Impact of Data Modality on the Classification
Performance. At the end, we set to test the performance of the multichannel DANN when different data modalities are used for ASD classification. All results were based on 50 repeats of 10-fold cross validation experiment. Table 4 lists the performance of multichannel DANN on varying combinations of FC data (marked as AAL, HO, and CC200) and PC data (marked as Demo). e upper part of Table 4 contains results based on both FC and PC data, while the lower part of the table focuses on FC data only. e combined FC and PC data (AAL + HO + CC + Demo) had a better accuracy (p � 0.011), sensitivity (p � 0.039), and specificity (p � 0.025) than FC data alone (AAL + HO + CC), while no significant differences were All data are mean and standard deviation. e highest metrics were marked as bold.  Figure 4: Performance of multichannel DANN over varying data split schemes with k-fold cross validation strategies (k � [6,7,8,9,10]). Mean and standard deviation are displayed. observed on precision (p � 0.231) and F-score (p � 0.347).

Complexity
is demonstrated the predictive power of PC data. Without PC data, our model achieved the highest performance by combining FC from all three brain atlases.
is suggests that brain connected data from different atlases may have complementary information so as to assist the ASD classification. Interestingly, the model using CC200 FC data (marked as CC in the table) performed better than FC data derived from AAL (p � 0.012) and HO (p � 0.023). It is likely because that CC200 atlas is constructed from rs-fMRI data, representing a brain functional parcellation.

Conclusion
In summary, we developed a multichannel DANN model by applying the state-of-the-art attention mechanismbased deep learning techniques for automated diagnosis of ASD. e k-fold cross validation experiments have shown that our multichannel DANN achieved an accuracy of 0.732, outperforming multiple peer machine learning models.
e results of the leave-one-site-out cross validation experiments showed promise for our model to be applied to clinical data with unseen variations. e experiments using varying combinations of data modalities demonstrated discriminative power of individual data modalities such as brain functional connectome and PC data. is suggests a future direction of combining additional data modalities to move the machine learning applications towards clinical usage of ASD computer-aided diagnosis tools. One limitation of the current work is that the selected cohort is in the adolescent and young adult population, which limits the generalizability of the model, since the ASD diagnosis was performed much earlier. In the future study, we would retrain the model with additional data from a wider age range of population.

Data Availability
e dataset used to support the findings of this study is available in http://fcon_1000.projects.nitrc.org/indi/abide/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.