MSTGC: Multi-Channel Spatio-Temporal Graph Convolution Network for Multi-Modal Brain Networks Fusion

Multi-modal brain networks characterize the complex connectivities among different brain regions from structure and function aspects, which have been widely used in the analysis of brain diseases. Although many multi-modal brain network fusion methods have been proposed, most of them are unable to effectively extract the spatio-temporal topological characteristics of brain network while fusing different modalities. In this paper, we develop an adaptive multi-channel graph convolution network (GCN) fusion framework with graph contrast learning, which not only can effectively mine both the complementary and discriminative features of multi-modal brain networks, but also capture the dynamic characteristics and the topological structure of brain networks. Specifically, we first divide ROI-based series signals into multiple overlapping time windows, and construct the dynamic brain network representation based on these windows. Second, we adopt adaptive multi-channel GCN to extract the spatial features of the multi-modal brain networks with contrastive constraints, including multi-modal fusion InfoMax and inter-channel InfoMin. These two constraints are designed to extract the complementary information among modalities and specific information within a single modality. Moreover, two stacked long short-term memory units are utilized to capture the temporal information transferring across time windows. Finally, the extracted spatio-temporal features are fused, and multilayer perceptron (MLP) is used to realize multi-modal brain network prediction. The experiment on the epilepsy dataset shows that the proposed method outperforms several state-of-the-art methods in the diagnosis of brain diseases.


I. INTRODUCTION
I N RECENT years, brain network analysis methods based on neuroimaging technology have attracted more and more attention [1], and have been widely used in the brain disease diagnosis [2], [3]. The brain network analysis technology can detect the intrinsic correlation and interaction pattern in brain. A brain network model can be represented by a series of nodes and edges. The node represents brain region defined by physiological templates, and edges measure interactions between brain regions of interest (ROI) [4]. In general, the brain network can be divided into structural connectivity networks, such as diffusion tensor imaging (DTI) [5], and functional connectivity networks (fMRI), such as restingstate functional magnetic resonance imaging (rs-fMRI) [6]. Previous studies have shown that the different modalities of the brain network convey complementary information to each other [7], [8]. However, due to the heterogeneity and the complex topology of different modalities, it is still a challenge to investigate effective fusion of functional connectivity network and structural connectivity networks to improve the performance of feature representation and diagnosis.
There are many multi-modal brain network fusion methods have been proposed for brain disease diagnosis. As shown in Figure 1, existing methods for multimodal brain network analysis can be mainly divided into two categories. The first type is based on the non-graph structured data fusion strategy, such as multi-view embedding, multi-kernel learning (MKL) and principal component analysis (PCA). For example, Dai et al. fused fMRI and sMRI image features by MKL for the diagnosis of hyperactivity disorder [9]. Yang et al. used information from one modality to aid the construction of brain network from another modality [10], [11]. Unfortunately, the above fusion methods need to stretch the brain network into vector form, which destroys the topology of the brain network. The second category is based on the conventional graph structured data fusion strategy, which considers the topological characteristics of brain network in fusion. The most common model for this strategy utilizes a graph convolution network (GCN) to fuse multimodal structural and functional information. As a powerful representation of graph data, GCN can integrate the node features and the internal graph nodes of brain networks by nonlinear mapping. For example, Liu et al. [12] proposed a framework of Siamese community preserving graph convolutional network to learn the structural and functional joint embedding of brain networks.
It is worth noting that although these conventional fusion methods can integrate the structural and functional features of brain network, the topological and discriminative information of brain network over time in each modality is often ignored, which is important in classifying the brain networks. On one hand, the connection pattern of the functional brain pattern changes over time during the scanning period, so capturing temporal dynamic information of the brain network is conducive to improving its feature extraction [13]. Most of the existing multi-modal brain network analysis models are developed to fuse the static brain networks [10], [14], but they cannot take advantage of dynamic topological properties of brain networks for brain disease diagnosis. On the other hand, the specific features in the fMRI modality and DTI modality of the brain networks should also be effectively extracted and fused. With the increase of network layers, it is natural for a deep model to lose the information of the original input due to the information bottleneck problem. But most of the deep models focus on finding the complementary features among modalities, without ensuring how rich the features reflect the original data [15]. Therefore, it is necessary to investigate the spatio-temporal brain network fusion analysis method, which can simultaneously effectively extract the complementary and specific information in multi-modal brain network and exploit the spatial-temporal characteristics of among different modalities.
In order to solve the above challenging problems in multi-modal brain network fusion, in this paper, we propose a multi-channel spatio-temporal graph convolution network to distill both the spatial and temporal topological information from multi-modal brain networks, and apply it to the diagnosis of epilepsy. First, we divide the ROI-based series signals into multiple overlapping time windows, and we construct the dynamic functional brain network based on these windows. Second, an adaptive multi-channel GCN is employed to obtain the spatial features of the multi-modal brain networks, in which the graph models of functional network and structural network are respectively used as the input of multi-channel GCN. The proposed multi-channel GCN consists of three convolution modules: two specific convolution modules that are utilized to extract unique features of fMRI modality and DTI modality, respectively, and a common convolution module that is used to fuse multiple modalities and extract complementary information. Then we use the attention mechanism to adaptively fuse the features encoded by different channels. Moreover, in order to encourage consistency in multi-modal graph representations and distill discriminative information from each modality, we develop two contrastive objectives: multi-modal fusion InfoMax and interchannel InfoMin. Among them, the former objective extracts complementary information from different modalities so that the extracted features can reflect the complementary information of the brain networks of different modalities, while the latter distinguishes different graph views for the sake of capturing the specific information in each modality. Next, stacked Long Short-Term Memory (LSTM) units are exploited to capture the temporal information between time windows. Finally, we utilize MLP to classify the spatio-temporal topological features extracted from the multi-modalities. Figure 2 shows the main difference between the proposed method with the conventional multi-modal brain network fusion methods.
In brief, the proposed multi-modal brain network fusion has the following advantages: 1) To the best of our knowledge, it is the first work that investigates spatio-temporal topological characteristics in multi-modal brain network fusion.
2) A multi-channel graph convolution network with stacked LSTM module is proposed to effectively capture the dynamic graph features embedded in dynamic brain network.
3) We develop mutli-modal graph contrastive learning method to improve the discriminability of the complementary and specific topological information among different modalities.
4) The results on epilepsy dataset show that the proposed method is significantly superior to the state-of-the-art multimodal fusion diagnosis methods.
The rest of the paper is organized as follows: In Section II, we introduce the related work. Then, Section III describes the datasets used in our work and the proposed method. We report the experimental results in Section IV. Finally, we summarize this work and draw a conclusion in Section V.

II. RELATED WORKS A. Deep Learning Method With the Graph Data
Recently, Graph Neural Network has emerged as a powerful tool for understanding graph-structured data in many domains, such as protein structure [16], and brain network [17]. In essence, theses graph neural network methods in brain disease diagnosis can be mainly divided into two aspects: node classification and graph classification. In node classification, each node represents a subject and is associated with a feature vector extracted from imaging data. The edge weights encode the pairwise similarities between subjects and their features obtained from auxiliary phenotypic data [18]. Graph classification regards the brain regions of interest as nodes and the constructed brain functional network with the form of the adjacency matrix of the graph [19]. The one-layer of GCN can be described as: where A is adjacency matrix, X is feature matrix and W is weight matrix. Therefore, the output of GCN layers considers both brain structural connectome and functional features at the same time. Many multi-modal brain networks using GCNs are often based on guiding strategy. Although the interaction among modalities can be exploited in guiding strategy based fusion methods, they usually tend to ignore the unique characteristics in each modality.

B. Contrastive Learning for Network Embedding
Recent work on self-supervised visual representation learning has shown that comparing congruent and incongruent views allows the encoder to learn the effective features from the data, which is conducive to representing network information [20]. Different from the data in Euclidean space, the features in brain network data have both topological properties and adjacency relations. So it is natural to use graph contrastive learning to improve the feature representation for graph structured brain network. These methods maximize the mutual information between the input and the learned representation for fusion. For example, CREME [21] adopts node-to-node contrastive learning to distill information from embeddings generated from different graph views and capture the complementary information between them. Hassani and Khasahmadi [22] proposed a contrastive multi-view representation learning method on both node and graph levels. The embedding features can reflect the global structure information of the original network. Similarly, in the brain network fusion analysis, we hope that the features obtained can reflect the global and local features of the brain network. Therefore, we introduce graph contrast learning in our model to maximize the mutual information between the features of embedding and the brain network.

III. PROPOSED METHOD
In this work, we propose the multi-channel spatio-temporal graph convolution network for effectively fusing the functional and structural brain networks, and develop graph contrastive learning to enhance feature representation among modalities. Next, this section describes the proposed method in detail.

A. Dynamic Functional Brain Network and Structural Brain Network Construction
We assume that the rs-fMRI time-series data for a subject is (x 1 , · · · , x N ) T ∈ R N ×M , where each vector x n ∈ R M (n = 1, · · · , N ) contains the blood oxygen level dependent (BOLD) measurement of the n th ROI at M successive time points. N represents the number of ROI. To characterize the temporal variability of the functional architecture associated with a set of given regions, we segment all rs-fMRI time series into T overlapping windows with the constant length of L. For each subject, we can define a dataset consisting of T windows, where g i d and g i f represent the DTI and fMRI modality of the subject, and y is the label of the corresponding subject. Each modality can be described in a weighted graph g i * = (X, A i * ), where X ∈ R N ×d is the node feature matrix consisting of time series within each time segment, A i * ∈ R N ×N is the brain network, and * represents specific modality between fMRI and DTI.

B. Adaptive Multi-Channel GCN
The proposed multi-channel GCN is used to extract spatial information from multi-modal brain networks. The overall framework of MSTGC consists of three convolution modules: two specific convolution modules are utilized to extract unique features from fMRI modality and DTI modality respectively, and a common convolution module realizes multi-modal fusion. Then attention mechanism is exploited to adaptively fuse the features encoded by each channel.
1) Specific Convolution Module: The specific convolution module is composed of three stacked GCNs [23], and top-k pooling [24] is used to retain important nodes in the graph. First, in order to obtain the spatial characteristics of the fMRI mode under each window, we regard g i f = (X, A i f ) as the input of the module. Thus, the l − th layer output Z i(l) f can be represented as: denote the adjacent matrix and an identity matrix respectively, and D i f is the N × N degree matrix. We denote the last layer output embedding as Z i f . In this way, we can learn the node embedding, which captures the specific information in fMRI modality. Similarly, for the DTI modality, we use g i d = (X, A i d ) as the channel input, and then the specific information encoded in DTI modality can be embedded as Z i d . The calculation formula can be expressed as: 2) Common Convolution Module: Since fMRI reflects the functional characteristics of the brain, DTI reflects the structural characteristics, there is information coupling between the two modalities. Multi-modal fusion can discover complementary information among them, which is beneficial to improve the performance of feature representation and model detection. Here we design a common convolution module with a parameter sharing strategy to fuse multi-modal information. Specifically, we utilize common convolution module to extract the graph embedding Z i C F from fMRI graph g i f = (X, A i f ) as follows: is the node embedding in the (l − 1) − th layer, and (1) the construction of dynamic brain network, (2) the extraction of spatial features by MSTGC, (3) the extraction of temporal features. As shown in the right panel, MSTGC consists of three GCN modules, in which the Specific Features module is used to extract the most distinguishing features of each modality separately, and common features module is utilized to extract multi-modal complementary information. In addition, we use attention mechanism and two graph contrastive objectives, InfoMax and InfoMin, to adaptively fuse multi-modal complementary and discriminative information from the learned graph data.
In order to extract the shared information from DTI modality, we share the same weight matrix W i(l) c . for each layer of common convolution module, so the embedding of DTI graph can be formulated as: The shared weight matrix can filter out the complementary information from two modalities. Finally, we get two output embeddings Z i C F and Z i C D and the common embedding Z i C of the two modalities is: 3) Attention Mechanism: Through above multi-channel GCN, we obtain two specific embedding Z F , Z D and one common embedding Z C under each window. In order to extract the most correlated information Z i to indicate subject, we utilize an attention mechanism att(Z F , Z D , Z C ) to adaptively fuse these embeddings with the learned weights [25]. The formula can be described as: where a f , a c , a d indicate the attention values of Z F , Z D , Z C , respectively. Specifically, take Z F as example, we firstly transform the embedding through a nonlinear transformation, and use one shared attention vector q ∈ R h×1 to get the attention value w i F as follows: where W is the weight matrix, and b is the bias vector. Similarly, we can get the attention values of w i D and w i C . Then we normalize the attention values with the SoftMax function to get the final weight: The value of a i f denotes the importantance of Z F . Similarly, we can also attain a i d , a i c through softmax function. Thus, we have the learned weights , and a c = [a i c ], and denote a F = diag (a f ), a D = diag (a d ) and a C = diag (a c ). Then we combine these three embeddings to obtain the final embedding Z of each window: Finally, we obtain a low-dimensional and dense representation for each subject.
C. Contrastive Learning of Multi-Modalities GCN extracts deep features for disease diagnosis by combining the topology of the brain network and the information on the nodes of brain regions. However, the graph representation in graph classification is then transferred to a vector representation, which has the global property by gathering the information from all brain regions. However, both global and local properties of brain networks are equally important for disease diagnosis. Therefore, in order to maximize the mutual information between the global and local representations between fMRI and DTI graphs, we introduce a multiview fusion InfoMax constraint on the common convolution module, so that the output representation can incorporate the global properties and local attributes of brain network. In addition, in order to make the embedding features under each channel discriminative, we impose the Inter-channel InfoMin constraint among channels. 1) Multi-View Fusion InfoMax: We design a graph contrastive constraint to further enhance the consistency of two embeddings Z C F and Z C D , so as to promote the module to effectively integrate the information of the two modalities. The contrastive object is designed to maximize the mutual information between node representation of one modality and graph representations of another modality, which is shown in Figure 3(a). Specifically, the brain network graph is first transformed into vector form and then we calculate the mutual information between it and the obtained features by GCN. The consistency constraint L C between Z C F and Z C D can be formulated as: For the convenience of the optimization, we estimate mutual information I (X ; Y ) in Eq. (11) by using Jason-Shannon Divergence (JSD): where sp (x) = log (1 + e x ) and d is a discriminator function, which takes the inner product with a sigmoid activation. The resulting multi-modal fusion representation can distill discriminative knowledge from each modality.
2) Inter-Channel InfoMin: Though the embedding Z F and the embedding Z C F are learned from the same graph structure g f = (X, A f ), they capture different information from brain networks. In order to make the representation of embeddings under different channels more distinguishable, we consider adding diversity constraints between channels. Our approach is to minimize the mutual information between different channel node representations g f and g c f . Specifically, we regard all nodes in two graphs as negative pairs and minimize mutual information between node feature vectors on each negative pair, which is shown in Figure 3(b). Thus, the loss function of (g i c f , g f ) can be calculated as: where k is the number of the remaining nodes and τ is temperature parameter. We define θ (u, v) = s ( p (u) , p (v)) as a critic function, s(·, ·) is implemented using a simple cosine similarity, and p(·) is a non-linear projection to enhance the expression power of the critic function. In this work, we adopt MLP as above non-linear projection. Since these graphs are symmetric, so we can summarize the L S F as: Similarly, the loss L S D between Z C D and Z D can be calculated as: So we set the disparity constrain L d as: In addition, we use cross-entropy to measure the loss of subject classification and represent it as L t . In general, combining the subject classification loss and constrains, we can formula the overall objective function as: where α and β are parameters of the disparity and consistency constraint terms. Finally, we can optimize the proposed model via backpropagation and learn subjects' embeddings for classification.

D. Temporal Convolutional Layer
The brain is essentially a dynamic system, in which the brain network constantly reconstructing overtime during the scanning period [26], [27]. Brain regions interact dynamically with each other over time. Considering that long short-term memory (LSTM) can properly solve the problem of gradient disappearance and gradient explosion of traditional RNN [28], we use two stacked LSTM units to capture the temporal information transmitted across time windows. Each of these LSTM is followed by batch normalization and tanh activation. Finally, the fully-connected layer is employed to learn a mapping between the dynamic embedding features and the disease progression prediction. The overall framework of the method is shown in Algorithm 1.

E. Implementation
The proposed method was implemented using Python based on Pytorch, and the model was trained on a single GPU (NVIDIA GeForce RTX 2080 Ti). We optimize the proposed method via the Adam algorithm, with the learning rate of 0.001, the number of epochs of 300, and the batch-size of 25. In multi-channel GCN module, we used 3 stacked GCN modules for each channel, and the drop-out was set to 0.5. Extract features using multi-channel GCN and then obtain specific feature Z f , Z d and common features Z c ; 5: Fuse multi-channel features by Eq.(10); 6: Obtain time information by two-stacked LSTM; 7: Maximize the mutual information between two modalities by Eq.(11); Back-propagate L to update model weights; 11: epoch+=1; 12: end while In the temporal convolutional layer, the number of neurons for each LSTM unit of two layers were 180, 90, respectively. Each temporal convolutional layer was followed by batch normalization, tanh activation. Based on the output of the stacked LSTM, a fully-connected layer with 2 neurons was utilized to predict the category of the subject, where the sigmoid was used as the activation functions of the last fullyconnected layer.

IV. EXPERIMENT
In this section, we compare the proposed method with the previous brain network analysis algorithms in recent references. In addition, we also carried out ablation experiments, which proved the effectiveness of our proposed method.

A. Data and Pre-Processing
In this work, we used the epilepsy dataset collected from Jinglin Hospital, Nanjing University School of Medicine for experiment. It contains 103 frontal lobe epilepsy (FLE) patients (all right-handed, 50 female, age range: 17-51 years, mean age 24.1), 89 temporal lobe epilepsy (TLE) patients (all right-handed, 45 female, age range: 17-51 years, mean age 25.9) and 114 normal controls (NC) (all right-handed, 56 female, age range: 20-38 years, mean age 26.2). By using Siemens Trio 3T scanner, the raw rs-fMRI data and DTI data of all participants are collected. The scan parameters of rs-fMRI are as follows: TR = 2000ms, TE = 30ms, flip angle = 90 • , vorel size = 3.75 × 3.75 × 3.75mm 3 . The scan parameters of DTI are as follows: TR = 6100ms, TE = 93ms, flip angle = 90 • , vowel size = 0.94×0.94×3mm 3 . All rs-fMRI images are preprocessed using SPM8 in the DPARSF toolbox. The resulting volumes have 240 timepoints and are parcellated into 90 regions of interest (ROIs) using the AAL atlas. These time series reflect information about brain activity. The DTI data are processed by using the PANDA suite. First, we use the FSL toolbox to perform distortion correction on the DTI, remove eddy currents and extract brain masks from the B0 image. Then, based on each subject's co-registered T1 images, the TrackVis is used to obtain fiber images by deterministic tracking method, and anatomical regions were defined using AAL conventions. Finally, through the number of fibers, we can get the structural information of the brain network and the strength of physical connections.
(1) G-Unet: In this work, the brain functional connectivity network constructed by fMRI of each subject is considered as the adjacency matrix of the graph, and the time series is considered as the feature matrix of each node.
(2) dFCN-LSTM: This model fuses past and future information from fMRI to effectively learn time-series changes in signals from brain regions by using LSTM.
(3) DFC: In DFC, Pearson correlation coefficient of the whole time series is used for static FC calculation. Dynamic FC algorithm considers the moving window of time series. After that, the main repetitive FC matrix is found by clustering algorithm.
(4) BrainNetCNN: Three specialized convolutional layer types for DTI datasets are proposed, aiming to exploit the inherent structure of weighted brain networks.
(5) SAGPool: In SAGPool, a graph pooling method based on self-attention is used to extract the information of graph structure, which can take into account both node features and graph topology.
(6) SCP-GCN: In SCP-GCN, a framework of Siamese community preserving graph convolutional network are utilized to learn the structural and functional joint embedding of brain networks.
(7) M2E: In this method, a multi-view brain network processing framework is used to extract the high-order representation features in multi-modal data. We treat fMRI network and DTI network as different views and use tensor techniques to exploit the correlations between the multi-view brain networks.
(8) MPCA: The MPCA method mainly uses the multilinear principal component analysis method to analyze the brain network. We first concatenate the fMRI and DTI data into a three-dimensional tensor, and then utilize MPCA to obtain the features of the tensor.
(9) GCNeuro: In the GCNeuro method, we trained two Graph Attention Networks (GATs) for the fMRI modality and the DTI modality separately by sharing weight. Then we aggregated the features of each node in two modalities and regard it as feature matrix of the composite graph.

C. Experimental Setup
In the experiment, we evaluate the proposed method and other comparison methods on epilepsy dataset based on a 10-fold cross-validation strategy. Specifically, we divided the data set into 10 parts. Due to the slight imbalance of the subject categories, we utilize stratified 10-fold cross validation, that is, each compromise maintains the same proportion as the original subject category. After that, eight folds are taken as the training dataset each time, one fold is used as the validation dataset to determine the optimal value of the objective function parameters, and the remaining fold is used as the test data. In order to compare the performance of each method fairly, grid search is adopted to find the optimal model parameters. For the deep learning based methods, the training will not stop until the loss converges to the threshold. Four binary classification tasks are utilized to evaluate our model's diagnostics capacity for epilepsy (NC vs. TLE & FLE), diagnostic capacity for each subtype (NC vs. TLE, and NC vs. FLE), and diagnostic capacity between the two subtypes (TLE vs. FLE). The performance of each task are measured by five metrics, i.e., accuracy (ACC), precision (Pre), recall (Rec), F-Measure (F1), and area under the ROC curve (AUC). For all the methods, we perform 10-fold cross-validation 10 times, and report the average results.

D. Classification Performance on Epilepsy Dataset
We divided the comparison methods into two groups for analysis: single-modal group and multi-modal group. The result can be seen in Table I. It shows that our proposed method is effective under all 4 tasks on the epilepsy dataset and outperforms other algorithms. To be specific, compared with the static methods such as G-Unet and BrainNetCNN, our proposed method achieves better performance by integrating the spatio-temporal information of brain network. Compared with DFC and dFCN-LSTM, our method has greatly improvement on each task, which proves multi-modal classification is indeed an effective method for diagnosis of brain diseases. The other multi-modal fusion methods are also compared in the experiment, including machine learning methods, including M2E, MPCA and some deep learning methods, including GCNeuro, SCP-GCN, SAGPool. We find that deep learning models generally achieve better classification results than other comparison methods. This is because GCN can preserve direct and indirect relationships between nodes, so that it can capture more potential high-order features. In addition, although SAGPool, SCP-GCN and GCNeuro utilize GCN model and introduce fMRI and DTI to jointly learn brain network, they only retain the common feature between modalities and do not retain the unique features of fMRI and DTI. In general, our method can achieve best classification results among all the methods on all the four diagnosis tasks.

E. Ablation Studies
In order to verify the effectiveness of the three components of multi-channel GCN, graph contrastive learning, and LSTM in our proposed MSTGC method, we conduct the ablation experiments and present experiment results in Table II. In the  table, MSTGC-1,. . . , MSTGC-7 represent the variants methods of our proposed MSTGC method. We use " √ " to indicate that the corresponding component is used. Otherwise, it is not used. For example, MSTGC-1 indicates that it neither uses graph contrastive learning to enhance the topology feature representation of brain network nor utilizes LSTM to obtain the temporal information of dynamic brain network. Compared with the accuracies of MSTGC-4, MSTGC-5, and MSTGC-6, we can find that abandoning any components of multi-channel GCN, graph contrastive learning, and LSTM will reduce the experiment accuracies. Compared with group of MSTGC-1, MSTGC-2, and MSTGC-3 and the group of MSTGC-4, MSTGC-5, and MSTGC-6, it can be concluded that if any two components are discarded, the accuracies will be lower than one of them is discarded, which further demonstrates that these components can improve disease diagnostic accuracy.
Multi-channel GCN obtains specific information and complementary features from fMRI modality and DTI modality by multiples channels, and combines the attention mechanism to adaptively fuse these features. Graph contrastive learning maximizes the mutual information between graph representation and the extracted features, thereby enhancing the feature representation of brain networks from both global and local perspectives. LSTM module obtains the temporal information of dynamic brain network. These three modules make our model have promising disease diagnosis performance.

V. DISCUSSION
In this section, we discuss the sensitivity of parameters in the proposed model, and then analyze the discriminative connectivities and brain regions discovered by our method. In addition, we also visualize the embedding features.

A. Analysis of Parameter Sensitivity
According to Eq. (17), there are two parameters in the total loss function. We use the grid search method to show the influence of parameters on our experimental results and find the optimal parameters. Specifically, we set parameter α in the range of [1, 0.1] and parameter β in the range of [1, 0.1]. The results are shown in Figure 4. In general, it can be seen from the figure that no matter what the values of parameters α and β are, our model can obtain good results on the four tasks of epilepsy, which proves that our proposed method is robust to these parameters.
In addition, we also discuss the influence of the number of windows dividing on the experimental results. Specifically, we constructed the DCNs with different window numbers located in [5] and [15] and conducte experiments under four classification tasks respectively. The result is shown in Figure 5. As can be seen from Figure 5, for NC vs. TLE & FLE task, the classification accuracy reaches highest point, when the number of time windows equal to 6. For NC vs. TLE, the accuracy of classification reaches a peak, when the number of time window is 5. For the NC vs. FLE task and TLE vs FLE task, the model performes best, when the number of sliding windows was 7. In addition, we find that the change of accuracies with different sliding time windows number has the similar trend under four tasks. With the increasing of number of windows, the accuracy of classification first increases and then decreases. This trend is reasonable, because the number of windows is too small, information interaction among ROIs under different time windows will be lost. When the number of windows is small, it may contain too much redundant information in each time window, which will decrease the classification accuracy as well. In addition, we also discuss the effect of the dropout on the model results. The experimental results are shown in Figure 5, where the blue line represents dropout value equal to 0.5, the orange line is 0.4, and the green line is 0.3. From the figure, we can see that the model achieves the best classification performance when the dropout value is 0.5 on all the four tasks. We believe this phenomenon is reasonable, because the lower dropout is, the fewer nodes are retained, which is easy to cause information loss, resulting in the decline of classification accuracy.

B. Discriminative Functional Connectivity and Regions
Because not all ROIs are strongly associated with epilepsy, and we attempt to utilize our proposed method to figure out the most discriminative ROIs for understanding brain abnormalities. Specifically, we respectively visualized the 10 most relevant ROIs and the connections between them under the three task states including NC vs. TLE, NC vs. FLE, and TLE vs. FLE. For each classification task, since the selected features are different in each 10-fold cross-validation, we choose features that always occur in all folds of cross-validation as the most important features. On the task of the NC vs. TLE, the most relevant ROIs are concentrated on Parahippocampal gyrus, Olfactory cortex and middle temporal gyrus. On the task of NC vs. FLE, the key regions include Superior frontal gyrus, middle temporal gyrus and Supplementary motor area. On the task of TLE vs. FLE, the most essential region includes middle temporal gyrus and Parahippocampal gyrus. These brain areas have also been suggested to be related to epilepsy in previous studies [34], [35], [36], [37]. In addition, according to Figure 6, we find that although the most relevant brain regions searched under fMRI-based, DTI-based and fMRI-DTI-based are not all the same, they are all related to epilepsy and complement each other, which further proves that mining specific and complementary features from different modes can better describe brain networks.

C. The Effectiveness of Multi-Modal Fusion
In the experiment, we utilize multi-channel GCN to extract specific and common features in fMRI modalities and DTI modalities and adaptively fuse them by attention mechanism. MSTGC learns specific features in each modality and the multi-modal complementary features among different modalities. The significance of each feature is evaluated by attention mechanism. We analyzed these attention values under 4 tasks on the epilepsy dataset, and the results are shown in Figure 7. From Figure 7, we can see that for these four tasks, the attention value of common embeddings is larger than that in DTI specific features, and the attention value of specific features contained in fMRI modality is between them. It means that complementary information between multimodalities is more important than specific information under a single modality. This phenomenon is also consistent with the results of our ablation experiments in Table II. In addition, it can be seen from Figure 7 that, under the four tasks, the specific features extracted from fMRI modality, the specific features extracted from DTI modality, and the common features among them always have relatively high attention values, which demonstrates the necessity of fusing modal specific and complementary features. In addition, in Table II, the diagnosis accuracy of MSTGC-5, which only uses the complementary features of fMRI and DTI, is obviously lower than our method. Therefore, the adaptive fusion of specific features and complementary features from multi-modal brain networks can effectively improve the diagnosis performance.

D. Analysis of Embedding Features
What's more, in order to compare more intuitively and further demonstrate the effectiveness of our proposed method, we conduct the task of visualization on epilepsy dataset. We use the output embedding on the last layer of MSTGC before SoftMax and plot the embedding by t-SEN. The result is illustrated in Figure 8, in which the red dots represent healthy subjects and the green dots represent patients. According to Figure 8, it can be seen that the result of GCN and GCNeuro are not satisfactory, because the nodes with different labels are mixed together. Our method performs best, where the embedding features we learned have the clearest distinct boundaries among different classes. The reason why our method can clearly distinguish healthy subjects and patients is that we introduce graph contrast learning, which makes the embedding features more discriminative, reflecting the global and local features of original brain network.

VI. CONCLUSION
In this work, we propose a multi-channel spatio-temporal GCN that mines the dynamic typological features from rs-fMRI and DTI brain networks. Specifically, we employ an adaptively multi-channel GCN to learn the specific and complementary features of rs-fMRI and DTI. In order to reveal the global and local structures from the original brain networks, two graph contrast learning constraints, i.e. multimodal fusion InfoMax and inter-channel InfoMin, are developed and imposed on the model. After obtaining the spatial characteristics of multi-modal brain networks, LSTM units are leveraged to capture the temporal dynamic pattern of functional connectivities along multiple time windows. Finally, MLP is used to classify the spatio-temporal features. The proposed MSTGC can effectively fuse the specific and complementary typological features of multi-modal and the dynamic characteristics of brain networks simultaneously. Experimental results on the epilepsy dataset show that our proposed MSTGC has achieved better diagnosis performance than the state-ofthe-art methods in identifying epilepsy patients from healthy controls.