Identification of image genetic biomarkers of Alzheimer’s disease by orthogonal structured sparse canonical correlation analysis based on a diagnostic information fusion

: Alzheimer’s disease (AD) is an irreversible neurodegenerative disease, and its incidence increases yearly. Because AD patients will have cognitive impairment and personality changes, it has caused a heavy burden on the family and society. Image genetics takes the structure and function of the brain as a phenotype and studies the influence of genetic variation on the structure and function of the brain. Based on the structural magnetic resonance imaging data and transcriptome data of AD and healthy control samples in the Alzheimer’s Disease Neuroimaging Disease database, this paper proposed the use of an orthogonal structured sparse canonical correlation analysis for diagnostic information fusion algorithm. The algorithm added structural constraints to the region of interest (ROI) of the brain. Integrating the diagnostic information of samples can improve the correlation performance between samples. The results showed that the algorithm could extract the correlation between the two modal data and discovered the brain regions most affected by multiple risk genes and their biological significance. In addition, we also verified the diagnostic significance of risk ROIs and risk genes for AD. The code of the proposed algorithm is available at https:


Introduction
Alzheimer's disease (AD) is a neurodegenerative disease caused by many factors, and its incidence is increasing yearly [1]. The change of gene expression in genetics involves gene variation. However, the brain structure and function of AD patients are also different from those of the control group. Image genetics explores changes in brain structure and function from the perspective of genetic variation. Through a correlation analysis of imaging and genetics, we can explore the potential markers of AD.
Machine learning algorithms have been widely used in various bioinformatics fields, such as miRNA-disease relationship prediction. Ha and Park [2] proposed the metric learning for predicting miRNA-disease association (MLMD) algorithm, which can reveal not only novel miRNAs associated with disease, but also miRNA-miRNA and disease-disease similarities. Moreover, they proposed the matrix factorization with disease similarity constraint (MDMF) algorithm based on matrix factorization, which incorporates disease similarity constraints to improve the prediction performance [3]. In addition, they proposed the simple yet effective computational framework (SMAP) algorithm to predict the relationship between and accurately predict the association between miRNA-diseases, which combines miRNA functional similarity, disease semantic similarity and Gaussian interaction spectrum kernel similarity [4]. Recently, they combined a deep neural network to propose the node2vec-based neural collaborative filtering for predicting miRNA-disease association (NCMD) algorithm, which uses Node2vec to understand the low-dimensional vector representation of miRNA and disease, and combines the linear ability of generalized matrix factorization and the nonlinear ability of multi-layer perceptron [5]. They tested and confirmed the effectiveness of the algorithm on three datasets of breast cancer, lung cancer and pancreatic cancer. In the search for biomarkers of AD, many scholars have proposed algorithms related to the diagnosis and prediction of AD. Park et al. [6] proposed a deep learning model to integrate large-scale gene expression and DNA methylation data to predict AD. This method is superior to traditional machine learning algorithms in that it uses typical dimensionality reduction methods and improves the accuracy of prediction. Wang et al. created the multi-task sparse canonical correlation analysis and regression (MT-SCCAR) model, combined with the annual total score of depression level, the clinical dementia rating scale, the functional activity questionnaire, and the neuropsychiatric symptom questionnaire. These four clinical data were used as compensation information and embedded in the algorithm by a linear regression. They confirmed the superiority and robustness of the algorithm on real and simulated data [7].
Canonical correlation analysis (CCA) is an algorithm to obtain the maximum correlation between two kinds of data. However, it is not suitable for an association analysis of high-dimensional data. For this reason, some scholars put forward a sparse canonical correlation analysis (SCCA) algorithm [8]. Based on CCA, the SCCA algorithm assists the CCA algorithm in feature selection in high-dimensional features through l1 norm constraints. However, because the l1 norm constraints only considers sparsity at the individual level, it is only partially applicable to image data. Lesions in different brain regions may play a role at the same time; therefore, it is necessary to add structural constraints to the SCCA algorithm. Du et al. [9] proposed a graph-guided pairwise group lasso (GGL) and applied it to image data. GGL can be used in a data-driven mode that does not provide prior knowledge. It thinks that each group consists of only two variables, and they will be extracted with similar or equal weights. They found that the performance of this algorithm with actual data is due to other competitive algorithms. However, they did not consider the diagnostic information of the subjects. Previous studies showed that the addition of diagnostic information could effectively improve the correlation analysis performance of the algorithm [10,11]. In addition, there may be feature redundancy in imaging and genetic data, and orthogonal constraints can be added to the algorithm by linear programming.
Thus, it was suggested to integrate structural magnetic resonance imaging (sMRI) and the gene expression data of AD patients and its control group using an orthogonal structured SCCA algorithm based on diagnostic information fusion. Specifically, after preprocessing sMRI data, we extracted the gray matter volume of each region of interest (ROI) as a feature. Then, we picked the expression of differentially expressed genes (DEGs) between the sick group and the control group as characteristics from the gene expression data. Based on the SCCA algorithm, this method added GGL constraints on images and orthogonal constraints on two kinds of data, which could improve the performance of the association analysis and prevent the influence of feature redundancy on the results. The experimental results showed that this algorithm was superior to other CCA-based algorithms and had a stronger correlation analysis ability. Top ROIs and top genes with biological and diagnostic significance can be obtained. The selected top biomarkers can provide a reference for the diagnosis of AD and drug target discovery.

SCCA
The SCCA algorithm adds sparse constraints to the CCA algorithm. Given n samples, p ROIs and q genes, sMRI data can be expressed as ∈ , and gene expression data can be expressed as ∈ . The objective of the SCCA algorithm is to adjust the typical correlation weights ∈ and ∈ to maximize the correlation between and , and its objective function is shown in Formula (1): where and control the sparsity of and , respectively.

GGL
The graph-guided fusion lasso differs from the conventional group lasso in that it does not rely on prior knowledge; however, the graph-guided fusion lasso will introduce estimation bias. GGL uses group lasso and graph-guided fusion lasso. It can be defined as follows: where is the edge set of the graph which highly related features are connected.

Orthogonal structured sparse canonical correlation analysis for diagnostic information fusion (OSSCCA-DIF)
This paper presented an OSSCCA-DIF algorithm, which uses GGL as a structural constraint based on the SCCA algorithm. Orthogonal constraints were used to prevent the redundant characteristics of image genetics data from affecting the results. In addition, we added the diagnostic information of samples as prior knowledge to the algorithm to improve its correlation performance. The objective function of the OSSCCA-DIF algorithm is given as follows: where ∈ is used to store the diagnostic information of the sample, and control the orthogonal constraint strength of and , respectively, and is applied to control the strength of the GGL constraint.

The optimization algorithm
For the optimization of Formula (3), the Lagrange multiplier method can be used to solve the partial derivatives of the weight of the ROI and the weight of the gene, respectively. Firstly, is regarded as a constant term. Then, the objective function can be rewritten as Eq (4): For Eq (4), take the derivative of and make it zero, and we can get Eq (5): The iterative solution formula of can be written as Formula (6): Similarly, for u, if v is regarded as a constant term, then the solution formula of u can be obtained, as shown in Eq (7): In Eq (7), u and v are diagonal matrices, and their diagonal elements in the th row can be expressed as | | 1, ⋯ , and | | 1, ⋯ , , respectively. G u is also a diagonal matrix, and the diagonal elements in the th row can be expressed as 2, ⋯ , 1 .

Data acquisition and pretreatment
We obtained 296 samples of AD, mild cognitive impairment (MCI), and healthy controls (HC) from the Alzheimer's Disease Neuroimaging Disease (ADNI) database (https://adni.loni.usc.edu/). Table 1 provides the statistical information for each set of samples. We collected sMRI and gene expression data from these samples. For sMRI, we first calibrated the head movement using the DiffusionKit software and then segmented the images using the Statistical Parametric Mapping (SPM) software package of the Computational Anatomy Toolbox (CAT) toolkit of Matlab software and divided them into gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF). In this paper, we used a GM volume of 140 brain regions divided by the brain as the ROI. For the gene expression data, we utilized the limma algorithm to examine the differential expression and obtained 962 DEGs. We extracted the expression of DEGs as the genetic feature of the sample.

. Parameter selection and results of the algorithm
In order to obtain the best result, this paper took the canonical correlation coefficient (CCC) as the performance measure of the algorithm, and the solution method of the CCC was as follows. Using the grid search method [0.01 0.1 1], the parameters of super parameters ( , , , , and ) of the OSSCCA-DIF algorithm were selected for the real data sets ( Figure 1). Finally, the results of the sixth parameter selection were taken as the best parameters ( 0.01 , 0.01 , 0.01 , 0.1 and 1). We detail the optimal parameter information in Table S1 of the Supplementary material. It can be seen from the figure that the CCC obtained by different parameter combinations is quite different. The proposed algorithm is sensitive to the parameters and thus affects the stability of the model. Therefore, it is helpful to select the optimal results by appropriately enlarging the selection range of parameters in the practical application. Figure 2 shows the heat map of weights u and v. Specifically, the abscissa of the graph represents either an ROI or a gene. The distribution of different colors represents the weight of the ROIs or genes; the darker the color, the higher the weight. After taking the absolute values of U and V, we give the names and weight information of the top 10 ROIs and the top 10 genes with the greatest weight in Table 2. We will discuss the biological significance of these ROIs and genes in detail in the discussion section. Figure 3 is a visual display of the top 10 ROI. Figure 4 displays the enrichment analysis results of the top 10 genes. We will discuss the relationship between these channels and AD in detail in the discussion section.    To prove that the proposed algorithm has a good correlation analysis ability, we compared the CCC of the proposed algorithm with the SCCA-FGL, SCCA, and CCA algorithms (Table 3), and the CCC of the proposed algorithm was higher than the other three algorithms. In addition, we present the Pearson correlation heat map of the top 10 ROIs and top 10 genes in Figure 5. Among them, lHip and ZC2HC1C reached the maximum positive correlation (0.4551), while lMidOccGy and POM121L12 reached the maximum negative correlation (−0.3227).

Diagnostic performance verification of top markers
In this section, we used the Receiver Operating Characteristic (ROC) curve to verify the diagnostic performance of the top markers ( Figure 6). ROC curves have been widely used in the biomedical field [12,13]. Based on the logistic regression algorithm, we constructed the diagnosis model using the top 10 ROIs and top 10 genes. Among them, the AUC of the diagnosis model constructed by the ROIs reached 0.898. The AUC of the diagnosis model created by the genes reached 0.853. In addition, we present the diagnostic models constructed using the top 10 ROIs and the top 10 genes from several other algorithms in Figure S1 of the Supplementary material. The AUC of the top 10 ROIs selected by the proposed algorithm was higher than that of several other algorithms. The AUC of the top 10 genes selected was slightly lower than that of the CCA algorithm. In addition, we present the details of the ROC curves of the diagnostic models constructed by the top ROIs/genes selected by the OSSCCA-DIF, SCCA-FGL, SCCA, and CCA algorithms in Table S2 and Table S3 of the Supplementary material.

Experimental results on the synthetic dataset
In order to further verify the effectiveness of the algorithm, we constructed two synthetic data sets ( ∈ , ∈ , ∈ , ∈ ), and generated two sets of weights ( ∈ ， ∈ ， ∈ , ∈ ). Here, 1 100, 2 500, 1 =300 ， 1 500 , and 2 800 . In addition, we generated the variable ∈ . and can be generated by ~ , * . Here, 1，2 . and denote the variance and noise level of the noise, respectively. We set l to 10 and present the CCCS of several algorithms on two synthetic datasets in Table 4. As can be seen from the table, the CCC of the proposed algorithm is larger than that of the other algorithms on both datasets.

Results of ablation experiments on real and simulated datasets
Additionally, we conducted ablation experiments on the proposed OSSCCA-DIF algorithm. Specifically, by removing each part of the OSSCCA-DIF algorithm either individually or in pairs (except for sparse constraints), we compare the CCC between the real and simulation datasets under the same parameter conditions ( Table 5). The objective functions to be compared (scenarios 1-6) are given below, as shown in Eqs (8)- (13):

Discussion
As a neurodegenerative disease, there is no effective treatment for AD. Alongside aging, the incidence of AD is increasing year by year. Image genetics can mine disease-related markers by integrating image genomics and genetic data through a series of correlation analysis algorithms. Therefore, this paper proposed an OSSCCA-DIF algorithm. Based on the SCCA algorithm, this algorithm added orthogonal constraints on weight vectors u and v and GGL structural constraints on image feature weight u. In addition, the diagnostic information of the sample was added to the algorithm. The experimental results showed that by integrating sMRI and ROI data in real data, the performance of this algorithm was better than other CCA-based algorithms.
Most of the top ROIs mined by the proposed algorithm have proven to be closely related to AD. First, our algorithm determined that rEnt was the brain region with the most significant weight, and the weight value reached 0.06482. The entorhinal cortex (EC) is unanimously considered to be the earliest pathological structure of AD [14,15]. Thaker et al. [16] analyzed the relationship between the thickness of the olfactory cortex (sMRI) and pathological changes (autopsy) in 50 AD patients and found that the thickness of the olfactory cortex may be related to the severity of AD. The experiment confirmed that, compared with the control group, the volume and average thickness of the right inner olfactory cortex in AD patients were lower, and the expression level of lncRNA BACE1-AS in plasma exosomes isolated from AD patients was significantly increased. Therefore, Wang et al. [17] proposed that the level of BACE1-AS in peripheral blood exosomes should be combined with the volume and thickness of the right entorhinal cortex as a potential biomarker of AD. Second, our algorithm identified that lAmy and rAmy were top ROIs. These two brain regions are both sides of the amygdala. As an important structure of emotional learning and memory, the amygdala is related to a series of mental diseases such as AD. The MRI volume of the amygdala may be related to the severity of dementia in AD, and it shows neuron loss and atrophy in AD patients [18][19][20]. The amygdala has an excellent diagnostic value for sMRI of AD [21]. Finally, our algorithm found that the top brain regions (rParHipGy and lHip) also proved to be closely related to AD. In the memory system of the human brain, it is essential to connect the posterior cingulate cortex with the hippocampus, either directly or indirectly through the parahippocampal gyrus. These brain regions all play a vital role in the early progression of AD [22]. In the experiment evaluating the correlation between brain metabolism and the orientation of AD, it was found that its improved orientation performance was related to the more significant brain metabolism in brain regions such as the left middle occipital gyrus, and the higher CERAD identification score was more related to the metabolic activity in the left medial temporal lobe regions (including the hippocampus, parahippocampal gyrus, and left fusiform gyrus) [23].
The top genes (ADAM17, CCND1, HIST1H2BM, ALDH3A1) determined by our algorithm are proven to be either directly or indirectly related to AD. It is common knowledge that a feature of AD is the accumulation of extracellular Amyloid-β (Aβ) plaques and neurofibrillary tangles composed of tau in neurons [24][25][26]. Among them, Aβ is a protein hydrolysate separated from its precursor, namely the amyloid precursor protein (APP), by β-and γ-secretase, and tau is a microtubule-associated protein involved in microtubule stability. The main manifestations of AD patients are decreased memory, attention, spatial orientation, language ability, and olfactory function, which are all related to the deposition of tau protein and APP. It has been proven that ADAM17 is a potential therapeutic target for AD because ADAM17 can be used as an α-secretase regulating APP, thereby affecting the production of Aβ.
Additionally, protease encoded by ADAM17 plays a role in the shedding of tumor necrosis factor-α (TNFα). As a key pro-inflammatory cytokine in inflammation, TNF-α's signal transduction aggravates Aβ and tau pathology in vivo [27]. In order to explore the role of propionic acid (PPA) in the pathogenesis of AD, Aliashrafi et al. [28] selected 284 genes related to PPA and AD and identified CCND1 as an important hub gene, bottleneck gene, and seed gene through a network analysis and an Molecular Complex Detection (MCODE) analysis. Zeng et al. [29] also determined that CCND1 can be the core goal of AD treatment. H2BC14 (HIST1H2BM) is the core component of the nucleosome. The nucleosome assembly protein 1-like 5(NAP1L5) is downregulated in the brain tissue of AD patients, and the overexpression of NAP1L5 can alleviate APP metabolism and Tau phosphorylation [30]. ALDH3A1, a protein-coding gene, belongs to A1, a member of the aldehyde dehydrogenase 3 family. In the enrichment analysis of the top 10 genes, we also found that aldehyde dehydrogenase is closely related to AD. The relationship between other genes and AD needs further study.
Through the enrichment analysis of the top 10 genes, we found that many significant pathways were related to the occurrence and development of AD. McKibben and Rhoades [31] studied the role of the proline-rich region (PRR) in regulating the interaction between Tau and soluble tubulin. Aldehyde dehydrogenase can balance the amine metabolism of neurodegenerative diseases such as AD [32]. Tao et al. [33] indicated that aldehyde dehydrogenase-2 could be a potential target for AD treatment. Rapamycin can block G1/S conversion between AD patients and normal controls. Experiments have confirmed that compared with the control group, AD patients who used rapamycin still progressed to the late cell cycle [34]. Cyclin dependent kinase 5 (Cdk5) can also be used as a potential therapeutic target for AD [35].
Finally, we also determined the diagnostic significance of the top markers for AD. We verified the AUC of each top ROI and gene in the test set and found that the AUCs of all ROIs and genes were more significant than 0.5. In addition, through the logistic regression algorithm, we found that the AUC of the diagnosis model constructed by the top 10 ROIs and top 10 genes reached 0.898 and 0.853, respectively, and were both within reasonable confidence intervals.

Conclusions
In this paper, we discussed the risk brain regions and risk genes closely related to the occurrence and development of AD by studying the relationship between sMRI and bivariate variables of gene expression data. The proposed OSSCCA-DIF algorithm has advantages in correlation analysis performance and biomarker selection. However, most algorithms based on the SCCA algorithm assume that the image genetic data is linear; however, this assumption is not necessarily valid in real data. Therefore, in future research, we will try to introduce a deep-learning algorithm to make up for this defect.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Conflict of interest
The authors declare there is no conflict of interest.