Identification and Analysis of Alzheimer’s Candidate Genes by an Amplitude Deviation Algorithm

Background: Alzheimer’s disease (AD) is the most common form of senile dementia. However, its pathological mechanisms are not fully understood. In order to comprehend AD pathological mechanisms, researchers employed AD-related DNA microarray data and diverse computational algorithms. More efficient computational algorithms are needed to process DNA microarray data for identifying AD-related candidate genes. Methods: In this paper, we propose a specific algorithm that is based on the following observation: When an acrobat walks along a steel-wire, his/her body must have some swing; if the swing can be controlled, then the acrobat can maintain the body balance. Otherwise, the acrobat will fall. Based on this simple idea, we have designed a simple, yet practical, algorithm termed as the Amplitude Deviation Algorithm (ADA). Deviation, overall deviation, deviation amplitude, and 3δ are introduced to characterize ADA. Results: 52 candidate genes for AD have been identified via ADA. The implications for some of the AD candidate genes in AD pathogenesis have been discussed. Conclusions: Through the analysis of these AD candidate genes, we believe that AD pathogenesis may be related to the abnormality of signal transduction (AGTR1 and PTAFR), the decrease in protein transport capacity (COL5A2 (221729_at), COL5A2 (221730_at), COL4A1), the impairment of axon repair (CNR1), and the intracellular calcium dyshomeostasis (CACNB2, CACNA1E). However, their potential implication for AD pathology should be further validated by wet lab experiments as they were only identified by computation using ADA.


Introduction of DNA microarray and its application on AD genes
With little known about the cause of AD, it is necessary to identify more AD-related candidate genes. Prior to the year 2000, identifying AD candidate genes involved a large amount of time and money and yielded limited results. However, with the development of DNA microarray technology, researchers began to apply this technology in order to better identify potential AD genes [15].
DNA microarrays (also called gene chips) are a complex technology in molecular biology. A DNA microarray is typically a glass slide onto which DNA molecules are fixed in an orderly manner at specific locations called spots (or features). The DNA in a spot may either be genomic DNA or short stretches of oligonucleotide strands that correspond to a specific gene. The spots are printed onto the glass slide by a robot or are synthesized by the process of photolithography [16].
Researchers use DNA microarrays to measure the expression levels of thousands of genes simultaneously. Since DNA microarrays contain many spots, we can obtain many gene expression levels from a single experiment, compared to only being able to measure expression of one gene with a Northern Blot [17].
In order to analyze the DNA microarray data and identify AD candidate genes, researchers began designing computation methods (algorithms) in order to process the data. Currently, the most commonly used algorithms are the K-means Clustering algorithm [18], the Principal Component Analysis (PCA) algorithm [19], and the Ant Colony algorithm (ACO) [20].

Organization of gene expression levels
In this manuscript, the original DNA microarray data were downloaded from the GEO Dataset within NCBI [21], which includes 22,283 genes. The data were obtained from control, incipient, moderate, and severe AD patients. All of these data are organized in a matrix format (Table 1). Table 1 consists of 22,283 rows and 9 columns, where the 22,283 rows correspond to the expression levels of the 22,283 genes, and the 9 columns represent the 9 samples (experiments). The matrix element in Table 1 comes from male controls, denoted by in this paper. The other three matrices, incipient, moderate, and severe, are denoted by, and, with 7, 8, and 7 columns, respectively.

The simple idea behind the amplitude deviation algorithm (ADA)
In order to identify AD candidate genes, we are proposing a new model based around the following principal: when an acrobat walks along a steel-wire, his/her body must have some swing; if the swing can be controlled by the individual then the acrobat can keep the body balance; otherwise, the acrobat will fall. Correspondingly, each gene can be seen as an acrobat and the change of expression level of each gene can be compared to the swing of an acrobat's body on the tightrope. In the controlled stage (i.e., acrobats maintaining balance on the tightrope), since all genes have the ability of self-regulation, the gene expression levels are maintained within a certain range. However, when AD pathology develops in the brain, the expression levels of certain genes goes beyond the controlled range, analogous to the acrobats losing their balance. These are the genes that may be associated with AD.

Data pre-processing
The data in the four different stages are organized as the matrices T ctrol 22, 283 ⋆ 9 , T ircip 22, 283 ⋆ 7 , T moder 22, 283 ⋆ 8 , and T severe 22, 283 ⋆ 7 , respectively. Since the data of every column in each matrix is from one sample, the data in different columns are incomparable. In order to solve this problem, four data matrices are processed. In this paper, we process these matrices using the fact that the data in every column is equal to the value in the corresponding column minus that column's average value. This column of differences then defines the difference as deviation: Where s(I,j) is defined as deviation matrix.
In the process of obtaining data, different experimental conditions (such as samples, equipment, temperatures, etc.) may generate errors. In order to reduce the potential errors, we calculated the average deviations for the different samples of the same stage, and this is how the overall deviation is calculated. The format of is presented as follows: i=1,2,…,22283; n=9,7,8,7.

Mathematical representation of the ADA
In this paper, each gene is compared to an acrobat as aforementioned. When an acrobat walks along a tightrope, his body must have some swing. Correspondingly, there must be certain changes of expression levels of each gene. More specifically, for each gene there must be differences between the controlled stage and the incipient, moderate, and severe stages. We use these differences to characterize the changes. Then the deviation amplitude is introduced (here, deviation amplitude can be interpreted as the difference). The format of deviation amplitude is presented as follow: Where D cotrl (i),D invip (i),D moeder (i),D severe (i), are denoted by the overall deviation in four stages and A incip (i),A moder (i),A severe (i) represents the deviation amplitude in the incipient, moderate and severe stage, respectively.
Here, we find that A ircip (i) satisfies the normal distribution via the corresponding statistical histogram, shown in Figure 1A. Similarly, both A moder (i) and A severe (i) satisfy normal distribution (their corresponding averages are almost equal to 0, their variances are equal to 0.417 and 0.536, respectively).
The normal distribution follows the 3σ principle, which asserts that 99.7% of the data falls within a range of 3σ. Any sample that does not follow this principle is abnormal. We use the 3σ principle as the criterion for characterizing how big of a range our "acrobats" can have while maintaining stability. That is to say, when the change of expression level of a gene is greater than 3σ or less than −3σ (i.e., A t t − A t | > 3σ, t represents the different stages) and its overall deviation is consistently and significantly upregulated or downregulated, the gene is a candidate for AD.
The computation was performed, and 12 genes were identified as AD candidate genes ( Table  2). Among them are 7 genes whose average deviations are significantly and consistently upregulated (listed in the right column of Table 2), and 5 genes whose average deviations are significantly and consistently downregulated (listed in the left column of Table 2).
This result is not ideal. In addition to the small number of AD candidate genes, the data collected was affected by noise during the experimental process. If we can keep the original basis of the result and relax the conditions appropriately, the effects should improve. Therefore, the specific ADA is formulated: Below, deviation amplitude is defined as the forms of eqns. (6), (7) and (8): The statistical histogram of A incip (i) is shown in Figure 1B. Here, A incip (i) does not obey the normal distribution. Both A moder (i) and A severe (i) also do not satisfy normal distribution.
In statistics, σ has a common formula as follows: Where E(x), D(x) is the expectation and variance of x respectively.
When the deviation amplitude is defined as eqn. (3), the value of E(x) is almost equal to zero. However, when the deviation amplitude is defined as eqn. (6), the value of E(x) is greater than zero, which leads to a decreased value of D(x). Correspondingly, 3σ gets smaller as well. via A t t − A t > 3σ > 3σ * σ * represents the variance which are from eqns (6), (7) and (8)), 3σ ⋆ is selected to be the biggest threshold of range that allows acrobats to keep their balance. Since A t (i)(t represents the different stages) does not obey the normal distribution, we propose the following two criterions to identify AD candidate genes: Criterion 1: Criterion2: Implementation of the ADA to identify AD candidate genes Step 1: Use eqn. (1) to calculate deviation matrices in four different stages Step 2: Use eqn. (2) to calculate the overall deviation of each gene in four stages Step 3: Use eqns. (3), (4) and (5) to calculate three deviation amplitudes of each gene (incipient, moderate and severe stages).

Results
Here, we identify 52 candidate genes. Out of these genes, 27 have deviation amplitudes with consistent rises (Figure 2A), and 25 genes have deviation amplitudes with consistent declines ( Figure 2B). It is worth noting that the 52 genes discovered with the ADA also contain formerly identified AD candidate genes. Thus, the results of the second set of equations not only retain the original results, but also achieve the conditions of relaxation; therefore, the second set should be better at reflecting the actual number of AD-related gene expressions. Tables 3 and 4 contain candidate genes selected by the first and second criterion.

Analysis of AD candidate genes
The healthy human body maintains physiological homeostasis in countless aspects, and this balance depends on coordination among proteins in the human body. Proteins are subject to the regulation of gene expression, so the coordination among proteins depends on the coordination of gene expression levels. Once the coordination of gene expression levels is destroyed, the body will enter pathophysiological status, resulting in disease. Since AD is a chronic neurodegenerative disorder that is characterized by memory impairment, cognitive dysfunction, and behavioral disturbances, this paper claims that AD revolves around a nervous system imbalance, which is associated with an imbalance of gene expression levels. Therefore, these genes whose expression levels are out of balance are AD candidate genes.
The ADA proposed in this paper attempts to find those genes with dramatic changes in gene expression levels, which then leads to an imbalance. It is possible that this imbalance may either be the cause or outcome of AD.
After analyzing the locations and functions of the proteins encoded by the identified genes, the following features were discovered and certain new pathological factors of AD were conjectured. First, most of the proteins encoded by these identified genes are located in the membrane and the cytoplasm (the distributions of proteins are shown in Figure 3A).
The functions of most proteins encoded by these identified genes correlate with signal transduction, metabolism, regulation of transcription, protein transport, immune response, and protein degradation (especially regarding signal transduction, metabolism, regulation of transcription, and transport ( Figure 3B). The location and functions of these proteins suggests that analyzing the proteins at the membrane and the cytoplasm is helpful for exploring the AD pathology.
Moreover, proteins located specifically on membranes are mainly involved in signal transduction and protein transport ( Figure 3C), and the proteins located specifically at the cytoplasm mainly correlate with protein transport and degradation ( Figure 3D). Based on these findings, we can infer that the factors causing AD are significantly associated with signal transduction, metabolism, regulation of transcription, and protein transport and degradation. Loss of proper signal transduction and signaling pathway function [22,23], loss of regulation of transcription, [24] and dysregulation of metabolism [25] have all been correlated with AD progression in previous studies.
In regard to particular genes studied, the abnormal expression levels of genes (AGTR1 and PTAFR) are associated with increasing protein kinase C (PKC), which can promote the accumulation of amyloid Aβ, potentially leading to AD. The protein AGTR1 (angiotensin II receptor 1), encoded by gene AGTR1, allows for binding of angiotensin II, which generates diacylglycerol, and in turn activates PKC [26] which markedly decreases Aβ release from cells [27]. Since the expression level of gene AGTR1 consistently increases with AD progression (Figure 4), the excessive cellular accumulation of Aβ will be promoted, which may lead to AD [7]. There is evidence to suggest that angiotensin II receptor blockers may be a viable option for AD treatment [28], and this may be because these blockers prevent PKC from decreasing amyloid Aβ release.
The increase of the protein PTAFR (platelet-activating factor, PAF), a phospholipid signaling molecule, causes increased binding to its corresponding receptor (RAF-R) in the membrane surface which can activate phosphatidylinositol and phospholipase C [7]. In the phosphatidylinositol pathway, extracellular signaling molecules which bind G proteincoupled receptors on the cell surfaces cause the hydrolysis of phosphatidylinositol diphosphate into two products: inositol triphosphate (IP3e) and diacylglycerol (DG). DG can activate protein kinase c [29]. Protein kinase c markedly decreased the Aβ release from cells, [27] and increases the accumulation of Aβ that may lead to AD [7]. A recent study suggests that aberrant lipid signaling is correlated with AD [30].
Furthermore, the abnormal expression of gene CNR1 correlates with the absence of LTD (long-term depression), which may lead to the impairment of LTP (long-term potentiation) and then may induce AD. The protein CNR1 (Cannabinoid receptor 1), which is mainly distributed in the central nervous system (CNS), is involved in preventing neurotransmitter release. Since the protein CNR1 is significantly downregulated in the presence of AD, (Figure 4), LTD is not activated [31,32]. The absence of LTD may contribute to abnormalities of learning and memory-related behavior [33], which are AD symptoms.
Additionally, the abnormal expressions of genes (COL5A2 (221729_at), COL5A2 (221730_at), COL4A1) are associated with a decrease in energy supply, which then can lead to neuronal apoptosis, an AD neuropathological feature [34]. All of the proteins encoded by genes COL5A2 (221729_at), COL5A2 (221730_at), and COL4A1 are involved in phosphate transport in an organism's activities. Phosphorus is one of the main elements that make up the human body; it is largely involved in the body's energy metabolism and it is also an important component of adenosine triphosphate (ATP) [35]. A recent study revealed that higher levels of serum phosphorus are correlated with increased risk for dementia [36]. More research is needed to find out more information about the connection between phosphorus and AD. Figure 5 explains that the expression of proteins encoded by these genes (COL5A2 (221729_at), COL5A2 (221730_at), COL4A1) is consistently downregulated, which may contribute to energy metabolism disorders and then affect a series of activities such as signal transduction, transcription, protein degradation and transport, which, as mentioned previously, are related to AD. Neuronal apoptosis, another AD pathological feature mentioned previously, may be the result of this downregulation.

Analysis of candidate genes associated with metal ions
In the 52 identified genes, 14 genes (about one-third of the total) are associated with metal ions. After analyzing these metal ion genes, the following characteristics are observed. First, most of the genes associated with metal ions are related to calcium ions ( Figure 6A). As shown in Figure 6A, 14 genes are mainly related to calcium ions. Numerous studies have shown there is a close relationship between calcium dyshomeostasis and AD [37][38][39]; therefore, it is necessary to study these genes associated with calcium ions. In fact, a recent study that added new evidence for the Calcium Hypothesis of Alzheimer's and Brain Aging details how changes in calcium signaling can affect neurons and, in some cases, promote death and disease [40]. Clearly, the connection between AD and calcium has been an active area of research for decades; since calcium plays such a significant role in neuronal function, it is no surprise that dysregulation of calcium may promote AD.
After analyzing the genes associated with calcium ions, we saw that most of the proteins encoded by these genes are distributed in the membrane and extracellular regions. The percentage of protein distributions are shown in Figure 6B. In addition, we found that proteins encoded by the genes associated with calcium ions are mainly involved in signal transduction and metabolism. The percentages of protein functions are shown in Figure 6C.
We also investigated the CACNB2 and CACNA1E proteins, which are the β2 subunit and α1E subunit, respectively, of a voltage-dependent calcium channel. The protein voltagedependent calcium channel (voltage-dependent calcium channels, VDCC), which is located in the cell membrane, controls the intake of calcium ions into the cell. VDCC activity is determined by the α1 subunit whose functions are regulated by a β subunit (β1-β4) [41].
Since the expression level of gene (CACNB2) is significantly downregulated with the deterioration of AD (Figure 7), the ability of a β2 subunit to regulate an α1subunit may weaken, which may cause the expression of the gene CACNA1Ecacna1e to consistently increase with AD progression (Figure 7). The increasing proteins encoded by the gene cacn1e may induce VDCC activity changes. Moreover, VDCC activity changes may cause a change in calcium influx, which may lead to intracellular calcium dyshomeostasis and induce AD. This conclusion coincides with the idea that AD is correlative with the intracellular calcium dyshomeostasis [39].

Potential implications for the AD candidate genes
In this paper, we studied eight genes selected from a list of identified AD candidate genes and discussed their potential implications in AD. Although we believe that these AD candidate genes are interesting for many reasons, including the dramatic change of gene expressions and its proven role in AD pathogenesis by other references (Figure 8), the potential implication should be used only as a reference.

Conclusions
Even though AD is the most common form of dementia, its pathological mechanisms are not fully revealed. However, three early-onset familial AD genes (APP, PSEN1, and PSEN2) and one genetic risk factor for late-onset AD (APOE) have been identified. With the applications of DNA microarray technology, identifying more AD candidate genes by computation appears particularly promising.
By using ADA, we were able to identify 52 genes that showed dramatic changes in gene expression, and thus can be identified as potential AD candidate genes. With regards to these candidate genes, 27 genes showed average amplitudes with unanimous rises (Figure 2A) and 25 genes showed average amplitudes that consistently downregulated with the deterioration of AD ( Figure 2B). By studying these AD candidate genes, the following four pathogenetic roles are determined: (1) the abnormal expression levels of genes (AGTR1 and PTAFR) are associated with an increase in the activity of protein kinase c, which promotes the accumulation of Aβ, in turn leading to AD; (2) The abnormal expression of gene CNR1 correlates with the absence of LTD, which may lead to the impairment of LTP, and subsequently may induce AD; (3) The abnormal expressions of genes (COL5A2 (221729_at), COL5A2 (221730_at), COL4A1) are associated with decreases in energy supply, which may lead to neuronal apoptosis, a pathological feature of AD; (4) The abnormal expressions of genes (CACNB2, CACNA1E) correlate with the intracellular calcium dyshomeostasis, which is related to AD.
Finally, since these AD candidate genes were only identified by computation using ADA their potential implication for AD pathology should be further validated by wet lab experiments.

Figure 3C:
Percentages of functions of membrane proteins and shows that the membrane proteins associated with signal transduction, protein transport, metabolism, immune response, and regulation/ control of transcription are respectively 55%, 20%, 15%, 5% and 5%.  The functions of proteins in the cytoplasm and shows the percentages of proteins involved in protein transport and degradation, metabolism, signal transduction, and regulation/control of transcription are 34%, 22%, 22%, 11% and 11%, respectively.  The proportions of genes associated with various metal ions and the percentages of genes associated with calcium ions, manganese ions, magnesium ions, zinc ions, and ferric ions are 53%, 17%, 12%, 12%, and 6%, respectively.

Author Manuscript
Author Manuscript

Author Manuscript
Author Manuscript Figure 6B: The distributions of proteins encoded by the calcium-related genes. The percentages of proteins located at the membrane, the extracellular region, and unknown are, 78%, 11%, and 11%, respectively.  The percentages for functions of proteins encoded by the genes associated with calcium ions. 56% are involved in signal transduction, 44% are related to metabolism. The expression levels of genes (CACNB2, CACNA1E). Genes (CACNB2, CACNA1E) are associated with calcium ions. As the severity of AD increases, the expression levels change sharply.   The AD candidate genes selected by the first criterion.  The AD candidate genes selected by the second criterion.