Identification of network-based biomarkers of cardioembolic stroke using a systems biology approach with time series data

Background Molecular signaling of angiogenesis begins within hours after initiation of a stroke and the following regulation of endothelial integrity mediated by growth factor receptors and vascular growth factors. Recent studies further provided insights into the coordinated patterns of post-stroke gene expressions and the relationships between neurodegenerative diseases and neural function recovery processes after a stroke. Results Differential protein-protein interaction networks (PPINs) were constructed at 3 post-stroke time points, and proteins with a significant stroke relevance value (SRV) were discovered. Genes, including UBC, CUL3, APP, NEDD8, JUP, and SIRT7, showed high associations with time after a stroke, and Ingenuity Pathway Analysis results showed that these post-stroke time series-associated genes were related to molecular and cellular functions of cell death, cell survival, the cell cycle, cellular development, cellular movement, and cell-to-cell signaling and interactions. These biomarkers may be helpful for the early detection, diagnosis, and prognosis of ischemic stroke. Conclusions This is our first attempt to use our theory of a systems biology framework on strokes. We focused on 3 key post-stroke time points. We identified the network and corresponding network biomarkers for the 3 time points, further studies are needed to experimentally confirm the findings and compare them with the causes of ischemic stroke. Our findings showed that stroke-associated biomarker genes at different time points were significantly involved in cell cycle processing, including G2-M, G1-S and meiosis, which contributes to the current understanding of the etiology of stroke. We hope this work helps scientists reveal more hidden cellular mechanisms of stroke etiology and repair processes.


B. Data selection and pre-processing
The stroke microarray dataset was obtained from the NCBI GEO [11]. In this study, we chose GSE58294 [12] and its corresponding platform, GPL570, as our research object (Table 1). It contains gene expression data following a cardioembolic stroke. We only used data derived from non-processed primary biopsies to avoid discrepancies in gene expressions that are intrinsic to cell culture and fixation. Therefore, the dataset contained 3 time points of stroke patients and 23 control samples from non-disease subjects. We built SPPINs for 3, 5, and 24 h post-stroke in this study and the NPPIN. The dataset contained 23 samples for each stage. Prior to further analysis, the gene expression value, hij, of gene i in the jth sample was normalized to z-transformed scores, gij, and then the resultant normalized expression value had a mean μi=0 and standard deviation σi=1 over sample j [13,14].
PPI data for Homo sapiens were extracted from the Biological General Repository for Interaction Database (BioGRID). The BioGRID is an open-access archive of genetic and protein interactions that are curated from the primary biomedical literature of all major model organisms [15]. BioGRID was mined for candidate stroke PPINs which were pruned to delete false-positive PPIs using their corresponding microarray data. These PPINs of 3, 5, and 24 h post-stroke and normal stages were then compared to obtain network markers.

C. Selection of a protein pool and identification of PPINs for normal and stroke stages
To integrate gene expressions with PPI data to construct the corresponding SPPINs and NPPIN, we set up a protein pool containing differentially expressed proteins. Gene expression values were reasonably assumed to correlate with protein expression levels. We used a one-way analysis of variance (ANOVA) to analyze the expression of each protein and select for proteins with significant differential expression levels. This method allowed determination of significant differences between the stroke and normal datasets. The null hypothesis (Ho) was based on the assumption that mean protein expression levels of stroke and normal sets were the same. The Bonferroni adjustment [16], a type of correction for multiple testing, was used to detect proteins with a discrepancy. Proteins with a p value of <0.01 were included in the protein pool, while proteins in the protein pool with no PPI information were eliminated. In addition, proteins that were not already in the protein pool were included if their PPI information indicated that they had a close relationship with a protein already in the pool. As a result, the protein pool contained proteins that had significant differences in expression levels and proteins that had close relationships with the aforementioned proteins.
On the strength of the significant protein pool and PPI information, candidate PPINs for 3, 5, and 24 h post-stroke and normal stages were constructed by linking proteins that interacted with each other. In other words, proteins that had PPI information through the pool were linked together, resulting in candidate PPINs.
As the candidate PPIN included all possible PPIs under various environments and experimental conditions, the candidate PPIN needed to be further confirmed by microarray data to identify appropriate PPIs according to the biological processes that are relevant to stroke. To remove false-positive PPIs from each candidate PPIN for different biological conditions, we used both a PPI model identification scheme and a model order detection method to prune each candidate PPIN using corresponding microarray data to approach the actual PPIN of stroke. Here, the PPI of target protein i in the candidate PPIN can be depicted by the following protein association model: where xi[n] is the expression levels of target protein i for sample n; xj[n] is the expression level of the j-th protein interacting with target protein i for sample n; αij is the association interaction ability between target protein i and its j-th interactive protein; Mi is the number of proteins interacting with target protein i; and ωi[n] is stochastic noise due to other factors or model uncertainty. The biological meaning of equation (1) is that expression levels of target protein i are associated with expression levels of proteins that interact with it. Consequently, a protein association (interaction) model for each protein in the protein pool can be built using equation (1).
After constructing equation (1) where ˆi j  was identified using microarray data in accordance with the ML estimation method.
Once association parameters for all proteins in the candidate PPIN were identified for each protein, significant protein associations were determined using the model order detection method based on the estimated association abilities, i.e., detecting the interaction number, Mi, in equation (2). The Akaike information criterion (AIC) [17] and a Student's t-test [18] were used for both model order selection and significance determination of protein associations in ˆi j  (see Additional file S.2).

D. Determination of significant proteins and their network structures at 3, 5, and 24 h post-stroke and normal cells
After the interaction number, Mi ' , was determined using the AIC order detection and Student's t-test, spurious false-positive PPIs, ˆi j  , in equation (2) were pruned away, and only significant PPIs that remained were refined as follows: where  (3)) pruning resulted in the following refined PPIN: where k = 3, 5, and 24 h post-stroke; k S A and AN are interaction matrices of the refined PPINs of 3, 5, and 24 h post-stroke, respectively; and M is the number of proteins in the refined PPIN. Therefore, the protein association model for SPPINs and the NPPIN for 3, 5, and 24 h post-stroke and normal cells can be represented by the following equations according to equations (4) and (5): where k = 3, 5, and 24 h post-stroke and where k = 3, 5, and 24 h post-stroke; k ij d is the protein association ability difference between SPPINs and NPPIN at k = 3, 5, and 24 h post-stroke and normal cells; and matrix D k is the difference in network structures between SPPINs and the NPPIN for k = 3, 5, and 24 h post-stroke and normal cells. In order to investigate stroke-related factors from the difference matrix, D k , between SPPINs and the NPPIN at 3, 5, and 24 h post-stroke and normal cells in equation (7), a score, which we named the SRV, is presented to quantify the correlation of each protein in D k with the significance of stroke as follows [13]: i j j SRV d , and k = 3, 5, and 24 h post-stroke. The k i SRV in equation (8) quantifies the differential extent of protein associations of the i -th protein (the absolute sum of the i-th row of D k in equation (7)) and the  (8). We also found 5, 9, and 4 significant proteins, respectively, as the specific network markers of 3, 5, and 24 h post-stroke.
These proteins showed significant changes between SPPINs and the NPPIN in the stroke process according to the corresponding stroke stage, and we suspected that these changes might play important roles in the stroke process. These findings warrant further investigation.

E. Pathway analysis
More-valuable cellular information can be found in known pathways, which are useful for describing most "normal" biological phenomena. All of these known pathways are the result of repeated testing and verification, and the entire pathway network has definitions for most links. Therefore, proteins we identified to be significant in the above network markers were mapped onto known pathway networks (e.g., the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway) to investigate significant pathways with the network markers and explore relationships between these pathways and stroke. This approach supports the view that systems biology can help identify significant network biomarkers in stroke and relate their cellular roles in stroke etiology and repair processes.
Together with comprehensive pathway databases such as the KEGG, we used a series of bioinformatics pathway analytical tools to identify biologically relevant pathway networks. The KEGG includes manually curated biological pathways that cover 3 main categories: systems information (e.g., human diseases and drugs), genomics information (e.g., gene catalogs and sequence similarities), and chemical information (e.g., metabolites and biochemical reactions). At present, the KEGG contains 134,511 distinct pathways generated from 391 original reference pathways [19]. Therefore, to investigate the pathways involved in stroke etiology and repair processes, the DAVID bioinformatics database [20,21], which generates automatic outputs of the results from a KEGG pathway analysis [22], was used for the pathway analysis of significant proteins identified in network markers to determine their cellular roles in stroke.
To complete our research results, we used the well-known commercial software, Ingenuity® Pathway Analysis (IPA), to do multiple functional and pathway analyses.
IPA® is from QIAGEN (Redwood City, CA, www.qiagen.com/ingenuity). We then used free network ontology analysis (NOA) software to do the pathway analysis and gene set enrichment analysis (GSEA) on biological processes, cellular components, and molecular functions [23]. The NOA first defines link ontology that assigns functions to interactions based on the known annotations of joint genes via optimizing 2 novel indexes 'Coverage' and 'Diversity'. Then, the NOA generates 2 alternative reference sets to statistically rank the enriched functional terms for a given biological network. Wang et al. compared NOA with traditional enrichment analysis methods in several biological networks, and found that: (i) the NOA can capture changes in functions not only in dynamic transcription regulatory networks but also in rewiring protein interaction networks while the traditional methods cannot and (ii) the NOA can find more relevant and specific functions than traditional methods in different types of static networks. The above description of NOA is directly cited from their paper [24].