Targeted co-expression networks for the study of traits

Weighted Gene Co-expression Network Analysis (WGCNA) is a widely used approach for the generation of gene co-expression networks. However, networks generated with this tool usually create large modules with a large set of functional annotations hard to decipher. We have developed TGCN, a new method to create Targeted Gene Co-expression Networks. This method identifies the transcripts that best predict the trait of interest based on gene expression using a refinement of the LASSO regression. Then, it builds the co-expression modules around those transcripts. Algorithm properties were characterized using the expression of 13 brain regions from the Genotype-Tissue Expression project. When comparing our method with WGCNA, TGCN networks lead to more precise modules that have more specific and yet rich biological meaning. Then, we illustrate its applicability by creating an APP-TGCN on The Religious Orders Study and Memory and Aging Project dataset, aiming to identify the molecular pathways specifically associated with APP role in Alzheimer’s disease. Main biological findings were further validated in two independent cohorts. In conclusion, we provide a new framework that serves to create targeted networks that are smaller, biologically relevant and useful in high throughput hypothesis driven research. The TGCN R package is available on Github: https://github.com/aliciagp/TGCN.


Bootstrapped LASSO decreases algorithm's instability when identifying seed transcripts
The first step to generate a TGCN on a transcript expression profile is to select the transcripts that contribute to predict the trait of interest T with LASSO regression.We are concerned with the stability of LASSO when selecting the seeds.Our algorithm would be stable if a slight variation in the sample set does not significantly vary the seed transcripts set composition.To empirically assess stability, we used the gene expression profiles of 13 different brain regions from GTEx (see supplementary Table S1) to predict the age of the donors at the time of death and created 13 age-TGCNs.The empirical instability of LASSO is illustrated in Fig. 1A.Each curve corresponds to a different value of the number of times the transcripts are selected or ratio of appearance, r.We observe that, when the number of LASSO runs (x axis) increases, the number of seeds keeps increasing at high pace, preventing the seed geneset from becoming stable (see curve for r value of 1 in Fig. 1A).However, when we bootstrap the LASSO, then the curves stop growing early on (see curves for r values 6, 7 and 8 in Fig. 1A).We experimented with LASSO runs values up to 50 in all 13 brain regions and observed that, amongst all seeds detected with 10 LASSO runs, the ones that were selected in at least 75% of the models remain selected at the same rate after running 50 LASSO runs.This suggests no more than 10 bootstraps are needed.This strategy allowed us to detect the more confident and universal transcripts to predict the age of the donors in each brain region of the GTEx dataset.Additionally, we observed that, for most of the brain regions, the age-predictive models created with only the repeatedly selected transcripts as predictors (seed transcripts models) reported a lower RMSE on test compared with the LASSO models created for the same brain region (7.42 and 8.16 RMSE on test on average, respectively) (see supplementary table S2).Among the different brain regions, the highest R 2 was observed for the cerebellum region, with a mean R 2 of 0.732 ± 0.089, where the standard deviation represents the error of the cross validation.On the opposite side, the substantia nigra brain region reported the worst performance to predict the age of the donors, reporting a mean R 2 of 0.193 ± 0.155.On average, the seed transcripts models reported an average R 2 of 0.435 across the different brain regions.
Interestingly, we observe that the number of times that a transcript t was selected in the bootstrap is positively correlated with the magnitude for the coefficient in the LASSO models for the corresponding transcript t (see Fig. 1B and supplementary table S3).This suggests that bootstrapped LASSO tends to select the most relevant transcripts in the prediction of the trait.On the other hand, we empirically observe the trade-off between increasing values of t(r) and lower R 2 (Fig. 1C, left panel).Alternatively, large seed set sizes naturally increase R 2 (Fig. 1C, right panel).
Finally, we also investigated how sample size influences stability.We use the Jaccard coefficient across runs as an estimate.Figure 1D includes experiments for brain regions that showed the highest stability (cortex), the lowest stability (spinal cord) and middle stability (caudate, hippocampus and amygdala in this order).We observe that, by increasing the training sample size, we get an increase of stability (cor = 0.535, P < 5.78 × 10 -62 ).

Seed transcripts are representatives of independent pathways
Seed transcripts are expected to be important transcripts of each pathway involved in the trait or phenotype of interest.When creating TGCNs, we implicitly assume that biological pathways influence the phenotype of interest somewhat independently and additively.Accordingly, their respective representative transcripts, the seeds, should be independent of each other.In terms of how correlated seeds are pairwise, we observe a significant difference (T-test P < 0.05) in the mean of the maximum pairwise correlation of LASSO selected transcripts across runs (0.489) and that observed within randomly chosen genesets paired in size (0.756) for cortex tissue from GTEx dataset (Fig. 2A, supplementary table S4, see "Methods").Therefore, the bootstrapped LASSO identifies unassociated transcripts.
If seed transcripts represent independent pathways and we try to annotate them functionally as a geneset, we should not observe any significant enrichment.Figure 2A shows enrichment of GO, KEGG, REAC and HP terms found in seed transcripts grouped by ratio of appearance from 1 to 10. None of the seed transcript sets showed relevant functional abundance (see also methods).This shows that these transcripts are not involved in the same pathways.Therefore, they might be involved in different non overlapping pathways.The number of new transcripts selected by a LASSO run when increasing the number of runs in comparison to only one iteration for GTEx cortex tissue with donor´s age as the predicted variable.Increasing the number of runs shows a strong increase in the number of transcripts selected at least once (red curve).We are interested in the transcripts selected by LASSO in a reasonable number of iterations, e.g. 6, 7 or 8 in the graph.(B) The plot shows a positive association, R 2 0.279 (P < 8.66 × 10 -38 ) between the eligibility of a transcript (i.e., number of times it is selected across the b LASSO bootstraps) and the average size of the coefficients for that transcript within the linear models that predict the trait.This plot corresponds to the age-TGCN for GTEx cortex.Each point is a transcript, the x axis represents eligibility and the y-axis represents the mean coefficient across linear models predicting age (see supplementary table S3).(C) Both plots show the relation between the number of transcripts used to predict age and the adjusted R 2 for all age-TGCNs for GTEx brain tissues.We see a clear negative relation between adjusted R 2 at predicting age as the ratio of appearance increases (left), that is, because the number of transcripts decreases within the model (right).Plots in B and C illustrate a trade-off in using TGCNs.The larger the seed size, the better performer is the final linear model to predict the trait, but the TGCN gets also larger and difficult to interpret.(D) The influence of the number of samples used to train the model, m, is represented in terms of LASSO stability.For five different brain tissues of the GTEx dataset, 10 LASSO runs were applied using 25, 50, 75 and 100% of the samples for the training.We observe that, the higher the m, the higher the LASSO stability.Jaccard represents the overlap of relevant transcripts between pairs of iterations, nzero represents the number of transcripts selected through the 10 iterations and RMSE train and test represent the train and test error of the LASSO models.

Modules within a targeted GCN show low or null overlap
For each brain region of the GTEx dataset, the transcripts selected by LASSO in at least 8/10 iterations were kept as seed transcripts since we considered these are reliable and we also obtained a good balance between the number of modules created and the RMSE of the model (see supplementary table S2).Interestingly, we observed that the seed transcripts detected per brain region replicate, on average, in seven of the 12 different brain regions of the study (for more information, see supplementary Fig. S1).
During the process of module creation, potential module member transcripts may show high correlation with more than one seed.TGCN allows transcripts to belong to more than one module, although this seldom happens.Our main interest is the characterization of pathways involved in the trait and we know genes are multifunctional 29 .What is relevant is that the independence of seed transcripts leads to quite different modules.In Fig. 2B, we show pairwise Jaccard coefficients between modules within each brain region-specific network.For all brain regions, the overlap between modules is practically zero, therefore the modules hold different transcripts (see supplementary table S5).Extreme brain regions regarding modules composition overlap are the age-TGCN in substantia nigra, whose modules show no overlap (Fig. 2C) and the age-TGCN in hippocampus, which showed a significant pairwise overlap (Fisher exact test P < 0.05) between three of its modules, CLEC5, MARCO and VENTX modules (Fig. 2D).
Figure 2. Independence of seeds and overlap across modules.(A) For all 13 brain tissues, we obtained the pairwise correlation between transcripts selected by LASSO (magenta) and compared them with the same number of pairs of randomly chosen transcripts (turquoise) across runs.We consistently see across runs that LASSO gets transcripts with less association (absolute value of maximum Spearman correlation between pairs) than we would expect by random chance (see supplementary table S4).This illustrates that seeds show no association and it is maintained across bootstraps.(B) For each brain region of GTEx, enrichment (y axis) is shown for the age-specific seed transcripts grouped per ratio of appearance (x axis).Enrichment is measured as the sum of the − log 10 P of all the annotations normalized by group size in transcripts.No geneset enrichment reaches the required cut-off for significance, − log 10 (0.05).(C) Average pairwise overlap between modules in all age-TGCNs from GTEx brain tissues.For all tissues, we observe an overlap, as estimated by Jaccard, close to zero.This suggests each module represents different gene functions and pathways.(D) Overlap between modules within the substantia nigra age-TGCN.Each row and column refer to a module, led by the corresponding seed, at the row and column label.Each cell shows the number of common transcripts.Color scale represents the adjusted -log 10 P of the overlap based on a Fisher Exact test.(E) Same plot as D but for the hippocampus age-TGCN.We observed that CLEC5A module showed a significant overlap with CDKN2A (P < 1.58 × 10 -21 ), VENTX (P < 3.65 × 10 -91 ) and MARCO (P < 2.01 × 10 -20 ) modules with a Fisher exact test after p-value adjustment.In addition, we also observe a significant overlap between VENTX and MARCO modules (P < 1.40 × 10 -25 ).(F) PCA projection at transcript level with the union of the transcript of modules VENTX, MARCO and CLEC5A from the age-targeted network of the hippocampus brain tissue.We observe that the transcripts of module CLEC5A go in one direction, those of MARCO in another, those of VENTX in a different direction while the transcripts common to these three modules appear at the intersection of the point clouds.www.nature.com/scientificreports/Targeted GCNs provide more specific modules than WGCNA WGCNA is the most widely extended hypothesis free approach to create genome wide gene co-expression networks.Since TGCNs are focused on a specific trait and its transcript pool is restricted to gene expression pathways associated with the trait, TGCNs are smaller and much more specific than hypothesis free GCNs which act on the whole transcriptome.For example, on average, the GTEx WGCNA networks we created for the 13 brain tissues use transcript pools of 15,281 genes in size and have nine modules average each network, with a mean size for each module of 1787 transcripts.The corresponding age-TGCNs use, on average, 426 transcripts in total and have 12 modules average each network with a mean module size of 37 transcripts.35.9-fold more transcripts in WGCNA GCNs (see supplementary tables S6 and S7).Remarkably, even though TGCN modules are smaller in transcripts, TGCN achieves similar or higher functional abundance across all 13 brain regions, on average 67.95 and 44.02 functional abundance per module in TGCN and WGCNA respectively.

Seed transcripts in targeted GCNs are orthogonal to WGCNA hub transcripts
We evaluated the relevance of the GTEx age-targeted seeds transcripts into the corresponding WGCNA modules.
We found those transcripts we consider as seeds in TGCNs play no particular relevant role in their WGCNA module, showing a modest module membership, with an average percentile of 14.51 (see supplementary table S8).A plausible explanation for this is that in this case, the trait, age at death, has a subtle association with the genome wide gene expression profile.
We also investigated the possible role of WGCNA hubs transcripts into the age-TGCNs in brain samples.For each WGCNA module, we selected the top 10 transcripts with highest adjacency values (hub transcripts) and checked the overlap with the TGCNs modules from the same brain region.We found that only a small proportion (18.75%) of WGCNA hub transcripts significantly overlapped with at least one TGCN module from the same brain region (see supplementary table S9).

TGCN modules tend to be subsets of single WGCNA modules
In order to better understand the relation between WGCNA modules and TGCN modules, we investigated how transcripts in a TGCN spread across the WGCNA GCNs created from the same gene expression profile.We found that, across all GTEx brain regions, 75.95% age-TGCN modules are included within a single WGCNA module while 16.46% WGCNA modules contribute significantly to more than one age-TGCN.
All the transcripts not included within the age-TGCN fell into the cells of the rightmost column and only the MT1G, GDF15 and IGF2BP2 modules significantly overlapped with more than one module while the rest of modules overlapped with just a single module.WGCNA modules which include any transcript appearing in the corresponding TGCN, show significant overlap with more than one TGCN module.See for example, the yellow module including entirely six modules out of the 16 TGCN total modules.At the same time, only 53.58% of the transcripts in that module are used in TGCN modules.This suggests that TGCN modules' functions are local to single WGCNA modules and by virtue of using only those transcripts that contribute to predict the trait, and those that co-express with them, they filter out all functions not relevant to the trait.

TGCNs show specificity to the trait and brain region
TGCNs are a set of transcript-based modules, led by seed transcripts with no association to any other seed through mRNA levels, and that also resemble independent pathways within each module.We hypothesize that the pathways play a biological role, mediated by gene expression, in relation to the targeted trait.Therefore, the APP-targeted GCN that models the role of APP in the AD phenotype is assumed to describe the pathways APP is involved with.To what extent, do those pathways emerge from the fact that we are predicting a trait that is really associated with the gene expression of the samples used?What would happen if instead of using APP as the target, we would use as target a totally irrelevant gene in the omics plus trait context involved?Would the TGCN approach be specific enough to detect no relevant signal when there is actually none?To shed some light about these questions, we conducted an experiment with six genes not previously related with AD.The selection of control genes has been based on two criteria: (1) the selected control genes must be well studied, and their function characterized, to minimize the chance of eventually finding them associated with AD; (2) they must be functional, in the sense that they are associated with diseases other than of neurodegenerative nature.In supplementary table S10, the disease associated with each control gene is available.
Firstly, within all 10 LASSO bootstraps for APP, the minimum test set R 2 for the linear model using the seeds selected through that bootstrap is 0.935, mean of 0.952.In all 10 LASSO bootstraps for all the other six genes, the maximum R 2 was 0.702 and the minimum was 0.149 (see supplementary table S10, supplementary Fig. S3A).Secondly, the seeds detected by the method in all ten LASSO bootstraps for APP were nine.In three of the six other TGCNs, the method detected no seeds in any of the runs.In the other three, the method detected only two seeds in each (supplementary table S10, supplementary Fig. S3B).Finally, the average functional abundance of modules in the APP-TGCN, measured as the sum of -log 10 (Pval) for all enriched terms, is 293.9.We see practically no average enrichment in the other six networks, with 10.8 enrichment on average for the best network, TNNT2-TGCN.This is a 26-fold difference (see supplementary table S10, supplementary Fig. S3C), which we consider as a strong indication of TGCN specificity.www.nature.com/scientificreports/

Seed transcripts and co-expression modules of the APP-TGCN replicate in two alternative AD cohorts
TGCNs are constructed in three steps, by (1) detecting the seeds with model bootstrapping, (2) constructing the co-expression modules from those seeds and (3) annotating them (see Fig. 3).We constructed an APP-TGCN on the ROSMAP cohort (see materials) for r values of 8, 9 and 10.Thoroughly the paper, we focused on the ratio of appearance of eight, as it is a good trade-off between APP transcript expression prediction and complexity (i.e., it has 25 modules and 880 transcripts in total).We wanted to shed light about the APP transcript role in AD from brain samples gene expression.To assess whether this TGCN replicates, we used two alternative cohorts, Mayo After all bootstraps, the TGCN software reports transcripts selected one or more times, along with how many times they were selected.Seed transcripts are selected on the basis of their ratio of appearance across bootstraps, r.Those transcripts are then used within a linear regression model as predictors to evaluate their prediction power on the trait and, therefore, the relevance of gene expression predicting that trait.
Step 2: each seed transcript is the seed for a co-expression module.Modules are completed by adding the genes more co-expressed with the seed based on the absolute Pearson correlation.To determine how many transcripts to add to each module, the algorithm offers three options.In the first option, all seed transcripts are equally relevant within the network, therefore all modules should have the same size (i.e. 100 transcripts).In the second option, the higher the magnitude of the coefficient in the linear model, the more relevant the seed is within the TGCN.Therefore, the seed transcript with the highest coefficient receives the maximum number of transcripts (i.e. 100 transcripts).In the third option, we look for the minimum size of the module that reports the best enrichment possible.www.nature.com/scientificreports/and MSBB (see "Materials").The seeds detected in ROSMAP with a minimum ratio of appearance of eight lead to a linear model with high APP predictivity (R 2 = 0.97).These seeds also showed a high APP-predictivity in Mayo and MSBB cohorts, with R 2 of 0.82 and 0.86, respectively (see supplementary table S11 for details on all models for r values of 8, 9 and 10).To properly assess whether those values of R 2 were better than random chance, a permutation analysis was applied.To this end, we created 100,000 APP-predicting linear models with randomly chosen seed transcripts (same size of relevant seeds) and compared their performance with the APP-predictive model created with the relevant seed transcripts.This procedure was applied separately for both Mayo and MSBB.
Results showed that the APP-predictive model created with the relevant transcripts performs better than APPpredictive models created with randomly chosen seed transcripts, leading to empirical p-values of statistical significance P < 0.059 and P < 4 × 10 -5 , respectively (see supplementary table S11, supplementary Fig. S4A,B).This suggests that seed transcripts discovered in ROSMAP reasonably explain APP transcript expression at the other two cohorts as well, nominally in Mayo and clearly in MSBB.Additionally, we evaluated how critical it would be to not find some of the seed transcripts in the expression matrix of the replication set.To this end, a permutation test was applied to compare the performance of the model created with the real seeds with the models created (1) removing each seed individually; (2) removing sets of five seeds (10,000 simulations); (3) removing sets of 10 seeds (10,000 simulations).This strategy was applied for Mayo and MSSB cohorts separately.We observed that, since it is an additive model, the effects of missing seeds are not noticeable (see supplementary table S12).
In order to assess whether co-expression modules are also present in the replication cohorts, for each APP-TGCN module built on the ROSMAP dataset, we tested for significant overlaps in the modules of Mayo and MSBB APP-TGCNs (see supplementary Fig. S4C,D).Out of the 25 modules identified in ROSMAP, seven modules did not show a significant overlap with any other module of the replication networks.The small size of the unreplicated co-expression modules (average size is 40) might explain, in part, their low replicability.Regardless, the replicability of 18 out of 25 modules across three brain transcriptomic AD datasets, demonstrated the robustness of this new method.

Insights into the biology of APP and its role in AD
The TGCN R software package generates, for each targeted network, a whole bundle of information, in HTML format, on the TGCN construction process and results annotation (see the TGCN GitHub for the one generated for APP-TGCN).All this is summarized in a table with an entry for each module (see supplementary table S13).The complete APP centric network is composed of 880 transcripts, spread across 25 modules, whose size varies in the [10, 80] genes range (see "Methods").The 25 seeds alone can explain most of the APP gene expression variation in a linear model (R 2 0.96).The CellTypeEnriched column of supplementary table S13 shows four modules that clearly contain cell marker enrichment, including the ATP1B1 module enriched for dopaminergic neuron markers (Fisher's exact P < 4.55 .10 -3 ), the ABCD3 module, enriched for astrocytic markers (P < 1.083 .10 -3 ), the FRZB module enriched for mural cell markers (7.761 .10 -4 ), and the NFASC module enriched for oligodendrocyte markers (3.899 .10 -4 ).This adds evidence to the already known complex expression profile of APP in AD brains and different expression patterns across brain cell types.After multiple test corrections, the eigengenes of two of these modules were significantly correlated with AD status, i.e., the ATP1B1 and NFASC modules (P < 0.038 y P < 7.02 .10 -4 , respectively).The ATP1B1 module is enriched for GO biological processes like synaptic vesicle cycle (GO:0,099,504, P < 3.65 .10 -4 ) and vesicle mediated transport in synapse (GO:0,099,003, P < 4.46 .10 -4 ).The NFASC module seems to be involved in the development of neurons, particularly through axons in the neurons since this module is enriched for terms like main axon (GO:0,044,304, P < 9.32 .10 -5 ) and axon development (GO:0,061,564, P < 6.49 .10 -4 ).Both ATP1B1 and NFASC modules replicated in two independent cohorts.Specifically, AT1BP1 module replicated in Mayo and MSBB cohorts with empirical p-values of P < 2.76 .10 -9 and P < 1.96 .10 -10 , respectively.NFASC module replicated in Mayo and MSBB cohorts, reporting P < 4.51 .10 -4 and P < 1.03 .10 -11 , respectively.

Discussion
We have developed a new hypothesis-driven co-expression modeling approach as an alternative or to complement WGCNA-like networks (see Fig. 1), targeted co-expression networks (TGCNs).The approach is divided into three main steps: (1) seed transcript detection, (2) co-expression modules creation and (3) network annotation.The first step of the pipeline uses bootstrapped LASSO in order to select the most relevant transcripts (seed transcripts) in predicting the target trait.In this high dimensional data scenario, transcripts selected by LASSO can be slightly different from run to run.To address this sign of instability, we bootstrap the LASSO.Previous studies have demonstrated that bootstrapped LASSO increases the stability of the features selected 30,31 .As a proof of concept, we used LASSO to predict the age of death of the donors using transcriptomics from 13 different brain regions from the GTEx project 32 .Previous studies also identified a set of age-related genes for each tissue in the GTEx cohort.Yang et al. 33 used a linear regression approach with 100 bootstrapping resamples to detect age-related genes within each tissue with at least 80 samples in size from an early version of GTEx (it did not include any brain tissue as all were smaller than 80 samples).Wang et al. 34 ranked the age-related genes for the same 13 brain regions based on the Pearson correlation.For each brain region, the top 50 age-related genes for that region were used to create a predictive model.These models reported a mean RMSE of 7.89 ± 1.23, which is very close to the mean RMSE obtained with our models 7.42 ± 1.47 but we only require 12 transcripts on average to achieve that error (see supplementary table S2).
The second step of the TGCN approach characterizes the relation between the trait and the transcriptomic profiling.This is precisely what we have pursued by creating an APP-TGCN on AD samples: to shed light on the role of the APP gene within AD.Other works approach the AD phenotype from transcriptomics by extracting the most relevant features in predicting AD.LASSO, or LASSO combined with other approaches of feature selection (Boruta's algorithm or differential expression analysis), have been applied for this task using a variety of data types, including MRI [35][36][37] , single nucleotide polymorphisms imputed from whole genome sequencing 38,39 , microarray 40,41 , plasma lipid levels 42 and, of course transcriptomics 43,44 .However, since AD is a complex disease affected by multiple factors, we propose to approach the AD phenotype through a specific trait, as a way to simplify this challenging task, i.e. this is, precisely, the main motivation behind TGCNs.Naturally, the selected trait should be strongly associated with AD.In particular, we focused on the APP gene, which has a relevant role in AD 45 .Therefore, we created an APP-TGCN from the transcriptomics of the ROSMAP cohort.The model predicting the APP gene expression values reported a 0.97 R 2 .To the best of our knowledge, there is no other study where an APP-predictive model was created using transcriptomics data from human controls and AD donors.The reliability of the seed transcripts selected by TGCN was assessed in two independent cohorts, Mayo and MSBB, with R 2 of 0.82 and 0.86, respectively.Amongst the modules within the APP-TGCN, the module led by the HNRNPA2B1 seed transcript, a heterogeneous nuclear ribonucleoprotein from the A2B1 family, stands out from the rest because of its high correlation with donors neuro status (P < 2.13 × 10 -5 ).Donev et al. demonstrated that hnRNP A1 is involved in alternative splicing of APP exons 7 and 8. Berson et al. found that both hnRNP proteins A1 and A2/B1 were depleted in entorhinal cortex samples from AD patients.In addition, HnRNP A2/ B1 protein has also been linked to other neurodegenerative diseases such as amyotrophic lateral sclerosis 48,49 .Another module highly correlated with donor neuro status (P < 7.02 × 10 -4 ) was the one led by neurofascin (NFASC) seed transcript.Neurofascin is a transmembrane protein that plays an essential role in nervous system development and node of Ranvier function 50 .In this same line, Xu et al., 2014 reported that APP aggregates at nodes of Ranvier in the myelinated central nervous system axons but not in the peripheral nervous system.Bai et al. demonstrated, in the mouse brain, that APP interacts with neurofascin.Brinkmalm et al. detected that the concentrations of the synaptic protein neurofascin were significantly lowered in the CSF of AD donors.Furthermore, NFASC gene mutations were previously associated with autosomal recessive ataxia with demyelinating neuropathy 54 and was also proposed as a candidate autoantigen in multiple sclerosis 55 .Both modules, the ones led by HNRNPA2B1 and NFASC seed transcripts, were further replicated in the two independent cohorts (Fisher exact test P < 0.05), demonstrating that our findings are reliable, and the networks are robust.
Finally, we compared the TGCN networks with WGCNA networks on the same GTEx transcriptomic profiles.Since WGCNA clusters the whole pool of transcripts into modules, the resulting clusters tend to be very large in size [56][57][58] and more difficult to interpret.Since TGCNs are focused on a specific trait, these networks are more than 30 times smaller than WGCNA networks but surprisingly, equally rich in functional annotations (see "Results").Besides, we observe that the most important WGCNA modules are actually an amalgamation of a few of the TGCN modules which clearly demonstrates that TGCNs are a simpler and more specific way to describe a trait of interest in terms of transcriptomics.
The targeted co-expression model we propose has some limitations in terms of applicability.Obviously, it is not directly applicable when there is no specific trait involved.However, transcriptomic cohorts created with no specific research hypothesis focused on a specific trait can yet be approached with this model.In such scenarios, we would create a TGCN for each trait separately used as a target.But considering that a biologically interesting TGCN has the potential to generate a large volume of information, having to manage multiple TGCNs at the same time makes the interpretation of all those models a hard task.(syn5550404).Normalized gene counts from the temporal cortex were downloaded together with donor phenotype.Samples that did not pass quality control were removed, that is, the subjects with low % mapped reads (< 85%) or sex discrepant gene counts and samples with gene body coverage ratio of > 3.0.For more information about the QC, please see (syn20818651).Then, the same data preprocessing as ROSMAP was applied.
MSBB brain data were obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (syn3159438).The dataset includes 214 frontopolar prefrontal cortex samples.Only participants classified as healthy controls (CDR = 0) and AD donors (CDR > 3) were kept for downstream analysis.RNA-seq raw counts were downloaded from synapse and transformed to RPKM.Then, the same data preprocessing as ROSMAP was applied.

TGCN workflow
We have divided the generation of TGCN networks in three steps (Fig. 1).First, we select the specific transcripts for the trait of interest (i.e.seed transcripts) with LASSO regression 27 .Then, we create the network around those transcripts and annotate the modules.

Seed transcripts identification
Using the transcriptomic data as input and the trait of interest as outcome, we apply LASSO regression to identify the more relevant transcripts (Fig. 3, step 1).LASSO is one of the most popular regularization methods for linear models.It is characterized by the inclusion of a L 1 penalty function at its cost function.The penalty is the sum of the absolute value of the β coefficients at the linear model compound by p predictors multiplied by (see below).The LASSO cost function is as follows: where RSS (Residual Sum of Squares) is the sum of quadratic errors when predicting the dependent variable with each training sample.This penalty term will zero out some coefficients, giving the LASSO regression the capability of feature selection 64 , required to identify the seeds.This property is optimized via the hyperparameter, estimated with cross-validation 65 .Gene expression profiles are very particular datasets for machine learning: they are highly dimensional (the number of potential predictors is of the order of thousands) and the predictors show high collinearity (groups of genes present high correlation amongst them).In these situations, it is easy to find many alternative linear models with approximately similar predictive performance.This leads to algorithm instability, when a small change in the training samples imply an appreciable change in the gene set used at the final LASSO model.In order to increase LASSO stability, we bootstrap the LASSO.Increasing stability also improves the reliability of the predictors (transcripts in this case) selected as relevant by LASSO, i.e., the LASSO solution.To this end, we generate b different LASSO solutions from b bootstrap train-test resamples.Each LASSO model in the bootstrap is created using a 5-folds cross-validation with the cv.glmnet function from glmnet package 66 and only those transcripts with non-zero coefficients are considered.
Once all b LASSO models are created, the transcripts selected by the LASSO are grouped based on the number of times the transcript was selected, r, across the b runs, with 1 ≤ r ≤ b .If the user does not specify a value for r, then for each value of r, the software creates a linear model to predict the trait of interest using as predictors the transcripts selected r or more times.Models are created with the Caret R package 67 with a 5-folds cross-validation repeated b times (i.e., b is 10 by default).It is up to the user to find the right trade-off between complexity of the model and variance of the trait explained by it (more complex models lead to better approximation of the trait, but they are harder to manage as they will be larger).The TGCN software assists in this process through the generation of a detailed report of the whole analysis (see the front page of the software at GitHub for a detailed explanation).

Building co-expressed modules around each seed transcript
After selecting the seeds, the TGCN software characterizes the pathways involved to better describe how the seeds are associated with the trait.To do this, we select the most co-expressed transcripts for each seed (Fig. 3, step 2).We define co-expression of a pair of transcripts as the absolute value of their Pearson correlation.
To optimize on the number of transcripts/features to include in a module, the TGCN R Package offers three different approaches.The simplest approach assumes that all seed transcripts are equally relevant, therefore, all modules are created with the same number of transcripts, i.e., those with highest correlation with the seed (100 by default).An alternative option assumes that seeds have different relevance, represented by their coefficient, in absolute value, at the linear regression model created to predict the trait.Thus, the resulting module size is proportional to the relevance, with a minimum size specified as a parameter (10 by default).The seed transcript with the highest coefficient will lead to the largest module (100 transcripts by default), followed in size by the second most relevant seed transcript and so on.The size of module n, S n is.

Figure 1 .
Figure1.Study of lasso stability in high-dimensionality problems.(A) The number of new transcripts selected by a LASSO run when increasing the number of runs in comparison to only one iteration for GTEx cortex tissue with donor´s age as the predicted variable.Increasing the number of runs shows a strong increase in the number of transcripts selected at least once (red curve).We are interested in the transcripts selected by LASSO in a reasonable number of iterations, e.g. 6, 7 or 8 in the graph.(B) The plot shows a positive association, R 2 0.279 (P < 8.66 × 10 -38 ) between the eligibility of a transcript (i.e., number of times it is selected across the b LASSO bootstraps) and the average size of the coefficients for that transcript within the linear models that predict the trait.This plot corresponds to the age-TGCN for GTEx cortex.Each point is a transcript, the x axis represents eligibility and the y-axis represents the mean coefficient across linear models predicting age (see supplementary tableS3).(C) Both plots show the relation between the number of transcripts used to predict age and the adjusted R 2 for all age-TGCNs for GTEx brain tissues.We see a clear negative relation between adjusted R 2 at predicting age as the ratio of appearance increases (left), that is, because the number of transcripts decreases within the model (right).Plots in B and C illustrate a trade-off in using TGCNs.The larger the seed size, the better performer is the final linear model to predict the trait, but the TGCN gets also larger and difficult to interpret.(D) The influence of the number of samples used to train the model, m, is represented in terms of LASSO stability.For five different brain tissues of the GTEx dataset, 10 LASSO runs were applied using 25, 50, 75 and 100% of the samples for the training.We observe that, the higher the m, the higher the LASSO stability.Jaccard represents the overlap of relevant transcripts between pairs of iterations, nzero represents the number of transcripts selected through the 10 iterations and RMSE train and test represent the train and test error of the LASSO models. https://doi.org/10.1038/s41598-024-67329-7 https://doi.org/10.1038/s41598-024-67329-7

Figure 3 .
Figure 3. Workflow for the generation of targeted co-expression networks.The pipeline to create a TGCN is split into three steps (1) select the specific transcripts for the trait of interest (i.e.seed transcripts) based on a bootstrapped LASSO, (2) create the TGCN around those transcripts and (3) annotate the network modules.Step 1: to identify the transcripts that best predict the trait of interest, bootstrapped LASSO is used based on gene expression.Bootstrapped LASSO increases reliability of selected transcripts (seeds): the LASSO model is created b times with different train and test resamples.After all bootstraps, the TGCN software reports transcripts selected one or more times, along with how many times they were selected.Seed transcripts are selected on the basis of their ratio of appearance across bootstraps, r.Those transcripts are then used within a linear regression model as predictors to evaluate their prediction power on the trait and, therefore, the relevance of gene expression predicting that trait.Step 2: each seed transcript is the seed for a co-expression module.Modules are completed by adding the genes more co-expressed with the seed based on the absolute Pearson correlation.To determine how many transcripts to add to each module, the algorithm offers three options.In the first option, all seed transcripts are equally relevant within the network, therefore all modules should have the same size (i.e. 100 transcripts).In the second option, the higher the magnitude of the coefficient in the linear model, the more relevant the seed is within the TGCN.Therefore, the seed transcript with the highest coefficient receives the maximum number of transcripts (i.e. 100 transcripts).In the third option, we look for the minimum size of the module that reports the best enrichment possible. ) impairment based on clinical and neuropathological exams.ROSMAP cohort demographics are represented in supplementary Fig.S5.Only participants classified as healthy controls and AD were kept for downstream analysis.ROSMAP bulk RNA-seq data was downloaded as FPKM counts together with the corresponding metadata and phenotype.The same data preprocessing as ROSMAP was applied.The Mayo cohort includes whole transcriptome data for 276 temporal cortex samples from individuals with caucasian ancestry with neuropathological diagnosis of AD, progressive supranuclear palsy, pathologic aging or elderly controls without neurodegenerative diseases Vol.:(0123456789) Scientific Reports | (2024) 14:16675 | https://doi.org/10.1038/s41598-024-67329-7www.nature.com/scientificreports/