A simplified protocol for performing MAGMA/H-MAGMA gene set analysis utilizing high-performance computing environments

Summary Here, we present a quick-start protocol to perform generalized gene-set analysis of GWAS data on a metaset of gene lists generated by upstream pipelines, such as differential expression analysis, using the Multi-marker Analysis of GenoMic Annotation (MAGMA) software package and Hi-C coupled H-MAGMA annotation data (de Leeuw et al., 2015; Sey et al., 2020). We specifically tailor the steps and operations to meet the multithreading capability in modern computers, a feature nowadays shared by personal computers and high-performance clusters alike. For complete details on the use and execution of this profile, please refer to Yao et al. (2021).

4. Before commencing, it is strongly recommended to use a Python distribution/package manager such as venv/Miniconda/Anaconda to install and manage any dependencies that are used by creating virtual environments. 5. Download MAGMA and H-cAGMA from a credible source, preferably from its distributor site.
The distributor sites have been listed in the key resource table.

Create a new virtual environment
Timing: 30 min to 2 h (depending on whether Miniconda has been pre-installed) 6. Check the presence and install Anaconda/Miniconda on the system. a. Open a terminal window or connect to the host through an SSH terminal, type in: If there is any returning output similar to ''conda 4.10.1'', then it means that an anaconda/Miniconda package manager has been pre-installed with the system. Skip directly to 2. Otherwise, proceed with step b to install the Miniconda package manager.
Note: some university-operated computing clusters use their own package managers to load/ unload package modules, consult system administrator before processing.  iii. Download the MAGMA auxiliary files.
It is also handy to download the gene location data, SNP synonyms, and reference data files from Phase 3 of the 1000 Genomes Project (Genomes Project Consortium et al., 2015) for subsequent use (included in the Deposited data section of the key resources table). Note: all the data listed on this page were built based on GRCh37/hg19 genome. Hence, if you have any coordination data based on the more recent GRCh38/hg38 genome, you have to either perform a genome liftover or find the corresponding hg38 version of the reference files (the GitHub repository of this article includes an introduction on downloading and creating a set based on the 1000G European population). Place the auxiliary files in a separate path from the MAGMA executable (e.g., /data/ your-name/magma_aux_files/).

STEP-BY-STEP METHOD DETAILS
We will recapitulate the H-MAGMA analysis in Figure 4C of Yao et al. (2021), with an added section of standard MAGMA analysis using a user-generated annotation file instead of the gene annotation file provided by H-MAGMA from Hi-C data. Here we will start with a clean slate by directly using the .bim file (the SNP location file of the plink bim/bed/fam file sets) associated to the g1000_eur data set (SNP_Loc_File, download links included in KRT) and one revised version of the gene location file from the original MAGMA gene location build 37 file (Gene_Loc_File, download links included in KRT). As an example, we could use the .bim file (such as the g1000_eas.bim or g1000_eur.bim files from the 1000 Genome Project, downloadable in KRT) as the SNP_Loc_File and the Rev.NCBI37.3.gene.loc (also in KRT) as the Gene_Loc_File as a start.

Run gene annotation
1. Pre-run preparations. a. Confirmation of data format compatibility. It is of utmost importance that all files use the same naming rules during the analysis since different data files, especially reference data files, may be acquired from distinct sources. The common fields that could cause incompatibility and therefore error during analysis include, but are not limited to: i. Chromosome names: check the first few lines of both SNP_Loc_File and Gene_Loc_ File and confirm whether the chromosome names are consistent (e.g., both ''1'' or ''chr1''). Since the SNP_Loc_File (.bim) file usually comes with chromosome names in 1, 2, ., in most cases it would be the Gene_Loc_File that has an excessive ''chr'' prefix to the chromosome names. To remove the prefix, simply run the following bash script: ii. Gene names: Different Gene_Loc_File may use different strings (gene symbols, ENSEMBL gene ID, NCBI gene ID, etc.) in the first column of Gene_Loc_File. This column needs to be consistent with the names used in the gene set file (Set_File) that will be mentioned later. If different gene symbols (such as ENSEMBL gene symbols vs UCSC gene names) were used between the two files, some lookup table-based conversion may need to be performed. Some basic (but very handy) conversion can be performed on g:Profiler (https://biit.cs.ut.ee/gprofiler/convert) and the user could consult additional web resources, such as R biomaRt package, for more in-depth study themselves. iii. (Rare) 0-based or 1-based gene tracks. In some rare cases when the gene annotation file (Annot_File) was pre-generated rather than made de novo (as in the case of H-MAGMA), a 0-based Annot_File might be used together with a 1-based Set_File (especially when genomic intervals were used instead of gene names). In order to avoid such misadventure, it is recommended to spot-check a few known SNP and genes for their positions to see if they are consistent. b. Decide the size of the annotation window around genes.
In order to include up/down-stream SNPs into genes of interest for analysis, it is necessary to set a sliding window around the genes using the -annotate flag. Although there is no universally agreed value in setting the size of the sliding window, users are recommended to check related published results for optimized results.
(!) Warning: if the ignore-strand modifier is used, all genes will be assumed to be located on the positive (+) strand with the strand column information ignored. Use with caution.

Run the annotation.
It is recommended to write and save each command as a separate bash script for easy debugging and code recycling using editors such as Vim, GNU Emacs, or VS Code.
It is also recommended to pre-define the location of all data files as variables instead of embedding them in the magma command line, for easy code recycling.
Assuming that the magma executable path has been added to your $PATH variable.
Here we will run a standard annotation using window size of 100 kb upstream, 20 kb downstream, output file prefix is miR137_100k_20k and the output file will be sent to the current directory. If performing H-MAGMA analysis rather than the standard MAGMA model, skip from here to step 3. Save the following code as gene_annotation.sh.
Note: users should carefully choose the path of ''/data/your-name/magma-aux-files/ '' section since they need to define this part themselves. The Output_Prefix is customizable as long as it is kept consistent in all subsequent steps. This rule applies in all subsequent steps and associated code blocks.
Then execute the following command in the console.
After this operation, the user will find two files: miR137_100k_20k.genes.annot and miR137_100k_20k.log generated under the current directory. The .log file recorded the actual commands that were executed as well as some details during the program execution. Nevertheless, here we focused on the generated.genes.annot file (Annot_File) for the next step of gene analysis. If we examine the content of the Annot_File, we will find a row-based file that each gene was marked by an interval of its length plus the window size, followed by the id of all the SNPs that fall within the interval.

Gene analysis
Timing: 30 min to 1 h During this step, we will calculate the p-values and associate metrics of each gene and store them in the generated .genes.out and .genes.raw files for subsequent gene-level analysis uses. Moreover, we will take advantage of the parallel computing ability of modern computers/clusters to boost the process to a significantly faster degree.
3. Identify and confirm the available system resources to use.
a. Identify the numbers of system CPUs/threads. In the console, type in: SNP_Loc_File="/data/your-name/magma-aux-files/SNP_Loc_File" Gene_Loc_File="/data/your-name/magma-aux-files/Gene_Loc_File" b. Identify the size of available memory to use.
In the console, type in: The console returns: So we can see there are 458 GiB of memory at our disposal.
4. Compose the main code part for gene analysis using SNP p-value data with parallel processing. a. Verify GNU Parallel is installed on the system. In the console, type in: See if the console successfully returns the current version of the parallel installed. b. Calculate the appropriate number of parallel threads to use.
As a general rule, each MAGMA session would use approximately 1 GiB of memory. Hence, the maximum theoretical parallel thread numbers we can set will be the size of available memory divided by 1 GiB. However, we should also consider the fact that the number of parallel threads cannot exceed the count of logical cores of the system, i.e., the number we collected from step 3a. Also, we need to reserve some resources for system use. recommended that the number of parallel threads to use should not exceed 10 on computers < 16 logical cores and not exceed 40 in mainframe clusters, as long as the user-available memory permits. c. Prepare a temp working folder for storing intermediate batch files and write the code. Here we will use the data file from the Psychiatric Genomics Consortium Schizophrenia Wave 3 (scz2021) (The Schizophrenia Working Group of the Psychiatric Genomics Consortium et al., 2020) as the SNP_Pval_File). The download link has been included in the key resource table.
In a text editor, type in the following code and save it as gene_analysis.sh Then execute the following command in the console.
This snippet will automatically perform a series of operations as the in-line comments have noted.
Finally, it will generate three files started with the Out_Prefix: a .genes.raw file contains the raw permutation results, a .genes.out file contains the per-gene statistics, and a .log file records the commands used. Here we choose the SNP-wise mean model for our PGC Schizophrenia analysis since we know that mental disorders tend to be of cumulative, non-mendelian inheritance, therefore a model that is more attuned to the mean SNP association and skews towards associations in areas of higher LD in individual genes is preferable. However, the in-depth discussion on the statistical basis and their associated disease application would be too complex and beyond the scope of this protocol. For further discussion on all the modifiers, their statistical basis, and the potential impact on the performance, we recommend the user read the official MAGMA manual and original publications.
Users should also note the N/ncol modifier under the -pval option that is used to specify the sample size. First, not all the SNP_Pval_File has the sample size column. In these cases, the sample size is usually stated in the associated README file and users should specify N = [sample_size] (e.g., N = 25350) instead of using the ncol modifier. Secondly, SNP_Pval_File from different sources may have different names for their sample size column (e.g., Nca, Neff, Ngt). User should manually check the first few lines of the SNP_Pval_File file and confirm the column name before applying in code.
The MAGMA user manual contains a detailed explanation of the structures and columns of the generated .genes.out file. Users may want to refer to the corresponding section for the information contained in each column if interested.
CRITICAL: If performing H-MAGMA analysis instead of standard MAGMA analysis, it is crucial to use the H-MAGMA provided Annot_File corresponding to your cell/tissue of interest instead of the one generated in step 2.

Gene set analysis
Timing: 15-30 min In this section, we will integrate the gene-level analysis results derived from the previous section together with a list containing one or more gene sets for a linear regression model-based test.
Note: the original H-MAGMA package provided input files using gene names in ENSEMBL gene ID without the sub-version numbers, which are different from the record in most GENCODE or ENSEMBL GTF annotation files (e.g., ENSG00000153266 instead of ENSG00000153266.13). Users are advised to verify their own gene list to make sure they are compatible.
5. Prepare the gene set list file.
The gene set (Set_Annot_File) could be written in either row-based or column-based format (see Appendix A of the MAGMA manual for details). From our experience, the column-based format allows greater versatility and better clarity when working in conjunction with other bioinformatic package outputs, although using the column-based input file format does require the addition of the ''col'' modifier (see below). Another note is each line of the Set_Annot_File needs to terminate with the newline (\n) character only but not the Windows carriage return (\r\n), hence, some tab-delimited files exported directly from Microsoft Excel in Windows may not work before having their \r character stripped.
6. In a text editor, type in the following code and save it as gene_set_analysis.sh Then execute the following command in the console.
We also provided another file set for the readers to recapitulate the analysis in the Figure 4C of Yao et al. (2021) using the iPS-derived neuron Hi-C data from H-MAGMA (Figure 2A). The corresponding file set can be found in the GitHub repository of this article.
The structure and columns of the output .gsa.out file has been well-documented in the MAGMA manual. Essentially, we will need the beta (regression coefficient) and P values to see if our different gene sets are more concordant to our disease model (PGC3 schizophrenia here).
Optional: Make bubble plots using R and ggplot2 to show beta and enrichment value.

#!/bin/bash
Gene_Results_File="miR137_PGC_SCZ_w3.genes.raw" Set_Annot_File="/data/your-name/analysis-results/miR137_gene_set_list_4_magma.txt" Output_Prefix="miR137_100k_20k_gene_set_results" To replot the Figure 4C of Yao et al. (2021), create a new R project and a new R code window in RStudio. Ensure that library ggplot2and RColorBrewer has been installed. Then type in: Then execute the whole code block in RStudio (select-all then ctrl + ENTER) or execute line-by-line (ctrl + ENTER). The generated ggplot2 graph will appear in the plot window of the RStudio. This will generate panel B in Figure 2. To generate Figures 1 and 2A, use the code provided in the GitHub repository.

EXPECTED OUTCOMES
Here we can compare the two sets of outputs, one standard MAGMA results using sliding window size of 100 kb, 20 kb size, the other using H-MAGMA model from iPS-derived neuron Hi-C data  (Rajarajan et al., 2018) from the same five sets of genes in the Table S1 of Yao et al. (2021) (Figures 1  and 2, an organized, ready-to-use version was also provided in the associated GitHub repository). The raw .gsa.out files were provided on GitHub.

LIMITATIONS
As noted above, there has not been a universally accepted value of the sliding window size around genes and this may need several rounds of optimization. In addition, there are some recent discussions related to the statistical stability of the H-MAGMA analysis and possible fallouts (de Leeuw et al., 2020;Yurko et al., 2021). Users are advised to explore these discussions and make the best decisions for their specific cases.

TROUBLESHOOTING
Problem 1 MAGMA pipeline fails at merge (step 4c) In some rare cases, MAGMA returns a merging error after finishing the parallel gene analysis step. It is usually caused by one of the many parallel threads that unexpectedly failed during execution, thus resulted in a missing part in the generated file series.

Potential solution
In most cases, simply re-run the bash snippet should solve the issue. If the issue still remains unsolved, comment out the rm -r temp_annot line at the end of the snippet and investigate the content of temp_annot. Check if there is any missing entry in the consecutive numbering of $Output_Prefix_(method).batch(X)_N.genes.out files. If the missing number was found as X, step back to the parent directory and execute: Here N will be the same number of threads (e.g., 8) used in the previous snippet and this one-line will generate the single missing file. Then proceed with merging:

Problem 2
In some cases, the GWAS reference file may contain malformed lines. These lines may cause errors in some versions of MAGMA (malformed line numbers will be reported in the MAGMA output).

Potential solution
Open the text file in a text editor (e.g., VIM) and manually remove malformed lines. The line number option in VIM can be turned on by using :set number command

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the corresponding author (szhang@northshore.org).

Materials availability
No biological reagents are required as part of this protocol.

Data and code availability
The dataset used in Yao et al. (2021) is available for download as supplementary files of the original research paper. For the GWAS database used and additional sources, please refer to the key resource table. Additional data files, scripts, and sample output can be found at the GitHub repository for this protocol at https://github.com/endeneon/MAGMA_analysis_protocol.