Analysis plan for primary cohort GWAS for blood lipid levels for the Global Lipids Genetics Consortium


 This protocol describes the guidelines for generating GWAS summary statistics that were provided to all participating cohorts in the Global Lipids Genetics Consortium HRC + 1KGP3 imputation meta-analysis collected jointly with the GIANT consortium. It is applicable for generating the input phenotype and genotype files for use with the rvtests software. Key stages include imputation, generating phenotype files, and running the analysis to generate summary statistics. The exact time needed to carry out this protocol varies depending on the existing genotype files and size of the cohort. The latest version of rvtests is available at: http://zhanxw.github.io/rvtests/


Overview
There are three major components to this analysis plan: 1) Genome-wide genotypes must be on the correct build (37/hg19) and correct strand (forward).
2) For ALL studies, imputation of genotypes is performed using the 1KG phase 3 (if you have not already done so) and, for studies with samples of European-ancestry, also to the large haplotype panel Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js from the Haplotype Reference Consortium. If you have already imputed to a different large haplotype panel (e.g. UK10K) please contact us.
3) Association analysis is conducted using speci c software tools that will provide all necessary summary statistics for exible central meta-analysis.
This coordinated plan between Global Lipids and GIANT is meant to reduce the burden on primary analysts. We welcome you to join one or both consortia. We thank you for your participation in past projects, and particularly welcome studies that are new to either consortia.

Software
Below is a list of the software you will need to complete this analysis plan. For some tools using the speci c version listed may be important.
Phasing algorithms such as SHAPEIT, Eagle, HapiUR, and Minimac are only necessary if you're conducting imputation in-house, rather than using an imputation server. We provide generic instructions and instructions speci c for the University of Michigan imputation server. Some may have used or be using other servers (e.g. the Sanger server); please contact them for instructions speci c to that imputation server. All studies must have some version of a genome-wide array, for example with >200,000 genome-wide common variants. Targeted arrays with limited/incomplete coverage of the genome (e.g. MetaboChip, ImmunoChip, Exome Chip, etc.) should not be used unless merged with a genome-wide array. If you are unsure, please contact Joel and/or Cristen for speci c advice.
Individual studies should provide information in the Excel spreadsheet about the manufacturer and version of the array(s) they are using.
QC should be done separately for individual studies, for major continental ancestries within a study, and also separately for samples of the same study that were genotyped on different arrays.

Genotype QC
Typical pre-imputation QC criteria: These are some steps that we recommend for sample QC. Additional QC steps may be needed and should be determined by the local analysts for each study. Studies should provide a brief description of QC criteria in the Excel sheet.

Prepare les for imputation
The "HRC/1KG Imputation Preparation and Checking Tool" developed by Will Rayner will check input data for accuracy relative to expected HRC or 1000G inputs prior to imputation.
This process will identify errors in your input data, including incorrect REF/ALT designations, incorrect strand designations, extreme deviations from expected allele frequencies, and palindromic (A/T and G/C) SNPs with allele frequency near 0.5 that are often the source of imputation errors, and generates commands to make les that have xed or removed these problematic variants. The tool also requires frequency les from plink, which can be created as follows: plink --b le <binary plink pre x> --freq --out input_ le_pre x With these input les the tool can be run as follows: The perl script automatically produces a shell-script called Run-plink.sh. The shell script contains a set of plink commands that should be run to update or remove SNPs based on the checks and to create one updated binary plink le per chromosome. For this, the paths to the original study binary le should be adjusted in the shell script and the script should be started by typing ./Run-plink.sh at the command prompt.
The cleaned/updated binary les (one for each chromosome) generated by this tool should be ready for upload to the imputation server below after changing them to VCF format using plink2, bgzip and tabix in the UNIX environment: There are two ways to conduct imputation: a) an Imputation Server or b) in-house imputation. The following instructions provide detailed instructions for imputation with the Michigan Imputation Server, but others such as the Sanger server are also acceptable. We recommend you use an imputation server if possible.
Note: If you prefer, the imputation server has a 'Quality Control only' option you can use prior to imputation to ensure that no samples will be eliminated during imputation of autosomes or X Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js chromosome.
If you run 'QC and imputation', and samples are removed (e.g., due to gender discordance), please make sure that number and order of samples in the X chromosome and autosomes will match as described in section 2.3.

Ancestry of samples and imputation panels:
We ask all studies that have not imputed to 1KG phase 3 to use the imputation server to do so. In addition, for studies with individuals of European-ancestry, we ask that they also use the imputation server to impute using the HRC panel.

Phasing
The imputation server will automatically determine if you provide phased or unphased VCF les. We encourage re-phasing using Eagle (which is implemented by the imputation server). If you would like to provide phased haplotypes please convert them to VCF format.
An example command to convert haplotype format of HAPS/SAMPLE les from SHAPEIT to VCF would be: shapeit -convert -input-haps study. Make an account and follow the instructions on the website. You will upload VCF genotype les to the server (only VCF le format is supported; instructions on converting to VCF are contained on the imputation server website under "Help" and above in the chapter "Prepare les for imputation"). One VCF le must be submitted for each chromosome.
Downloading your imputed genotypes, info les, and QC report When imputation has nished you will receive an email alert. The imputation server will automatically encrypt all your imputed genotypes (for protection during download). The password to decrypt the les will be in the email noti cation, so don't delete that email! When you download your imputed genotypes please be sure to download all available les (the qcreport, statistics, zip les, and all the log les). We will ask that you submit most of these les to us along with all other les generated as part of this analysis plan. Please do a quick check of the qcreport.html and statistics.txt les before proceeding with the analysis plan. If you are unsure of your imputation quality, please contact us. You will also upload these les to the server (see 8c).
X chromosome NOTES on chromosome X: 1) If the server detects sex discrepancy between genotypes and the provided PED le, discordant samples will be removed automatically prior to phasing and imputation. You will need to resolve these discrepancies genome-wide before proceeding with analysis.
2) You will receive two output VCF les for chromosome X, one for males and one for females. You will need to merge these into a single chrX VCF.
Both of these issues are addressed below in section 2.3.

Imputation In-House
If you have chosen to use an imputation server, then you can ignore this step.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js Some studies will not be able to use the imputation server, for example if original participant consents forbid it. These studies will have to conduct imputation on their own. Please contact us if you need assistance.

Post-imputation sample harmonization and variant pruning
There are a few post-imputation processing steps that are necessary prior to analysis.
A. Harmonize samples and samples order between autosomes and chromosome X As mentioned above, chrX imputation may have different numbers and order of samples from the autosomes due to automated ltering on the imputation server. You need to reconcile these possible discrepancies before creating a single whole genome VCF le, which will be used to create a genomic relationship matrix (kinship matrix) for the linear mixed model. If the sets of samples are not identical between the X and autosome VCFs (most likely because possibly sex-discordant samples were dropped during imputation of the X) you will need to eliminate any samples present only in autosome or X VCFs, and then reorder the samples in the X chromosome VCF le. The following bcftools commands 1) make a list of samples IDs in the chr22 VCF, 2) merges the male and female chrX VCFs and reorders the resulting chrX VCF according to the order of samples in chr22, then 3) generates an ordered list of IDs from chrX. Compare the two ID lists to con rm they are identical before proceeding. Note that some cohorts have a small number of males who are heterozygous at many X chromosome markers outside the pseudoautosomal region. Because these create problems later in analysis, we recommend removing these individuals as well, in addition to sex-discordant individuals. It then proceeds to reorder the le, omitting these observations. If you encounter this message, you will need to use the sample list le ID_order_chrX.txt to select the overlapping samples when creating the polymorphic datasets in section B below.
If you do NOT receive the warning message, please check that the sample list les ID_order_chrX.txt and ID_order_autosomes.txt are exactly the same (in content and order). If these two les are identical, you may proceed without the sample selection option in section B. This will speed up the operation of bcftools in section B.

B. Create a list of monomorphic sites and subset polymorphic variants
These reference panels impute a great many variants into your samples, the vast majority of which are rare, and many will not be polymorphic in your sample. To minimize the size of output les in the association analyses below, we ask that you remove all monomorphic sites (monomorphic de ned as at least one variant hard call genotype) from input VCF les prior to analysis. In conjunction, we ask that you upload a list of the variants that you drop in this process. The following commands with bcftools will generate both the list of monomorphic SNPs and the pruned VCF les.

·
If LDL was measured after 1994 (measured and not estimated by Friedewald), then adjust LDL values for individuals taking lipid-lowering medication by using LDL/0.7 to approximate pre-medication LDL levels.
(iii.a) Inverse normal transformed Model: Generate residuals and inverse normal transform in men and women separately and separately by case status if appropriate (e.g. where disease status is correlated with cholesterol you will have MenCase, MenControl, WomenCase, WomenControl, SexCombinedCase, and SexCombinedControl). For the sex combined analyses, combine the "men-speci c_LDL_INV" and women-speci c_LDL_INV" phenotype values to analyze together (all_LDL_INV).

(iii.b) Raw trait
Model: Only sex-combined. Generate residuals with sexes combined, but separately by case status if appropriate (e.g. where disease status is correlated with cholesterol you will have SexCombinedCase and SexCombinedControl).
(iv) Triglycerides (iv.a) Natural log + inverse normal transformed Model: Generate residuals and inverse normal transform in men and women separately and separately by case status if appropriate (e.g. where disease status is correlated with cholesterol you will have MenCase, MenControl, WomenCase, WomenControl, SexCombinedCase, and SexCombinedControl). For the sex combined analyses, combine the "men-speci c_logTG_INV" and women-speci c_logTG_INV" phenotype values to analyze together (all_logTG_INV). ln(raw Triglycerides trait value in mg/dl) = age + age 2 + PCs (+ other study speci c covariates as needed) -> residuals -> inverse normal transformation (logTG_INV) NOTE: GLGC requests that you include ~4 PCs as study-speci c covariates. ln(raw Triglycerides trait value in mg/dl) = age + age 2 + sex + PCs (+ other study speci c covariates as needed) -> residuals (logTG_RAW) NOTE: GLGC requests that you include ~4 PCs as study-speci c covariates. Non-HDL cholesterol (raw trait value in mg/dl for men) = age + age 2 + PCs (+ other study speci c covariates as needed) -> residuals -> inverse normal transformation (men-speci c_nonHDL_INV). Non-HDL cholesterol (raw trait value in mg/dl) = age + age 2 + sex + PCs (+ other study speci c covariates as needed) -> residuals (nonHDL_RAW)

Genotype-phenotype association
A. Group samples for analysis · Please analyze major ancestry groups separately.

·
We anticipate that some studies will have data from multiple genotyping arrays on samples from the same cohort. We expect there will likely be three typical situations: o No sample overlap: analyze studies separately (batches with reasonably similar arrays can be analyzed together, using the array type as a covariate). o All samples overlap: either a) select 1 array for imputation or ideally b) merge genotypes prior to imputation and perform a single analysis.
o Partial but signi cant overlap of samples -contact us, if needed, to customize a plan. The goal should be to upload sets of results from non-overlapping samples that can be combined in meta-analysis.

·
If a set of samples being analyzed together is particularly large (for example, N>30000), the association plan may need to be modi ed. Please contact us to discuss further if this is the case.

B. Perform association analyses
Always test additive models using linear regression, using a method that accounts for genotype imputation uncertainty, and also accounting for known or cryptic relatedness (see below). Please indicate Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js in your submission README le what method you have used for association analysis.
Each individual study will perform data quality control (QC) and analysis and provide summary results for meta-analysis. Results les will be deposited to a central repository (details are provided below) where QC/data cleaning and meta-analysis will be performed.
We request association analyses be carried out using rvtests (version 20170210).
This software can be used for samples that include families or when an empirical kinship matrix is required for analysis. The input les required (phenotype, covariates) for rvtests are compatible with PLINK formats. It also calculates Hardy-Weinberg p-value and call rate for quality control purposes. We ask each study to generate kinship matrices (one for the autosomes and one for chromosome X) and t linear mixed models to deliver single variant analysis results for the additive model. Documentation for rvtests can be found here: https://github.com/zhanxw/rvtests If you are unable to use a linear mixed model for some reason, please correct for ancestry/relatedness. At a minimum, use ~10 principal components as covariates to correct for population strati cation and include principal components as indicated in the analysis models below. If you do not use rvtests, your results may not be usable for conditional analyses and aggregated analyses of rare variants. If in doubt, we are happy to advise. Please indicate in the Excel le requested below the method used to account for ancestry/relatedness.

Summary level statistics for meta-analysis of variant associations
The following summary level statistics will be generated and shared for meta-analysis of variant associations, ideally using rvtests. rvtests requires an indexed VCF le (for genotypes), a PED le (for phenotypes) as input. Instructions below are for rvtests. If you are using a different method, please contact us.
i. Basic VCF metrics, including reference and alternative alleles, chromosome positions and strand. These statistics are not directly used in the computation of the variant tests but are needed for interpretation and meta-analysis. ii.
Single variant association test statistics, including direction of effect. We use score statistics calculated at each variant site. Another possible option would be to share estimates of genetic effects -but, when variant frequencies are low, score statistics are more numerically stable and preferred. iv.
Covariance matrix for each genetic region. We compute the genotype covariance for all variants in a sliding window, with a width of 500,000 base pairs. This matrix re ects linkage disequilibrium in the region and will be used for meta-analysis of aggregate region-or gene-based tests of rare/low-frequency variation.
2. Indexing the VCF les rvtest works with tabix, and takes indexed bgzipped VCF les as input.

Specify VCF les
You should use --inVcf $your_vcf_ le to specify which VCF to use.

Specify phenotypes
The rvtest tool requires a simple pedigree le that starts with the standard 5 columns (family id, individual id, father id, mother id and sex) followed by trait or trait residuals.
You can use --mpheno $phenoypeColumnNumber or --pheno-name to specify a given phenotype.  Phenotype le is speci ed by the option --pheno example.pheno. The default phenotype column header is "y1". If you want to use alternative columns as phenotype for association analysis (e.g the column with header y2), you may speci c the header names using either · --mpheno 2 or · --pheno-name y2 NOTE: to use "--pheno-name" the header line must start with " d iid" as PLINK requires.
In phenotype le, missing values can be denoted by NA or any non-numeric value. Individuals with missing phenotypes will be automatically dropped from subsequent association analysis. For each missing phenotype value, a warning will be generated and recorded in the log le.
Optional: If using rvtests for phenotype transformation: If you would like to calculate residuals using the rvtest software please refer to the covariates that need to be included for each trait as described in the phenotype transformation step and use --covar and --covarname options to designate covariates, and the --inverseNormal and --useResidualAsPhenotype while performing the analysis.

Analyses will loop through chromosomes
Many of the steps in this analysis plan will be done chromosome by chromosome. One very easy way to run by chromosome is with a for loop in bash or c-shell. An example is provided below; modify as appropriate for the phenotypes, sexes, and ancestries available in your study, the panel(s) being used for imputation, and for job submission with your compute resources.
NOTE: Some of these lines are not used if you have already generated residuals and performed inverse normal transformation. These are indicated in red. If using rvtest to generate residuals, input covariates are not consistent for all analyses.
For chrX, male genotypes in rvtest can be coded as 0, 1, 0/0, or 1/1, and dosages should be between 0 and 1. rvtest will convert these to a 0-2 scale automatically in analysis. rvtest already has preset values for the pseudo-autosomal regions (PAR), and properly adjusts analysis accordingly. rvtest will perform analysis on both PAR and non-PAR regions, but only calculates HWE and alleles counts in females for non-PAR regions. rvtest association analysis should be conducted on polymorphic variants only.