Integrating Electronic Health Records and Polygenic Risk to Identify Genetically Unrelated Comorbidities of Schizophrenia That May Be Modifiable

Background Patients with schizophrenia have substantial comorbidity that contributes to reduced life expectancy of 10 to 20 years. Identifying modifiable comorbidities could improve rates of premature mortality. Conditions that frequently co-occur but lack shared genetic risk with schizophrenia are more likely to be products of treatment, behavior, or environmental factors and therefore are enriched for potentially modifiable associations. Methods Phenome-wide comorbidity was calculated from electronic health records of 250,000 patients across 2 independent health care institutions (Vanderbilt University Medical Center and Mass General Brigham); associations with schizophrenia polygenic risk scores were calculated across the same phenotypes in linked biobanks. Results Schizophrenia comorbidity was significantly correlated across institutions (r = 0.85), and the 77 identified comorbidities were consistent with prior literature. Overall, comorbidity and polygenic risk score associations were significantly correlated (r = 0.55, p = 1.29 × 10−118). However, directly testing for the absence of genetic effects identified 36 comorbidities that had significantly equivalent schizophrenia polygenic risk score distributions between cases and controls. This set included phenotypes known to be consequences of antipsychotic medications (e.g., movement disorders) or of the disease such as reduced hygiene (e.g., diseases of the nail), thereby validating the approach. It also highlighted phenotypes with less clear causal relationships and minimal genetic effects such as tobacco use disorder and diabetes. Conclusions This work demonstrates the consistency and robustness of electronic health record–based schizophrenia comorbidities across independent institutions and with the existing literature. It identifies known and novel comorbidities with an absence of shared genetic risk, indicating other causes that may be modifiable and where further study of causal pathways could improve outcomes for patients.

nearest neighbors were from the same population, we inferred the detected MEGA individual's ancestry.(If they are from more than one super population, we assigned the individual's ancestry as admixed one.)Among the 90,313 MEGA individuals, 87,558 (96.95%) were assigned to five homogeneous super-populations, i.e.AFR=13,808, AMR=2,446, EUR=70,473, EAS=441, SAS=390.(5) The final set used in our main analyses was the EUR only subgroup.
Prior to imputation, the SNP position, alleles, Ref/Alt assignments and frequency differences were checked by comparison with the Haplotype Reference Consortium (HRC) panel (Version r1.1 2016) (6) in build GRCh37(7) using McCarthy Group Tools (https://www.well.ox.ac.uk/~wrayner/tools/). SNPs with inconsistent alleles, with > 0.2 allele frequency difference, or not in the reference panel were removed.Phasing and imputation were then performed using a standard pipeline on the Michigan Imputation Server (MIS).(8) The total QC-ed SNP data was divided into five batches.Phasing was performed using Eagle version v2.4.(6) The HRC reference panel in build GRCh37 was selected using mixed population.(6,7) The genotype probabilities in post-imputed data were converted to hard-call genotypes using PLINK2 (hard-call >= 0.1).(9)SNPs were removed with imputation info score in any of the batches < 0.3, position duplicates, missing genotype rate > 0.02, or multiallelic states (>2).For autosomal data, we reused a series of QC filtering steps, excluding SNPs with MAF < 0.005, HWE test P value < 1x10 -6 , and removing the individuals with missing rate ≥ 0.02, excess heterozygosity rate over 3* interquartile range (IQR) of the upper heterozygosity quartile (Q3) for each subset.

Genotype sample description, ancestry estimation, and quality control (QC) at MGB
Mass General Brigham Biobank (MGBB) samples were genotyped using the Illumina Multi-Ethnic Global array with hg19 coordinates.(1,7)MGBB initially contained 1.7 million SNPs and 36,424 individuals.
Variant-level quality control filters removed variants with a call rate < 98%, as well as those that were duplicated across batches, monomorphic, not confidently mapped to a genomic location, or showed an association with genotyping batch.Sample-level quality control filters removed individuals with a call rate < 98%, MAF < 0.05, strand ambiguous SNPs and long-range LD regions (chr6:25-35Mb; chr8:7-13Mb inversion).
Principal components of ancestry were calculated in the 1000 GP3 reference panel(2) and subsequently projected onto the MGBB dataset, where a Random Forest classifier was used to assign ancestral group membership for those with a prediction probability > 90%.Using the assigned ancestral group membership, we identified individuals with EUR ancestry.Within the European samples, sample-level quality control filters removed samples with excessive autosomal heterozygosity (±3 standard deviations from the mean), related individuals (pi-hat < 0.2), or discrepant self-reported and genetically inferred sex.Variant-level quality control filters removed SNPs with a call rate < 98%, HWE test P value < 1x10 -10 and retained only autosomal SNPs (exclude indels and monomorphic SNPs).
The MIS (8) was then used to impute missing genotypes with the HRC(6) dataset serving as the reference panel.Imputed genotype dosages were converted to hard-call format and subjected to further quality control, where SNPs were removed if they exhibited poor imputation quality (INFO < .80),call rate < 98%, MAF < 0.01, or HWE test P value < 1x10 -10 .These procedures yielded a final analytic sample of 25,698 individuals in the MGBB with a set of high-quality autosomal SNPs (N=909,229).

AFR ancestry schizophrenia PRS calculation and PheWAS (VUMC only)
PRS were calculated using the Python based program PRS-CSx and the 1000 Genomes Project African cohort and European cohorts to represent the LD structure between SNPs of each ancestry.(2,10)The SNP effect sizes from the Psychiatric Genomic Consortium's (PGC) latest available genome-wide association study (GWAS) for schizophrenia were used to calculate the PRS.(11) These results came from a meta-analysis of 9 cohorts that included 6,152 cases and 3,918 controls of African ancestries.(11) We used the larger sample size of the EUR summary stats (53,386 cases and 77,258 controls) and the multi-ancestry capabilities of PRS-CSx to boost power for the AFR cohort.(11) The AFR cohort .bimfile was used as the base population for PRS-CSx estimates instead of the EUR cohort.The scaling parameter, phi, was set to 1e-2 to represent the polygenic architecture of schizophrenia.The PRS-CS generated posterior effects for SNPs were then used to calculate PRS for each cohort using the PLINK-score flag with sum as a modification to output SCORESUM instead of SCORE.(3)As with the EUR sample, PRS was calculated with schizophrenia patients excluded from the PheWAS analyses to exclude association as a byproduct of comorbidity.
Since we estimated PRS for all those genotyped without a schizophrenia diagnosis, there were 1,406 phecodes out of the possible 1,866 with at least one person as a case for the AFR cohort.For each of these 1,406 phecodes, we performed a logistic regression testing its association with the schizophrenia PRS including covariates of sex, current age, record length in days, and the first 10 principal components (PCs) for African ancestry.As with EUR ancestry, at least two instances of the phecode had to be present to be considered a case and phecodes listed as exclusions were removed from the controls.