Modeling of mitochondrial genetic polymorphisms reveals induction of heteroplasmy by pleiotropic disease locus 10398A>G

Mitochondrial (MT) dysfunction has been associated with several neurodegenerative diseases including Alzheimer’s disease (AD). While MT-copy number differences have been implicated in AD, the effect of MT heteroplasmy on AD has not been well characterized. Here, we analyzed over 1800 whole genome sequencing data from four AD cohorts in seven different tissue types to determine the extent of MT heteroplasmy present. While MT heteroplasmy was present throughout the entire MT genome for blood samples, we detected MT heteroplasmy only within the MT control region for brain samples. We observed that an MT variant 10398A>G (rs2853826) was significantly associated with overall MT heteroplasmy in brain tissue while also being linked with the largest number of distinct disease phenotypes of all annotated MT variants in MitoMap. Using gene-expression data from our brain samples, our modeling discovered several gene networks involved in mitochondrial respiratory chain and Complex I function associated with 10398A>G. The variant was also found to be an expression quantitative trait loci (eQTL) for the gene MT-ND3. We further characterized the effect of 10398A>G by phenotyping a population of lymphoblastoid cell-lines (LCLs) with and without the variant allele. Examination of RNA sequence data from these LCLs reveal that 10398A>G was an eQTL for MT-ND4. We also observed in LCLs that 10398A>G was significantly associated with overall MT heteroplasmy within the MT control region, confirming the initial findings observed in post-mortem brain tissue. These results provide novel evidence linking MT SNPs with MT heteroplasmy and open novel avenues for the investigation of pathomechanisms that are driven by this pleiotropic disease associated loci.


Scientific Reports
| (2023) 13:10405 | https://doi.org/10.1038/s41598-023-37541-y www.nature.com/scientificreports/ also evidence of significantly lower MT-DNA copy number in the temporal cortex 9 and dorsolateral prefrontal cortex 10 of AD patient brains compared to control subjects. Lastly, APOE ɛ4 is the common DNA variant that confers the highest increase in risk and age of onset of sporadic AD 11 and was associated with impaired MT structure and function from proteins measured within postmortem human brain tissues of AD patients 12 . As such, studying the role of MT function in AD pathogenesis is currently an active area of research 13,14 .
Here, we primarily focus on MT heteroplasmy as a potential mechanism affecting MT function and thus impacting on risk of disease. MT heteroplasmy refers to the presence of MT mutations that occur in only a fraction of mitochondrial DNA within a given sample 15 . We analyzed data from 1801 whole-genome sequenced (WGS) post-mortem tissue samples generated by the Accelerating Medicines Partnership in Alzheimer's Disease (AMP-AD) for the presence of MT heteroplasmy. After discovering a significant association between the mitochondrial SNP 10398A>G (rs2853826) and MT heteroplasmy, we characterized the gene networks and gene expression changes associated with this single-nucleotide polymorphism (SNP). 10398A>G corresponds to the Thr114Ala missense mutation of the MT-ND3 gene coded by the MT, and is a common variant found within most populations with a reported allele frequency of 41.8% from the gnomAD database v3.1.1 16 . The 10398A>G allele has been collectively associated with an expansive set of phenotypically diverse diseases including AD 17 , Parkinson's disease [18][19][20][21][22][23][24] , breast cancer [25][26][27][28][29][30][31][32][33][34] , gastric cancer, type 2 diabetes and several other human diseases and phenotypes [35][36][37][38][39] .
To explore the effect of 10398A>G on transcriptomic and MT variables, we performed experiments using lymphoblastoid cell lines (LCLs) from the Harvard Personal Genome Project (PGP) 40,41 . The PGP consists of many participants that have high coverage WGS data as well as self-reported phenotypic data that are publicly available for research 42,43 . We selected 60 participant LCLs that vary by the 10398 genotype (major allele A: n = 30, minor allele G: n = 30) and performed bulk RNA sequencing, as well as quantifying MT copy number and heteroplasmy, and report our findings herein.

Results
Significant heteroplasmy detected within the mitochondria control region from brain samples. We analyzed WGS data generated from 1801 post-mortem tissue samples collected as part of the Religious Order Study (ROS), Memory and Aging Project (MAP), Mount Sinai Brain Bank (MSBB) and Mayo Temporal Cortex (MAYO) studies. Across these four studies, WGS data was available from a total of seven different tissues (Table 1). We analyzed data from these samples to perform MT variant calling and to estimate MT heteroplasmy at each MT genomic site (MT heteroplasmy was classified as present if at least 5% of reads that were reliably mapped to either the major or minor allele were discordant with the remaining reads).
While MT heteroplasmy was robustly detected in many samples, there was much higher overall heteroplasmy within the blood compared to the brain samples (Fig. 1A). When the analysis was restricted to only brain samples, we detected significantly more heteroplasmy in the cortically derived samples than the cerebellum derived samples (Fig. 1B).
We observed that MT heteroplasmy detected within the blood derived samples demonstrated a distribution across the entire MT genome, whereas in the brain derived samples, heteroplasmy was entirely restricted to a set of several hundred bases corresponding to the MT control region (Fig. 2), a highly polymorphic 44 , non-coding region of the MT genome which contains the origins of both transcription and replication 45 . Mitochondrial heteroplasmy in the dorsolateral prefrontal cortex is not associated with Alzheimer's disease. We used a logistic regression approach to examine whether there are any systematic differences in mitochondrial heteroplasmy between samples from subjects with AD compared with aged con- Table 1. The full set of cohorts and tissue groups from which samples were used for analyzing whole-genome sequencing data for mitochondrial heteroplasmy. Samples were sequenced from the ROS (Religious Order Study), MAP (Memory and Aging Project), MSBB (Mount Sinai Brain Bank) and MAYO (Mayo Clinic) samples.  The distribution of detected heteroplasmic sites across the mitochondrial genome, within sample groups from the main tissue types represented within this study. Dot size area is proportional to the fraction of samples in which heteroplasmy was detected at that site. www.nature.com/scientificreports/ trols, while adjusting for Study, Sex, Age at death, Years of education, Post-Mortem Interval, and Ancestry. We restricted this analysis to the brain region with the largest subset of WGS samples, the dorsolateral prefrontal cortex (DLPFC) from the ROS/MAP studies (n = 451). Analyses were restricted to nine MT genomic sites with detectable heteroplasmy in at least five samples across this cohort. We did not identify any individual heteroplasmy sites with a significant (FDR < 0.05) association to AD status (Table S1). A complementary analysis performed on 'total mitochondrial heteroplasmy' estimated by summarizing each sample according to the number of mitochondrial sites at which heteroplasmy was detected also did not reveal an association with AD (P = 0.67).
10398A>G is significantly associated with mitochondrial heteroplasmy. Based on the analysis of mitochondrial heteroplasmy within brain samples, we performed a cis-heteroplasmy-QTL across the mitochondrial to determine if any SNPs were associated with heteroplasmy in the DLPFC samples from the ROS/MAP studies. While a number of SNPs were found to be significantly associated with heteroplasmy, the 10398A>G variant had the most abundant signal, being significantly associated with heteroplasmy at five separate loci (FDR < 0.05) ( Table 2). We further annotated each variant for its associated disease phenotypes using MITOMAP 46 and observed that 10398A>G was associated with the largest number of distinct phenotypes of any reported MT variant (n = 10). 10398A>G has been linked with AD 17 , Parkinson's disease [18][19][20][21][22][23][24] , breast cancer 25-34 , type 2 diabetes and many other diseases and conditions [35][36][37][38][39] .
The 10398A>G variant is significantly associated with MT-ND3 expression in human prefrontal cortex. Given the association of 10398A>G with mitochondrial heteroplasmy and diverse disease phenotypes, we analyzed its association with gene expression changes. By integrating RNA sequence data available on ROS/MAP cohort DLPFC samples, we performed a cis-eQTL analysis and found that 10398A>G was associated with MT-ND3 (Mitochondrially Encoded NADH Dehydrogenase 3) expression, where the minor G allele is associated with decreased expression (Fig. 3A). MT-ND3 is one of the subunits of the mitochondrial respiratory chain complex I that enables NADH dehydrogenase (ubiquinone) activity. We thus constructed a targeted gene regulatory network aimed at identifying genes that are downstream of MT-ND3, conditioned on the relationship with 10398A>G, using a causal inference testing approach 47 and identified a number of significantly associated downstream genes (Fig. 3B). We then performed a gene set enrichment analysis on the MT-ND3 subnetwork using Enrichr 48-50 focused on enrichments in Biological pathways and Gene Ontology components (see "Materials and methods" for gene set libraries) and observed several enrichments for NADH dehydrogenase activity and Mitochondrial respiratory chain Complex I components ( Table 3).

The 10398A>G variant is significantly associated with MT-ND4 expression in LCLs. After per-
forming RNA extraction and whole transcriptome RNA sequencing on all 60 LCL samples, we tested our data www.nature.com/scientificreports/ against previously discovered eQTLs in LCLs to determine if they were replicated within our samples. To do this, we integrated cis-eQTL data obtained from the GTEx portal for LCLs mapped in European-American subjects 51 . After filtering and pruning, we retained 764 SNPs that were also annotated as cis-eQTLs in GTEx and with available association statistics within our LCL data. We found an enrichment of GTEx cis-eQTL from our data in the same direction of effect ( Figure S1, Table S2). Of the 764 SNPs, 98 of them were nominally significant from our data using a threshold of P < 0.01, whereas one would expect only 7.64 to be significant under the null hypothesis (P = 7.4 × 10 -74 ). This result suggests that our LCL gene expression data is broadly concordant with the LCL data obtained from GTEx. We then analyzed whether 10398A>G is associated with expression levels of any MT encoded genes. Of the 37 MT genes we examined for analysis, only 28 demonstrated robust expression in our data. We observed that 10398A>G is significantly associated with lower expression of MT-ND4 (P = 5.96 × 10 -9 , FDR = 1.67 × 10 -7 ) (Fig. 4A, Table S3). MT-ND4 (NADH Dehydrogenase 4), which is a different subunit from MT-ND3, is also part of the mitochondrial respiratory chain complex I. The G allele is also significantly associated with lower expression of MT-ATP8, MT-ND2, MT-ND4L, MT-ATP6 and MT-CYB (FDR < 0.005) (Table S3). Notably, there was no significant difference in expression detected for the MT-ND3 gene (as we had observed in the ROS/MAP DLPFC samples) despite the G allele causing a Thr114Ala missense mutation (P = 0.853, FDR = 0.884) (Fig. 4B, Table S3).
Testing for association of 10398A>G with mitochondrial DNA copy number in the LCLs. Given that the effect of 10398A>G on MT gene expression within the LCL samples was to uniformly decrease MT gene expression (Table S3), we hypothesized that this may reflect an effect on reducing overall MT copy number, as has been described for 10398A>G in a previous disease context 52 . We thus tested whether 10398A>G had any effect on MT copy number in the LCLs. We obtained genomic and mitochondrial DNA for each of the 60 PGP LCLs and performed multiplex PCR of a mitochondrial fragment, a fragment on the X-chromosome, and a  Table 3. Selection of top Gene Ontology and Biological Pathways that are over-represented among the gene regulatory network constructed from neighbors downstream of MT-ND3, conditioned on MT:10398 A>G genotype. Enrichments were calculated using Enrichr. Gene sets with a false discovery rate (FDR) < 0.05 were classified as significantly over-represented among the MT-ND3 gene regulatory network.

Library
Gene www.nature.com/scientificreports/ fragment on chromosome 22 as a genomic control. The X-chromosome fragment serves as a positive control given that female samples should have significantly more X-chromosome DNA than male samples. Consistent with our expectation, the multiplex PCR approach demonstrated a significantly increased X-chromosome DNA copy number in our female samples compared to males, with clear separation between female and male samples (P = 1.28 × 10 -7 ) (Fig. 5A). We performed the equivalent analysis to determine if 10398A>G is associated with  www.nature.com/scientificreports/ altered MT copy number. The analysis did not detect any significant differences, suggesting that 10398A>G is not associated with altered MT copy number (P = 0.19) (Fig. 5B). In addition, we also performed MT copy number analysis using quantitative PCR (qPCR) on DNA from 52 LCL samples. Using a genomic DNA fragment on chromosome 17 as a baseline control, we did not observe any significant MT copy number difference with 10398A>G (P = 0.33) (Table S4).
10398A>G is significantly associated with increased mitochondrial heteroplasmy. Finally, we quantified MT DNA heteroplasmy within the DNA obtained from our 60 PGP LCL samples by next-generation sequencing of the MT control region, where we had observed all MT heteroplasmy within brain derived samples. We designed primers to sequence neighboring regions chrM:16043-16238 and chrM:16469-262 of the MT DNA and calculated a heteroplasmy score for each site (see "Materials and methods"). We used simple linear regression to look for association between age, sex, and MT copy number with individual heteroplasmy scores. There was no significant association between total heteroplasmy score and age (R 2 = − 0.01609, P = 0.79838), sex (R 2 = − 0.01567, P = 0.75984), or copy number mitochondrial Z-score (R 2 = − 1.5 × 10 -3 , P = 0.344). We then performed an association test of 10398A>G with the heteroplasmy scores and found that the minor G allele of 10398 was associated with significantly increased heteroplasmy at chrM:16469-262 (P = 0.011) (Fig. 6B). While there was a slight increase in heteroplasmy detected for chrM:16043-16238, the association statistic was not significant (P = 0.29) (Fig. 6A). Probing further into the individual sites, we found certain individual sites were contributing disproportionately to the overall increase in heteroplasmy (Fig. 7). We then performed an association test between all MT SNPs with overall heteroplasmy. We identified eight mitochondrial SNPs significantly associated (P < 0.05) with total heteroplasmy score ( Table 4). 10398A>G was significantly associated with higher heteroplasmy score (P = 0.021). Furthermore, we computed the squared R 2 correlation coefficient between each SNP identified as significantly associated with heteroplasmy score with 10398A>G and found that some of them were significantly correlated with 10398A>G (Table 4).

Discussion
We analyzed large cohorts of DNA sequencing data to determine the extent of detectable mitochondrial heteroplasmy in various tissue types for different donor samples. We discovered that the 10398A>G (rs2853826) allele is significantly associated with mitochondrial heteroplasmy in the control region of brain samples. Given the reported associations between 10398A>G and a multitude of human diseases, we explored the functional consequence of 10398A>G with data available to us and discovered that it is associated with MT-ND3 expression levels in brain samples. We then characterized lymphoblastoid cell lines (LCLs) from the Harvard Personal Genome Project to determine the effect of 10398A>G in LCLs and discovered that it is associated with MT-ND4 expression levels and the MT-ND3 expression levels were not significantly altered in the LCLs. This suggests that 10398A>G could be affecting mitochondrial gene expression in a tissue or cell-type specific manner. Finally, from the sequencing data obtained from the LCLs, we found that 10398A>G is also associated with mitochondrial heteroplasmy, suggesting that heteroplasmy may be a general consequence of 10398A>G within multiple tissue types and potentially a mechanism by which 10398A>G confers an effect upon diverse human diseases. www.nature.com/scientificreports/ Our initial motivation for exploring mitochondrial heteroplasmy was to determine its role in Alzheimer's disease pathology. Early evidence indicated that post-mortem AD brains were enriched for heteroplasmic mitochondrial control region mutations 53 . Furthermore, recent evidence suggests mitochondrial DNA mutations can lead to mitochondrial dysfunction and increased Aβ deposition 54 . We did not observe any significant difference in overall, or locus-specific MT heteroplasmy within the DLPFC between AD and aged controls. Our subsequent eQTL findings lead us to the mitochondrial variant 10398A>G, which we found to be a strong predictor of heteroplasmic mutations in the mitochondrial control region. In an early study, 10398A>G was shown to be associated with AD risk 17 . However, a later study of a cohort of patients from Tuscany reported no mitochondrial haplotype group association with AD 55 . A later study in a cohort of Japanese individuals suggests little to no contribution of 10398A>G to AD 56 . Recently, there was a report of other mitochondrial variants associated with AD in a 254 individuals Tunisian cohort 57 . While that study did not find 10398A>G to be associated with AD, they identified other variants such as 5656A>G and 13759A>G to be marginally associated with AD. These studies suggest that the effect of 10398A>G on AD, if true, will be moderate and may not be observed in small cohort studies of only a few hundred patients. Nevertheless, it remains to be determined if the underlying cause Besides AD, the role of mitochondrial heteroplasmy has also been studied for other neurological diseases. In autism spectrum disorders (ASD), researchers have discovered pathogenic heteroplasmic mitochondrial mutation being associated with ASD 58,59 . Perhaps the most common and well-studied mutation linked with mitochondrial heteroplasmy is 3243A>G, which has been associated with mitochondrial disorders and dysfunctions, including Mitochondrial Encephalopathy, Lactic Acidosis, Stroke-like episodes (MELAS), and Leigh syndrome 60 . As such, our finding of an inherited polymorphism 10398A>G being linked to mitochondrial heteroplasmy may inspire future research to investigate the role of mitochondrial heteroplasmy as a causal component of disease pathology.
Our study has several important limitations. In the current paradigm, we are unable to definitively establish the directionality of causal relationships between MT-ND3/MT-ND4 expression and mitochondrial heteroplasmy (both of which are presumably downstream in their association with inherited variant 10398A>G). Whether heteroplasmy is a consequence of altered MT-ND3/MT-ND4, induced directly by 10398A>G or even indicates a pleiotropic locus that is independently driving regulatory effects and heteroplasmy, awaits further experimentation to resolve. An additional limitation pertains to our focus on mitochondrially encoded transcripts, rather than the broader set of mitochondrial genes which are also encoded within the nuclear genome. This limitation was motivated by our usage of a cis-eQTL framework, which is limited to regulatory relationships occurring within 20 kilobases, though this may well overlook important regulatory relationships that occur between genomes.
In addition, we have not presented any functional mitochondrial data within this study, which may represent a fruitful direction for future studies. While we report an intriguing eQTL relationship, the actual fold-change difference, while strongly significant, is not large in magnitude. Given that 10398A>G is a common genetic variant, this eQTL relationship may indicate a physiological association that is non-pathogenic under most tissues and conditions, but which may become pathogenic under more extreme 'stress' conditions, such as altered pH, temperature or conditions of inflammation. This could be consistent with previous reports of 10398A>G associating with altered mitochondrial matrix pH and calcium dynamics 61 . It is also possible that the observed relationship between heteroplasmy and 10398A>G in the DLPFC are confounded with other disease comorbidities that themselves are linked with 10398A>G, which are present in the ROS/MAP cohort, though not characterized within the study and are thus not accounted for in our analyses. Despite this limitation, the replication of 10398A>G as a heteroplasmy QTL in our LCL experiments is supportive of this effect not being solely mediated by unobserved comorbidities. Additional avenues for further exploration could be to look at additional populations, larger sample sizes and additional tissues beyond DLFPC.
While we found that the 10398A>G was associated with decreased gene expression in MT-ND3/MT-ND4 and increased mitochondrial heteroplasmy, we should also be cognizant that the cause of this association could be due to other genetic variants linked with 10398A>G. 10398A>G is an established marker for the I, J and K European mitochondrial haplogroup and it is possible that other variants within these haplogroups account for the heteroplasmy association. The frequency of the G allele is also significantly higher in non-European (e.g. African and Asian) populations and future follow up studies from data generated from these populations may allow us to answer this question.
Given the association of mitochondrial heteroplasmy with such a wide range of human diseases, and the previously reported progression from mitochondrial heteroplasmy to mitochondrial dysregulation and disruption of the OXPHOS pathway 62 , an exploration of the downstream effects of heteroplasmy and the 10398A>G variant on metabolic pathways may be warranted. Further evaluation of associations between the 10398A>G variant with other AD-associated phenotypes may prove to be fruitful. 10398A>G has previously been linked with Human Papillomavirus (HPV) positivity in the context of cervical cancer, indicating a potential modulation of host-virus interactions 52 . Given prior evidence of the role of Herpes simplex virus 1 (HSV-1) in AD pathogenesis in APOE-ε4 carriers [63][64][65][66] , one avenue of experimentation might be to test the viability of viral infection (e.g. HSV-1) in cell lines from individuals with the A versus G allele at 10398. This may provide better context for the relationships between the 10398A>G variant, mitochondrial heteroplasmy and damage, and AD pathology. Table 4. List of mitochondrial SNPs that were associated with overall heteroplasmy. Position depicts the specific base-pair position in the mitochondria. Ref. and Alt. stands for reference and alternate allele, respectively. MAF stands for minor allele frequency. Effect size, standard error (SE) and P value are the respective statistics for associating these SNPs with overall heteroplasmy. R 2 w. 10398 shows Pearson's correlation coefficient values for each SNP with the SNP at position 10398 on the mitochondria. In summary, we have demonstrated the use of large scale whole-genome sequence data for analyzing mitochondrial heteroplasmy and performing cis-heteroplasmy-QTL analysis. Doing so led us to discover the association between 10398A>G and mitochondrial heteroplasmy. We then further characterize the effect of 10398A>G with the use of personalized LCL models. This highlights the utility of such resources as an effective way to perform in-vitro modeling for exploring genetic drivers of mitochondrial function in diseases, and suggests novel avenues for illuminating the role of heteroplasmy as a candidate mediating mechanism that links 10398A>G with diverse human diseases.

Materials and methods
Whole genome sequences from the ROS, MAP, MSBB and Mayo cohorts. Mitochondrial variables (germline variation, heteroplasmy and copy number) were estimated on tissue samples with available whole genome sequences (WGS) generated by the Accelerating Medicines Partnership in Alzheimer's Disease (AMP-AD). BAM files can be accessed at the AD Knowledge Portal (https:// adkno wledg eport al. org) for the MSBB (syn19987071), Mayo (syn19989379) and ROS/MAP (syn20068543) cohorts.

Mitochondrial heteroplasmy quantitative trait loci analysis. Mitochondrial (MT) heteroplasmy
quantitative trait loci (hetQTL) analysis was performed on whole genome sequences generated from dorsolateral prefrontal cortex (DLPFC) samples collected from 451 unique subjects within the Religious Orders Study (ROS) and Memory and Ageing Project (MAP). We applied a logistic regression approach, modelling the presence (or absence) of heteroplasmy at each MT site, as a function of each MT SNP genotype, while adjusting for AD status (NIA Reagan Score), Study (ROS or MAP), Sex, Age at death, Years of education, Post Mortem Interval, and Ancestry (estimated using the 10 largest population ancestry components precomputed from the WGS data). Within the ROS/MAP DLPFC samples, we identified 77 common MT variants (MAF ≥ 5%) and nine MT heteroplasmic sites that were detected in at least five subjects, which were carried forward into the hetQTL analysis. HetQTL associations were estimated using a generalized linear model using the R function "glm" and the family type "binomial". Associations between each MT variant and MT heteroplasmic site were adjusted using the Benjamini-Hochberg approach for controlling the false discovery rate (FDR). Associations with an FDR < 0.05 were classified as significant. Results were annotated with symbols from overlapping genes, and separately with symbols from genes within 20 kilobases of MT variants and heteroplasmy sites using annotations from the GRCh38 transcriptome assembly. www.nature.com/scientificreports/ Mitochondrial expression quantitative trait loci analysis. Mitochondrial (MT) cis expression quantitative trait loci (cisQTL) analysis was performed across the set of samples with whole genome sequences (originating from any tissue) and RNA-sequences generated from dorsolateral prefrontal cortex (DLPFC) samples, comprising 573 unique subjects within the Religious Orders Study (ROS) and Memory and Ageing Project (MAP). Within the ROS/MAP DLPFC samples, we identified 212 common MT variants (MAF ≥ 5%). Transcriptomic data was subset to the 37 MT encoded genes. Genes were retained in the analysis if expression was detected in at least five samples, resulting in 23 included genes. Normalized abundance values were offset by a count of 1 and log2 transformed. Cis-eQTL analysis was performed using the R package, MatrixEQTL. Gene expression was modelled using a linear additive approach, including terms for MT variant genotype and while adjusting for AD status (NIA Reagan Score), Study (ROS or MAP), Sex, Age at death, Years of education, Post-Mortem Interval, RNA Integrity Number (RIN), RNA-seq batch, and Ancestry (estimated using the 10 largest population ancestry components precomputed from the WGS data). Cis-eQTL relationships were defined as a maximum distance of 1 MB between a MT variant/gene pair, which was definitionally true for all examined MT pairs.
Construction of MT-ND3 gene regulatory network. We performed causal inference testing 47 , to build a causal gene regulatory network focused on MT-ND3 in post-mortem DLPFC brain tissue collected as part of the ROS/MAP studies. This approach requires paired gene expression and genotype data for a large number of samples to establish the direction of regulation between MT-ND3 and its correlated genes. Causal inference testing (CIT) has been well described previously 47 . Briefly, it offers a hypothesis test for whether a molecule (in this case, the expression of MT-ND3) is potentially mediating a causal association between a DNA locus (10398A>G), and some other quantitative trait (such as the expression of genes correlated with MT-ND3 and 10398A>G). Causal relationships can be inferred from a chain of mathematic conditions, requiring that for a given trio of loci (L), a potential causal mediator i.e., MT-ND3 (G) and a quantitative trait (T), the following conditions must be satisfied to establish that G is a causal mediator of the association between L and T: (a) L and G are associated (b) L and T are associated (c) L is associated with G, given T (d) L is independent of T, given G We used the R software package "cit" 47 , to perform the causal inference test, calculating a false discovery rate using 1000 test permutations. Trios with a Qvalue < 0.05 were classified as significant, and the associated T genes were considered downstream of MT-ND3.
Personal genome project lymphoblastoid cell lines and whole genome sequencing data. The lymphoblastoid cell lines (LCL) from the Personal Genome Project (PGP) cohort were obtained from the Coriell Institute for Medical Research (https:// www. corie ll. org/). The whole genome sequencing data (WGS) for PGP samples were obtained from the PGP website (https:// pgp. med. harva rd. edu). There were a total of 123 unique donors with both LCLs and WGS data available. We performed data mining of the 123 donors and determined that 30 donors carry the rs2853826 "G" allele of which 3 individuals also have the "GCT" allele (Table S5). Most of these individuals were self-reported as having European ancestry except for 4 donors (2 Chinese, 1 African-American and 1 Hispanic). We then selected an additional 30 donor samples carrying the rs2853826 "A" allele as control samples bringing the total number of donors analyzed to 60. We also performed sanger sequencing of the rs2853826 locus for all 60 samples to confirm their rs2853826 genotype (Fig. S2).

LCL cell culture, DNA and RNA extraction and RNA sequencing. Lymphoblastoid cell lines (LCL)
were cultured in RPMI medium supplemented with 10% Fetal Bovine Serum (FBS) and 1% penicillin-streptomycin. These cultured LCLs were maintained at 37 °C. DNA from the LCLs were extracted using the AccuPrep ® Genomic DNA Extraction Kit (Bioneer). RNA was extracted from the LCLs using the PureLink™ RNA Mini Kit (Thermofisher). The extracted RNA for whole transcriptome sequencing was sent to Psomagen for sequencing. Preparation of samples for RNA-Seq analysis was performed using the TruSeq RNA Sample Preparation Kit (Illumina, San Diego, CA). Briefly, rRNA was depleted from total RNA using the Ribo-Zero rRNA Removal Kit (Human/Mouse/Rat) (Illumina, San Diego, CA) to enrich for coding RNA and long non-coding RNA, following the TruSeq Stranded Total RNA Sample Prep Guide, Part # 15031048 Rev. The Ribo-Zero libraries were sequenced on the Illumina NovaSeq 6000 System with 151 nucleotide paired end reads, according to the standard manufacturer's protocol (Illumina, San Diego, CA). The raw sequence reads were aligned to human genome hg38 (ensembl_GRCh38, GenBank Assembly ID GCA_000001405.15) with the star aligner (v2.5.2b) and gene level expression (read counts) were summarized by the "-quantMode GeneCounts" parameter in star. All sequences are depicted in the 5′ to 3′ orientation. The primer sequences were synthesized using Custom DNA Oligos Synthesis Services (Thermofisher). We used a Q5 High-Fidelity DNA Polymerase (NEB) and ran the PCR on a ProFlex Thermal Cycler (Thermofisher). The DNA libraries were then pooled and sent for whole-genome sequencing (Psomagen). We performed Illumina paired-end sequencing (read 150 bases) on the HiSeq X Ten system, with different index barcodes for each sample resulting in roughly one million paired-end reads per sample on average (Table S7).
Calculating heteroplasmy score. We performed alignment of the PCR-amplified sequencing reads for each sample to the hg19 assembly of the human genome using bwa 67 . We mapped the aligned sequences to a fasta file containing the primers and mitochondrial genome sequences using the mpileup function of samtools 68 . From the mpileup file, we constructed a statistic to measure the degree of heteroplasmy across our nine sites of interest and the surrounding bases using the following method.
For each of the 60 individuals in our cohort, we computed the fraction of non-reference alleles to the total number of reads at all PCR-amplified positions (each mtDNA site between bases 16043 to 16238 and bases 16469 to 262). We first identified any sites with measurable heteroplasmy (noted in the formula below with the indicator function) which are sites with allele fraction estimates between 0.05 and 0.95 (Table S8). Sites with non-reference allele fraction less than or equal to 0.05 or greater than or equal to 0.95 are assigned as stably inherited SNPs, and the heteroplasmy measure at these sites are set to zero. We then weighted each heteroplasmic site by the allele fraction and summed them up to generate heteroplasmy scores for each individual.
Het16043 i is the heteroplasmy score of individual i generated from sites between bases 16043 and 16238. Het16469 i is the heteroplasmy score of individual i generated from sites between bases 16469 and 262 and HetTotal i is the heteroplasmy score of individual i generated by taking the sum of Het16043 i and Het16469 i . Quality control and association analysis of MT SNPs with heteroplasmy. Plink 1.90b6.9 69 was used to perform quality control and association analysis. We ensured that no individuals had a genotyping rate less than 20%, discordant sex data, or were duplicated or shared significant identity-by-descent. Because of  www.nature.com/scientificreports/ limited sample size, SNPs with minor allele frequency less than or equal to 0.09 as well as those not in Hardy-Weinburg equilibrium (P < 1.0 × 10 -6 ) were removed. Of the 60 samples in our cohort of LCLs, we found that three individuals of non-European ancestry (2 Chinese, 1 African-American) were stratified in PCA (Fig. S3). We used Plink to implement a generalized linear regression model for the remaining 42 MT SNPs for the 60 LCL samples in our cohort, including the first two principal components from PCA as covariates, with HetTotal i and calculated the overall association statistics (Table S9).
Ethical approval. The use of the human cell lines for research in this work was determined to not be human subjects research by the IRB of UMass Chan Medical School (H00021419).

Data availability
Mitochondrial data derived from whole genome sequences are available via the AD Knowledge Portal with data identifier syn25927578 at https:// doi. org/ 10. 7303/ syn25 80853.