Diallel panel reveals a significant impact of low-frequency genetic variants on gene expression variation in yeast

Unraveling the genetic sources of gene expression variation is essential to better understand the origins of phenotypic diversity in natural populations. Genome-wide association studies identified thousands of variants involved in gene expression variation, however, variants detected only explain part of the heritability. In fact, variants such as low-frequency and structural variants (SVs) are poorly captured in association studies. To assess the impact of these variants on gene expression variation, we explored a half-diallel panel composed of 323 hybrids originated from pairwise crosses of 26 natural Saccharomyces cerevisiae isolates. Using short- and long-read sequencing strategies, we established an exhaustive catalog of single nucleotide polymorphisms (SNPs) and SVs for this panel. Combining this dataset with the transcriptomes of all hybrids, we comprehensively mapped SNPs and SVs associated with gene expression variation. While SVs impact gene expression variation, SNPs exhibit a higher effect size with an overrepresentation of low-frequency variants compared to common ones. These results reinforce the importance of dissecting the heritability of complex traits with a comprehensive catalog of genetic variants at the population level.

Thank you again for submitting your work to Molecular Systems Biology.We have now heard back from the three reviewers who agreed to evaluate your study.Overall, the reviewers acknowledge that the presented findings are potentially interesting.However, they do raise a series of substantial concerns regarding the conclusiveness and impact of the study.Notably, reviewer #3 is not convinced that the study presents a significant advance over existing knowledge in the field.During our crosscommenting process, in which the reviewers get the chance to make additional comments based on each other's reports, reviewers #1 and #2, mentioned that in their opinion the study is a potentially valuable contribution to the field, pending substantial revisions.Reviewer #3 mentioned that while they are still not convinced that the advance is major, they would see value in extending the study and providing further insights into the effect of the low frequency variants (e.g. on whether these low-frequency variants have any particular features).On balance, given that the reviewers did have positive words about the potential relevance of the study, we have decided to offer you the chance to address the issues raised in a major revision.
Without repeating all the points listed below, some of the more fundamental issues are the following: -The broader relevance of the main findings and conclusions for a general audience, beyond yeast, needs to be better supported.It is important to clarify the main contribution and significance of the study for quantitative genetics.
-Further insights into the observed effect of the low frequency variants would be required to enhance the impact of the study.
-All three reviewers raise several technical issues that need to be addressed in order to better support the main conclusions.
All issues raised by the reviewers need to be satisfactorily addressed.The reviewers make several constructive suggestions on how to address the issues raised.I have also included below the additional comments of the three reviewers from referee crosscommenting, as they provide helpful guidance on how to improve the study.We recognize that the requested revisions are substantial.As you may already know, our editorial policy allows in principle a single round of major revision, so it is essential to provide responses to the reviewers' comments that are as complete as possible.Please feel free to contact me in case you would like to discuss in further detail any of the issues raised or if you would like to share your revision plan with me.I would be happy to schedule a call.
On a more editorial level, we would ask you to address the following points: -The keywords need to be reduced to 5.
-Please provide a .docversion of the manuscript text (including legends for the main figures) and individual production quality figure files for the main Figures (one file per figure).
-We have replaced Supplementary Information by the Expanded View (EV format).In this case, all additional figures can be included in a PDF called Appendix.Appendix figures should be labeled and called out as: "Appendix Figure S1, Appendix Figure S2... Appendix Table S1..." etc.Each legend should be below the corresponding Figure/Table in the Appendix.Please include a Table of Contents in the beginning of the Appendix.For detailed instructions regarding expanded view please refer to our Author Guidelines: .
-Tables S1-S9 and Datafile 1 should be provided as EV Datasets (either as .xlsfiles or .zipfolders).Please provide one file per EV Dataset.Please include the description of each EV Dataset in the dataset file itself, ie. in a separate tab for .xlsfiles or as a README.txtfile in .zipfolders.
-Please provide a "standfirst text" summarizing the study in one or two sentences (approximately 250 characters), three to four "bullet points" highlighting the main findings and a "synopsis image" (550px width and max 400px height, jpeg format) to highlight the paper on our homepage.
-All Materials and Methods need to be described in the main text.We would encourage you to use 'Structured Methods', our new Materials and Methods format.According to this format, the Materials and Methods section should include a Reagents and Tools Table (listing key reagents, experimental models, software and relevant equipment and including their sources and relevant identifiers) followed by a Methods and Protocols section in which we encourage the authors to describe their methods using a step-by-step protocol format with bullet points, to facilitate the adoption of the methodologies across labs.More information on how to adhere to this format as well as downloadable templates (.doc or .xls)for the Reagents and Tools Table can be found in our author guidelines: .An example of a Method paper with Structured Methods can be found here: .
-Please include a "Disclosure and Competing Interests Statement" in the main text.
-Please include a Data availability section describing how the data, code etc. have been made available.This section needs to be formatted according to the example below: The datasets and computer code produced in this study are available in the following databases: -Chip-Seq data: Gene Expression Omnibus GSE46748 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46748) -Modeling computer scripts: GitHub (https://github.com/SysBioChalmers/GECKO/releases/tag/v1.-When you resubmit your manuscript, please download our CHECKLIST (https://bit.ly/EMBOPressAuthorChecklist)and include the completed form in your submission.*Please note* that the Author Checklist will be published alongside the paper as part of the transparent process (https://www.embopress.org/page/journal/17444292/authorguide#transparentprocess).
If you feel you can satisfactorily deal with these points and those listed by the referees, you may wish to submit a revised version of your manuscript.Please attach a covering letter giving details of the way in which you have handled each of the points raised by the referees.A revised manuscript will be once again subject to review and you probably understand that we can give you no guarantee at this stage that the eventual outcome will be favorable.

Maria
Maria Polychronidou, PhD Senior Editor Molecular Systems Biology ------------------------------------------------------We realize that it is difficult to revise to a specific deadline.In the interest of protecting the conceptual advance provided by the work, we recommend a revision within 3 months (9th Jan 2024).Please discuss the revision progress ahead of this time with the editor if you require more time to complete the revisions.Use the link below to submit your revision: https://msb.msubmit.net/cgi-bin/main.plexIMPORTANT: When you send your revision, we will require the following items: 1. the manuscript text in LaTeX, RTF or MS Word format 2. a letter with a detailed description of the changes made in response to the referees.Please specify clearly the exact places in the text (pages and paragraphs) where each change has been made in response to each specific comment given 3. three to four 'bullet points' highlighting the main findings of your study 4. a short 'blurb' text summarizing in two sentences the study (max.250 characters) 5. a 'thumbnail image' (550px width and max 400px height, Illustrator, PowerPoint or jpeg format), which can be used as 'visual title' for the synopsis section of your paper.6. Please include an author contributions statement after the Acknowledgements section (see https://www.embopress.org/page/journal/17444292/authorguide) 7. Please complete the CHECKLIST available at (https://bit.ly/EMBOPressAuthorChecklist).Please note that the Author Checklist will be published alongside the paper as part of the transparent process (https://www.embopress.org/page/journal/17444292/authorguide#transparentprocess).8.When assembling figures, please refer to our figure preparation guideline in order to ensure proper formatting and readability in print as well as on screen: https://bit.ly/EMBOPressFigurePreparationGuidelineSee also figure legend guidelines: https://www.embopress.org/page/journal/17444292/authorguide#figureformat 9. Please note that corresponding authors are required to supply an ORCID ID for their name upon submission of a revised manuscript (EMBO Press signed a joint statement to encourage ORCID adoption).(https://www.embopress.org/page/journal/17444292/authorguide#editorialprocess)Currently, our records indicate that the ORCID for your account is 0000-0002-6606-6884.
Please click the link below to modify this ORCID: Link Not Available 10.At EMBO Press we ask authors to provide source data for the main manuscript figures.Our source data coordinator will contact you to discuss which figure panels we would need source data for and will also provide you with helpful tips on how to upload and organize the files.
The system will prompt you to fill in your funding and payment information.This will allow Wiley to send you a quote for the article processing charge (APC) in case of acceptance.This quote takes into account any reduction or fee waivers that you may be eligible for.Authors do not need to pay any fees before their manuscript is accepted and transferred to the publisher.EMBO Press participates in many Publish and Read agreements that allow authors to publish Open Access with reduced/no publication charges.Check your eligibility: https://authorservices.wiley.com/author-resources/Journal-Authors/openaccess/affiliation-policies-payments/index.htmlAs a matter of course, please make sure that you have correctly followed the instructions for authors as given on the submission website.
*** PLEASE NOTE *** As part of the EMBO Press transparent editorial process initiative (see our Editorial at https://dx.doi.org/10.1038/msb.2010.72),Molecular Systems Biology publishes online a Review Process File with each accepted manuscripts.This file will be published in conjunction with your paper and will include the anonymous referee reports, your point-by-point response and all pertinent correspondence relating to the manuscript.If you do NOT want this File to be published, please inform the editorial office at msb@embo.orgwithin 14 days upon receipt of the present letter.
This manuscript reports on a large diallele eQTL experiment in Yeast in conjunction with long-read sequencing to assess the relative impacts of structural vs SNP variation on the expression variation.The inclusion of long-read sequencing is key to this work.While the results are intriguing, there is an issue of missed genomic variation (cytoplasmic) as well as issues surrounding the mapping of eQTL for accessory genes that need to be clarified.I understand that the crossing design was non-reciprocal which limits the ability to investigate any cyto-nuclear interactions.However, is there no information on the mitochondrial genetic diversity in this panel that would allow an investigation of how the cytoplasmic genomic variation may be influencing the results.There are only at most 26 cytoplasmic haplotype and it would be possible to test how these 26 haplotypes are influencing the nuclear expression in manner similar to the GWAS analysis.To fully understand what the nuclear variation means on the expression on these genes, it seems necessary to provide some indication of how the cytoplasmic variation may be shaping the same expression patterns.For example, what is simply the ratio of nuclear to cytoplasmic heritability on expression patterns?
In Figure 2C, there are some distant large effect eQTL.In plants, it is becoming clear that these are actually genes that have moved laterally in the genome such that they are no longer in the reference position.And this ends up leading to a misidentification of distant because the gene is now in the "distant" position.With the long-read sequencing was there any evidence of lateral gene movement that could explain these large effect distant eQTL?And it could lead to the report of accessory genes having larger effect distant eQTL.I'm somewhat confused by the GWAS of the accessory genes.For 422 of these 708 accessory genes it was stated that they were variably present but it isn't clear what accessory means for the other 286 genes.From my understanding for accessory genes, there should be no expression in some of the genotypes as they are not present.Yet a majority of these do not find an eQTL?Is this because if a gene had no expression in a genotype, that the given genotype was discarded from the GWA (both SNP and SV)? From my understanding of the pangenome terminology, calling a gene an accessory indicates that it has presence/absence variation.This would indicate that in genotypes with absence there should be zero expression and lead to a large bimodal distribution for this genes transcript abundance that should identify a large local (possibly distant) eQTL.Yet this does not seem to have occurred from the reported analysis.This suggests a false-negative error rate where eQTL are being missed for these genes.One option is that the expression data was normalized to remove the zero expression genotypes.In other fungi like Botrytis cinerea this issue was due to unmapped SVs but the long-read genome sequence should eliminate this possibility.Some explanation for this lack of local cis-eQTL for genes that by default should have a local large-effect cis-eQTL should be provided.
In the SV analysis, variants present in only one of the 26 parents was excluded due to a concern about false positive signals.Was a similar methodology used for the SNPs?I could not quite tell from the methods or the results descriptions.If the goal is to compare SVs to SNPs, then it would seem that the same filtering approach should be used.Especially as the false detection issue should be similar.

Reviewer #2:
Tsouris et al investigated whether rare SNP and rare SV contribute to transcriptomic variation in populations.They did so by analyzing all-by-all hybrids from 26 parental isolates which were sequenced by several technologies, including long-read sequencing.They exploit the fact that shared ancestry is naturally broken up by recombination over evolutionary time to then perform variance partitioning and more importantly, GWAS for eQTL detection.They find that distant eQTLs have lower effect size than local eQTLs, that rare eQTLs have lower effect size than common variants, despite being more likely detected by their eQTL-detection approach and finally that SVs have lower effect size than SNPs.Overall the paper investigates important questions that has been posed for decades, and resolves some of these.There are some expected results, but some which I was very surprised about.In my opinion, the fact that low-frequency variants have much lower effect on transcription should be discussed more heavily in light of the fact that low-frequency variants typically have more effect on growth.The authors do recognize this but make no further effort to explore this.This is very unexpected to me (see point 5).I have a few thoughts: 1) The authors filter singletons from the SNP and SV data.This makes sense for various reasons, as all singletons would be linked together.But throughout the reading I was wondering whether the authors can estimate how much heritability the singletons contribute to.It seems perhaps the authors could do so by considering the singletons as a single locus and partitioning the variance accordingly.
2) I'm unable to find what the authors claim are SVs to the detail that I think is required.Is there a minimal size for the SV? Would an insertion of 1 nucleotide be an SV?I think this is important to specify since it allows us to interpret better why SVs have low effect size.
3) The authors should segregate the SVs from Ty-related elements with others and show the effect size of these two categories separately.I don't expect a Ty element to have much effect on anything unless it hops inside a gene or a promoter (and I believe the transposase usually avoids these regions), but I would expect a gene deletion to have massive effect on that gene's expression.This isn't something that seems to be captured here.This is somewhat interesting because SVs have been shown to be very important in altering phenotypes of yeast strains (from pure missing pathways like sugar assimilation, or cation removal by Ena1).In those cases, the effect size was massive, so I'm curious how the authors can reconcile this (maybe an artifact of growing in rich media?).4) Is there a distant/local relevant analysis with SVs that is possible?I wasn't able to find it here, but again it would be the first thing I would verify if some gene duplicated.5) The fact that rare variants are enriched in 'causative phenotypic changes' is not surprising given the previous QTL papers on rare variants.However, what is surprising to me is the fact that their effect size is generally very low.This goes against the finding that these should have larger effect on growth and it's also not what we would expect purely from a statistical perspective since detectability is usually related to the effect (and to some extent to frequency of the variant in the pool).This led me to ponder what exactly was being plotted in 3D.Are the same variants plotted several times?I can see that there are only 104 lowfrequency eQTL so I couldn't understand the n = 1205 as that is the number of low-frequency variants in the whole pool.Are rare-variants a random sampling of all the SNPs?Are they enriched in non-synonymous, synonymous, non-coding, specific gene function etc? I'm sure this is discussed in cited papers but a small mention of them here might re-contextualize some of the results discussed here.6) The data description of '2,002,550 transcriptomic measurements' is very vague as it can relate to the number of transcriptomes rather than the number of genes with at least 1 read in all samples.I would recommend rephrasing.7) There is a typo in Figure 3 (Frequeny instead of Frequency).8) The color scheme for Figure 4a/b can be improved.The use of gradients for distinct categories is unusual and unfortunately it is impossible for me to tell which green is which part of the pie.

Reviewer #3:
Tsouris et al sequences the genomes of a panel of yeast hybrids and compare that to previously published transcriptome data to explore the genetic basis of variations in mRNA abundance.While the design is classical, the approach employed by the users give allow them to address important questions with fewer confounding effects or better power than some other studies.In particular, the experimental design used here allows the authors to test the contribution of both rare(r) SNPs and structural variants to variation in mRNA abundance.Doing so, they help explore some of the causes of the missing heritability, a wellknown problem in the quantitative genetics field.The main finding of the authors, that both rarer SNPs and structural variants contribute to variation in mRNA abundance but with different magnitudes, is not entirely novel.Indeed, the significant contribution of rarer SNPs is quite well-established and has been the leading theme of several publications, including in yeast.The contribution of structural variation to traits is perhaps less well quantified in the published literature, but the fact that the authors detect relatively few associations of this type means that the generality and robustness of conclusions are somewhat questionable also in the current submission.The measured mRNA abundances correspond to one snapshot in time (optical density ~0.3 for a single environment, and it is quite possible that other time points or other environments would have resulted in different conclusions.There are very few efforts by the authors to extract a deeper understanding of the biology that underlies the observed patterns, and the study does not go beyond the quantification of statistical associations.For example, we do not learn much about what properties of low frequency variants that make them disproportionately likely to contribute to mRNA abundances.The authors do not trace down candidate causative SNPs nor do they validate some of these, or the contribution of candidate structural variants, which would typically be done in influential yeast GWAs papers.This could have revealed interesting biology and would have inspired confidence in that detected associations are real.In fact, as the mRNA abundance data was taken from a recent publication by the authors, the main new work in this paper is the long-read sequencing, and the GWAs.While competently performed here, long-read sequencing of many yeast genomes has been done since Yue 2017 and is now quite standard.The technical paragraph on the characteristics of structural variants in yeast therefore adds little new, and is not put into the perspective of what has been reported in previous studies.It would thus seem better suited to be in an extended Methods section.The Discussion has a helpful section on study limitations, but otherwise reiterates results and adds few perspectives or thoughts on why or how yeast, and other organisms, have evolved to produce the observed patterns of the genotype-phenotype maps.Overall, I see the submission as a mostly incremental advancement of the quantitative genetics field.I find it hard to justify its publication in MSB.
Minor comments: The authors harvest cells at around OD 0.3, which is interpreted as cells being in log-phase growth.But this really depends on the starting OD.Moreover, given the microcultivation format used, it's not evident, in absence of data, that a log-phase will occur at all.In this cultivation scale cell populations often start experiencing nutrient restrictions before all cells have exited the lag-phase.Probably, many cell populations are at or close to their peak growth rate at OD = 0.3, but the physiological state of cell populations at the time of harvest will probably differ a bit.This differences in physiological state may underlie some part of the associations observed, e.g. in terms of deviations of accessory genes.This is not to say that the associations are irrelevant, but as these difference in physiological state are likely to be temporary, it does cast some doubt on the extent to which conclusions can be generalized -across the life cycle of yeast, across environments and across species.The authors do exclude singletons from their analysis as these have a too-pronounced confounding population structure.But it wasn't clear to me to what extent they also account for the population structure of other rare variants with similar presence in a few strains.
--------------Additional comments from referee cross-commenting Reviewer #1 I would agree that the significance is somewhat hard to derive from the way that the manuscript is written.This is partly because it is almost exclusively about yeast with little discussion or introductory material laying out the broader picture across a wider array of organisms.This makes it a bit of effort for a generalist/non-yeast reader to have to work to place it in a broader context.This could be at least partly fixed by reworking to focus on a general quantitative genetics audience.
I think the conundrum of large effect rare variants controlling low amounts of variation is simply an issue of estimating r2 in a population.The rare variant aspect will simply relegate the total variance to a lower realm.I am less concerned about not tracking down individual causative genes and talking about them.At some level that would decrease the general biologists' interest as the specific gene would be unlikely to be broadly interesting across organisms.The phenomenology is where the general interest would lay.
I agree that there is some significant confusion about the SVs to expression and effect size.
Reviewer #2 Similar to Reviewer #1, I do not find a need for tracking down the causative genes or understanding the specifics of yeast biology for these quantitative genetics questions and I also do not think it is a fatal flaw to not be able to generalize to the whole cell cycle of an organism.
As for the conundrum that I have, I see the point that r2 for rare variants is low if it's variance explained of the total population variation.However, I did not interpret the author's analysis to be r2.The authors specifically talk about effect size, which I interpret to be the actual difference in expression level.I'm pasting the relevant lines in their manuscript: ("low-frequency variants have a lower effect size compared to the common variants"), which directly contradicts another statement ("suggesting that low-frequency variants have a large impact on transcript abundance variation").Perhaps this is something the authors can simply address by making their interpretation clearer.I do not like the analysis as presented.The fact that rare variants have lower population r2 is not really surprising.
I do agree with Reviewer #3 that the main novelty stems from the long-read sequence, but the SV analysis could be improved as I mentioned in the original review.
Finally, I would also agree that the significance of the work is hard to derive as written.However, I do believe that the broader quantitative genetics questions are important.Whether the advance is 'sufficient' is not really something I can comment on.The insight is perhaps limited, but I do not believe that the experiments have been performed before.While I could have guessed some of these results, I could not know before I read this and what I did not know was not yeast-specific.
Reviewer #3 I agree with reviewers 1 and 2 on the importance of the research question -but I remain cautious about the novelty of the submission.Four years ago, in eLife, both Bloom et al, and Fournier et al reported a disproportionate contribution of lowfrequency variants to yeast fitness traits across a range of environments.This study uses a different experimental design and looks at mRNA abundances but does so in a single environment.It reports essentially the same conclusion.
There are also several publications reporting a, perhaps surprisingly, limited contribution of SVs to trait variation.Their lack of comprehensiveness when cataloguing SVs leaves some room for false negatives, and this, to me, is the most important motivation for the current study.But, I am not yet convinced that the authors make a major advance here, as their final, boileddown list of SVs seems to be quite far from comprehensive and their conclusions are ultimately based on a handful of associations.The SV suggestions of reviewer 2 are prudent and may go some way to address this.
Beyond confirming previous reports that low-frequency variants contribute much and SVs much less to variation in mRNA abundance, I am not sure that the authors tell us much new.I would have expected some efforts on the side of the authors to shed further light on their primary observations.Are e.g.low-frequency variants particular in any way -in terms of what genetic features/functions they affect, how they affect them or where in the genome the affected features are situated.Asking for confirmation of causality is perhaps taking it too far, but at the very least I would expect a revised paper to do further statistical analysis along these lines -and to tell us a bit more about how the authors think around the why's, when's and how's of what seems to be important properties of the genotype-phenotype map. ------------

Reviewer #1:
This manuscript reports on a large diallele eQTL experiment in Yeast in conjunction with long-read sequencing to assess the relative impacts of structural vs SNP variation on the expression variation.The inclusion of long-read sequencing is key to this work.While the results are intriguing, there is an issue of missed genomic variation (cytoplasmic) as well as issues surrounding the mapping of eQTL for accessory genes that need to be clarified.I understand that the crossing design was non-reciprocal which limits the ability to investigate any cyto-nuclear interactions.However, is there no information on the mitochondrial genetic diversity in this panel that would allow an investigation of how the cytoplasmic genomic variation may be influencing the results.There are only at most 26 cytoplasmic haplotype and it would be possible to test how these 26 haplotypes are influencing the nuclear expression in manner similar to the GWAS analysis.To fully understand what the nuclear variation means on the expression on these genes, it seems necessary to provide some indication of how the cytoplasmic variation may be shaping the same expression patterns.For example, what is simply the ratio of nuclear to cytoplasmic heritability on expression patterns?
[R] The reviewer raised an important question about how cytoplasmic genome variation could impact gene expression.Unfortunately, our experimental design does not allow us to answer this for several reasons.First, unlike the nuclear genome, the mito genome in a hybrid does not correspond to the combined genotypes of the parents due to limited heteroplasmy and mitochondrial recombination in yeast (PMC4196626).As a result, the mito genotype cannot be easily inferred for our hybrid panel.Even using the RNAseq reads, the potential recombination breaks cannot be reliably detected as all the intergenic regions are not transcribed.Second, our diallel panel is consistent with batch crossed hybrids, where each individual is likely to have a different mitotype.For these reasons, we are unable to infer the mitochondrial genotypes for the hybrids and therefore cannot reliably estimate the contribution of mitochondrial genome variation to gene expression variation.
In Figure 2C, there are some distant large effect eQTL.In plants, it is becoming clear that these are actually genes that have moved laterally in the genome such that they are no longer in the reference position.And this ends up leading to a mis-identification of distant because the gene is now in the "distant" position.With the long-read sequencing was there any evidence of lateral gene movement that could explain these large effect distant eQTL?And it could lead to the report of accessory genes having larger effect distant eQTL.
[R] To check for this kind of potential lateral gene movements, we checked all trans eQTL hotspots (Figure 2B) genes and their localization in the long-read sequencing assemblies.All ORFs that overlapped with any eQTL hotspots (>10 associations for a given SNP) are located at the expected chromosomes and positions in the assemblies.We also checked the SNPs associated with top trans-eQTL with large effects (Figure 2C).In this case, because it's hard to localize single SNPs in the long-read assemblies, we examined whether there is any SVs that overlapped with these SNPs.We did not find any SVs at the same position as these associated SNPs.Therefore, the distant eQTL we observed here are unlikely to be due to lateral gene movements.I'm somewhat confused by the GWAS of the accessory genes.For 422 of these 708 accessory genes it was stated that they were variably present but it isn't clear what accessory means for the other 286 genes.
[R] We are sorry about the confusion.The 708 accessory genes are the total number of accessory genes that are present in this panel according to previous annotations across the 1,011 natural isolates (PMC6784862).However, 286 out of these 708 genes are all present in parental strains and the hybrid panel.This concerns mostly the "Ancestral" category where the genes are absent only in relatively a small number of isolates in the 1,011 population.We redefined and clarified this in the revised manuscript.We are only focusing on the 422 genes as accessory in this hybrid panel.Changes are found in page 6, line 23-33.
From my understanding for accessory genes, there should be no expression in some of the genotypes as they are not present.Yet a majority of these do not find an eQTL?Is this because if a gene had no expression in a genotype, that the given genotype was discarded from the GWA (both SNP and SV)? From my understanding of the pangenome terminology, calling a gene an accessory indicates that it has presence/absence variation.This would indicate that in genotypes with absence there should be zero expression and lead to a large bimodal distribution for this genes transcript abundance that should identify a large local (possibly distant) eQTL.Yet this does not seem to have occurred from the reported analysis.This suggests a false-negative error rate where eQTL are being missed for these genes.One option is that the expression data was normalized to remove the zero expression genotypes.In other fungi like Botrytis cinerea this issue was due to unmapped SVs but the long-read genome sequence should 18th Dec 2023 1st Authors' Response to Reviewers eliminate this possibility.Some explanation for this lack of local cis-eQTL for genes that by default should have a local large-effect cis-eQTL should be provided.
[R] For the 422 accessory genes that are relevant to this hybrid panel, the majority (307/422) corresponds to introgressions from a closely related Saccharomyces species, namely S. paradoxus.In these cases, both copies of the gene are present, albeit in a heterozygous configuration where one copy is from S. cerevisiae and the other copy is from S. paradoxus.Based on our other analysis on the same panel (https://doi.org/10.1016/j.xgen.2023.100459)and transcriptomics data across the 1,011 natural population (https://doi.org/10.1101/2023.05.17.541122), most of the introgressed genes have integrated the S. cerevisiae regulatory circuits and are not particularly enriched for cis regulation variation.
The remaining accessory genes mostly correspond to the "Unknown" category, where the origin of the ORF and/or their locations are not clear.We only have a small number of genes where we are certain that they are indeed presence/absence variation.These correspond to the horizontal gene transfer (HGT, 8 genes) genes that are only present in certain parent, therefore either completely absent in some hybrids or present as single copy.For this category, eQTL were mapped for 5 out of 8 HGT genes.
Overall, for the accessory genes, we found eQTL for 121/307 "Introgression" genes, 29/99 "Unknown" genes and 5/8 HGT genes.Globally, we tend to find proportionally more genes with any eQTL in the accessory genes than the core genes (1552/5770).Genes with presence/absence variation have the highest fraction with any detected eQTL.The effect sizes for genes with presence/absence variation are also significantly higher than other categories: For the lack of local cis-eQTL, it is firstly due to an overall low number of genes with presence/absence variation across the dataset.Secondly, genes in the HGT and Unknown categories were originally annotated based on short reads sequencing data and therefore do not have a chromosomal location assigned.These eQTLs are therefore not classified as either local or distant.
We rewrote the text for the accessory gene analyses in the revised ms to clarify these points.Changes are found in page 6, line 19-21.
In the SV analysis, variants present in only one of the 26 parents was excluded due to a concern about false positive signals.Was a similar methodology used for the SNPs?I could not quite tell from the methods or the results descriptions.If the goal is to compare SVs to SNPs, then it would seem that the same filtering approach should be used.Especially as the false detection issue should be similar.
[R] Yes, the same filtering was applied for both SNPs and SVs.Descriptions for the filtering can be found in the method section.

Reviewer #2:
Tsouris et al investigated whether rare SNP and rare SV contribute to transcriptomic variation in populations.They did so by analyzing all-by-all hybrids from 26 parental isolates which were sequenced by several technologies, including long-read sequencing.They exploit the fact that shared ancestry is naturally broken up by recombination over evolutionary time to then perform variance partitioning and more importantly, GWAS for eQTL detection.They find that distant eQTLs have lower effect size than local eQTLs, that rare eQTLs have lower effect size than common variants, despite being more likely detected by their eQTL-detection approach and finally that SVs have lower effect size than SNPs.Overall the paper investigates important questions that has been posed for decades, and resolves some of these.There are some expected results, but some which I was very surprised about.In my opinion, the fact that lowfrequency variants have much lower effect on transcription should be discussed more heavily in light of the fact that low-frequency variants typically have more effect on growth.The authors do recognize this but make no further effort to explore this.This is very unexpected to me (see point 5).

I have a few thoughts:
1) The authors filter singletons from the SNP and SV data.This makes sense for various reasons, as all singletons would be linked together.But throughout the reading I was wondering whether the authors can estimate how much heritability the singletons contribute to.It seems perhaps the authors could do so by considering the singletons as a single locus and partitioning the variance accordingly.
[R] We rerun the hglm model dividing the variant matrix into singletons, all SNP without singletons and SV without singletons.The results are presented here: On average, SNP without singletons accounts for 0.202 of the genome-wide heritability, compared to 0.062 for all singletons and 0.021 for SV without singletons.2) I'm unable to find what the authors claim are SVs to the detail that I think is required.Is there a minimal size for the SV? Would an insertion of 1 nucleotide be an SV?I think this is important to specify since it allows us to interpret better why SVs have low effect size.
[R] SV are defined as variants that impact at least 50 bp.We added more descriptions in the revised text (Page 8, line 23-24).3) The authors should segregate the SVs from Ty-related elements with others and show the effect size of these two categories separately.I don't expect a Ty element to have much effect on anything unless it hops inside a gene or a promoter (and I believe the transposase usually avoids these regions), but I would expect a gene deletion to have massive effect on that gene's expression.This isn't something that seems to be captured here.
[R] We segregated the SV-eQTL into according to their types, i.e. deletion (DEL), insertion (INS), contraction (CONTR) and duplication (DUP) and compared their effect sizes according to Ty or non-Ty related variants.The results are presented here: Despite the overall low number of SV-eQTL, we do observe that deletions tend to have larger effect sizes than other types of variants.Moreover, Ty related variants tend to have smaller effect sized than non-Ty related variants for deletions and insertions.The same comparison cannot be performed for contraction and duplication due to the low number of associated variants in these categories.
We included these results in the revised ms (page 9, line 28-32 and Figure S4D).This is somewhat interesting because SVs have been shown to be very important in altering phenotypes of yeast strains (from pure missing pathways like sugar assimilation, or cation removal by Ena1).In those cases, the effect size was massive, so I'm curious how the authors can reconcile this (maybe an artifact of growing in rich media?).
[R] As the reviewer pointed out, there are many known examples of SVs with large effect size in yeast, including MAL genes for maltose assimilation, SUL1 for sulfate transport, CUP1 more copper resistance, and the ENA genes for sodium stress.However, all these cases are specific to a type of stress or growth conditions, and those genes are simply not expressed in the rich media where our experiment is performed.We believe the use of rich media is indeed why we do not observe such large effect size SV-eQTL.
4) Is there a distant/local relevant analysis with SVs that is possible?I wasn't able to find it here, but again it would be the first thing I would verify if some gene duplicated.
[R] There is only one local eQTL among the 20 SV-eQTL detected, which is an insertion (Ty-related) and has the largest effect size (0.039).Overall, duplication events are relatively rare across all SV types and many of these do not overlap with an entire ORF.Moreover, due to the nature of the diallel hybrids, most of these variants are present in a heterozygous state so the effects could be further buffered.A copy number variant (CNV) type of analyses based on short-reads sequencing for the hybrid panel might be more appropriate to see the effect of deletion or duplication of genes on gene expression.
5) The fact that rare variants are enriched in 'causative phenotypic changes' is not surprising given the previous QTL papers on rare variants.However, what is surprising to me is the fact that their effect size is generally very low.This goes against the finding that these should have larger effect on growth and it's also not what we would expect purely from a statistical perspective since detectability is usually related to the effect (and to some extent to frequency of the variant in the pool).
[R] The reviewer raised an important point.This is a complicated question and we think there are currently not enough data to directly draw a link between expression traits and growth traits.One of the hypotheses we have is that previous QTL analyses on growth traits are focused on a diverse set or growth conditions whereas the expression traits are measured in rich media with no stress.However, more data is needed, for example RNAseq in different stress conditions, to really explain this apparent difference in growth traits vs. expression traits.
This led me to ponder what exactly was being plotted in 3D.Are the same variants plotted several times?I can see that there are only 104 low-frequency eQTL so I couldn't understand the n = 1205 as that is the number of lowfrequency variants in the whole pool.
[R] This was indeed a mistake in the plot, the eQTL are accidently plotted multiple times.The corrected numbers are 2534 for eQTL associated with common variants and 504 associated with low frequency variants.The 104 lowfrequency eQTL referred to the number of unique loci associated, but not the total number of eQTL.We clarified all these numbers and corrected the figure in the revised ms (page 7 line 13-16).
Are rare-variants a random sampling of all the SNPs?Are they enriched in non-synonymous, synonymous, noncoding, specific gene function etc? I'm sure this is discussed in cited papers but a small mention of them here might re-contextualize some of the results discussed here.
[R] We annotated the full SNP matrix using SnpEff (PMC3679285).The results are summarized in the following table : Across the full matrix, there is no difference in terms of variant types between common and low frequency variants.In terms of associated SNPs, same trend is observed: Overall, low frequency variants do appear to be a random sampling all SNPs.There is no significant difference in terms of mutation types between common and low frequency variants that are associated with any eQTL.We added some discussions about this point in the revised ms (page 7 line 24-36; page 10 line 22-31).
6) The data description of '2,002,550 transcriptomic measurements' is very vague as it can relate to the number of transcriptomes rather than the number of genes with at least 1 read in all samples.I would recommend rephrasing.
[R] We rephrased this sentence in the revised ms.
7) There is a typo in Figure 3 (Frequeny instead of Frequency).
[R] We corrected this in the revised ms.
8) The color scheme for Figure 4a/b can be improved.The use of gradients for distinct categories is unusual and unfortunately it is impossible for me to tell which green is which part of the pie.
[R] We changed the color scheme for this figure in the revised ms.
Reviewer #3: Tsouris et al sequences the genomes of a panel of yeast hybrids and compare that to previously published transcriptome data to explore the genetic basis of variations in mRNA abundance.While the design is classical, the approach employed by the users give allow them to address important questions with fewer confounding effects or better power than some other studies.In particular, the experimental design used here allows the authors to test the contribution of both rare(r) SNPs and structural variants to variation in mRNA abundance.Doing so, they help explore some of the causes of the missing heritability, a well-known problem in the quantitative genetics field.
The main finding of the authors, that both rarer SNPs and structural variants contribute to variation in mRNA abundance but with different magnitudes, is not entirely novel.Indeed, the significant contribution of rarer SNPs is quite well-established and has been the leading theme of several publications, including in yeast.The contribution of structural variation to traits is perhaps less well quantified in the published literature, but the fact that the authors detect relatively few associations of this type means that the generality and robustness of conclusions are somewhat questionable also in the current submission.The measured mRNA abundances correspond to one snapshot in time (optical density ~0.3 for a single environment, and it is quite possible that other time points or other environments would have resulted in different conclusions. [R] We agree that changing the culture conditions would possibly lead to condition specific conclusions.But here we are focusing on the mid-log growth cultures to have the same point of comparison with our previous population level RNAseq study (Caudal et al. 2023).Further studies are needed to see the specific effect of culture conditions on expression variation.
There are very few efforts by the authors to extract a deeper understanding of the biology that underlies the observed patterns, and the study does not go beyond the quantification of statistical associations.For example, we do not learn much about what properties of low frequency variants that make them disproportionately likely to contribute to mRNA abundances.
[R] We annotated the full SNP matrix using SnpEff and analyses the distribution of variant types (non-synonymous, synomynous, intron variants, non-sense variants etc.) across the full matrix as well as associated SNPs.There are no significant differences between common and low frequency variants for any of those features, suggesting that low frequency variants are a random sampling of all SNPs (See more details in response to reviewer 2, point 5).These comparisons are now included in the supplementary tables 10 and 11.We agree with the reviewer that the biology behind these observations is very important, but we are unable to answer them yet with the current dataset.We added results and discussions for this point (page 7 line 24-36; page 10 line 22-31).
The authors do not trace down candidate causative SNPs nor do they validate some of these, or the contribution of candidate structural variants, which would typically be done in influential yeast GWAs papers.This could have revealed interesting biology and would have inspired confidence in that detected associations are real.[R] It would indeed be interesting to trace down some of the causative candidates.But as also pointed out by the other reviewers, our goal here is to have a global view of the impact of different variant types on gene expression.Nevertheless, the next step could be to focus on some specific cases to further explore specific biological questions.
In fact, as the mRNA abundance data was taken from a recent publication by the authors, the main new work in this paper is the long-read sequencing, and the GWAs.While competently performed here, long-read sequencing of many yeast genomes has been done since Yue 2017 and is now quite standard.The technical paragraph on the characteristics of structural variants in yeast therefore adds little new, and is not put into the perspective of what has been reported in previous studies.It would thus seem better suited to be in an extended Methods section.
[R] We think this paragraph is usual for the reader to contextualize the number and types of SVs in our diallel specifically.It's not intended to give an overview of the SVs in a natural population.
The Discussion has a helpful section on study limitations, but otherwise reiterates results and adds few perspectives or thoughts on why or how yeast, and other organisms, have evolved to produce the observed patterns of the genotype-phenotype maps.Overall, I see the submission as a mostly incremental advancement of the quantitative genetics field.I find it hard to justify its publication in MSB.
Minor comments: The authors harvest cells at around OD 0.3, which is interpreted as cells being in log-phase growth.But this really depends on the starting OD.Moreover, given the microcultivation format used, it's not evident, in absence of data, that a log-phase will occur at all.In this cultivation scale cell populations often start experiencing nutrient restrictions before all cells have exited the lag-phase.Probably, many cell populations are at or close to their peak growth rate at OD = 0.3, but the physiological state of cell populations at the time of harvest will probably differ a bit.This differences in physiological state may underlie some part of the associations observed, e.g. in terms of deviations of accessory genes.This is not to say that the associations are irrelevant, but as these difference in physiological state are likely to be temporary, it does cast some doubt on the extent to which conclusions can be generalized -across the life cycle of yeast, across environments and across species.
The authors do exclude singletons from their analysis as these have a too-pronounced confounding population structure.But it wasn't clear to me to what extent they also account for the population structure of other rare variants with similar presence in a few strains.
[R] For the variant matrix, we only removed the singletons.To account for population structure, we used LMM based GWAS with the genotype matrix as kinship, which should account for such effects.
29th Jan 2024 2nd Authors' Response to Reviewers All editorial and formatting issues were resolved by the authors.
30th Jan 2024 2nd Revision -Editorial Decision 30th Jan 2024 Manuscript number: MSB-2023-12009RR Title: Diallel panel reveals a significant impact of low-frequency variants on gene expression variation Dear Joseph, Thank you again for sending us your revised manuscript.We are now satisfied with the modifications made and I am pleased to inform you that your paper has been accepted for publication.Your manuscript will be processed for publication by EMBO Press.It will be copy edited and you will receive page proofs prior to publication.Please note that you will be contacted by Springer Nature Author Services to complete licensing and payment information.
You may qualify for financial assistance for your publication charges -either via a Springer Nature fully open access agreement or an EMBO initiative.Check your eligibility: https://www.embopress.org/page/journal/17444292/authorguide#chargesguideShould you be planning a Press Release on your article, please get in contact with embo_production@springernature.com as early as possible in order to coordinate publication and release dates.
If you have any questions, please do not hesitate to contact the Editorial Office.Thank you for your contribution to Molecular Systems Biology.

Maria
Maria Polychronidou, PhD Senior Editor Molecular Systems Biology ------->>> Please note that it is Molecular Systems Biology policy for the transcript of the editorial process (containing referee reports and your response letter) to be published as an online supplement to each paper.If you do NOT want this, you will need to inform the Editorial Office via email immediately.More information is available here: https://www.embopress.org/transparentprocess#Review_Process

EMBO Press Author Checklist USEFUL LINKS FOR COMPLETING THIS FORM
The EMBO Journal -Author Guidelines EMBO Reports -Author Guidelines Molecular Systems Biology -Author Guidelines EMBO Molecular Medicine -Author Guidelines Please note that a copy of this checklist will be published alongside your article.

Abridged guidelines for figures 1. Data
The data shown in figures should satisfy the following conditions: New materials and reagents need to be available; do any restrictions apply?Yes Material and Methods

Antibodies
Information included in the manuscript?
In which section is the information available?
(Reagents and Tools Plants: provide species and strain, ecotype and cultivar where relevant, unique accession number if available, and source (including location for collected wild specimens).

Not Applicable
Microbes: provide species and strain, unique accession number if available, and source.

Human research participants Information included in the manuscript?
In which section is the information available?
(Reagents and Tools If your work benefited from core facilities, was their service mentioned in the acknowledgments section?Yes Materials and Methods

Design
-common tests, such as t-test (please specify whether paired vs. unpaired), simple χ2 tests, Wilcoxon and Mann-Whitney tests, can be unambiguously identified by name only, but more complex techniques should be described in the methods section; Please complete ALL of the questions below.Select "Not Applicable" only when the requested information is not relevant for your study.
if n<5, the individual data points from each experiment should be plotted.Any statistical test employed should be justified.Source Data should be included to report the data underlying figures according to the guidelines set out in the authorship guidelines on Data Each figure caption should contain the following information, for each panel where they are relevant: a specification of the experimental system investigated (eg cell line, species name

Reporting
Adherence to community standards Information included in the manuscript?
In which section is the information available?
(Reagents and Tools Have primary datasets been deposited according to the journal's guidelines (see 'Data Deposition' section) and the respective accession numbers provided in the Data Availability Section?

Data Availability
Were human clinical and genomic datasets deposited in a public access-controlled repository in accordance to ethical obligations to the patients and to the applicable consent agreement?

Not Applicable
Are computational models that are central and integral to a study available without restrictions in a machine-readable form?Were the relevant accession numbers or links provided?

Not Applicable
If publicly available data were reused, provide the respective data citations in the reference list.

Yes References and main text
The MDAR framework recommends adoption of discipline-specific guidelines, established and endorsed through community initiatives.Journals have their own policy about requiring specific guidelines and recommendations to complement MDAR.
panel reveals a significant impact of low-frequency variants on gene expression variation Dear Joseph, 0) -[data type]: [full name of the resource] [accession number/identifier] ([doi or URL or identifiers.org/DATABASE:ACCESSION])-For data quantification: please specify the name of the statistical test used to generate error bars and P values, the number (n) of independent experiments (specify technical or biological replicates) underlying each data point and the test used to calculate p-values in each figure legend.The figure legends should contain a basic description of n, P and the test applied.Graphs must include a description of the bars and the error bars (s.d., s.e.m.).

In which section is the information available?
definitions of statistical methods and measures: (Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section)

In which section is the information available?
Table, Materials and Methods, Figures, Data Availability Section) (Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section)

materials Information included in the manuscript? In which section is the information available?
(Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section)Cell lines: Provide species information, strain.Provide accession number in repository OR supplier name, catalog number, clone number, and/OR RRID.

In which section is the information available?
Provide species, strain, sex, age, genetic modification status.Provide accession number in repository OR supplier name, catalog number, clone number, OR RRID.
(Reagents and ToolsTable, Materials and Methods, Figures, Data Availability Section) Laboratory animals or Model organisms:

In which section is the information available?
Table, Materials and Methods, Figures, Data Availability Section) If collected and within the bounds of privacy constraints report on age, sex and gender or ethnicity for all study participants.(Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section)

Checklist for Life Science Articles (updated January Study protocol Information included in the manuscript? In which section is the information available?
). the assay(s) and method(s) used to carry out the reported observations and measurements.anexplicitmention of the biological and chemical entity(ies) that are being measured.anexplicitmention of the biological and chemical entity(ies) that are altered/varied/perturbed in a controlled manner.ideally,figurepanels should include only measurements that are directly comparable to each other and obtained with the same assay.plotsincludeclearly labeled error bars for independent experiments and sample sizes.Unless justified, error bars should not be shown for technical the exact sample size (n) for each experimental group/condition, given as a number, not a range; a description of the sample collection allowing the reader to understand whether the samples represent technical or biological replicates (including how many animals, litters, cultures, etc.).a statement of how many times the experiment shown was independently replicated in the laboratory.This checklist is adapted from Materials Design Analysis Reporting (MDAR) Checklist for Authors.MDAR establishes a minimum set of requirements in transparent reporting in the life sciences (see Statement of Task: 10.31222/osf.io/9sm4x).Please follow the journal's guidelines in preparing your the data were obtained and processed according to the field's best practice and are presented to reflect the results of the experiments in an accurate and unbiased manner.(ReagentsandTools Table, Materials and Methods, Figures, Data Availability Section)If study protocol has been pre-registered, provide DOI in the manuscript.For clinical trials, provide the trial registration number OR cite DOI.

In which section is the information available?
(Reagents and ToolsTable, Materials and Methods, Figures, Data Availability Section) Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section) Include a statement about sample size estimate even if no statistical methods were used.Not Applicable Were any steps taken to minimize the effects of subjective bias when allocating animals/samples to treatment (e.g.randomization procedure)?If yes, have they been described?Not Applicable Include a statement about blinding even if no blinding was done.Not Applicable Describe inclusion/exclusion criteria if samples or animals were excluded from the analysis.Were the criteria pre-established?If sample or data points were omitted from analysis, report if this was due to attrition or intentional exclusion and provide justification.Not Applicable For every figure, are statistical tests justified as appropriate?Do the data meet the assumptions of the tests (e.g., normal distribution)?Describe any methods used to assess it.Is there an estimate of variation within each group of data?Is the variance similar between the groups that are being statistically compared? (

definition and in-laboratory replication Information included in the manuscript? In which section is the information available?
(Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section)In the figure legends: state number of times the experiment was replicated in laboratory.

In which section is the information available?
(Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section)Include a statement confirming that informed consent was obtained from all subjects and that the experiments conformed to the principles set out in the WMA Declaration of Helsinki and the Department of Health and Human Services Belmont Report.State if relevant permits obtained, provide details of authority approving study; if none were required, explain why.
Studies involving human participants: State details of authority granting ethics approval (IRB or equivalent committee(s), provide reference number for approval.Not ApplicableStudies involving human participants:Not ApplicableStudies involving human participants: For publication of patient photos, include a statement confirming that consent to publish was obtained.Not Applicable Studies involving experimental animals: State details of authority granting ethics approval (IRB or equivalent committee(s), provide reference number for approval.Include a statement of compliance with ethical regulations.Not ApplicableStudies involving specimen and field samples:

Use Research of Concern (DURC) Information included in the manuscript? In which section is the information available?
(Reagents and ToolsTable, Materials and Methods, Figures, Data Availability Section) Could your study fall under dual use research restrictions?Please check biosecurity documents and list of select agents and toxins (CDC): https://www.selectagents.gov/sat/list.htmNot Applicable If you used a select agent, is the security level of the lab appropriate and reported in the manuscript?Not Applicable If a study is subject to dual use research of concern regulations, is the name of the authority

granting approval and reference number for
the regulatory approval provided in the manuscript?

and III randomized controlled trials
Table, Materials and Methods, Figures, Data Availability Section) State if relevant guidelines or checklists (e.g., ICMJE, MIBBI, ARRIVE, PRISMA) have been followed or provided.Not Applicable For tumor marker prognostic studies, we recommend that you follow the REMARK reporting guidelines (see link list at top right).See author guidelines, under 'Reporting Guidelines'.Please confirm you have followed these guidelines., please refer to the CONSORT flow diagram (see link list at top right) and submit the CONSORT checklist (see link list at top right) with your submission.See author guidelines, under 'Reporting Guidelines'.Please confirm you have submitted this list.Reagents and Tools Table, Materials and Methods, Figures, Data Availability Section) (