Haplotypes-based genetic analysis: benefits and challenges

The increasing availability of Single Nucleotide Polymorphisms (SNPs) discovered by Next Generation Sequencing will enable a range of new genetic analyses in crops, which was not possible before. Concomitantly, researchers will face the challenge of handling large data sets at the whole-genome level. By grouping thousands of SNPs into a few hundred haplotype blocks, complexity of the data can be reduced with fewer statistical tests and a lower probability of spurious associations. Owing to the strong genome structure present in breeding lines of most crops, the deployment of haplotypes could be a powerful complement to improve efficiency of marker-assisted and genomic selection. This review describes in brief the commonly used approaches to construct haplotype blocks and some examples in animals and crops are cited where haplotype-based dissection of traits were proven beneficial. Some important considerations and facts while working with haplotypes in crops are reviewed at the end.


Introduction
Advances in Next Generation Sequencing (NGS) technologies by whole genome (Berkman et al., 2012;Chia et al., 2012), transcriptome (Cavanagh et al., 2013), reduced-representation (Elshire et al., 2011;Poland et al., 2012) and/or exome sequencing (Winfield et al., 2016) have led to new levels of Single Nucleotide Polymorphisms (SNPs) discovery. Hence, a paradigm shift from marker-based to sequencing-based genotyping of breeding populations and diversified germplasm panels has been observed in the post-genome sequencing era. These developments have facilitated development of highdensity maps, identification of Quantitative Trait Loci (QTL) and discovery of new genes in several crops, thus assisting the breeding process (Sehgal et al., 2016(Sehgal et al., , 2017Singh et al., 2016;Pandey et al., 2017;Su et al., 2017). Especially polyploid crops such as wheat have benefited from these advances, as marker number and density were major gaps in conducting in depth genetic analyses. Dense sets of SNPs now available from different marker platforms [90K Illumina iselect, Genotypingby-Sequencing (GBS), Diversity Array Technology Sequencing (DArTseq), high-density Affymetrix Axiom ® genotyping array] have significantly upgraded the genetic toolkit available in wheat. Therefore, rapidly growing numbers of breeding lines are being genotyped at low cost (Poland, 2015). In addition, whole genome sequence (>15 Gb) of wheat is now available, by combining next generation (short Illumina reads) and third generation sequencing data (long Pacific Biosciences reads), which will make cloning of genes feasible (Shi, Ling, 2018).
With upsurge in dense marker data sets coming from different genotyping platforms leading to more markers than observations, scientists will face the challenge of handling large data sets at the whole-genome level for both reliable gene discovery and genomic predictions. Therefore, new approaches will be required to deal with cumbersome data and to make analysis easier. Constructing haplotypes from SNPs is one of the options to deal with bulky datasets. Being multiallelic, haplotypes are more informative than SNPs and allow more powerful and less exhaustive genome-wide scan. In this review, we have first defined what haplotypes are and what approaches are available to make haplotypes. Many examples are cited in animals and crops where haplotypes-based analysis have yielded better results than using SNPs in Genome Wide Association studies (GWAS), Genomic Prediction (GP) and in candidate gene identification.

What are haplotype blocks?
A haplotype block defines a region in the genome that comprises a set of neighboring SNPs, whereby their phased alleles are likely inherited together with little chance of contemporary recombination (Fig. 1). Mainly, three approaches are used to construct haplotype blocks: (1) user-defined length, (2) sliding-window, and (3) linkage disequilibrium (LD). Any of these three methods can be used depending on the skills of the user and/or on the objective of the research. The user-defined fixed length of haplotype blocks (2 to 15 bp) is the easiest approach; however, generated haplotypes do not reflect any biological phenomenon such as LD (Gabriel et al., 2002) or shared evolutionary history (Templeton et al., 2005). The sliding-window approach is the most widely used, and has been used intensively for building haplotypes in GWAS for quantitative or qualitative traits. In this method, a genomic region under study is divided into windows, either of uniformsize or variable-size (Tang et al., 2009), and a multiple-marker association test is performed for each window. This approach is easy to use and handle, however, when adjacent SNPs are in strong LD, it provides redundant information thus making the sliding-window approach no more informative than a SNP. Similarly, when LD pattern vary over large genomic regions, it is difficult to determine window-size for a genome-wide scan. The LD-based approaches are the most advantageous because they focus directly on the detection of historical recombination in a given population and LD coefficients are easy to visualize.
Today most genomic analyses such as GWAS or GP use bi-allelic SNP markers. However, SNPs can be combined into short, multi-allelic haplotypes to overcome bi-allelic problem and to perform a powerful and less exhaustive genome scan. By using haplotype blocks, information on multiple markers jointly can be used and hence local epistatic interactions can be naturally modelled, and the reduced number of parameters enables a range of genomic analyses including GWAS, GP, and/or detection of selection signatures. Further, haplotype blocks can be coded in a simple numeric (binary) form to be used in different R codes or Java-based programs. Figure 2 shows how a haplotype block composed of two adjoining SNPs and having four alleles (AC, GT, AT and GC) can be converted to a simple binary 1-0 format.

Case studies in animals and humans
GWAS studies based on haplotypes are common in animals and humans (Grapes et al., 2004;Hayes et al., 2007;Calus et al., 2009;Shim et al., 2009;Khankhanian et al., 2015;Jónás et al., 2016;Sato et al., 2016). Studies have generated plethora of evidences to establish that multi-allelic haplotypes significantly improve the power and robustness of association as compared with individual SNPs. A common observation in SNP-based GWAS is the large gap between the variance explained by the identified SNP-associations and the total variance, termed as the 'missing heritability'. J. Yang et al. (2010) showed that a part of the 'missing heritability' could be attributed to a lack of LD between SNP markers and causative variants. Combining neighboring SNPs into haplotype blocks is a simple way to generate a more complete LD. It has been shown that the use of haplotype-based methods have reduced the heritability gap in many cases compared with SNP-based methods when both were applied to the same dataset. P. Khankhanian et al. (2015) investigated the genetic basis of Multiple Sclerosis (MS), a complex genetic disorder in humans controlled by a major histocompatibility complex (MHC) on the short arm of chromosome 6. Haplotypes of various lengths (from 1 up to 15 contiguous SNPs) were constructed at each of the 110 previously identified, MS-associated, genomic regions. The results based on haplotypes outperformed the results using individual SNPs by at least three orders of magnitude. Moreover, when 932 MSassociated haplotypes (identified from 102 genomic regions) were included as independent variables into a logistic linear model; the amount of MS heritability was 38 %, while with individual SNPs it was 29 %.
Simulations based on the LD and population history of livestock have shown that haplotypes can provide greater QTL detection power and mapping accuracy than single markers (Hayes et al., 2007;Calus et al., 2009). Use of haplotypes have also led to the discovery of new genetic regions of interest, which have not been identified by a SNP-based GWAS (Lu et al., 2003;Hagenblad et al., 2004;Shim et al., 2009). W. Barendse (2011) showed that haplotype analysis improved evidence for candidate genes for intramuscular fat (IMF) percentage in cattle as they explained around 80 % more of the phenotypic variance for the five genes that showed some evidence of association to IMF compared to individual SNP analyses. Further studies in animal breeding have also accumulated evidences that integration of haplotypes or haplotype-tagged QTL in genomic selection models can improve GP accuracies for complex traits (Boichard et al., 2012;Cuyabano et al., 2014Cuyabano et al., , 2015aJónás et al., 2016;Hess et al., 2017;Jiang et al., 2018).

Case studies in crops
Although only a few case studies have been reported in crops, results have been encouraging towards haplotype-based analy- актуальные технологии генетики растений / MainstreaM technologies in plant genetics S1B_121281093 S1B_121290405  nated Agricultural Project. Three associations were found for heading date, two of which were detected by haplotype analyses only. Further, the authors determined the effect of three sets of QTL simulations. The power of individual SNPbased analysis was superior to that of haplotypes when the causal SNP was present in genotyping data. In the absence of causal SNP, haplotypes-based GWAS was more powerful to detect QTL than SNPs. In the latter case, however, the type of method used to construct haplotype blocks affected power of the GWAS. Y. Ma et al. (2016) studied the effect of marker preselection on the prediction accuracy in soybean on plant height and yield per plant. The three strategies tested were (a) a random SNP sampling method (RSM), (b) a haplotype block analysis-based sampling (HBA), and (c) even SNP sampling method (ESM). They found that for grain yield, prediction accuracy increased by approximately 4 % based on HBA-based approach compared with RSM and ESM. Y. Lu et al. (2012) conducted comparative LD mapping using SNPs and haplotype blocks to identify QTL for plant height and biomass under drought stress in maize. They used a 10 kb sliding-window approach accounting for the average length of LD to construct haplotype blocks. Using haplotypebased LD mapping, three and 12 significant haplotypes were identified for plant height and biomass, respectively, of which six haplotypes contained at least one SNP that was also significantly associated with the specific trait revealed by SNPbased LD mapping. The haplotype-based analysis explained higher phenotypic variation (on average 2.9 %) than SNPs for both traits.
A few genetic studies have attempted to model the effect of interactions between haplotypes (epistasis) on quantitative traits in crops. Some examples include the vernalization response in barley (Cockram et al., 2007) and chlorophyll content in rapeseed (Qian et al., 2016).

Studies in wheat (published and ongoing)
In wheat, studies are so far very few where haplotypes-based genetic analysis have been conducted. K. Voss-Fels et al.
(2017) explored molecular interactions connecting root and shoot development and growth in European elite wheat germplasm to investigate plant's demand for water and nutrients along with its ability to access them. They mapped two highly significant haplotypes for root biomass in close proximity to a major locus known to affect spike development. It was concluded that possibly, strong selection for a haplotype variant controlling heading date, has eliminated a specific combination of two flanking, highly conserved haplotype variants whose interaction confers increased root biomass. Breeders could reverse this consequence of selection to recover root diversity that may be useful under stress environments.
N'Diaye et al. (2017) conducted a SNP-and haplotypebased GWAS of semolina and pasta color in elite durum wheat lines. They combined SNPs within a window size of 5.3 cM (based on average LD decay) on the same chromosome to form haplotype blocks. Haplotype-based GWAS resulted in an increase of the phenotypic variance explained (50.4 % on average) and the allelic effect (33.7 % on average) compared to SNP-based GWAS.
In the past decade, various high-throughput genotyping platforms have been adopted by CIMMYT including the 20K and 90K Illumina iselect SNP arrays, the Breeders' 35K Axiom® array (Affymetrix), DArTseq GBS. As a result, large data sets have been generated on different sets of germplasm. Several SNP-based GWAS studies have been performed (reviewed in Dreisigacker et al., 2019) and haplotype-based GWAS has been initially tested. A latest example include haplotype-based quantification of exotic (landrace, synthetics, etc.) genome imprints in pre-breeding germplasm (Singh et al., 2018). A set of 984 pre-breeding lines (PBLs) generated by a three-way cross (exotic/elite1//elite2) were genotyped with DArTseq and phenotyped for a range of agronomic traits under stress environments. Haplotype blocks, generated using the LD approach, identified 361 and 367 blocks in PBLs and exotics, respectively. Haplotype block-by-block comparison on each chromosome revealed that 58 (16 %) blocks identified in PBLs were exotic-specific. Further, a rare and favorable haplotype (GT) was identified on chromosome 6D that minimized grain yield (GY) loss under heat stress without penalty under irrigated conditions.
A large GWAS using haplotypes and individual SNPs was performed for GY and superiority index Pi (measure of GY stability) using a large set of advanced bread wheat lines (4,302), which were genotyped with GBS markers and phenotyped under contrasting (irrigated and stress) environments (unpublished work). The average R 2 explained by haplotypes and SNPs showed a 6.1 to 9.9 % higher variation with the haplotype-based GWAS as compared to the individual SNP-based GWAS for GY and Pi (Sehgal et al. personal communication). We further explored whether integrating haplotype-tagged QTL for GY as fixed variables in prediction models improved prediction accuracy. It was observed that the model accounting for the haplotype-based GWAS results as fixed effects led to up to 9 to 10 % increase in prediction accuracy, whereas it was only 4 to 5 % with SNP-tagged QTL. Similarly, haplotype-based GWAS conducted for thousand-grain weight identified four major loci in CIMMYT germplasm; all the four loci showed higher p values than the associated individual SNPs on chromosomes 4A and 6A.

Considerations and challenges
Due to the growing availability of SNP datasets in crops, haplotype-based approaches for genomic analyses is likely to increase markedly. However, the power of analyses using haplotypes vs. SNPs must be evaluated on a case-by-case basis, as risk factors are common for both approaches. For example, under certain disease models (simple Mandelian or complex multi-gene additive or epistatic inheritance) and certain LD patterns one method outperforms the other, so different architectures of QTL and LD patterns interact with marker characteristics to influence power in GWAS. Similarly, bottlenecks are known to increase LD and shift allele frequency spectra toward higher minor allele frequencies. Hence, after a bottleneck, SNPs are more likely to be in LD with QTL and haplotypes might provide little advantage. Marker ascertainment is another important criterion and is a characteristic of SNP chips. In the standard method of developing a SNP chip or an array, a small SNP discovery panel is used, which means that low frequency mutations often go undetected and SNPs occurring at intermediate to high frequencies dominate in such chips or arrays. This over-sampling of mutations at intermediate frequencies results in lower levels of LD than if SNPs were selected randomly. For GP, haplotype-based prediction approaches are favored only if alleles at QTL are more closely linked to the haplotype than to individual SNPs. Finally, map order errors can play a significant role in determining the safe and best approach for analysis. For example, SNP analysis is unlikely to be affected by ordering errors and hence is the best approach when map order is doubtful.