Second-Generation Linkage Maps for the Pacific Oyster Crassostrea gigas Reveal Errors in Assembly of Genome Scaffolds

The Pacific oyster Crassostrea gigas, a widely cultivated marine bivalve mollusc, is becoming a genetically and genomically enabled model for highly fecund marine metazoans with complex life-histories. A genome sequence is available for the Pacific oyster, as are first-generation, low-density, linkage and gene-centromere maps mostly constructed from microsatellite DNA markers. Here, higher density, second-generation, linkage maps are constructed from more than 1100 coding (exonic) single-nucleotide polymorphisms (SNPs), as well as 66 previously mapped microsatellite DNA markers, all typed in five families of Pacific oysters (nearly 172,000 genotypes). The map comprises 10 linkage groups, as expected, has an average total length of 588 cM, an average marker-spacing of 1.0 cM, and covers 86% of a genome estimated to be 616 cM. All but seven of the mapped SNPs map to 618 genome scaffolds; 260 scaffolds contain two or more mapped SNPs, but for 100 of these scaffolds (38.5%), the contained SNPs map to different linkage groups, suggesting widespread errors in scaffold assemblies. The 100 misassembled scaffolds are significantly longer than those that map to a single linkage group. On the genetic maps, marker orders and intermarker distances vary across families and mapping methods, owing to an abundance of markers segregating from only one parent, to widespread distortions of segregation ratios caused by early mortality, as previously observed for oysters, and to genotyping errors. Maps made from framework markers provide stronger support for marker orders and reasonable map lengths and are used to produce a consensus high-density linkage map containing 656 markers.


Illumina
GoldenGate bead assays regression maps maximum likelihood maps distortion of segregation ratios genotyping error The Pacific oyster Crassostrea gigas is one of the most widely cultivated marine species, having been introduced for this purpose from Asia to all continents but Antarctica (Mann 1979;Food and Agriculture Organization 2015). Global annual production is uncertain, owing to taxonomic confusion in reports from China and other countries (Wang et al. 2008), but is conservatively estimated as 555,914 metric tons in 2013 (Food and Agriculture Organization 2015), down from 4.6 million metric tons of cupped oysters grouped under the Pacific oyster moniker in 2006. Because of its commercial importance, a number of breeding programs have been initiated over the years (Hershberger et al. 1984;Langdon et al. 2003;Hedgecock and Davis 2007;Dégremont et al. 2010), substantial genetic and genomic resources have been developed [genetic lines; expressed sequence tag (EST) collections; transcriptomes; bacterial artificial chromosome libraries; Hedgecock et al. 2005;Curole and Hedgecock 2007], and recently the genome has been sequenced (Zhang et al. 2012). The Pacific oyster is also becoming a model species for research on environmental resilience (Applebaum et al. 2014).
Linkage maps are essential tools useful for mapping quantitative-trait loci (QTL; Lander and Botstein 1989;Doerge 2002), marker-assisted selection, positional cloning, and genome assembly. Yet, construction of a single, high-density consensus linkage map for a species, particularly for one such as the Pacific oyster, which has a high load of early lethal or highly deleterious mutations (Launey and Hedgecock 2001;Plough and Hedgecock 2011), is a challenging goal.
Previously published linkage maps for the Pacific oyster were based on microsatellite DNA markers (Hubert and Hedgecock 2004;Hubert et al. 2009;Plough and Hedgecock 2011), on dominant amplified fragment-length polymorphism (AFLP) markers or combinations of AFLP and microsatellite markers (Li and Guo 2004;Guo et al. 2012), and more recently (Sauvage et al. 2010;Zhong et al. 2014) on microsatellite markers together with single-nucleotide polymorphisms (SNPs). Although adequate for mapping viability QTL (Plough and Hedgecock 2011;Plough 2012), these first-generation maps have rather low marker densities. Here, higher density, second-generation linkage maps are constructed for the Pacific oyster, based largely on assays of a fixed set of 1536 coding (exonic), SNPs in five mapping families.
Variability in marker orders and intermarker distances among individual family maps and between different mapping methods presents formidable challenges to the construction of a consensus linkage map for this species. Harmonizing results from regression (RG) and maximum likelihood (ML) mapping methods, we identify three factors behind uncertainties in marker orders and inflated map distances: (i) markers segregating from only one parent, (ii) markers linked to chromosomal regions impacted by strong, early viability selection, and (iii) genotyping errors. In the end, a single consensus map, constructed from the most reliable set of individual framework maps, is achieved. SNP linkage information has an unexpected bearing on the current genome assembly, insofar as it implies that a large proportion of genome scaffolds may be incorrectly assembled.

Mapping families
Five full-sib families of Pacific oysters were used for linkage mapping. Two were F 2 families, 2 · 10 (n = 90) and 51 · 35 (n = 108), each derived from an intercross of full-sib F 1 hybrids, which, in turn, were produced by crosses of partially inbred lines (i.e., 2, 10; 51, 35; Hedgecock and Davis 2007). Three other families, F12, F20, and F45 (each, n = 46), were produced by crosses between adult oysters taken in 2006 from the wild population in Dabob Bay, Washington (L. V. Plough, G. Shin, and D. Hedgecock, unpublished data); these random-bred, full-sib families were the G 0 foundation stock for development of inbred lines (Hedgecock and Davis 2007).

EST library sequencing and transcriptome assembly
To construct libraries of expressed DNA sequences, RNA was extracted from 2-yr-old oysters, from G 3 inbred lines 51 and 35, after we subjected small groups to different stressful conditions. The pedigree of all oysters was confirmed by typing microsatellite DNA markers on the inbred oysters and tissue samples from their parents (Curole and Hedgecock 2007). RNA was pooled within the inbred lines and made into coded cDNA libraries. Four additional larval cDNA libraries, two from 2-hr to 12-hr-old larvae and two from 15-hr to 30-hr-old larvae, were also constructed.
These libraries were Sanger-sequenced by the U.S. DOE Joint Genome Institute (JGI; larval libraries CBSI, CBSN, CBSO, and CBSP; and adult libraries CCBN, CCTP, and CCTS; GenBank accessions HS109651-HS248985). The adult libraries contributed 126,331 (90.7%) of the 139,335 EST sequences obtained from these libraries. Additional sequence was generated from a putative, nonredundant subset of 11,904 cDNA clones, which were pooled, their inserts fragmented,and sequenced by Illumina Genome Analyzer methods (SRA accession SRR2119182). Further sequence information came from two other larval cDNA libraries, CFPP and CFPS, which were made, respectively, from 6-dand 18-d-old larvae of a reciprocal hybrid between inbred lines 35 and 51; these libraries were sequenced by 454 GS FLX Titanium methods (Gen-Bank accessions SRX032364 and SRX032365). Sequences from all libraries were assembled, using miraEST (Chevreux et al. 2004), into a C. gigas transcriptome comprising 52,095 unigenes.
SNP discovery, marker development, and genotyping The JGI ESTs (145,600 sequences) were compared with the transcriptome assembly, using BLAST (blastn, cutoff 1e-50) and the best hit for each EST was aligned with the EST sequence using CLUSTAL. Alignments of cDNA sequences from the two inbred lines 51 and 35 were combined to generate a SAM file for all ESTs. Separately, reads from two additional Illumina RNA-Seq libraries (SRP061799) were compared with the assembled transcriptome, using BWA (Li and Durbin 2009; http://bio-bwa.sourceforge.net/). The three SAM output files were combined, using the pileup command in SAMtools ; http://www.htslib.org/) to find 126,392 sites with potential variants (whether true SNPs or read errors). Candidate SNPs were sorted according to the amount of supporting sequence evidence and whether they were segregating within or between the two inbred lines; each alternative allele had to be supported by two or more sequences. Additionally, the unigene contigs were mapped on a draft version of the oyster genome, so that intron-exon boundaries could be identified. The list of high-quality SNPs was further refined to remove candidates located within 50 nucleotides of an intron-exon boundary, so that a 100-nt GoldenGate assay probe centered on the SNP site (Shen et al. 2005) would be contained entirely within a single exon. A total of 4122, 100-nt probes were selected, after these steps, for design of GoldenGate probes.
From the list of potential probes, we chose the top 1536 SNPs for the final GoldenGate bead array (Shen et al. 2005), which was used to genotype four 96-well plates containing the five mapping families. Subsequently, 27 SNPs were found to have duplicates in the array, 6 SNPs were present in triplicate, and 1 SNP was present in quadruplicate; these repeats were removed from the mapping data sets and the subsequent tally of SNPs and genome scaffolds mapped. Although present in earlier assemblies of the genome, which enabled identification of exon boundaries, seven of the assayed SNPs were not present in the final genome assembly; these retain the "USC" reference numbers from the transcriptome assembly. Otherwise, SNPs were assigned names based on the genome scaffold number and nucleotide position (see Genomic locations and annotations of mapped, coding SNPs).
To determine genotyping error in the GoldenGate bead assay, nine individual samples were run in duplicate on plates processed at the same time. Discounting for missing data and duplicated markers, an average of 1437 SNPs were compared in duplicate samples, a total of 12,937 genotypes; no discrepancies between duplicated genotypes were found.
In addition to SNPs assayed by the GoldenGate assay, we genotyped 66 microsatellite DNA markers that have been mapped in previous studies Hedgecock 2004, Hubert et al. 2009;Plough andHedgecock 2011, Plough 2012). We also typed 17 SNPs in families F45 and 51 · 35, which were identified early by the SNP discovery pipeline but genotyped on a Sequenome platform (L. V. Plough, G. Shin, and D. Hedgecock, unpublished data). Across families and markers, a total of 179,263 genotypes were determined in this study and contributed to the linkage maps to be described. Complete genotype data sets for the five families are given in Supporting Information, Table S1, Table S2,  Table S3, Table S4, and Table S5.

Linkage analyses
Linkage mapping calculations were made with JoinMap 4.1 (Van Ooijen 2006), using the cross-pollinated coding of genotypes to accommodate the multiple mating types observed in the full-sib families (Table 1; here, but not in Table S1, Table S2, Table S3, Table S4, and Table S5, coding of markers segregating from only one parent is reversed from JoinMap convention, to reflect our long-standing practice of writing controlled crosses as ♂·♀). Markers missing data for more than 10% of individuals in a family were excluded. Initial groupings of markers were based on independence log of the odds (LOD) tests in steps from a LOD of 1.0 (independence P threshold, 0.001; recombination threshold, 0.25) to a LOD of 10.0 (independence P threshold, 0.0001; recombination threshold, 0.05). The independence LOD is less likely than the linkage LOD test to find spurious linkage, when segregation distortion, such as that observed for oysters (Bierne et al. 1998;Launey and Hedgecock 2001;Plough andHedgecock 2011), is present (Van Ooijen 2006). The final grouping LOD is the maximum threshold value passed by all markers in at least one test of independence with other markers in the group. Previously mapped microsatellite markers (Hubert and Hedgecock 2004;Plough and Hedgecock 2011) were used to associate the resulting groupings with the 10 identified linkage groups (LG). In each family, sets of markers with identical genotypes were identified in subsequent mapping steps. Since identical markers map to the same location, we used only one marker from each set of identical markers, to increase the efficiency of linkage calculations, and we subsequently positioned initially excluded identical markers at the locus where their representative mapped.
As observed previously for the Pacific oyster (Launey and Hedgecock 2001;Plough and Hedgecock 2011;Plough 2012), we find widespread distortions of Mendelian segregation ratios in most linkage groups, as reflected by highly significant x 2 goodness-of-fit tests. The extent of such distortions on maps is summarized by the proportions of markers on a LG that show departures from expected Mendelian segregation ratios at a nominal significance threshold of a # 0.01. A two-way analysis of variance on arcsine square root transformations of these proportions is used to assess variance among families and linkage groups.
Both RG (Stam 1993) and ML (Jansen et al. 2001) mapping methods, as implemented in JoinMap 4.1, were used to construct linkage maps. Initial RG linkage maps were made with all markers, using default settings (maximum recombination frequency of 0.4, minimum LOD of 1.0, goodness-of-fit jump threshold for removing loci of 5.0), Kosambi's mapping function, three-rounds of fitting (a second round attempts to add loci removed in the first round, with a third round forcing all markers onto the map), and a ripple performed after each marker addition. Kosambi's mapping function is justified by evidence for crossover interference (Hubert et al. 2009). Occasionally, insufficient linkage or failure to determine linkage phase with default settings required relaxation of thresholds. In three cases (LG 2 in 51 · 35 and LG 8 and LG 10 in F20, the latter two requiring relaxation of linkage thresholds), single markers that appeared to cause poor fit or excessive map lengths were removed before proceeding with mapping. Nearestneighbor fit-the sum of the absolute values of the differences between observed pairwise and calculated map distances (in centiMorgans, cM) of each marker to its nearest informative neighbors on either side-is used as an indicator of the quality of the map order.
The ML mapping method uses multipoint, ML-based algorithms to determine optimal intermarker distances and marker order and simulated annealing to provide statistical support for marker order. Distance in the ML method is calculated with the Haldane mapping function, which assumes no cross-over interference and may overestimate dis-tances when interference is present. We used the JoinMap default spatial sampling thresholds (0.10, 0.05, 0.03, 0.02, and 0.01) and three rounds of optimization per sample. Default parameters for map-order optimization were used except that chain length was increased from 10,000 to 20,000. Parameters for multipoint estimation of recombination frequency were generally increased (length of burn-in chain, from 10,000 to 20,000; number of Monte Carlo EM cycles, from 4 to 10; chain length per EM cycles, from 1000 to 5000) to ensure convergence.
We attempted to resolve discrepancies in map orders and marker distances between RG and ML maps by focusing, first, on biparentally segregating framework markers and then on a set of well-spaced anchor loci. When suitable congruence between RG and ML maps for anchor loci was achieved, we applied the resulting order as a fixed order to the mapping of framework markers and all markers. Maps of framework markers were used to construct consensus maps and to compare pairwise recombination rates between male and female parents.
The length of the genome was estimated for each family by two methods. First, we calculated s, the average spacing of markers, as the sum of the lengths of the 10 linkage maps within each family divided by the total number of mapped intervals (i.e., the total number of markers minus the number of linkage groups, 10). Genome length was then estimated as the sum of linkage group lengths plus 20s, to account for intervals beyond the most distal markers on both ends of the 10 linkage groups (following Fishman et al. 2001). We also estimated genome length for each family as the sum of map lengths multiplied by (m + 1)/(m 2 1), where m is the number of markers per linkage group (Chakravarti et al. 1991). Finally, for each estimate of map length, we estimated genome coverage as 1e -2sn/L , where s is average spacing, n is total number of markers assigned to linkage groups, and L is the total map length (Bishop et al. 1983).
A consensus map was made, using MergeMap (Wu et al. 2011); owing to the explicit statistical support for marker order that the ML method provides, we merged family ML framework maps. Prior to merging, we removed poor-fitting markers from some family maps to improve agreement in map lengths and marker-orders between RG and ML maps. The G 0 families did not have enough bi-parentally segregating markers to provide framework maps for LG 9. In addition, we removed LG 2 and LG 7 maps for G 0 families F20 and F45 from the final consensus map, because they fit a preliminary merged map poorly; likewise, we removed F12 and F20 framework maps from the final consensus map for LG 10. Families were weighted in MergeMap, roughly by the number of individuals typed, such that the G 0 full-sib families were weighted 1.0 and the two F 2 families were weighted 2.0. The inverse Haldane mapping function was used to convert ML map units to recombination fractions, which were then converted into Kosambi map units prior to submitting to MergeMap. The consensus maps produced by MergeMap still appeared to have inflated distances, compared to the individual maps, so, after adjusting individual map origins to the appropriate consensus-map position, we performed n linear regressions of individual maps onto consensus maps in order to adjust final distances on each consensus linkage map.

Genomic locations and annotations of mapped, coding SNPs
The SNP-containing sequences were aligned with the reference genome using BWA-MEM (H. Li; http://arxiv.org/abs/1303.3997) and each SNP location was identified using a custom script. Loci were named (e.g., 3-G98282A) according to the genome scaffold to which they aligned (3 in the example), followed by the nucleotide position within that scaffold (98,282 in the example); the position is preceded by the base in the reference genome (G) and followed by the substitution observed in that strand (A).

Data availability
Accession numbers to sequences used for SNP discovery are given in "EST library sequencing and transcriptome assembly" and in "SNP discovery, marker development and genotyping." Supporting information at http://www.g3journal.org/content/suppl/2015/08/04/ g3.115.019570.DC1 contains detailed descriptions and links to Excel Table S1, Table S2, Table S3, Table S4, Table S5, Table S6, and Table S7, containing the genotype-input files for JoinMap, the mapping of SNPs to scaffolds and to linkage groups, and the consensus linkage map.
Tables 6 and 7 provide a roadmap to parameters used in sequential steps of the linkage mapping analyses.

Polymorphism of markers and preliminary assignment to
LGs Each of the mapped 1172 SNPs or microsatellite DNA markers can be classified into one of five possible parental mating types (♂·♀), np · nn, ll · lm, hk · hk, ef · eg, and ab · cd, depending on whether parent-pairs have two, three, or four alleles. The distribution of markers across these categories (Table 1) reflects the types of markers assayed and levels of marker polymorphism in the populations from which the parents were drawn. There is a preponderance of two-allele types (96.2%), as expected for the bulk of bi-allelic SNP markers. Across the three G 0 families, proportions of two-allele mating types are homogeneous (0.38 for np · nn, 0.40 for ll · lm, and 0.22 for hk · hk; x 2 = 0.65, 4 df, P = 0.96), but the two F 2 families have many more hk · hk mating types, reflecting derivation from a cross of inbred grandparents (i.e., hh · kk). The two F 2 families differ significantly from each other, however, with 51 · 35 having a much greater relative proportion of hk · hk mating types than 2 · 10 (x 2 = 89.8, 2 df, P , 0.0001); this difference is consistent with the initial selection of SNPs fixed for different alleles n b Includes previously mapped microsatellite DNA markers (Hubert and Hedgecock 2004;Plough and Hedgecock 2011). c Calculated as the total length of all linkage groups divided by the total number of map intervals, which is the total number of markers minus the number of linkage groups. d Calculated following Fishman et al. (2001). e Calculated following Chakravarti et al. (1991). f Calculated following Bishop et al. (1983), using genome length 1. g Calculated following Bishop et al. (1983), using genome length 2.
in the 51 vs. 35 inbred parent lines. Microsatellite DNA markers account for all of the ab · cd mating types and most of the ef · eg mating types. Some SNPs fall into the ef · eg category, owing to inference of a shared null allele (i.e., apparent aa · cc mating types produce progeny types, aa, cc, ac, and blanks; such crosses were re-interpreted as aØ · cØ, where Ø is a non-amplifying null allele, yielding aØ, cØ, ac, and ØØ progeny). In keeping with their inbred history, the two F 2 families differ noticeably from the three, random-bred G 0 families, in having almost 100 fewer markers mapped (526 vs. 623) and many fewer ab · cd mating types (1.0 vs. 18.7, on average; Table 1). Markers are initially grouped through pairwise G-tests of independent segregation. Summing across all five mapping families, a total 1085 coding SNPs are grouped, together with enough previously mapped microsatellite DNA markers, 66 in total, to establish consistency with previous linkage maps (Hubert and Hedgecock 2004;Hubert et al. 2009;Plough and Hedgecock 2011). Markers are provisionally mapped, using the regression mapping method, yielding 10 linkage groups in each family, as expected, with total lengths ranging from 508 to 690 cM and an average total map length of 588 cM (Table 2). With a mean of 584 markers per map, including identical loci, these maps have an average marker spacing of 1.04 cM. The 50 linkage groups range in size from 28.7 to 198.4 cM, with an average of 59 cM. Only five linkage groups (one each for LGs 1, 3, 4, 7, and 8) are greater than 85 cM in length, and these appear to be outliers with respect to map lengths obtained in the other families (Table 2); we return to these excessively long linkage groups in a later section. Genome lengths are estimated for each family by two methods (Fishman et al. 2001;Chakravarti et al. 1991) and range from 524 to 744 cM, with averages of 609 and 623 cM, respectively (Table 2). Despite variability in genome lengths among families, genome coverages are 85% or 86% in all cases (Table 2).
Most SNPs in the GoldenGate assay (823 of 1085, 75.9%) are mapped in two or more families (Table 3), lending confidence to LG assignments; of these, only two (1255-G326792C and 43940-T130339C) map to different linkage groups in different families. Close inspection of the mapping support for these markers did not uncover any explanation, as it did in a dozen similar cases for which data transcription errors were discovered and corrected. These two markers are dropped from further linkage analyses.

Mapping of genome scaffolds
All but seven of the 1085 mapped SNPs can be located on genome scaffolds (Zhang et al. 2012; Table S6). Of the 618 scaffolds with mapped SNPs, 358 contain one SNP, and 260 contain two or more SNPs (Table 4). Of the scaffolds with two or more SNPs, 100 (38.5%) contain SNPs that unexpectedly map to two or more linkage groups (Table 5). Cross-classifying scaffolds by the number of SNPs they contain and by the number of linkage groups to which they map reveals a highly significant, positive association between the two factors (contingency x 2 = 31.84, 3 d.f., for four levels of SNPs per scaffold-2, 3, 4, and .4 SNPs-and two classes of mapping-to one linkage group or to more than one linkage group; P , , 0.0001). The median length of the 100 scaffolds that map to two or more linkage groups, 740 kbp, is significantly longer than the median length of the 160 scaffolds that map to a single linkage group, 403 kbp ( Figure 1A), and the more SNPs per scaffold, the longer the average multiply-mapped scaffold compared to the average singly-mapped scaffold ( Figure 1A inset). Average distance between adjacent SNPs that map to different linkage groups is 273 kbp (n = 118), whereas the average distance between adjacent SNPs that map to the same linkage group is 58.2 kbp (n = 337; t = 11.14, P , , 0.0001, two-tailed test of difference in sample means with unequal variances). Numbers of ambiguous bases (coded as N in the published genome), between adjacent SNPs on a scaffold that either do or do not map to the same linkage group, are also significantly different ( Figure  1B). The distribution of the number of Ns between adjacent SNPs mapping to the same linkage group has a mode at zero and a median of 11, whereas between adjacent SNPs that map to different linkage groups, the distribution has a mode at nearly 40,000 and a median just over 25,000 ambiguous bases.
Finally, we consider the number of "breakpoints" necessary to account for the order with which same-scaffold SNPs are mapped to linkage groups. For example, a scaffold with three SNPs mapping to two linkage groups (A and B) can have one breakpoint (LG-assignment orders, AAB or ABB) or two breakpoints (ABA or BAB); of the 27 scaffolds in this category (Table 5), 26 have the minimum of one breakpoint and only one has two breakpoints. Similarly, of the 13 scaffolds with four SNPs mapping to two linkage groups, all 13 have one breakpoint (6, AAAB and 7, AABB) and none has an order requiring two breakpoints (AABA) or three breakpoints (ABAB). Indeed, across the 60 informative combinations in Table 5 (i.e., cases in which the number of SNPs . number of LG assignments), only two cases require more than the minimum number of breakpoints.

Comparison of RG and ML maps
Provisional linkage maps made by the RG method generally provide poor fits to observed recombination rates among markers (Table 6, nonidentical loci only). Threshold LOD values for grouping range from 3 to 10, reflecting differences in the number of progeny genotyped, such that the two larger F 2 families have average LOD values near 9 and the three smaller G 0 families have average LOD values near 5. In seven of 50 cases (five families · 10 linkage groups), default thresholds for linkage had to be relaxed to permit all markers to be added. For 44 of 50 maps, linkage of all markers could only be achieved by forcing remaining markers onto a third-round RG map, sometimes resulting in large contributions to total x 2 or negative distances to existing mapped markers; on average, 10.6 makers were forced onto the 44 linkage n n groups, with a range from one to 42 markers. Maximum values for the nearest-neighbor fit statistic range from a low of 0.2 cM, for LG 2 in 2 · 10, to 621.3 cM, for LG 10 in the same family; indeed, 36 of 50 fit statistics exceed the estimated lengths of their linkage groups (Table 6). Maximum nearest neighbor fit is significantly greater for forced than for unforced maps (252.6 cM vs. 74.0 cM, F 1,48 = 4.612, P = 0.037).
The ML method yields better fitting maps, in most but not all cases; the maximum nearest-neighbor fit statistic per linkage group exceeds map length in only three of 50 cases. However, the ML maps are invariably longer than those produced by the regression method ( Table  6). Nine of the 50 ML maps have lengths of 5000+ or 10,000+ cM, owing to markers that have recombination frequencies with flanking markers of 0.5 in one parent but not in the other parent (the ML map is an average of the two parental maps). Often, these markers are segregating from only one parent (mating types, np · nn or ll · lm in Table  1). Excluding these excessively long maps, the average ML map is still three times longer than the corresponding regression map. The orders of markers produced by these maps are, nevertheless, strongly supported by the simulated annealing algorithm implemented in JoinMap (Jansen et al. 2001), except for closely linked markers. Agreement between the order of markers on RG and ML maps is assessed by the proportion of variance explained by a linear fit (r 2 ) of RG marker rank order on ML marker rank order. These r 2 values range from 0.01 to 1.0, with 24 of 50 exceeding 0.75 but 8 falling below 0.1 (Table 6) Distortion of Mendelian segregation ratios is widespread over these maps, as observed in previous studies (Launey and Hedgecock 2001;Li and Guo 2004;Sauvage et al. 2010;Plough and Hedgecock 2011;Plough 2012;Guo et al. 2012). The extent of this distortion is represented in Table 6 and Table 7 by the proportion of markers on each linkage group with a x 2 goodness-of-fit test probability less than or equal to 0.01. In the vast majority of cases, markers with distorted segregation ratios are clustered on linkage maps, again, as observed in previous studies (op. cit.). The mean proportion of distorted markers on a linkage group, 0.31, does not vary among families but does vary among linkage groups (F 9,36 = 2.404, P = 0.03), ranging from a low of 0.034 on LG 1 to highs of 0.56 on LG 10 and 0.61 on LG 3. There is no significant relationship between proportions of markers distorted and either map lengths or maximum nearest-neighbor fit statistics across linkage groups; but certain distorted markers clearly do contribute to poor fit and inflated map lengths (examples given below).
Having observed effects of markers segregating from only one parent on map orders and lengths, we proceed with linkage maps of framework markers segregating from both parents (mating types hk · hk, ef · eg, and ab · cd in Table 1). For five framework maps, we exclude obviously poorly fitting markers, which are microsatellite DNA markers in seven of 17 cases. Thresholds for regression mapping are relaxed in only three of 47 cases (Table 7; the total number of maps is reduced from 50 because families F12, F20, and F45 had too few two biparentally segregating markers on LG 9 for map construction). Only 11 of 47 regression-method maps are third-round maps, which force an average of 3.2 markers onto those maps. Maximum nearest-neighbor fit still ranges widely, from 0.1 cM to 615.5 cM, but only 10 of 47 fit statistics exceed the estimated lengths of their LGs. As in the analyses of all markers, ML maps generally fit the data on biparentally segregating markers better than do RG maps, although the ML maps are still longer in each case. Excluding two maps that exceed 5000 cM, the average ML map is 1.6 times as long as its regression-method counterpart. The r 2 for the linear fit of RG marker order to ML marker order ranges from 0.09 to 1.0, with 29 of 47 being 1.0 and 36 exceeding 0.75. These framework maps were used to make consensus maps (see Consensus linkage maps).
The plausibility of marker positions for two ML maps, for all markers and for biparentally segregating framework markers, are presented in supplementary information ( Figure S1 and Figure S2), to illustrate factors influencing map lengths and marker orders. In the first example, LG 1 for family 51 · 35, segregation ratios conform to Mendelian expectations, so selection is not a factor. However, markers segregating from only one parent cause moderately large nearest-neighbor fits (cf. Figure S1A and Figure S1B with Figure S1C) and, at the termini of both parental maps, expansions of map lengths from about 70 cM to more than 200 cM. Still, agreement in the rank order of 66 markers between the RG and ML maps is fairly good, with only a few outliers (Figure 2A). Regression of RG on ML marker order is greatly improved for 32 framework markers ( Figure 2B). In the second example (79 markers grouped for LG 10 of family 2 · 10), severely distorted Mendelian ratios are associated with large jumps in length and regions of very poor fit on ML parental maps ( Figure S2A and Figure S2B). Single-parent segregations also contribute to regions of poor fit and length expansion. Regression of RG on ML marker orders is quite poor for all 79 markers ( Figure 3A). In attempting to construct the framework marker map, 12 markers with highly distorted segregation ratios had to be removed to achieve reasonable congruence of the remaining 24 marker orders on the RG and ML framework maps ( Figure 3B). Still, this reduced ML framework map, though unaffected by selection, has one region of poor fit ( Figure S2C) and is 2.4· as long as the comparable RG map (Table 7).

Differences in recombination between sexes
To assess differences in recombination between males and females, we plot sums of adjacent recombination frequencies-the parameter optimized in ML mapping-for maternal against paternal ML framework linkage maps (Figure 4). For 32 maps, the sum of adjacent recombination frequencies for the female map exceeds that for the male map, while for 14 maps, the opposite is true; however, assuming an intercept of zero, the slope of the regression, is not significantly different than 1.0.

Consensus linkage maps
To build a consensus linkage map for the Pacific oyster, we used individual family ML framework maps, with the exceptions described in Materials and Methods. The final consensus linkage map has 656 markers, including 49 previously mapped microsatellite DNA markers, and spans, after adjustment, 890 cM (Table S7).

DISCUSSION
More than 1100, coding SNPs are placed on a second-generation genetic linkage map for the Pacific oyster Crassostrea gigas, along with 66 microsatellite DNA markers to establish coherence with first-generation maps. On average, the map has a spacing of 1 cM and covers 86% of a genome that is estimated, here, to be~616 cM in total length. This observed map length corresponds well with both a cytological estimate of map length in the eastern oyster Crassostrea virginica (1.2 crossovers per bivalent · 50 cM · 10 chromosomes = 600 cM; Longwell et al. 1967) and with genome-size estimates for the Pacific oyster, from either flow cytometry or assembly of DNA sequences (637 Mb and 559 Mb, respectively; Zhang et al. 2012), if the 1 cM % 1 Mb rule of thumb from human genetics (http://ghr.nlm.nih.gov/glossary=centimorgan) applies to the oyster.
In contrast to what Hubert and Hedgecock (2004) reported, no significant differences in map distances are detected across 46 maleand female-parent maps, suggesting no marked difference in underlying male and female recombination rates in this much larger data set. On the other hand, the distribution of SNPs among parents appears to reflect both the breeding history of the mapping families used (F 2 vs. wild-caught) and an ascertainment bias stemming from the initial identification of SNPs fixed for different alleles in the inbred parent lines of the 51 · 35 F 2 family.
This second-generation map is a substantial improvement over previously published, first-generation maps for this species, which were either low-density maps based on microsatellite DNA markers or on small numbers of SNPs or higher-density maps based on dominant, largely nontransferable AFLP markers. A consensus map based on three families and 100 microsatellite DNA markers (Hubert and Hedgecock 2004;Hubert et al. 2009) had average lengths of 616 cM and 770 cM for male-and female-based maps, respectively, average spacing of 8210 cM and genome coverage of 70-80%. Li and Guo (2004)  EST-derived SNPs typed in three F 2 families, which spanned 1016 cM, had an average spacing of 12.7 cM, and covered 73.6% of the genome. Most first-generation maps are much longer than the second-generation map reported here and, of course, have much lower densities of markers. All of these previous reports note significant distortion of segregation ratios for a substantial proportion of markers, rendering mapping difficult in some cases (Sauvage et al. 2010).

Implications for genome scaffold assembly
Perhaps, the most surprising finding in this study was the frequency with which markers on the same genome scaffold map to different LGs. Nearly 40% of genome scaffolds with two or more SNPs map to different linkage groups, which strongly suggests that a substantial fraction of the genome scaffolds reported by Zhang et al. (2012) are incorrectly assembled. The likelihood of misassembly appears to rise with scaffold length, and additional statistical comparisons of inter-marker distances and numbers of ambiguous bases between adjacent SNPs support this inference. Adjacent SNPs mapping to different LGs are 4.7 times farther apart than adjacent SNPs mapping to the same LG, whereas sequences between adjacent SNPs that do or do not map to the same LG have median numbers of 11 and 25,161 ambiguous bases, respectively. Because the minimum number of breakpoints required to explain the order of mapped SNPs is found in 58 of 60 informative cases, our data suggest that misassembled scaffolds comprise large blocks (contigs or super-contigs, see Figure 1 in Zhang et al. 2012) that map to different linkage groups and are separated by long stretches of ambiguous bases. These large scaffolds should be shattered at potential breakpoints and reassembled using linkage information or long-range DNA sequences. That 50% of the genome assembly is found in the largest 401 scaffolds (N50 size of 401.3 kb; Zhang et al. 2012) will likely prove to be an overestimate. LG, linkage groups; LOD, log of the odds; RG, regression; ML, maximum likelihood. a Grouping LOD is the threshold passed by all markers in at least one test of independence with other markers in the group. For two values marked by an asterisk, the group was formed from smaller groups passing the LOD threshold indicated. b RG indicates default regression mapping settings in JoinMap; "rlx" indicates relaxed linkage thresholds for regression mapping. RGFO uses a fixed order of anchor markers established by ML mapping and agreement of ML and RG marker orders. c MLFO uses a fixed order of anchor markers.
This conclusion about scaffold assemblies contrasts sharply with the fact that 95% of ESTs were successfully mapped to the draft genome assembly (see Table S6 in Zhang et al. 2012). Evidently, at the level of contigs and genes, the assembly of the Pacific oyster genome is quite complete and useful, but at the level of scaffolds, the assembly contains many errors. Because the ordering of contigs and genes within scaffolds may not always be accurate, inferences about the clustering of gene families (e.g., Paps et al. 2015) should be regarded as provisional until the assembly of genome scaffolds is either confirmed or improved. High density linkage maps are presently being constructed, using genotypingby-sequencing methods (Elshire et al. 2011); these maps should provide critical information for improving future assemblies of the Pacific oyster genome, as they have in other species (e.g., Dalloul et al. 2010;Dodgson et al. 2010).

Uncertainties in linkage maps
In the course of this study, three factors emerged as evident causes of discrepancies in map orders and lengths between RG and ML methods and among families. The first, mentioned above and illustrated by two examples ( Figure S1, Figure S2, Figure 2, and Figure 3), is the effect of markers segregating from only one parent, which consistently contribute to inflated ML map lengths and poor agreement between RG and ML maps. Across all 47 linkage groups, the general effect of single-parent markers is reflected in the average improvement in agreement of RG and ML marker orders between maps made with all markers and maps made only from bi-parentally segregating framework markers (mean r 2 for all-marker maps, 0.67; mean r 2 for framework maps, 0.85; paired sample t = 3.17, P = 0.0014). For this reason, only framework maps are merged into consensus linkage maps.
The second factor causing uncertainty in map construction is distortion of Mendelian inheritance ratios, owing to strong viability selection, as documented in several previous studies of oysters (Bierne et al. 1998;Launey and Hedgecock 2001;Plough and Hedgecock 2011;Plough 2012). In this study, 45 of 50 linkage groups show evidence of viability selection (Table 2), yet the effect of viability selection on map construction is not a general one. Across all linkage groups, we could find no significant relationship between the proportions of markers distorted and the lengths of maps, the fit of maps to data, or the agreement of RL and ML marker orders. Furthermore, counterexamples to a general effect of viability selection on linkage mapping are evident-linkage groups with no evidence of selection but with uncertainties in map orders and lengths (e.g., Figure S1 and Figure 2) or linkage groups with most or even all markers showing distorted segregation ratios but having well-behaved maps (LG 3, LG 7, and LG 10 for family 51 · 35, Table 6 and Table 7). On the other hand, distorted markers do often contribute to poor fit and extended map lengths (Figure S2 and Figure 3). This is especially the case, if markers segregating from one or the other parent fall into regions of segregation distortion.
Mode of selection appears to make a difference, because massively distorted but well-behaved LGs occur only in the F 2 families, in which associative overdominance-the apparent advantage of heterozygotes, owing to selection against recessive deleterious mutations linked to alternative homozygotes-is a consequence of their breeding history, as is the excess of hk·hk mating types in these families (Table 1; Plough and Hedgecock 2011). For example, on the ML framework map for LG 10 in family 51 · 35, all 42 markers are highly significantly distorted, but nearest-neighbor fit averages only 0.085 cM and the RG and ML maps are highly correlated (r 2 = 0.93; Table 7). The relative fitness of heterozygotes compared to the most abundant homozygotes across the ML map for this linkage group averages 2.03, reaching a peak at 10.3, where genotypic proportions for SNP 1385-A193329G are zero hh, 103 hk, and 5 kk. In this case, the symmetry of the selection on parental types appears not to affect the fit of observed and calculated recombination frequencies and marker orders. In contrast, a framework ML map of 36 markers for the same linkage group, LG 10, in family 2 · 10, has two blocks of distorted markers (12 markers altogether, which had to be removed to produce the framework map recorded in Table 7 and Figure 3), but nearest-neighbor fit averages 281 cM and orders of markers on the RG and ML maps are poorly correlated (r 2 = 0.61). The average relative fitness of heterozygotes compared to the most abundant homozygotes at each marker across this ML map is 0.75, reaching a nadir of~0.3, where one homozygote disappears entirely. Here, the gross asymmetry and partial dominance of viability selection is clearly associated with inflation of ML map distances and incongruence of RG and ML marker orders. Thus, the number, location, and phase of loci under viability selection produce idiosyncratic effects on map lengths and marker orders. Detailed analyses of the location and effects of viability mutations in the G 0 families are reported elsewhere (L. V. Plough, G. Shin, and D. Hedgecock, unpublished data).
Distortion of Mendelian segregation ratios is commonly observed in mapping studies, especially in plants and marine molluscs, although the phenomenon may generally be underreported, because distorted markers often are discarded from data sets before linkage analysis. Distortion of segregation ratios can arise in interspecific or intersubspecific crosses, because of Dobzhansky-Muller incompatibilities (e.g., Fishman et al. 2001;Wang et al. 2005;Foley et al. 2013), but our focus, here, is on distortions of segregation ratios arising from genotype-dependent mortality of progeny from intraspecific crosses of Pacific oysters (Launey and Hedgecock 2001;Plough and Hedgecock 2011). Viability loci causing early mortality in oysters appear to act independently of one another, as no statistically significant epistatic interactions are detected (Plough and Hedgecock 2011;Plough 2012;L.V. Plough, G. Shin, and D. Hedgecock, unpublished data).
Advice on what to do about segregation distortion in linkage mapping studies ranges from discard distorted markers, to exercise great caution in interpreting results from commonly used computer packages (Liu 1998), to include such markers, so as to retain genetic and potentially biologically important information (Shah et al. 2014;Truong et al. 2014). Theoretical studies on how segregation distortion can bias estimates of recombination date back to Bailey (1949) and Allard and Alder (1960) but yield few generalizations regarding the practice of linkage mapping, particularly now, when dense linkage maps are enabled by high-throughput DNA sequencing. Estimates of recombination are affected by the form of selection (allelic vs. zygotic), by the dominance of viability genes, and by the dominance of markers (Lorieux et al. 1995;Liu 1998), which suggests that impacts of segregation distortion are likely to vary within and among linkage maps, owing to variation in the underlying modes of selection. Hackett and Broadfoot (2003) concluded that segregation distortion does not substantially impact map length or marker order, but they simulated only one distorted marker per linkage group, in a doubled-haploid mapping population, which may underestimate the impact of segregation distortion in other mapping populations or species with large numbers of viability loci, such as the oyster. Marker density also appears to be a factor, for example, marker orders in low-density, first-generation linkage maps for the Pacific oyster were reasonably consistent across families and studies (Plough and Hedgecock 2011;Plough 2012;Guo et al. 2012), despite widespread distortion of segregation ratios.
Efficient estimators of maps that take into account distortion or even simultaneously estimate fitness parameters are either computationally burdensome for dense maps or, when made computationally efficient, have so far been developed only for mapping populations common in plant genetics, i.e., F 2 , doubled-haploid, BC, RIL, and multiparent advanced generation intercross (i.e., MAGIC) populations (Wu et al. 2008;Lorieux 2012;Shah et al. 2014). In this study, we have left distorted markers in all analyses, discarding them only from the minority of framework maps in which they prevent a reasonable fit between observed and expected recombination rates or good agreement of marker orders from the regression and maximum likelihood mapping methods. For future linkage or QTL mapping studies with oysters, doubts about marker orders and distances, owing to selection and segregation distortion, can largely be eliminated by taking samples for genotyping in the mid to late larval stages (in addition to the life stage or time-point of interest), since much of the selective mortality occurs later, during metamorphosis (Plough and Hedgecock 2011).  A third factor likely contributing to discrepancies in map order and lengths is genotyping error (Hackett and Broadfoot 2003). Microsatellite genotypes for families F12, F20, and F45 were scored in duplicate reactions for at least 5% of individuals and an average of 17 markers, in each family (L. V. Plough, G. Shin, and D. Hedgecock, unpublished data), revealing an average error rate of 2.5%. Such an error rate has a small effect on map accuracy, when marker densities are low (~10 cM), as in previous linkage maps for the Pacific oyster based on microsatellite DNA markers (Plough and Hedgecock 2011;Plough 2012;L.V. Plough, G. Shin, and D. Hedgecock, unpublished data), but can impair the ordering of markers, when the density of markers is 1-2 per cM, as in this study (Hackett and Broadfoot 2003;Ball et al. 2010). That microsatellite markers frequently have poor nearest-neighbor fits is most likely explained by their relatively large genotyping error in the context of dense, second-generation SNP maps. The genotyping error of SNP markers, on the other hand, was extremely low (zero observed in 12,937 trials). Still, an observed frequency of zero is compatible with an error rate as high as 3 in 12,937, which yields a 5% chance of observing no errors in 12,937 trials of a binomial distribution. Given this potential error rate and the large number of genotypes determined in this study, nearly 172 thousand, the total number of genotyping errors could have ranged from 19 to 61 (99.9% confidence limits), with an average of 40 errors, and therefore could have contributed to otherwise unexplained problems in mapping.
A second-generation consensus linkage map for the Pacific oyster The consensus linkage map made with MergeMap (Wu et al. 2011) appears too long (890 cM), compared to the estimated length of the genetic map (616 cM) and the size of the genome (637 Mb) estimated by flow cytometry (Zhang et al. 2012). Still, this map likely represent the best consensus ordering of markers and is presented here for future reference (Table S7).