Modern technologies and algorithms for scaffolding assembled genomes

The computational reconstruction of genome sequences from shotgun sequencing data has been greatly simplified by the advent of sequencing technologies that generate long reads. In the case of relatively small genomes (e.g., bacterial or viral), complete genome sequences can frequently be reconstructed computationally without the need for further experiments. However, large and complex genomes, such as those of most animals and plants, continue to pose significant challenges. In such genomes, assembly software produces incomplete and fragmented reconstructions that require additional experimentally derived information and manual intervention in order to reconstruct individual chromosome arms. Recent technologies originally designed to capture chromatin structure have been shown to effectively complement sequencing data, leading to much more contiguous reconstructions of genomes than previously possible. Here, we survey these technologies and the algorithms used to assemble and analyze large eukaryotic genomes, placed within the historical context of genome scaffolding technologies that have been in existence since the dawn of the genomic era.


Background
The increased availability and lower cost of DNA sequencing have revolutionized biomedical research. Thousands of humans have been sequenced to date, and genome sequencing is increasingly used in clinical practice, particularly in the context of cancer [1,2]. Despite the long length of sequences generated by third-generation sequencing technologies (tens of thousands of base pairs), the automated reconstruction of entire genomes continues to be a formidable computational task, in no small part because of genomic repeats-ubiquitous features of eukaryotic genomes [3]. Recently, new genomic technologies have been developed that can "bridge" across repeats or other genomic regions that are difficult to sequence or assemble. We refer to technologies originally developed as a tool for interrogating the structure of genomes by cross-linking adjacent genomic segments and capturing these adjacencies through sequencing. These technologies are increasingly used to help improve genome assemblies by "scaffolding" together large segments of the genome. We survey here recent advances in this field, placed within the context of the technologies and algorithms that have been used for scaffolding throughout the entire genomic revolution. Note that our primary focus is on PLOS Computational Biology | https://doi.org/10.1371/journal.pcbi.1006994 June 5, 2019 1 / 20 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 complex eukaryotic genomes. We conclude with a survey of recent projects that demonstrate the effective combination of sequencing and scaffolding technologies to generate high-quality genome reconstructions.

Sources of information for genome scaffolding
Broadly speaking, any type of information that hints at the relative location of genomic segments along a chromosome can be used to drive the scaffolding process. In most cases, the information used derives from genomic technologies specifically designed to interrogate the structure of chromosomes, though indirect inferences based on evolutionary arguments have also been used effectively in genome scaffolding. Fig 2 shows the extent to which different sequencing technologies can provide the linkage information for scaffolding. This linkage information can span anywhere from several hundreds to tens of thousands of base pairs (Illumina, Pacific Biosciences, and Oxford Nanopore) to hundreds of thousands of base pairs (linked reads and optical maps) to millions of base pairs (Chicago and Hi-C). The genomic range spanned by the linking information is directly tied to the effectiveness of a particular technology to resolve certain classes of repeats [19]-to be effective, links must be longer than the length of a repeat but short enough so that they do not span multiple repeat units (which would increase the computational complexity of the scaffolding or repeat resolution process). Given the broad range of the lengths of repeats in most organisms, best results are usually obtained from a mixture of technologies or from technologies that yield data spanning a broad range of distances (such as third-generation sequencing reads, optical maps, or Reads and optical maps derived from the NA12878 sample (DNA from a human individual sequenced as part of the 1000 Genomes Project) were mapped to the GRCh38 human genome reference. The histograms represent as follows: Illumina-the separation between natively generated paired-end reads (SRX1049855); Pacbio-the length of the reads generated by the Pacific Biosciences technology (SRX1607993); Oxford Nanopore-the length of the reads generated by the Oxford Nanopore technology (https://github. com/nanopore-wgs-consortium/NA12878); optical maps-the length of the fragments mapped by the BioNano nanocoding technology (from BioNano website); linked reads-the span of the region covered by reads originating from the same DNA fragment, as generated by the 10X Genomics technology (SRX1392293); Chicago-the separation between read pairs generated by the Chicago chromosome conformation capture protocol (SRX1423027); and Hi-C-the separation between read pairs generated by the Hi-C chromosome conformation capture protocol (SRX3651893). https://doi.org/10.1371/journal.pcbi.1006994.g002 Hi-C links). Sequencing reads can be viewed as providing linking information that spans any distance within the length of the reads-as such, they provide valuable information for repeats of a broad range of sizes, up to the length of the reads.
We structure our presentation around the key types of information that can be used to organize genomic contigs into chromosome-wide scaffolds (Table 1).

Physical mapping
Physical-mapping technologies attempt to estimate the location of specific loci along genomic chromosomes. The loci can be short DNA segments that are unique within the genome, as in the case of sequence-tagged sites (STSs)[20], or the recognition sequence of a restriction enzyme, as in the case of restriction mapping and optical mapping. The approximate location of the markers along chromosomes can be identified through a number of techniques, from fluorescence in situ hybridization (FISH)[21] to the analysis of the random breakage of DNA being exposed to X-rays (radiation hybrid mapping [22]) to direct measurement of restriction fragment sizes, as performed in restriction mapping [23]. The original use of restriction enzymes to map chromosomes resulted in unordered information-simply the list of sizes of the restriction fragments generated from the molecule-information that had to be converted into an ordered map through a complex computational process. Optical maps are an enhancement of restriction mapping, which provides the fragment order in addition to their size based on imaging the fluorescence of DNA molecules immobilized on a glass slide [24] or within a nanochannel [25] (the latter technology is called nanocoding).
Physical-mapping data is among the earliest technologies used to order genomic contigs along a chromosome [26][27][28]. The computational approach used to perform this task simply involves comparing experimentally derived maps to theoretical (in silico) maps generated from the sequenced contigs (Fig 3). This process is easiest when the landmarks being compared are distinguishable from each other (as is the case for STS and radiation hybrid maps) and substantially more complex and error prone for restriction maps, in which all the landmarks are identical in sequence, e.g., in the context of optical mapping [29].
The experimental maps themselves are often the result of assembling a collection of DNA fragments (clones) that have been mapped separately, leading to similar analytical challenges as those encountered in genome assembly [30]. In the case of unordered restriction maps, the assembly process is guided by the probability that two clones overlap, which is computed by taking into account the number of restriction fragments shared by the clones [31]. The pairwise overlap probabilities are used to assemble the clones into a chromosome-wide structure using a heuristic assembly algorithm (fingerprinted contigs [FPC]), which also allows for manual intervention to inspect and correct the resulting layout. Ordered restriction maps, as generated by optical or nanocoding mapping, can be aligned using variants of dynamic programming alignment algorithms [32,33]. In SOMA [34], fragment-sizing errors are penalized through a chi-squared scoring function, and a variant of a scheduling algorithm is used to determine the layout of contigs with ambiguous mappings. The runtime of the alignment algorithm used in SOMA scales with the fourth power of the number of restriction fragments, making the approach impractical for large genomes. Two recent approaches address this limitation. TWIN [35] relies on an extension of the FM-index [36] to speed up the alignment process, whereas Maligner [37] indexes the reference map by simulating the effect of common mapping errors such as false cuts or missed restriction sites.

Subcloning
Subcloning involves breaking up the genome into large fragments that are then sequenced separately, retaining the connection between the sequencing reads generated from the same fragment (we refer to them as "linked reads" subsequently). The assembly process can then be run for each fragment separately, and the resulting assemblies can be merged together to reconstruct the full genome sequence. Initially, subcloning relied on bacterial artificial chromosomes (BACs) grown in Escherichia coli, yielding fragments in the range of~100 kbp in length. The two ends of each BAC clone were sequenced first in order to construct a clone map [38] representing the relative relationship between individual BACs along the genome. From this information, a minimal tiling path was identified in order to decide which fragments would be fully sequenced. This strategy was used effectively in the early days of genomics, most notably during the public effort to sequence the human genome [39].
Recently, new technologies have been developed that perform the subcloning process in vitro. The technology from 10x Genomics partitions large DNA fragments into droplets, and the DNA is sheared, and sequencing libraries are constructed within the droplets. The DNA within each droplet is tagged with a droplet-specific barcode, and these barcoded DNA molecules then undergo sequencing, and a postprocessing algorithm parses the barcodes to group the reads originating from the same large DNA fragment [40]. The Illumina TruSeq Synthetic Long Read (TSLR) technology [41] is based on fragmenting DNA into large segments of about 10 kb in size, which are distributed into pools such that each pool contains a relatively small number of fragments (~200-300). Each pool is processed separately and barcoded with a unique barcode prior to sequencing.
When the original fragments have been sequenced deeply enough (which is usually the case for the TSLR technology and most applications of BACs), the pooled reads can be assembled together in order to create complete reconstructions of the individual fragments, effectively generating long and highly accurate sequencing reads. Several approaches have also been developed that rely on the unassembled pooled reads to guide the scaffolding process, approaches that can be effective even at low depths of sequencing coverage. Fragscaff [42], a method which was originally designed for contiguity-preserving transposition sequencing data, creates links between the ends of contigs, which are represented in the set of reads from the same pool. Within the resulting graph, Fragscaff then identifies a minimum spanning tree by using as edge weights the number of pools shared between contigs. The longest path within this tree is selected as the scaffold backbone. ARCS [43] maps the linked reads to the assembled contigs and constructs intercontig links by identifying pairs of contigs whose ends share sequences from the same read pool. Scaffolding is then performed with the LINKS scaffolder [44], a tool originally developed for scaffolding assemblies with the help of long-read data. A similar approach is used by ARKS [45], a tool that relies on k-mer matches instead of sequence alignment to infer the assignment of the linked reads to assembled contigs. In Supernova, Weisenfeld and colleagues [46] relied on pool-specific barcodes to construct an adjacency graph in which the nodes are the initial set of contigs or scaffolds and edges denote the number of poolspecific barcodes shared between scaffolds. In this graph, Supernova finds linear paths consistent with the links provided by barcodes and selects the highest-scoring path as the backbone of the final scaffold. Linked reads have also been used to identify errors in the assembly. Tigmint [47] flags as misassemblies regions of contigs where the depth of coverage by read pools (as inferred from the mapping of reads to assemblies) is lower than expected.  [55] are effective in reconstructing genomic contigs from long-read data to achieve highquality assemblies with only long-read data, the genome needs to be sequenced at considerably high coverage, incurring significant costs. A more cost-effective strategy involves supplementing a short-read assembly with a relatively low-coverage set of long-read data, effectively representing a collection of sparsely sampled genomic subclones. SSPACE-LongRead [56] was one of the earliest methods able to leverage long reads for scaffolding. This approach used BLASR [57] (an aligner tuned for the high error rates of third-generation sequencing technologies) to align contigs to the long reads in order to infer orientation, ordering, and the distance between contigs. SMSC [58] and BIGMAC [59] both start by using the long-read data to identify potential errors in the assembly, break the contigs at the boundaries of these errors, and then scaffold the resulting data using linking information inferred from the long reads. Because the alignment of long-read data is computationally intensive, LINKS [44] proposes an alignment-free approach that extracts pairs of k-mers separated by a predefined distance within long reads and then treats these as if they were paired short reads, relying on traditional scaffolding approaches. Unicycler [60] operates directly on the assembly graph generated by the SPAdes assembler [61] from short reads, using the long-read data to disambiguate paths through the graphs and generate longer, more accurate contigs. npScarf [62] leverages the real-time generation of data by nanopore sequencing devices to iteratively develop and improve the scaffolding of a genome as more data become available. The scaffolding algorithm operates in a greedy fashion, linking contigs together as soon as sufficient support is available in the set of reads and breaking prior links if new ones that contradict them have stronger support. This iterative greedy process can be stopped by the user once a sufficiently good assembly is generated, allowing the dynamic selection of the depth of sequencing depending on the actual quality of the resulting reconstruction.

Paired-read technologies
By far, the most common source of information for scaffolding is technologies that yield information about the relative placement of pairs of reads along the genome being sequenced. Most commonly, this information is derived by carefully controlling DNA shearing prior to sequencing in order to obtain fragments of uniform sizes and by tracking the link between DNA sequences "read" from the same fragment. Multiple protocols have been developed to generate read-pairing information, and different names are commonly used to reflect the experimental source: paired-end reads (pairings natively generated by Illumina sequencing instruments, usually short range~300-500 bp) and mate-pair or jumping libraries (pairing information derived with the help of additional experimental assays, usually spanning thousands to tens of thousands of base pairs). Here, we use these terms interchangeably, as the information being generated is the same-pairs of reads with an approximately known relative distance and orientation. The information provided by mate pairs can be used to link the contigs and produce scaffolds [63] or to guide the assembly process itself, allowing the effective resolution of repeats [19,64].
The algorithms for using mate-pair data in scaffolding genomes all follow a similar workflow. First, mate pairs whose ends map to different contigs are used to link together the corresponding contigs. Second, the pairwise linkage information is used to orient and order contigs with respect to each other. Third, the size of the gap between adjacent contigs is estimated from the experimentally determined size of the mate pairs, and a linear layout of the contigs along a scaffold is generated (Fig 4). Because contig orientation and ordering are computationally hard problems [7], scaffolders implement different greedy heuristics. Scaffolders such as MIP [65], SOPRA [66], and SCARPA [67] use integer programming to find the optimal orientation and ordering of contigs. Bambus [68] and SSPACE [69] use multiple libraries in a hierarchical manner to perform scaffolding, starting from libraries with smaller insert sizes (which are more accurate and yield a simpler problem) and progressively expanding scaffolds using libraries with larger insert sizes. OPERA-LG [70] uses a branch and bound search to determine the relative placement of contigs along the chromosome. The authors show that the size of the search space is bounded by the ratio between the library and contig size, implying that the branch and bound heuristic is efficient for the data typically encountered in practical applications despite a theoretically exponential complexity. The scaffolder inGAP-sf [71] merges information from both the assembly graph and paired-read data to construct scaffolds. In addition, this scaffolder introduces a statistical model for estimating the support for a link between two contigs, information that is used in constructing the scaffold. A number of tools have also been developed that use RNA sequencing (RNA-seq) data for scaffolding. Because of the long lengths of eukaryotic introns, such approaches can yield long-range genomic connections using standard short-length paired-end sequencing protocols, with the caveat that scaffolding is only effective in genic regions. Tools developed specifically for such data include RNAPATH [72], L_RNA [73], Rascaf [74], AGOUTI [75], and P_RNA [76].
In the context of repeat resolution, the orientation and distance constraints imposed by paired reads limit the number of possible traversals of the graph through a repeat region and can link together the unique genomic regions surrounding each instance of a repeat. Assemblers such as Velvet [77], ABySS [78], ABySS 2.0 [79], and IDBA-UD [80] use paired-end information to guide the walk through the assembly graph. SPAdes [81] and metaSPAdes [82]  use the ratio of expected to observed numbers of mate pairs connecting two nodes [83] in the de Bruijn graph to check if the path traverses through a repetitive region. Wetzel and colleagues [19] explored the extent to which mate pairs can be used to resolve repetitive regions in prokaryotic genomes and showed that mate-pair libraries are most effective if tuned to the structure of the assembly graph.

Chromosomal contact data
A special type of paired-read data is generated by techniques recently developed to study the three-dimensional structure of chromosomes inside a cell [84]. These techniques are collectively referred to as chromosomal conformation capture (C3), which generate pairwise linking information between reads that originate from genomic regions that are physically adjacent in a cell. Unlike mate-pair data, the distance and the relative orientation between the paired reads are not known a priori.
The two most commonly used protocols for capturing chromosome conformation are Hi-C [84] and Chicago [85]. In the Hi-C protocol, DNA in the cell nucleus is cross-linked and cut with a restriction enzyme. This process generates fragments of DNA that are distally located but physically associated with each other. The sticky ends of these fragments are biotinylated and then ligated to form a chimeric circle. The resulting circles are sheared and processed into sequencing libraries in which individual templates are chimeras of the physically associated DNA molecules. The Chicago protocol from Dovetail Genomics starts not with cells but with purified DNA so that any biologically associated interactions are eliminated. Artificial nucleosomes with random specificity are then used to condense the DNA into chromatin, which is then processed through the standard Hi-C protocol. The result is a collection of fragments that is enriched for sets of paired reads that capture long-range interactions between segments of DNA that were in contact within the artificial chromatin.
Because Hi-C and Chicago protocols do not provide estimates of the distance between the paired reads, the data can only be used to estimate the relative order and orientation of contigs and not the size of the gaps separating them. The scaffolding process starts by filtering the data to eliminate artifacts such as reads aligning to multiple locations or chimeric reads derived from the ligation junctions. Several tools have been developed for this purpose, including HiCUP [86], HiCPro [87], Juicer [88], Juicebox [89], and HiFive [90]. These tools align reads to the assembly using standard alignment programs [91,92] and filter the alignments to remove experimental artifacts, yielding the "true" alignments, which imply the contact information. The number of paired reads linking two genomic regions (contact frequency) strongly correlates with the one-dimensional distance between the corresponding regions, thereby yielding an estimate of the relative placement of these segments within a genome. Furthermore, the contact frequency is much higher within a chromosome than across chromosomes, making it possible to infer chromosome structure directly from the genome assembly. Most of the algorithms developed to use Hi-C data for scaffolding use these properties to group contigs into chromosome-specific bins and then orient and order the contigs within each chromosome by maximizing the concordance with the experimentally derived contact frequencies. A major confounding factor in using Hi-C data for scaffolding is the nonrandom association between topological domains [93]. DNA-to-DNA interactions within the nucleus are organized in a domain structure in which interactions are much stronger within a domain than across domains. As a result, the Hi-C contact patterns exhibit a modular structure that can confound the estimate of distance between contigs during the scaffolding process.
DNATri [94] and LACHESIS [95] were the earliest methods developed to use Hi-C datasets for scaffolding. DNATri relies on a limited-memory Broyden-Fletcher-Goldfarb-Shanno optimization algorithm to identify the placement of contigs that best matches the contact frequencies derived from the Hi-C data. LACHESIS first clusters the contigs into chromosome groups using hierarchical clustering, matching a user-specified number of chromosomes. Then, it orders and orients contigs in each chromosome group/cluster separately by formulating the problem as identifying the "trunk" of a minimum spanning tree of the graph that encodes the Hi-C links between contigs. GRAAL [96] models the Hi-C data by distinguishing between cis-contacts (occurring within the same molecule) and trans-contacts (occurring across molecules). The contact frequency for the former are distance dependent, whereas the latter are drawn from a uniform probability distribution. The contigs are ordered and oriented to maximize the fit with this modeled data using a Metropolis optimization algorithm [97]. SALSA [98] relies on Hi-C data to correct misassemblies in the input contigs and then orients and orders the contigs using a maximal matching algorithm [98]. 3D-DNA [99] also corrects the errors in the input assembly and then iteratively orients and orders uniquely assembled contigs (unitigs) into a single megascaffold. This megascaffold is then broken into a user-specified number of chromosomes, identifying chromosomal ends based the on a Hi-C contact map. Putnam and colleagues [85] proposed a method called Hi-Rise that was specifically designed for handling Chicago libraries (based on artificial chromatin). They rely on a likelihood function that matches the characteristics of these data and use dynamic programming to identify a layout of contigs that maximizes the fit with the experimental data. Recently, Zhang and colleagues developed an approach for scaffolding polyploid genomes using Hi-C data in an approach called ALLHIC [100]. This approach relies upon the LACHESIS algorithm applied to Hi-C data that have been filtered to remove contacts that connect across haplotypes, thereby yielding haplotype-specific scaffolds.

Practical considerations
The scaffolds generated with the help of the data described previously simply organize contigs along a genome without specifying the actual DNA sequence represented within the gap between adjacent contigs. Once the relative location of contigs is known, however, it is frequently easy to reconstruct the sequence within the gaps, a process that is commonly referred to as gap filling. Most commonly, mate-pair information is used to identify which reads could be placed within a gap, and then those reads are assembled to fill in the sequence within the gap, extending or even joining the adjacent contigs. Variants of this process are included in virtually all genome assemblers, e.g., ABySS [78], ALLPATHS-LG [101], and EULER [102], and several stand-alone solutions were also developed: GapFiller [103], SOAPdenovo GapCloser [104], and Sealer [105]. The latter approach relies on Bloom filters [106] to reduce memory usage, thereby enabling gap filling in large draft genomes. When long reads are available, gap filling can be performed with the help of reads that could not previously be incorporated in the assembly. The relatively higher quality of the contig sequences allows gap filling software to identify alignments that were missed during the assembly process. This principle is used by PBJelly [107] and GMCloser [108], approaches specifically developed for Pacific Biosciences data. GMCloser relies on a likelihood ratio test to determine the quality of alignments and remove poor-quality alignments that could lead to misassemblies.
Each of the data types used for scaffolding contain errors and have specific biases. Incorrect insert size estimates in mate-pair data can lead to ordering and gap estimation errors in scaffolds [109]. Hi-C data cannot provide accurate orientation information at small genomic distances, yielding small inversions within the scaffolds [110]. Optical mapping data have fairly low resolution and contain many errors, including incorrect estimates of fragment sizes and missed or spurious restriction cuts [111]. To reduce the impact of such errors on the ultimate quality of the assemblies, most studies rely on a combination of data sources. Long-read Pacific Biosciences data scaffolded with the help of nanocoding maps from BioNano Genomics were used by Pendleton and colleagues [112] to assemble a human genome and by Du and colleagues [113] to reconstruct the indica rice genome. Short-read data (Illumina) combined with 10x Genomics linked-read data and with optical maps were used by Mostoyov and colleagues [40] to generate a high-quality, haplotype-phased de novo assembly of a human genome. A combination of Pacific Biosciences, BioNano Genomics nanocoding maps, and Hi-C data were used to reconstruct the genome of the domestic goat [110]. Pacific Biosciences, BioNano Genomics nanocoding maps, and 10x Genomics linked-read data were effective in reconstructing a haplotype-phased version of the genome of a human individual [114]. A complex mixture of technologies, including BAC-based sequencing through short-read Illumina data, genetic map information, Hi-C data, and nanocoding maps, was used by Mascher and colleagues [115] to reconstruct the barley genome. Most recently, Jiao and colleagues [116] combined BioNano Genomics nanocoding maps, Hi-C data, and Pacific Biosciences long-read data to produce high-quality reconstructions of the genomes of three relatives of Arabidopsis thaliana. Recently, Moll and colleagues [117] used the model legume Medicago trunculata as a basis for critically evaluating the impact of different technologies and their combinations on the quality of the resulting genome reconstruction. Their analysis revealed the effectiveness of Dovetail Chicago data followed by the use of BioNano Genomics nanocoding mapping information in improving the original assembly generated from Pacific Biosciences reads.

Leveraging synteny-A missed opportunity
Synteny refers to the colocalization of genes or genomic loci along a chromosome. In many cases, whereas the DNA sequence itself may diverge significantly during evolution, related organisms often preserve synteny and gene order. The conservation of synteny can thus be used to help order contigs along a chromosome by inferring their placement based on the location within a related genome of the orthologs of the genes found in the contigs. Despite the rapid increase of complete and draft genomes in public databases, the use of synteny information in genome reconstruction has not been widely adopted. Here, we provide a brief overview of the approaches that have been developed in this field in hopes of spurring increased use of these data in genome projects.
Synteny-based methods first map contigs onto the reference genomes using a wholegenome aligner [118,119]. The orientation and ordering of contigs are then inferred from the alignment data (Fig 3). Methods such as OSLay [120], ABACAS [121], Mauve Aligner [122], fillScaffolds [123], r2cat [124], and CAR [125] use only one reference genome for scaffolding draft assemblies. They are primarily based on mapping the assembly to a complete or incomplete reference genome and attempt to identify the ordering and orientation of contigs that are most consistent with the reference genome.
The main challenges in using synteny data to guide genome reconstruction involve reconciling true differences with the reference genome as well as handling incomplete reference genomes. Such approaches may lead to mistakes if the reference genome is rearranged with respect to the genome being assembled or if the two genomes are too distant in phylogenetic terms. To address such challenges, MeDuSa [126] and Ragout [127] use multiple reference genomes along with the phylogenetic tree of these genomes as a reference to scaffold contigs. MeDuSa models the problem of using multiple reference genomes for scaffolding as an instance of maximum weight path cover problem [128], which is known to be NP-hard (computational problem for which no known polynomial time algorithms exist), and proposes a greedy heuristic to find an approximate solution. Ragout [129] and Ragout 2 [130] represent the target and reference genomes as a multicolored breakpoint graph with nodes representing the conserved synteny blocks and edges representing the adjacency of these blocks. In this graph, Ragout finds the missing adjacencies by solving a half-breakpoint state parsimony problem on the given phylogenetic tree and then orients and orders synteny blocks to reconstruct the target genome. Multi-CAR [131] starts by processing each reference genome separately using CAR. Multi-CAR then reconciles the different contig orderings by constructing a graph in which nodes are contigs and edges are the adjacencies given by different reference genomes. A maximal weight perfect matching [132] within this graph defines the final set of scaffolds.

Perspective
As we have shown, technological advances on both the experimental and computational sides have dramatically improved the ability of reconstructing the genomes of complex eukaryotic organisms, including repeat-rich plants such as rice and barley. Most of the technologies used today are evolved versions of approaches developed decades ago during the dawn of the genomic era. Although read lengths have increased, the accuracy of optical and nanocoding maps has improved, and subcloning approaches are now performed in vitro without the need for culturing the DNA in an E. coli host; the fundamental properties of the data being generated have not changed in a meaningful way. The exceptions are the technologies used to interrogate the structure of chromosomes through sequencing-Hi-C and related approaches. The paired reads generated by these technologies no longer provide per-pair distance constraints; rather, distance information can only be reconstructed from the frequency of "contacts" between distant sections of the genome being reconstructed. In return, however, these technologies provide much longer-range linking information than provided by any other technology. Just as long-read technologies have dramatically advanced our ability to reconstruct genomes, the long-range linking information provided by Hi-C and similar technologies has made it possible to link together complete chromosome arms [99,110].
In the near future, it is likely that many previously intractable genomes will be reconstructed with the help of long-read sequencing data coupled with paired-read information from chromosome conformation capture technologies, augmented by short-read and short mate-pair technologies aimed at resolving the small-scale structure of genomes. This opportunity is particularly relevant to scientists studying the complex genomes of plants [133] or insects [134] for which few genomic resources are currently available.
As we have already mentioned, a largely unused source of information is the sequences of the many genomes already sequenced and deposited in public databases. This vast source of data can be a valuable addition to the other types of genomic data being used in genome reconstruction, particularly in projects aiming to more densely sample particular regions of the tree of life (e.g., the genomes of cereal crops [133,135]).
In our review, we have omitted a discussion of experimental challenges or cost, in part because our primary focus has been on algorithmic considerations and in part because of the rapid changes in technologies that would make any cost estimates obsolete even before the ink has dried on the paper. In general, such practical aspects are insufficiently discussed in current literature, and the community would benefit from a review focused on the technical challenges and costs of the available technologies.
Generating haplotype-phased chromosome-scale assemblies of eukaryotic genomes with the mix of sequencing technologies has been the driving force behind the development of the newer genome assembly methods. Koren and colleagues [136] proposed a method called "triobinning" that uses short and accurate Illumina reads from two parental genomes to partition long and noisy reads from an offspring into a haplotype-specific set of reads, and each haplotype is then assembled independently.
It is conceivable that in the very near future, further developments in genomic technologies will make the automatic reconstruction of mammalian genomes possible. Recent advances in nanopore sequencing devices are already yielding longer reads than all prior technologies, potentially leading to the ability to assemble complete eukaryotic genomes from nanopore data alone. Rather than the end of a road, such developments will create the opportunity for scientists to tackle even harder challenges, such as the complete reconstruction of individual haplotypes, particularly in the context of heterogeneous mixtures such as tumors or microbial mixtures or polyploid genomes. Some progress is being made in haplotyping human genomes with the help of pedigree information (specifically trios comprising two parents and a child) [136]; however, the solution to the more complex problems posed by mixtures and polyploidy will require further developments in both genomic technologies, such as those outlined in our review, as well as in the design of algorithms and tools able to effectively leverage the information provided by these technologies.