YaHS: yet another Hi-C scaffolding tool

Abstract Summary We present YaHS, a user-friendly command-line tool for the construction of chromosome-scale scaffolds from Hi-C data. It can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools and provides assembly results in multiple formats, thereby enabling rapid, robust and scalable construction of high-quality genome assemblies with high accuracy and contiguity. Availability and implementation YaHS is implemented in C and licensed under the MIT License. The source code, documentation and tutorial are available at https://github.com/sanger-tol/yahs. Supplementary information Supplementary data are available at Bioinformatics online.

1 Supplementary methods

Assembly error correction
To detect potential assembly errors for a contig, we first count the Hi-C read pairs spanning each position. Instead of using all read pairs, we require the mapping distance of two reads to fall under a certain threshold in order to reduce the influence of background noise. The distance threshold is estimated from data, such that 80% of read pairs are included. If this estimate is smaller than 1Mb, we use 1Mb instead. For each contig we then calculate the median value of the number of read pair counts crossing each site, and break the contig at positions where counts drop below a tenth of the median (except at the ends of the contig). The new contigs are subject to further rounds of error corrections until no errors are detected. See Fig. M1A.

Notation
Consider a genome assembly of N contigs. Denote by a n the size of nth (n = 1, ..., N ) contig. For a predefined resolution l, each contig n is split into b n chunks of size l, where b n = an l . The Hi-C read pair count between chunks i and j is denoted by c i,j , and the distance of separation of chunk i to j by d i,j = δ(i, j) = |i − j|, where i, j ∈ 1, ..., b n . The definition of c i,j and d i,j can be easily extended to a pair of contigs m and n, e.g., δ(i, j) = b m − i + j when m is followed by n in their positive orientations, and we have i ∈ 1, ..., b m and j ∈ 1, ..., b n . For convenience, a chunk pair (i, j) is referred to as a cell, and a cell for a chunk pair within a contig or between two contigs is respectively referred to as an intra-or inter -cell. We should note that, the subscripts for contigs are dropped here for convenience, e.g., the intra-and inter -cell Hi-C count should be c n,i,j and c m,n,i,j , respectively. See Fig. M1B.

The expected Hi-C count of a cell
The expected Hi-C count of a cell (i, j) is considered a function of chunk distance d i,j . The cells with the same distances form a band in the Hi-C contact map, which we assume have the same expected values denoted by E d . We use intra-cells to estimate this value E d = M ed{c n,i,j : n ∈ 1, ..., N, δ(i, j) = d}, where the function M ed calculates the median value of a data set. We also explored using the average values as expectations. For contig n, the total Hi-C count of band d is written as C n,d = i,j:δ(i,j)=d c i,j , and the total cell count of band d is written as B n,d = i,j:δ(i,j)=d 1 = b n − d, where i, j ∈ 1, ..., b n . Then the expected Hi-C count of cell (i, j) with δ(i, j) = d is averaged across all contigs and calculated as E d = N n=1 C n,d / N n=1 B n,d . We found that the average values performed less well compared to the medians for using as expectations.

The score for contig joining
We consider the joining of a contig pair m and n in all four possible orientations and the one with the largest score is selected as the final orientation. Without loss of generality, in this subsection, we assume the joining of the negative strand of m followed by the positive strand of n. The joining score is calculated as the weighted sum of the normalised Hi-C counts of the inter -cells between the two contigs. The Hi-C count of inter -cell (i, j) (chunk i from contig m and chunk j from contig n), where i ∈ 1, ..., bm 2 and j ∈ 1, ..., bn 2 , is normalised by the expected value and written as w i,j = c i,j /E δ(i,j) . Similar to the intra-cell calculation, the total cell count of band d is denoted by B m,n,d = i,j:δ(i,j)=d 1. Intuitively, we want the Hi-C signals to fill the Hi-C contact map band to band from inner to outer, and therefore define a filling rate up to band d as f m,n,d = i,j:δ(i,j)<d w i,j / d−1 k=1 B m,n,k . The joining score of contig m and n is finally calculated as s m,n = i,j: , is a distance threshold calculated from data. This is either the maximum value of d such that the total number of intra-cells on band d is no less than 30 or the minimum value of d such that the number of Hi-C read pairs fall into bands with distance no greater than d constitute at least 99% of the total number of Hi-C read pairs, whichever is smaller. The aim of setting an upper limit D is to ensure that a sufficient number of cells are used to estimate a sensible median (the 30 value), with the 99% threshold only rarely applying when the long range Hi-C signal is so weak as to not be usefully informative. The default values of these parameters have proved robust over a wide variety of data sets, but users who wish to explore alternative values can set them with command line options --D-min-cells and --D-mass-frac.

Remove background noise
Background noise is considered random Hi-C contacts between loci which could be either due to library preparation or read mapping issues. We see noise levels vary significantly between different species and Hi-C data types. (B) Contigs a and b are partitioned into eight and ten chunks respectively, and the Hi-C signals are consequently assigned to 36 and 55 intra-cells and 80 inter-cells (only consider the upper triangular part). The color of a cell indicates the Hi-C contact frequency (darker colors for higher frequencies). In general the contact frequency decreases as the distance of separation (d) between chunks increases. If a and b are neighboring contigs on a scaffold then cells with the same separation, e.g. those cells shaded with gray slashes (d = 3), should have similar contact frequencies whether they are intra-or inter-cells. Specifically, inter-cells within the green rectangle are used to calculate the joining score of contig a and b in this orientation. Here the separation threshold for contig joining score calculation is set as 6, which in the program is estimated from the data. Scores for each of the four possible pairwise orientations are calculated and compared to determine the orientation of the join. (C) A scaffolding graph for seven contigs. Numbers on edges are joining scores. The path indicated by red edges is used to assemble a scaffold after graph simplification. We estimate a global value for noise level from the data using inter -cells during the joining score calculation. More precisely, after finishing Hi-C counting for each pair of contigs, we choose the orientation with the least Hi-C counts from the four possibilities and record the counts for Hi-C contacts and cells. The average of cell Hi-C count across all contig pairs (most of which will be on different chromosomes) is used as the estimation for noise level, denoted by e. With the noise level included, the inter -cell Hi-C count normalisation is then written as w i,j = (c i,j − e)/(E δ(i,j) − e).

Consider Hi-C library restriction enzymes
The restriction enzymes used in the Hi-C library are optionally considered to remove biases on cell Hi-C counts caused by the variations in cutting site densities. We scan the input contigs to obtain the total number of cutting sites, denoted by R, and also to record the positions on contigs. The per base cutting site density is calculated as r = R/G, where G is the total number of base pairs of the input contigs. For each chunk along a contig, we calculate the relative per base cutting site density as r n,i = Rn,i lr , where R n,i is the number of cutting sites in chunk i of contig n. With the restriction enzymes considered, the cell Hi-C counts are normalised by the cutting site densities before any calculation mentioned above and written as c i,j = c i,j /r i r j , dropping implied subscripts for contigs.

Scaffolding graph construction and pruning
A scaffolding graph is constructed by using the contigs as nodes (two nodes for each contig for forward and reverse strands) and the joins between them as edges which are weighted by the joining scores (Fig. M1C). We implement nine heuristic operations for graph simplification.
(1) Filter by edge weight: edges with weights smaller than a predefined threshold or likely random contig joins are filtered out. To check if an edge is a likely random join, we find the 99th quantile, denoted by q, of a binomial distribution with W trials and a probability of w for success on each trial, where W is the number of inter -cells between the two contigs and w is the average of the normalised Hi-C counts for inter -cells across all contig pairs. The edge is considered a random join and filtered out if pW < q, where p is the joining score.
(2) Trim edges for tips: a tip is a node with only one edge -either incoming or outgoing -and the edge connects to a node on a contiguous path. The edge connecting the tip is removed. See Fig. M2A.
(3) Trim edges for blunt ends: a blunt end is a node with only incoming or outgoing edges and each edge connects to a node on a contiguous path. Edges connecting the blunt end are all removed. See Fig. M2B.
(4) Solve repeats: a repeat is a chunk of sequence which occurs multiple times and can be placed in multiple locations. Reuse of repeats is error prone so we skip them and leave scaffolds with gaps. A repeat in the graph is identified as a node with multiple incoming and outgoing edges, where for each pair of incoming and outgoing nodes, there exists an edge connecting them. Edges connecting the repeat are all removed. See Fig. M2C.
(5) Remove transitive edges: suppose three contigs are placed on a scaffold in a linear order, then as well as there being Hi-C contacts between the first two contigs and between the last two contigs, it is also common to see Hi-C contacts between the fist and the third contig. In the graph, we call the edge connecting the first and the third node a transitive edge, which is removed. See Fig. M2D.
(6) Remove bubbles: a bubble structure in the graph involves a source and a sink node where there exists multiple paths between them and no incoming edges from or outgoing edges to other parts of the graph except at the source and the sink. It is not possible to decide which path to select. However, we can skip the bubble structure by directly joining the source and the sink and introduce a gap between them. See Fig. M2E.
(7) Resolve ambiguous orientations: an ambiguously orientated contig pair will end up with a diamond structure in the graph, e.g., the forward strand of the first contig points to both the forward and reverse strands of the second contig which points back to the reverse strand of the first contig (companion edges). Ambiguous edges are removed if the diamond structure can be resolved, e.g., if the first contig is the only incoming node for the forward strands of the second contig while there exists a third contig pointing to the reverse strand of the second contig, then the edge connecting the forward strand of first contig and the reverse strand of the second contig along with its companion edge are removed. See Fig. M2F.
(8) Trim weak edges: an edge is considered weak and removed if there exists at least another edge with a larger weight in both outgoing edges from the source node and incoming edges to the sink node. See Fig. M2G.
(9) Remove ambiguous edges: a set of at least two edges are considered ambiguous and removed if they share the same source or sink node. Once removed the ambiguous edges, all paths extracted from the graph traversal are unambiguous which are used to assemble scaffolds. See Fig. M2H.

The default scaffolding process
YaHS runs a hierarchical joining process with multiple rounds of scaffolding at decreasing resolutions (increasing chunk sizes). In each round, the scaffolds generated in the previous round are used as the input except for the first round when the contigs are used. By default, the program runs at 15 resolutions: 10Kb,20Kb,50Kb,100Kb,200Kb,500Kb,1Mb,2Mb,5Mb,10Mb,20Mb,50Mb,100Mb,200Mb, and 500Mb. The number of rounds of scaffolding is determined by the program itself depending on the size the genome and the contiguity of the constructed scaffolds. The scaffolding process ends as soon as the scaffold N50 is smaller than 10 times the value of the resolution. This parameter is exposed to users. For example, users may start with a higher resolution (e.g., 1Kb) for highly fragmented genome assemblies. Most times, users do not need change this parameter. Table S1: Genome assembly results for the 15 DToL/VGP species. The cells of smaller L50 and L90 between YaHS and SALSA2 were coloured in green. Pin hic was not taken into account for this colouring as it was sometimes too aggressive to make contig joins which led to obvious scaffolding errors (Fig. S3-S17).