PhageTerm: a tool for fast and accurate determination of phage termini and packaging mechanism using next-generation sequencing data

The worrying rise of antibiotic resistance in pathogenic bacteria is leading to a renewed interest in bacteriophages as a treatment option. Novel sequencing technologies enable description of an increasing number of phage genomes, a critical piece of information to understand their life cycle, phage-host interactions, and evolution. In this work, we demonstrate how it is possible to recover more information from sequencing data than just the phage genome. We developed a theoretical and statistical framework to determine DNA termini and phage packaging mechanisms using NGS data. Our method relies on the detection of biases in the number of reads, which are observable at natural DNA termini compared with the rest of the phage genome. We implemented our method with the creation of the software PhageTerm and validated it using a set of phages with well-established packaging mechanisms representative of the termini diversity, i.e. 5′cos (Lambda), 3′cos (HK97), pac (P1), headful without a pac site (T4), DTR (T7) and host fragment (Mu). In addition, we determined the termini of nine Clostridium difficile phages and six phages whose sequences were retrieved from the Sequence Read Archive. PhageTerm is freely available (https://sourceforge.net/projects/phageterm), as a Galaxy ToolShed and on a Galaxy-based server (https://galaxy.pasteur.fr).


Phage DNA extraction and sequencing
Escherichia coli phages. Phage Lambda, HK97, T7, P1 and T4 were propagated on E. coli strain MG1655 in LB supplemented with CaCl2 (5 mM). A thermosensitive variant of bacteriophage Mu (MuCts62) was induced from strain RH8816, a derivative of strain MC4100 1 . Strain RH8816 was grown at 30°C to OD~0.5 and induced at 42°C for 2 hours.
Phage capsids were isolated using a PEG precipitation step followed by cesium chloride (CsCl) gradient purification. First, solid NaCl was added to the phage lysate to a final concentration of 1M and incubated on ice for 1h, followed by centrifugation at 11,000g for 10 min at 4°C to remove the cellular debris. PEG8000 was added to a final concentration of 10% w/v for 1h on ice followed by centrifugation at 11,000g for 10 min at 4°C. The pellet was resuspended in PBS (1X) and incubated for 1h at room temperature. An equal volume of chloroform was then added followed by centrifugation at 3000g for 15 min at 4°C. The aqueous phase which contains the phage particles was recovered and further purified by ultracentrifugation with a gradient of cesium chloride (CsCl). A volume of 3 ml of each of two cesium chloride solutions (density of 1.3 g/ml-0,41 g/ml of water and density of 1.5 g/ml-0,68 g/ml of water) and a volume of 2ml of a CsCl solution (density of 1.6 g/ml-0,82 g/ml of water) was added sequentially from more to less dense with a pipette into the bottom of a polyallomer tube. Phage suspensions up to 3 ml were then layered on the top of the gradient, and tubes were centrifuged at 100,000 g for two hours in a Beckman SW41 swinging bucket rotor. During this time, a roughly linear gradient of CsCl was formed in the tubes. Phages concentrated as a visible band were collected after puncturing the tube with a needle under the corresponding band. A second step of equilibrium ultracentrifugation in cesium chloride solution (density of 1.5 g/ml) was performed at 150,000 g with Beckman SW60 rotor. Phages were then dialyzed twice against water for 20 min and overnight against Tris 10 mM and NaCl 150 mM. Concentrated samples of phage lysates were treated with DNase and RNase for 30 min at 37°C, DNase was inactivated by EDTA (pH 8.0, 5 mM) for 10 min at 65°C and then proteinase K (0,5 mg/ml) and SDS (0,5 %) was added for 30 min at 55°C. Phage DNA was then purified using a typical protocol with phenol/chloroform/isoamylic alcohol followed by precipitation with acetate sodium (0,3M) and ethanol 100%. Pellets were washed with ethanol 70% and resuspended in TE (1X). Phage DNA was sequenced using the TruSeq library preparation kit on a HiSeq device.
Clostridium difficile phages. The NGS sequences of 8 Clostridium phages are reported: three temperate Siphoviridae phages (phiCD24-1, phiCD111, and phiCD146), and five temperate Myoviridae phages (phiCD481-1, phiCD505, phiCD506, phiMMP01 and phiMMP03). Phage DNA was isolated from 10 mL of purified phage lysate using the QIAGEN Lambda Mini Kit as described previously 2 . Since no suitable propagating host could be identified, we extracted phage DNA directly from a crude induction lysate. The quality of the DNA preparations was verified on agarose gel and the absence of contaminating bacterial DNA was verified by PCR using primers targeting the triose phosphate isomerase gene 3 . Single-end multiplex TruSeq libraries were created according to Illumina's specifications. Sequencing was performed on the Illumina HiSeq 2000 platform of the Pasteur Institute (Paris, France). The 110 bp sequencing reads (12 to 19 Mb, >1000X) were assembled into a single contig for each phage using Velvet 4 . Finally we used the MicroScope workflow 5 to generate an automatic functional annotation for putative coding sequence (CDS), which was then manually curated.

Implementation of the method from Li et al.
This approach is based on the computation and interpretation of three specific ratios R1, R2 and R3. The first ratio, is computed as follow: the highest starting frequency found on either the forward or reverse strands is divided by the average starting frequency, R1 = (highest frequency/average frequency). Li's et al. have proposed three possible interpretation of the R1 ratio. First, if R1 < 30, the phage genome does not have any termini and is either circular or completely permuted and terminally redundant. The second interpretation for R1 is when 30 ≤ R1 ≤ 100, suggesting the presence of preferred termini with terminal redundancy and apparition of partially circular permutations. At last, if R1 > 100 that is an indication that at least one fixed termini is present with terminase recognizing a specific site. The two other ratios are R2 and R3 and the calculation is done in a similar manner. R2 is calculated using the two highest frequencies (T1-F and T2-F) found on the forward strand and R3 is calculated using the two highest frequencies (T1-R and T2-R) found on the reverse strand. To calculate these two ratios, we divide the highest frequency T1 by the second highest frequency T2. So R2 = (T1-F / T2-F) and R3 = (T1-R / T2-R). These two ratios are used to analyze termini characteristics on each strand taken individually. Li et al. suggested two possible interpretations for R2 and R3 ratios combined to R1. When R1 < 30 and R2 < 3, we either have no obvious termini on the forward strand, or we have multiple preferred termini on the forward strand (when 30 ≤ R1 ≤ 100). If R2 > 3, it is suggested that there is an obvious unique termini on the forward strand. The same reasoning is applicable for the result of R3, used to analyze the termini on the reverse strand. Combining the results for ratios found with this approach, it is possible to make the first prediction for the viral packaging mode of the analyzed phage. A unique obvious termini present at both ends (both R2 and R3 > 3) reveals the presence of a cos mode of packaging. The headful mode of packaging pac is concluded when we have a single obvious termini only on one strand.

Software implementation
PhageTerm is written in Python 2.7, the source code and documentation are freely available at https://sourceforge.net/projects/phageterm under a GPL license (GPL 3.0). PhageTerm is compatible on three operating systems (Linux, Mac OS X, and Windows) in command line (require Python 2.7.X with the following library: matplotlib 1.3.1, numpy 1.9.2, pandas 0. 19.1, sklearn 0.18.1, scipy 0.18.1, statsmodels 0.6.1 and reportlab 3.3.0.). For direct graphical interface operation, a Galaxy wrapper version is available at https://galaxy.pasteur.fr.

Inputs and process
PhageTerm works with two mandatory inputs: (i) a file containing the phage sequence in Fasta format (Reference Sequence); and (ii) a file containing the sequencing reads in fastq format (Reads File). Optional inputs can also be chosen: the phage name to be used as an output prefix (default: Phagename), the length of the seed used for reads in the mapping process (default: 20), a surrounding value for peaks merging (default: 20), a file containing paired-end reads in fastq format (Paired File), the host genome sequence in fasta format (Host Sequence), phage upper limit coverage (default: 250) and the number of processor cores you want to use (default: 1).

Performance
The PhageTerm software was optimized to use multiple core and to minimize the memory consumption. During the whole process, the memory footprint remained low (< 1Gb). In our experiments, PhageTerm processed reads at a rate of more than 1 million reads per CPU-hour (3.5GHz Intel Xeon E5 with 12MB L3 cache).

How NGS reads quantity affect results ?
To test the reliability of the two methods, results were analysed with coverage values from 50 to 500 (Table S3). PhageTerm was able to accurately determine the termini of most phages down to a coverage of 50. Only a few C. difficile phages required a coverage of 250 or more to determine the right termini. PhageTerm can thus still perform very well with low sequence coverage.    Figure S1. COS phages with cohesive ends (e.g. λ -5', HK97 -3') Figure S2. DTR Phage with direct terminal repeats (e.g. T7) Figure S3. Headful phages with a prefered terminus (e.g. P1) Figure S4. Mu-like phages with host termini (e.g. Mu)