Genome Organization of Escherichia Phage YD-2008.s: A New Entry to Siphoviridae Family

Malaysia is one of the countries that are loaded with mega biodiversity which includes microbial communities. Phages constitute the major component in the microbial communities and yet the numbers of discovered phages are just a minute fraction of its population in the biosphere. Taking into account of a huge numbers of waiting to be discovered phages, a new bacteriophage designated as Escherichia phage YD-2008.s was successfully isolated using Escherichia coli ATCC 11303 as the host. Phage YD-2008.s poses icosahedral head measured at 57nm in diameter with a long non-contractile flexible tail measured at 107nm; proving the phage as one of the members of Siphoviridae family under the order of Caudovirales. Genomic sequence analyses revealed phage YD-2008.s genome as linear dsDNA of 44,613 base pairs with 54.6% G+C content. Sixty-two open reading frames (ORFs) were identified on phage YD-2008.s full genome, using bioinformatics annotation software; Rapid Annotation using Subsystem Technology (RAST). Among the ORFs, twenty-eight of them code for functional proteins. Thirty two are classified as hypothetical proteins and there are two unidentified proteins. Even though majority of the coded putative proteins have high amino acids similarities to phages from the genus Hk578likevirus of the Siphoviridae family, yet phage YD-2008.s stands with its’ own distinctiveness. Therefore, this is another new finding to Siphoviridae family as well as to the growing list of viruses in International Committee on Taxonomy of Viruses (ICTV) database.


INTRODUCTION
Escherichia coli is one of the well studied gram negative bacteria. E. coli strains are group of bacteria that are commonly found which are important to microbiology, as well as, biotechnology field (Abeles & Pride 2014). Among of its importance in microbiology would include serving as the hosts for many phages. In fact, the well studied groups of phages are those infecting E. coli (Dessel et al. 2005). Phages are non-pathogenic viruses to human, animals and plants. Phages are very specific to bacteria or in other words bacteria are the natural preys to bacteriophages (Duck & Park 2012;Klumpp & Loessner 2013). Due to its host specificity, phages are excellent agents to control bacteria populations, as well as, maintaining bacteria diversities (Chang & Kim 2011). In phage therapy, phages show very promising results in dealing with multi-resistant antibiotic bacteria (Chang & Kim 2011;Cannon et al. 2013;Klumpp & Loessner 2013). Being the most abundant biological entity on biosphere with an estimation of 10 31 particles (Ackermann 1998;Li et al. 2010) and nanometre measured sized microbe phages have been used as useful model systems in various molecular research work. Besides, phages are ubiquitous and could be found wherever hosts reside. Thus, this would make them as diversed as their hosts (Ackermann 1998;Valera et al. 2014;Yu et al. 2016). Even though phages are huge in number but the total number of phages that have been discovered and examined under electron microscope are only in the range of 6,300. Over 96% of discovered phages are tail phages, under the order Caudovirale and thus formed the biggest group in prokaryote viruses (Ackermann 1998;Ackermann & Prangishvili 2012;Yu et al. 2016). Tail phages are divided into three major families: Siphoviridae (57.3%), Myvoviridae (24.8%) and Podoviridae (14.2%) as reported in Ackermann and Prangishvili (2012).
Isolation and propagation of novel phages are facilitated by phage abundances in the biosphere. However, to fully characterise the phage isolate up until the genomic level is costly and time consuming (Valera et al. 2014). Nevertheless, to advance our understanding on phages and their bacteria hosts' evolution, then, the necessity for phage genome sequencing is vital. Therefore, the development of a high throughput, next generation sequencing (NGS) technology in 2005 has given the chances to virologists to identify more of those yet to be revealed phages (Forrester & Hall 2014;Valera et al. 2014). As of August 2015, more than 1500 phage genomes have been completely sequenced and deposited in the National Centre for Biotechnology Information (NCBI) database, compared to the year 2007 where only ~400 phage genomes were deposited in NCBI GenBank (Savalia et al. 2008). Therefore, without doubt, there are a lot more phages are waiting to be discovered and characterised. This paper would discuss on a new phage that has been designated as Escherichia phage YD-2008.s. The phage was successfully isolated from goat faeces using E.coli ATCC 11303 as the host and its genome was fully sequenced with the aid of MiSeq system (Illumina). Phage YD-2008.s full genome sequence could be accessed from NCBI GenBank using accession No. KM896878.1 or accession No. NC_027383.1.

Isolation and Propagation of Phage
Fresh goat faecal samples were collected from a goat farm in Kampung Batu Putih, Balik Pulau, Penang, Malaysia (GPS coordinates: latitude -5° 24' 18.515'' ; longitude -100° 12' 17.254''). Phage isolation was carried out according to previous report but with modifications . Briefly, 20 g of goat faeces was mixed with 100 ml of TS buffer (8.5 g NaCl and 1 g tryptone per liter). A volume of 10 ml overnight E. coli ATCC 11303 culture was mixed with 10 ml of faeces suspension and added into 20 ml of double strength LB broth (20 g bactotryptone, 10 g yeast extract and 10 g NaCl per liter). The mixture was incubated at 37°C with shaking at 160 rpm for 24 hours to enrich phage population. Subsequently, large debris was filtered through a filter paper and the filtrate was centrifuged at 4000 x g for 15 minutes. The filtrate was passed through a 0.45 μm syringe filter (Minisart, Sartorius Stedim Biotech, Germany) and stored at 4°C. The filtrate was used in standard protocol of soft agar overlay method (Pearson 2013) with E. coli ATCC 11303 as the host for phage isolation. Plaques (clear zone) that formed on the soft agar overlay were due to the lysis of the bacterial cells infected by the phages. Following that, phages were purified using single plaque purification method as describe previously (Jones & Portnoy 1994).

Transmission Electron Microscope (TEM)
Phage observation using transmission electron microscope (TEM) (Philips CM12 equipped with analysis system, Philips Electron Optics) was according to Ackermann (2007) with modifications. A drop of phage sample (approximately 5 x 10 10 pfu/ml) was applied onto carbon-coated grid (400 mesh copper grid) and left for three to five minutes. Then, a drop of 2% methylamine tungstate was applied to negatively stain the phage sample and after one minute. The prepared phage sample was ready to be viewed under TEM (Philips CM12 equipped with analysis system, Philips Electron Optics).

Phage Genome Extraction
The phage genome was extracted using phenol: chloroform technique as reported previously (Sellvam & Arip 2012). A volume of 2 µl of RNase A and DNase I each at final concentration of 1 mg/mL were added into phage sample and incubated at 37°C for 30 min. After incubation phage genome was extracted with phenol:chloroform:isoamyl solution (25:24:1). Then, the phage genome was precipitated by adding ice cold isopropanyl alcohol and incubated at −20°C overnight. After an overnight incubation, the mixture was spin at 14,000 rpm for 20 minutes at room temperature. The supernatant was decanted and the pellet was washed with 0.7 mL of ice cold 70% ethyl alcohol. Following centrifugation at 14,000 rpm for 10 min, the pellet was air-dried and dissolved in 30 µl of nuclease free water and stored at -20°C.

Complete Genome Sequencing
The phage genome was sequenced using Next Generation Sequencing (NGS) method using MiSeq Illumina technology. Briefly, in the process of sequencing, Nextera XT DNA sample preparation kit (Illumina) was used to prepare the purified genome sample and sequenced was proceed using paired end 2 X 150 bp reads on the MiSeq system (MiSeq reagent; Nano kit v2, 300 cycles). The data was analysed using the Assembly workflow of the MiSeq reporter for quality control purpose.

Bioinformatics Analysis
The paired-end reads obtained by Illumina MiSeq sequencing were assembled into a full genome sequence using de novo assembly of CLC Genomic Workbench 6.0 (CLC Bio,Denmark) and compared using Geneious R8 as well. The length of phage YD-2008.s genome, formation of DNA, GC contents and nucleotide (A, T, C, G) compositions details were obtained from CLC Genomic Workbench 6.0. Predictions of open reading frames (ORFs) were carried out using GeneMark. hmm v3.25 Lee et al. 2013) and confirmed with RAST server (Chang & Kim 2011). Nucleotide and protein sequences were subjected to search against NCBI BLAST (Basic Local Alignment Search Tool) tools to search for homologues similarities as described in previous work (Pan et al. 2013;Yu et al. 2016). The complete genome sequence of phage YD-2008.s was subjected to BLASTn against non-redundant nucleotide collection (nr/nt) (https://blast.ncbi.nlm. nih.gov/Blast; accessed in 15 August 2014). Each of the predicted ORFs functions' were cross checked against non-redundant protein databases; BLASTX, BLASTP, as well as, BLAST against phage proteins database on with cut off e-value > 10 -5. (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx; https://blast.ncbi.nlm. nih.gov/Blast.cgi?PROGRAM=blastp, accessed in September 2014) (Dessel et al. 2005;Chang & Kim 2011;Lee et al. 2013;Pearson 2013).

Nucleotide Sequence Accession Number
The complete genome sequence of Escherichia phage YD-2008.s was deposited in GenBank under (Accession No. KM896878.1 or NC_027383.1).

Isolation and Morphological Study of Escherichia phage YD-2008.s
Phage YD-2008.s was isolated from goat faeces in Penang, Malaysia, using Escherichia coli ATCC 11303 as the host. The phage poses typical features belonging to Siphoviridae family of Caudovirales order. Phages in this family would have a capsid with diameter about 50-60 nm and long non-contractile tail that could reach up to 750 nm in length (Matsuzaki et al. 2005;King et al. 2012). The transmission electron micrograph (TEM) pictures show that phage YD-2008.s has an icosahedral capsid measured 57nm in diameter with a flexible long noncontractile tail measured at 107nm in length (Fig. 1).

Basic Genomic Identity of Escherichia phage YD-2008.s
The complete genome of phage YD-2008.s was successfully sequenced using NGS method. Paired reads assembly with de novo assembly tool (CLC Bio 6.0, Denmark) indicates that phage YD-2008.s is a linear double stranded DNA. The phage composes of 44,613 base pairs with nucleotide composition of A (22.2%), T (23.2%), G (26.8%) and C (27.8%). Overall, it has G+C content of 54.6%, which is a slightly higher than G+C content found in other E.coli phages (50-51%) (Matsuzaki et al. 2005;Li et al. 2010;Chang & Kim 2011). However, the G+C content of phage YD-2008.s is very much similar to phages from genus Hk578virus, a member of Siphoviridae family.
Bioinformatics studies on full genome of phage YD-2008.s have identified a total number of 62 ORFs, where 96.8% of them encoded for functional and hypothetical proteins. The range of the ORFs is between 39-1139 amino acids (aa) with an average of 216 aa. Majority of the ORFs are located on the bottom strand with 62.9% of them and the balance of 37.1% of the ORFs is coded on the upper strand. Each of the coding DNA sequence (CDS) function was subjected to BLASTP programme and re-confirmed with NCBI BLAST against phage protein programme as well. Among the sixty-two ORFs, 28 putative coded proteins were annotated with known functions. Thirty-two were conserved hypothetical proteins which shared similarities to unknown function of other registered phage proteins in NCBI database. Genome of phage YD-2008.s also contain two ORFs with no hits (no significant similarity search) against the NCBI non-redundant protein database (BLASTP) and BLAST phage protein database. The two ORFs could be coded for new phage proteins or might derived from the host genome. The complete details of each of the predicted coding ORFs of phage YD-2008.s are listed in Table 1 with the schematic picture shown in (Fig. 2). Majority of the ORFs show confident hits against proteins of the phages from genus Hk578virus with 76-100% amino acids identity at E-value cut off >10 −5 . Hence, Escherichia phage YD-2008.s could be grouped together with the member of genus Hk578virus, another new phage in Siphoviridae family ).      SSL-2009a] 0.0/ 99 * no significant similarity search ** Predicted functions of the deduced hypothetical/putative proteins were made against the following databases: RAST, NCBI and GeneMark.

Sequence Analysis of Predicted Proteins of Escherichia phage YD-2008.s
A total of 45% of the CDS from the full genome of phage YD-2008.s hit sequence similarities to known proteins with known molecular functions. These predicted proteins of phage YD-2008.s could be categorised into five main functional groups; structural, DNA replication and recombination, lysis, packaging and additional functions including host interaction and nucleotide metabolism. For the structural proteins, phage YD-2008.s composes of all essential proteins typically found including tail fiber protein, tail assembly protein, minor tail protein, tail length tapemeasure protein 1, major tail protein, structural protein and major head protein. Most of the proteins coded for structure components are arranged at middle of the genome and all of them are on the bottom strand in reverse orientation. The major capsid protein (ORF33) and the major tail protein (ORF27) have 366 and 214 of amino acid respectively. Whereby, the tail fiber protein (ORF19) has the largest size gene with 1139 amino acids among all the coded ORFs.
The phage encodes proteins, such as DNA polymerase, helicase, endonuclease and helicase-primase for DNA replication/recombination process. It also encodes few proteins categorised under nucleotide metabolism that includes DNA cytosine C5 methyltransferase, DNA N-6 adenine methyltransferase, phosphoesterase and transposase. Proteins that encodes for DNA replication/ recombination and nucleotide metabolism that are resided in phage genome are produced with the aid of host's machinery since phages are known to be host dependent living microorganisms (Savalia et al. 2008;Li et al. 2010;Chang & Kim 2011).
Theoretically, phages need to lyse their host cells in order to release their progeny virions out of the bacteria intracellular as the final step of the virus life cycle. For this purpose they need to have encoded lysis protein (Li et al. 2010;Gan et al. 2013;Klumpp & Loessner 2013). To accomplish this function phage YD-2008.s has proteins that encode for lysozyme, holin-like class I protein and holin-like class II protein. These lysis encoded genes lies next to each other in the genome. Furthermore, these proteins substantiate that phage YD-2008.s is a lytic phage. The phage genome also encodes proteins for packaging as well, such as head morphogenesis protein, terminase large subunit and terminase small subunit where all of these proteins are needed for the assembly of viral particles.
Besides, phage YD-2008.s poses host interaction protein denoted as superinfection exclusion protein. This protein would prevent secondary infections from other phages of the closely related family. This viral protein has the capability to prevent the entry of DNA from other phages or modify the entry receptor on the host cell. Hence, the infected host would be dominated with only a specific phage at a time (Chumby et al. 2012;Lee et al. 2013). The predicted ORFs that coded for functional phage proteins are summarised into different protein clusters as tabulated in Table 2.