Dataset of 130 metagenome-assembled genomes of healthy and diseased broiler chicken caeca from Pakistan

This article presents metagenomic-assembled genomes (MAGs) of prokaryotic organisms originating from chicken caeca. The samples originate from broiler chickens, one group was infected with Newcastle Disease Virus (NDV) and one uninfected control group. There were four birds per group. Both groups were raised on commercially available antibiotic free feed under a semi-controlled setup. The binning step of the samples identified 130 MAGs with ≥50 % completion, and ≤10 % contamination. The data presented includes sequences in FASTA format, tables of functional annotation of genes, and data from two different approaches for phylogenetic tree construction using these MAGs. Major geochemical cycles at community level including carbon, sulfur, and nitrogen cycles are also presented.


a b s t r a c t
This article presents metagenomic-assembled genomes (MAGs) of prokaryotic organisms originating from chicken caeca.The samples originate from broiler chickens, one group was infected with Newcastle Disease Virus (NDV) and one uninfected control group.There were four birds per group.Both groups were raised on commercially available antibiotic free feed under a semi-controlled setup.The binning step of the samples identified 130 MAGs with ≥50 % completion, and ≤10 % contamination.The data presented includes sequences in FASTA format, tables of functional annotation of genes, and data from two different approaches for phylogenetic tree construction using these MAGs.Major

Value of the Data
• This data provides information about bacterial and archaeal genomes in caecum of both healthy and NDV infected broilers chickens.• The functional potential of genomes will be beneficial for developing intervention strategies that modulate microbiome.• Data is applicable for comparative genomic study of 130 different candidates of prokaryotes.
• Data will help to expand the knowledge of microbe-microbe and host-microbe interaction.

Data Description
The structure of the repository is shown in Fig. 1 , where for a total of 130 metagenomeassembled genomes (MAGs), the following files are provided (x replaces the MAG number) in the FINAL_MAGs main directory: • bin.x.fasta.gz:recovered genomic sequence • bin.x.gene.gz:recovered genomic sequences for genes • bin.x.faa.gz:recovered protein sequences for genes • bin.x.gff.gz:detailed annotation of MAGs including different types of features and their locations The METABOLIC_result.xlsx in the METABOLIC_Annotations main directory is a spreadsheet that contains 6 sheets: • "HMMHitNum": Presence or absence of custom HMM profiles within each MAG, the number of times the HMM profile was identified within a MAG, and the ORF(s) that represent the identified protein.
• "FunctionHit": Presence or absence of sets of proteins which were identified and displayed as separate proteins in the sheet titled "HMMHitNum".For each MAG, the functions are identified as "Present" or "Absent".• "KEGGModuleHit": Annotation of each MAG with modules from the KEGG database organized by metabolic category.For each MAG, the modules are identified as "Present" or "Absent".• "KEGGModuleStepHit": Presence or absence of modules from the KEGG database within each MAG separated into the steps that make up the module.For each MAG, the module steps are identified as "Present" or "Absent".• "dbCAN2Hit": The dbCAN2 annotation results against all MAGs (CAZy numbers and hits).For each MAG, there are two distinct columns, which show the number of times a CAZy was identified and what ORF(s) represent the protein.
For each MAG, there are two distinct columns, which show the number of times a peptidase was identified and what ORF(s) represent the protein.
The GTDB-Tk files in the METABOLIC_Annotations main directory are provided as: • gtdtbtk.ar122.classify.tree:Tree in newick format for MAGs classified as archaea • gtdtbtk.ar122.summary.tsv:Taxonomic classification of MAGs classified archaea at different ranks • gtdtbtk.bac120.classify.tree:Tree in newick format for MAGs classified as bacteria • gtdtbtk.bac120.summary.tsv:Taxonomic classification of MAGs classified as bacteria at different ranks The Nutrient_Cycling_Diagrams directory is the sub-directory of METABOLIC_Figures of the METABOLIC_Annotations main directory.It contains the following files for each MAG (x replaces the MAG number) where in the diagrams a red arrow designates presence of a pathway step and a black arrow means absence: • bin.x.draw_sulfur_cycle_single.PDF • bin.x.draw_nitrogen_cycle_single.PDF • bin.x.draw_other_cycle_single.PDF • bin.x.draw_carbon_cycle_single.PDF In addition, the folder contains the summary diagrams for pathways at a community scale: Two sequential transformation diagrams Sequential_transformation_01.pdf and Sequen-tial_transformation_02.pdf are provided which summarize and visualise the MAG numbers and coverages that were putatively involved in the sequential transformation of both important inorganic elements and organic compounds.Metabolic_Sankey_diagram.pdf shows the function fractions that are contributed by various microbial groups in a given community.
The Functional_network_figures directory is the sub-directory of METABOLIC_Figures of the METABOLIC_Annotations main directory.It contains diagrams representing metabolic connections of biogeochemical cycling steps at both phylum level and the whole community level.
The GTOTREE_files main directory contains the following files: • Universal_Hug_et_al.tre: Phylogenetic tree for MAGs in Newick format recovered using 16 gene SCGs.• Universal_Hug_et_al_gain_file.txt:Phylogenetic gain (absolute and in percentages) calculated for each MAG against all other MAGs, serving as a means to ascertain novelty.• Bacteria_and_Archaea.tre: Phylogenetic tree for MAGs in Newick format recovered using 25 gene SCGs.• Bacteria_and_Archaea_gain_file.txt:Phylogenetic gain (absolute and in percentages) calculated for each MAG against all other MAGs.

Experimental Design, Materials and Methods
Metagenomic microbial analysis was carried out on the caeca of from broiler chickens, including one group challenged with Newcastle disease virus and one uninfected healthy control group (4 samples per group).Both groups were raised on commercially available antibioticfree feed under a semi-controlled setup.The experiment was conducted for 8 weeks.Challenged birds were orally challenged with Newcastle disease virus (Strain ID: chicken/broiler-25days/Sargodha/2020/PRI provided by Poultry Research Institute, Rawalpindi, Pakistan) at 6th week according to the calculated dose i.e.Embryo infectious dose 50 (EID 50 ) and caecal samples were collected from the diseased birds on 3rd day post-challenge.Birds were euthanized and caecal samples were collected aseptically and maintained at -80 °C.DNA was extracted using extraction kit (Invitrogen PureLink TM Microbiome DNA Purification Kit), followed by quality check through NanoDrop spectrophotometer.Shotgun sequencing was performed on Illumina TruSeq ensuring ∼20M reads per samples for 8 samples using 2 × 100bp reads at Glasgow Polyomics sequencing facility.
Adapter trimmed reads were provided by the sequencing facility.These reads were then further trimmed using Sickle v1.200 [1] by trimming reads where the average Phred quality dropped below 20 and retaining paired-end reads after trimming if the length of the reads is greater than 50bp.This gave us a total of 163,357,856 quality-trimmed reads from all samples.We then collated all the forward and reverse reads together, and did an all samples co-assembly using megahit with the parameters -k-list 27,47,67,87 -kmin-1pass -m 0.95 -min-contig-len 10 0 0 [2] .This resulted in a total of 202,665 contigs, a total of 1,396,298,223 base pairs (bp), maximum of 344,753 bp, average length of 6,890 bp and an N50 score of 16,582 bp.We then used MetaWRAP pipeline [3] and binned the contigs using three different binning algorithms: metabat2 (198 bins) [4] , maxbin2 (167 bins) [5] , and CONCOCT (253 bins) [6] .On these bins, we applied CheckM [7] to assess their completion as well as contamination, and within MetaWRAP framework, the bins from the three binners were consolidated together only retaining bins with ≥50 % completion, and ≤10 % contamination to give a final set of 130 bins or metagenomic  assembled genomes (MAGs), with the completion and contamination ranking given in Fig. 2 .For the bins, we obtained a mean genome completion of 82.90 % and a mean contamination of 1.79 %.The summary statistics of these MAGs are given in Table 1 .
To infer the phylogeny of the MAGs, we have used the GToTree [16] .The software provides several Single Copy Genes (SCGs) sets depending on the resolution of the domains and taxonomic rank of interest.We have used two SCG sets, a 25-gene Bacteria and Archaea SCG set (recovered phylogeny for 81 MAGs) and a 16 genes SCG set (recovered phylogeny for 70 MAGs) by [17] that covers all major domains of life.To see which MAGs are novel, we have used the Genome Tree Toolkit ( https://github.com/donovan-h-parks/GenomeTreeTk ) by checking the phylogenetic gain for each MAG against the rest of the tree, with higher values potentially identifying novel species.We calculated these for each MAG in the trees recovered for both the 25 genes Bacteria and Archaea SCGs, and the 16 genes SCGs from [17] , respectively.

Table 1
Summary statistics of MAGs recovered using MetaWRAP pipeline with the ≥50 % completion, and ≤10 % contamination criteria based on CheckM software.The GTDB-TK classification is also shown along with the percentage gain (PG) scores for each MAG (whether included in the resulting tree) using the 25 genes Bacteria and Archaea SCGs, and the 16 genes SCGs set from [17] .The highlighted bins represent the MAGs on which either of the SCG sets worked successfully to make them part of the phylogenetic trees.

Fig. 1 .
Fig. 1.Repository structure diagram.There are three archives that provide the sequencing data and annotation for the MAGs.The yellow rounded corner nodes represent directories or compressed directories, whilst the grey node represent files.The ellipses represent a repeat of such files for each MAG.

Fig. 2 .
Fig. 2. Completion (A) and contamination (B) statistics of bins recovered using original software (metaBAT2, MaxBin2, and CONCOCT) which were later refined by MetaWRAP using ≥50 % completion, and ≤10 % contamination criteria to obtain a final set of 130 MAGs.

Fig. 3 .
Fig. 3. Major biogeochemical cycles recovered for 130 MAGs: (A) Carbon Cycle, (B) Nitrogen Cycle, (C) Sulfur Cycle, and (D) Other Cycles.Each arrow represents a single transformation.Besides each arrow is given the number of MAGs that can conduct these reactions and the metagenomic coverages (as a percentage of total MAGs).

Fig. 4 .
Fig. 4. Schematic figure of sequential metabolic transformations of inorganic compounds recovered for 130 MAGs.X-axes describe individual sequential transformation indicated by letters.The two panels describe the (A) number of MAGS and (B) genomes coverages (proportion of total MAGs) that are involved in a certain sequential process.

Fig. 5 .
Fig. 5. Schematic figure of sequential metabolic transformations of organic compounds recovered for 130 MAGs.X-axes describe individual sequential transformation indicated by letters.The two panels describe the (A) number of MAGS and (B) genomes coverages (proportion of total MAGs) that are involved in a certain sequential process.

Table 1 (
continued ) ( continued on next page )