Data for constructing insect genome content matrices for phylogenetic analysis and functional annotation

Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs.


a b s t r a c t
Twenty one fully sequenced and well annotated insect genomes were used to construct genome content matrices for phylogenetic analysis and functional annotation of insect genomes. To examine the role of e-value cutoff in ortholog determination we used scaled e-value cutoffs and a single linkage clustering approach.. The present communication includes (1) a list of the genomes used to construct the genome content phylogenetic matrices, (2) a nexus file with the data matrices used in phylogenetic analysis, (3) a nexus file with the Newick trees generated by phylogenetic analysis, (4) an excel file listing the Core (CORE) genes and Unique (UNI) genes found in five insect groups, and (5) a figure showing a plot of consistency index (CI) versus percent of unannotated genes that are apomorphies in the data set for gene losses and gains and bar plots of gains and losses for four consistency index (CI) cutoffs.
& Twenty-one whole insect genomes were filtered using a single linkage clustering approach to generate presence absence matrices for phylogenetic analysis. Lists of gene gains and losses were obtained for specified nodes in the phylogenetic tree using phylogenetic reconstruction approaches. These gene lists were then characterized for functional significance using the websites listed below.

Data source location
See Supplemental Table 1 as described in the Appendix A section of this paper.

Data accessibility Data within this article
Value of the data These data should allow any researcher to obtain raw genome sequences from 21 insect taxa for phylogenetic analysis, reconstruct phylogenies from the presence/absence matrices to compare to other methods of phylogenetic reconstruction, compare specific phylogenetic hypotheses generated by the presence absence matrices of insect genomes with other methods, and compare the FlyBase annotations we determined were part of the CORE genome and unique (UNI) in terminal groups in our phylogenetic analysis with other gene lists that might be of significance to insect evolution.

Data
The data were obtained from html sites listed in Supplemental Table 1, and manipulated to generate a genome content, gene presence/absence matrix for phylogenetic and functional analysis. Several gene presence/absence (genome content) matrices were generated from this process and these are included in this paper in Supplemental Table 2. The trees generated from phylogenetic analysis of these matrices are in Supplemental Table 3.

Experimental design and methods
The experimental design followed the methods outlined in Rosenfeld et al. [3] and involved the generation of phylogenetic trees to determine specific genes and gene families that have been gained and lost in insect evolution. Lists of gene gains and losses for five major insect groups -Insecta, Hemiptera, Holometabola, Diptera and Hymenopterawere generated and the functional significance of these lists was assessed.
The following is a list of the steps involved in the generation of (1) Assembly of 21 insect genomes into a searchable database.
(2) Ortholog determination of genes from these genomes and construction of phylogenetic matrices consisting of presence/absence data. (3) Phylogenetic analysis of the genome content data (presence/absence matrices). (4) Character reconstruction of the gains and losses of different genes and gene families for the five insect groups (Insecta, Hemiptera, Holometabola, Diptera and Hymenoptera). (5) Functional characterization of the genes that are gained and lost in the five insect groups listed above.
The specific methods used in the five steps listed above utilized Phylogenetic Analysis Using Parsimony (PAUP*; [4]) to generate genome content trees. Three metthods were used to do the phylogenetic analyses -Maximum Parsimony with unweighted characters, Maximum Parsimony with Dollo weighting and Maximum Likelihood (using the binGAMMA model). Presence and absence were reconstructed on the phylogenetic trees with PAUP* [4] using the "apolist" command.