Data in support of genome-wide identification of lineage-specific genes within Caenorhabditis elegans

Two sets of LSGs were identified using BLAST: Caenorhabditis elegans species-specific genes (SSGs, 1423), and Caenorhabditis genus-specific genes (GSGs, 4539). The data contained in this article show SSGs and GSGs have significant differences in evolution and that most of them were formed by gene duplication and integration of transposable elements (TEs). Subsequent observation of temporal expression and protein function presents that many SSGs and GSGs are expressed and that genes involved with sex determination, specific stress, immune response, and morphogenesis are most represented. The data are related to research article “Genome-wide identification of lineage-specific genes within Caenorhabditis elegans” in Journal of Genomics [1].


More specific subject area
Genomics Type of data

Value of the data
The data in our study shows the genetic features of SSGs and GSGs and that their expression profiles at different developmental stages.
The data of the origin analysis of SSGs and GSGs indicated that gene duplication and exaptation from TEs mainly generating these genes.
The data derived from protein function prediction in silico suggests SSGs and GSGs may be involved in specific stress, morphogenesis, and immune response, which indicates these genes might be relevant to some essential processes to adapt extreme environment.
1. Data, materials and methods

Sequence data
The proteomes of 58 vertebrate and 21 invertebrate species, excluding nematode species, were downloaded from Ensembl. The genomes and proteomes of nine Caenorhabditis species and 12 other nematode species were downloaded from WormBase. The UniProtKB protein data were downloaded from EBI (ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/). All of the expression data (ESTs and mRNAs) of Caenorhabditis elegans were downloaded from UCSC (http://hgdownload.cse.ucsc.edu/ downloads.html). The RNA-Seq data of C. elegans were obtained from EBI (http://www.ebi.ac.uk/ena/) and NCBI (http://www.ncbi.nlm.nih.gov/sra/).

Homolog search
The two sets of lineage-specific genes (LSGs), namely SSGs and GSGs, within C. elegans were identified through a pipeline ( Fig. 1) based on a homolog search using BLAST [2] with an e-value cutoff of 10 À 5 [3][4][5][6] and scoring matrices of Blossum62, Pam70, and Pam30. As the proteins are of different sizes and scored by different matrices, we divided 26,150 C. elegans proteins into three sets: a set of proteins that are less than 30 aa in length, a set between 30 and 70 aa, and a set more than 70 aa, for BLASTP analyses. The details of our result regarding these LSGs are shown in Supplementary File 1.

Characterization of SSGs, GSGs, and ECs
The genetic feature information for the LSGs was downloaded from Ensembl using BIOMART (http://www.ensembl.org/). We used Perl scripts to calculate the gene length, protein length, exon number, and GC content. In addition, we determined the significant differences between SSGs, CTSGs, and evolutionarily conserved genes (ECs) through one-way ANOVA. The gene length, protein length, exon number, and GC content are provided in Table 1.

Origin analysis of LSGs
We downloaded the information about the genomic positions of TE sequences from the UCSC Genome Browser, as well as that of paralogs from Ensembl, and compared their positions with that of SSGs and GSGs to identify the LSGs containing TE sequences and the LSGs having homologs. If a lineage-specific gene has a complete overlap with TEs, we consider it was generated by the mechanism of exaptation from TEs; if it completely overlaps with paralogs, we consider it was from gene duplication. The results are shown in Table 2.
However, for the retroposition, we referred to the previous study [7] and developed a sophisticated pipeline to screen the chimeric genes: we mapped the C. elegans protein sequences onto its genome using TBLASTN (E-value r10 À 3 ) [8]. Then, we analyzed the TBLASTN results and aligned the bestselected matches with each protein using GENEWISE [9]. Only proteins having multiple exons were selected for subsequent analyses. In parallel, we extracted and merged adjacent homology matches (distance o40 bp) from the TBALSTN results, requiring the merged target sequences have significant similarity with the query on amino acid level ( 430%) and more than 50 aa in length. After the above steps, similarity searches of the merged sequences against the multi-exon proteins were conducted using FASTA [10], and the closest matches were selected as candidate parental-retrogene proteins. Then we verified the absence of introns from these putative retrogenes with 10,000 bp flanking regions via GENEWISE, and screened out the sequences with scores over 35. We checked the match of each parental-retrogene sequence, observing at least two introns were contained in the matching part of the parental gene, and confirmed the alignment over 40%.  1. Procedure for identifying lineage-specific genes within C. elegans. BLASTP was primarily used in this pipeline with an evalue less than 1e À 5. C. elegans proteins were marked in orange, the proteins of species excluding C. elegans in green, and the results of SSGs, GSGs, and ECs (evolutionarily conserved genes) in blue. "Hit" in this figure represents a C. elegans protein has BLAST hits in BLASTP, whereas "No hit" represents a C. elegans protein without any hit.
they were compared to the gene positions of the annotated genes on WormBase. We defined an annotated gene with a specific amount of overlap (coding sequence450 bp) with a retrogene as a chimeric gene. If the overlap exceeded 90 bp, the chimeric genes were considered maybe as false positives derived from the parental genes or their flanking regions. Then we aligned the recruited coding sequences of retrogenes to their parental genes with extending 10,000 bp flanking regions to ensure the reliability of these chimeric genes. The complete information regarding retrogenes and chimeric genes identified is shown in Supplementary File 2.

SSGs
GSGs Fig. 2. The proportion of LSGs expressed in different developmental stages. The vertical axis represents the proportion of genes with read supports, whereas, the abscissa axis represents the different developmental stages.

Transcriptional analysis
Cleaned by SeqClean (https://sourceforge.net/projects/seqclean/), the 381,408 transcript sequences were mapped to the C. elegans genome using BLAT [11] with the default parameters. Then the following criteria were imposed to obtain high-quality and clearly mapped transcripts: mapping lengthZ150 bp, identity Z98%, coverage within mapping Z97%, and coverage within whole transcriptZ75%. When a transcript was mapped to multiple genomic loci, we discarded the ambiguous ones (difference in BLAT scores o2%) and attained the best matches from the rest. Furthermore, the genomic positions of the LSGs were compared with the mapped positions of ESTs and full-length cDNAs. A gene that overlapped by more than 100 bp with the ESTs was considered likely to be expressed. We then determined the significant differences between SSGs, CTSGs, and ECs through a t-test.
To further analyze the expression profile, we downloaded the RNA-Seq data from different embryo stages, including 1-cell, 2-cell, 4-cell, 28-cell, early, and late embryos with the accession codes SRX004864, SRX092477, SRX092371, SRX085219, SRX092479, and SRX004865, respectively [12]. Simultaneously, we downloaded information for the larva stages, including L1 larva, L2 larva, L3 larva, L4 larva, and L4 male larva with the accession codes SRX004867, SRX001872, SRX001875, SRX001874, and SRX004868, respectively [12]. Additionally, we downloaded the adult stages, including young adult, adult hermaphrodite, and adult male with the accession codes SRX001873, SRX191947, and SRX191950, respectively [12,13]. After downloading, the TopHat [14] and HTSeq [15] software programs were used to map all of the reads per time point independently back to the C. elegans genome and to then calculate the read number per gene. Moreover, the expression levels of LSGs were normalized to reads per kilobase per million mapped reads, which is known as the RPKM method. With the RPKM, we compared the number of expressed SSGs and GSGs at each developmental stage  to observe their overall expression levels. After screening out the SSGs and GSGs expressed only at the larva and adult stages, we examined the possible motifs contained in their proteins using InterProScan. The complete information regarding SSGs and GSGs, including the RPKM values, is shown in Supplementary Files 3 and 4. To compare the number of expressed SSGs and GSGs, we calculated the proportion of expressed genes at each developmental stage and gain the result in Fig. 2. InterProScan results for the predicted functions are provided in Supplementary File 5.

Functional assignment and categorization of SSGs and GSGs
We downloaded the current developmental and functional descriptions of C. elegans from WormBase (http://www.wormbase.org/). Based on the descriptions, functional assignments of some SSGs and GSGs were obtained and clustered into different categories. A table summarizing the gene numbers of each class description of SSGs and GSGs is provided in Supplementary File 6. In addition, a table of the gene class descriptions of SSGs and GSGs is provided in Supplementary File 7. For the LSGs without annotations, we employed the ProtFun 2.2 server (http://www.cbs.dtu.dk/services/ProtFun/) to predict their cellular roles and gene ontology categories. The functions of LSGs predicted using ProtFun are provided in Supplementary File 8.

RT-PCR experiments
The total RNA from a mixed-stage population of wild-type C. elegans N2, which was washed three times with an M9 buffer solution, was isolated using a Trizol Reagent kit (Invitrogen, Carlsbad, CA, USA). We then treated the RNA with Recombinant DNase I (TaKaRa, Japan) to eliminate genome pollution before reverse transcription. Simultaneously, we designed 16 pairs of unique primers to amplify the target sequences and then conducted RT-PCR experiments with a subset of the LSGs identified in our study. Gel image is shown in Fig. 3 and the primer information of RT-PCR is located in Supplementary File 9.