Data on the evolutionary history of the V(D)J recombination-activating protein 1 – RAG1 coupled with sequence and variant analyses

RAG1 protein is one of the key component of RAG complex regulating the V(D)J recombination. There are only few studies for RAG1 concerning evolutionary history, detailed sequence and mutational hotspots. Herein, we present out datasets used for the recent comprehensive study of RAG1 based on sequence, phylogenetic and genetic variant analyses (Kumar et al., 2015) [1]. Protein sequence alignment helped in characterizing the conserved domains and regions of RAG1. It also aided in unraveling ancestral RAG1 in the sea urchin. Human genetic variant analyses revealed 751 mutational hotspots, located both in the coding and the non-coding regions. For further analysis and discussion, see (Kumar et al., 2015) [1].


Retrieved from public databases Data format
Analyzed data Experimental factors RAG1 sequences were retrieved from ENSEMBL and/or NCBI database. Experimental features RAG1 protein alignment using Muscle tool and edited in the GeneDoc RAG1 Variants were analyzed with SIFT, Polyphen & rSNPbase Data source location Germany Data accessibility Data is with this article

Value of the data
Protein sequence analysis data reveal that SpRAG1L possesses only 19-20% identities with vertebrate RAG1, which helped us in deriving an ancestral RAG1 protein in sea urchin. This approach can be used the detection of origins for different proteins.
Protein sequence alignment locates two major domains and several regions of RAG1, which suggested that these fragments were conserved from sea urchin to human. This hints evolutionary conservation of protein domains in the protein of interest and their ancestors.
Data on the genetic variant analysis suggests that human RAG1 gene has 751 variants. Furthermore, there are 267 missense variants of human RAG1 causes change in amino acids including 140 deleterious mutations. These variant data serve as the mutational hotspots within the coding region of human RAG1. Assessment of mutational hotspot for any protein is critically important for understanding its function and roles in diseases.
Additionally, 284 non-coding variants were identified with 94% regulatory in nature, which are often called as regulatory SNP (rSNP). These data are source of regulatory implications flanking any given gene. Table 1 lists all RAG1 sequences used in Kumar et al. [1] and these sequences are used for constructing protein sequence alignment of RAG1 (Fig. S1). This protein alignment is the basis for the Figs. 2 and 3 and Table 1-5 of Kumar et al. [1]. Details of human RAG1 variants are summarized in the Table S1 and regulatory SNPs in the Table S2. These two supplementary tables are primary data for variant analyses described in Fig. 4 and Tables 2-5 of Kumar et al. [1].

Experimental design, materials and methods
Using the BLAST homology detection tool [2], we extracted RAG1 gene from vertebrate genomes listed either in Ensembl release 77 [3] or NCBI. To ensure accuracy of gene structures, we combined the gene predictions of the Ensembl [3] and AUGUSTUS tool [4]. We used human RAG1 as the standard sequence for intron position mapping and numbering of intron positions, followed by suffixes a-c for their location as reported previously [5]. We aligned selected RAG1 protein sequences using MUSCLE tool [6] with and we manually adjusted alignment with GENEDOC tool [7]. We reconstructed a phylogenetic tree with maximum likelihood method, based on the JTT matrix-based model [8] with 1000 bootstrap replicates. We imported all consensus trees to MEGA 6 software [9], where we edited and visualized these trees as per requirement. To detect the orthologs of RAG1 gene, we analyzed micro-synteny across different genomes using two genome browsers namely, NCBI map viewer [10] and ENSEMBL genome browser [11,12]. Furthermore, we generated human RAG1 variants from 1092 human genomes from 14 different populations available in 1000 genomes project [13]. We analyzed the impact assessments of missense variants on the human RAG1 protein using SIFT [14] and PolyPhen V2 [15] tools, as described previously [16][17][18][19]. We detected regulatory nature of non-coding variants using the rSNPbase (this database provides reliable, and comprehensive regulatory annotations [20] and such variants are called regulatory SNP or rSNP).