Dataset of genome identification and characterization of microsatellite markers loci in Atriplex atacamensis and Atriplex deserticola

In this work, we partially sequenced genomes of two Atriplex species (A. deserticola Phil. and A. atacamensis Phil.), using Illumina technology (Hiseq 2500 paired-end system) and de novo assembly strategy. Raw data of A. deserticola and A. atacamensis are available from NCBI-Bioproject, PRJNA495747 and PRJNA495763 accessions, respectively. A total of 127086 and 134984 microsatellite or simple sequence repeat (SSR) markers were identified within A. deserticola and A. atacamensis genomic DNA, respectively. In addition, predicted putative genes in A. deserticola and A. atacamensis sequences are also presented in this article.


Data
Raw partial genome sequencing data for A. deserticola and A. atacamensis was produced by de novo sequencing using a HiSeq 2500 System -Illumina. The data was then quality trimmed, filtered and assembled (assembly statistics are present in Tables 1 and 2).
Dinucleotide to hexanucleotide repeat microsatellite sequences were identified for A. deserticola and A. atacamensis (Table 3). However, only SSRs with a repeat motif size ranging from 2 to 8 bp and a length 12 bp were considered. This includes dinucleotide repeats 6, trinucleotide repeats 4, and tetra-, penta-, hexa-, hepta-and octanucleotide repeats 3. We analyzed the distribution of A. deserticola and A. atacamensis SSRs data with regard to motif length, type and number of repeats (Tables 4  and 5, Fig. 1). Primer pairs were designed from flanking sequences of di-to octanucleotide microsatellites of A. deserticola and A. atacamensis (S1 and S2 Tables).
Specifications Table   Subject area  Plant biology  More specific subject area  Plant genomic sequencing and bioinformatic  Type of data  Tables, graphs and  Value of the data The identified genomic sequences are an important resource for genetic, genomic and evolutionary studies and will Atriplex conservation and breeding programs. These newly developed microsatellite loci of A. atacamensis and A. deserticola can be useful for molecular studies, including the construction of linkage maps, QTL mapping and association mapping for these two species. Predicted genes of A. deserticola and A. atacamensis can be compared to the known genome sequence of similar or closely related organisms in order to identify any key similarities or differences and/or to investigate the function of a particular gene.
The partial sequences of genes, and SSR markers from high throughput next generation sequencing (NGS) of A. atacamensis and A. deserticola genomic DNA, constitute the first platform to undertake genetic and molecular studies of these plant species. The contig and singleton A. deserticola and A. atacamensis genomic sequences were analyzed by AUGUSTUS software [1,2] using A. thaliana as a model organism to predict putative genes ( Table 6). For functional annotation, the potential coding regions data were analyzed by WEGO [3], leading to consistent gene annotations, gene names, gene products and Gene Ontology (GO) numbers ( Fig. 2 and S3 Table).     The graph is based on a total of 127086 and 134984 SSR markers detected in non-redundant genomic DNA of A. deserticola and A. atacamensis, respectively. Di, tri, tetra, penta, hexa, hepta and octa refer to dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, hexanucleotides, heptanucleotides and octanucleotides, respectively. Table 6 Putative genes found in partial sequences of A. deserticola and A. atacamensis predicted by Augustus software.

Species
Total  Inc., Valencia, CA, USA), following the manufacturer's protocols. DNA quality and quantity were checked by agarose gel electrophoresis and spectrophotometric measurement of UV absorption at wavelengths of 260 and 280 nm and absorbance ratios of 260/280 and 260/230, using an Infinitive M200Pro Nanoquant (Tecan Group US, Inc., Morrisville, NC, USA).

Next-generation sequencing
The Illumina paired-end library was prepared with the Illumina TruSeq DNA PCR-Free350 bp Library Preparation Kit (Illumina, San Diego, CA, USA). The paired-end library was sequenced using Illumina HiSeq 2500 Sequencer (Macrogen Inc., Seoul, Korea) using the TruSeq rapid SBS kit or Truseq SBS Kit v4 (Illumina, San Diego, CA, USA). The read sequence length was 126 nts from one end of the fragment to the other.
Raw data of A. deserticola and A. atacamensis are available from NCBI-Bioproject, PRJNA495747 and PRJNA495763 accessions, respectively.

In silico identification of putative SSRs and primer design
We analyzed perfect and imperfect SSRs. The contig sequences obtained in FASTA files were screened with a repeat motif size range of 2e6 bp and a length of >12 bp. This included dinucleotide repeats 6, trinucleotide repeats 4, and tetra-, penta-, hexa-, hepta-and octanucleotide repeats 3, using MIcroSAtellite identification software [7,8]. The program allows for direct primer design using PRIMER 3 [9] by searching for microsatellite repeats and primer annealing sites in the flanking regions (S1 and S2 Tables).

Putative A. deserticola and A. atacamensis gene prediction
Putative genes were predicted with AUGUSTUS software [1,2], analyzing contig and singleton genomic sequences from A. deserticola and A. atacamensis. The program is based on a hidden Markov model and is used for the ab initio prediction of protein coding genes in eukaryotic genomes. Arabidopsis thaliana (L.), Heynh. was used as the model organism. WEGO (Web Gene Ontology Annotation Plot) software was then used to functionally annotate potential coding sequences or predicted genes [3]. A manual inspection of the predicted genes was performed to maximize the accuracy of gene prediction. The genes encoding predicted proteins were scored using the NCBI non-redundant (NR), Uniprot, and GO database. Matches were selected with the value e 1xe-5 and with 40% sequence identity.