Draft genome sequence data of maqui (Aristotelia chilensis) and identification of SSR markers

Maqui (Aristotelia chilensis [Molina] Stunz) is a small dioecious tree, belonging to the Elaeocarpaceae family. Maqui fruit has high levels of antioxidant activity, which are due to elevated anthocyanin and polyphenol content. Here we describe a draft genome sequence data of maqui (A. chilensis). The genomic sequence datasets were obtained using Illumina NextSeq platform. Nucleotide sequences of raw reads and the assembled draft genome are available at NCBI's Sequence Read Archive as BioProject PRJNA544858. Also, a total of 210067 microsatellite or simple sequence repeat (SSR) markers were identified.


Data
Here we described data of raw sequence-reads, an assembled draft genome and SSR analysis from genomic DNA of maquí (A. chilensis). Both raw data and assembled draft genome are available at NCBI's Sequence Read Archive as BioProject PRJNA544858P (https://www.ncbi.nlm.nih.gov/bioproject/? term¼PRJNA544858). The genomic DNA was obtained from fresh leaves of maqui. Using a library with 300 bp insert size and paired-endetag DNA sequencing using illumina NextSeq 550 platform around 187 million 2 Â 151 bp reads were generated. After a process of quality trimming and filtering of data using FastQC v0.11.5, which allow to remove reads containing more than 5% unknown nucleotides, low-quality reads (reads containing more than 50% bases with Q-value 20), all unpaired reads and short reads (<35 bp), a 95.87% from the total reads were suitable for genome assembling (Table 1). A draft genome of maqui was obtained through de novo assembling using MaSuRCA software [1] (see Table 2).
The final genome assembly had a total length of 326 Mb, comprising in 58,451 scaffolds and 140X of mean coverage were obtained. The scaffold N50s of this assembly were 13.2 kb, and unclosed gap regions represented 0.08% of the assembly. In addition, the G þ C content of the genome assembly excluding gaps was estimated to be 35.13%. The assembled draft genome was constructed using 343,326,678 (95.68%) of the raw sequence reads.
To check the draft genome generated, the raw sequence reads for transcriptomic data from maqui were downloaded from NCBI database (BioProject PRJNA255387) and mapped to the draft genome using HiSAT2 map alignment program [2] with 93.61% of filtered RNA sequences were mapped. Value of the data Data of raw sequence reads and assembled draft genome of maqui (Aristotelia chilensis) contribute to establish a genomic platform for this plant species. Draft genome data can facilitate the identification of molecular mechanisms that underlie properties of maqui products, thereafter contribute to improve them by classical and/or biotechnological approaches. The draft genome data will accelerate functional genomics research in this species. The newly developed SSR markers dataset of maqui should be useful tools to assesses its genetic diversity and understand its genetic structure, facilitating the implementation of effective conservation system of its natural populations.
The assembled draft genome of maqui was used to identify microsatellite sequences or simple sequence repeat (SSR) ( Table 3). Dinucleotide to hexanucleotide repeat microsatellite sequences, with repeat motif size ranging from 2 to 6 bp and a length !12 bp were considered. This includes data of dinucleotide repeats !6, trinucleotide repeats !4, and tetra-, penta-and hexa-, repeats !3. A total of 210.067 maqui perfect SSR markers were identified (Table 3). Among the identified SSRs, dinucleotide motifs (54.87%) were the most common, followed by tetranucleotide (17.73%) and trinucleotide motifs (15.7%) ( Table 4). We also examined the distribution of maqui microsatellites with regard to motif length and type and the number of repeats (Fig. 2). A total of 111,531 primer pairs were designed from flanking sequences of di-to hexanucleotide microsatellites of maqui (A. chilensis) and are available in Table S1.

Genomic DNA extraction
Genomic DNA of maqui (A. chilensis) was extracted as was described by Bastias et al., 2016 [4] using DNeasy Plant Mini Kit (Qiagen) following the manufacturer's instructions.

DNA sequencing
Paired-endetag DNA de novo sequencing using Illumina NextSeq 550 platform was used. Approximately 187 million 2 Â 151 bp reads were generated from library with 300 bp insert size. Sequence Table 1 Dataset of maqui (A. chilensis) reads obtained by Illumina NextSeq 550 sequencing before and after filtering.

Genome assembly
Then de novo assembly of the clean reads was performed to generate contigs and scaffolds. For de novo assembly we used MaSuRCA (http://www.genome.umd.edu/masurca.html) [1] with optimized k-mer length of 85, calculated by KmerGenie software [6]. Assembly statistics were obtained with QUAST (quality assessment tool for genome assemblies) software [7].

Assessing genome assembly completeness with benchmarking universal single-copy orthologs (BUSCO)
The assembled A. chilensis genome data was searched for BUSCO analysis [3] against the embryophyta database, consisting of 1375 orthologs constructed from 60 species.

Identification of Putative SSRs and primer design
We analyzed perfect SSRs. The contig sequences obtained in FASTA files were screened with a repeat motif size range of 2e6 bp and a length of >12 bp. This includes dinucleotide repeats !6, trinucleotide repeats !4, and tetra-, penta-and hexa repeats !3, using PERF software [8]. The program allows for direct primer design using PRIMER 3 [9] by searching for microsatellite repeats and primer annealing sites in the flanking regions.