Data of SSRs primers for high-throughput genotyping-by-sequencing (SSR-Seq) based on the partial genome assembly of Eugenia klotzschiana (Myrtaceae)

The neotropical fruit plant Eugenia klotzschiana Berg. is endemic from South America and occurs in the Brazilian savannah areas, a biome threatened by intensive agriculture. This species is a plant listed on the Brazilian list of Plants for the Future. The E. klotzschiana fruits have great nutritional value and antioxidant activity and are consumed in natura or processed into juice or jelly. However, their harvest is predominantly in native areas and needs further studies for large-scale commercialization. Nuclear genomic data and population genetic tools are still quite scarce for the species. Here, we provide data on the first partially assembled genome of E. klotzschiana (211 Mbp, ∼75.16% genome coverage, N50 = 3,407, and 46.8% BUSCO completeness), the raw Illumina sequencing reads, and two sets of primers for microsatellite (SSRs) high-throughput genotyping-by-sequencing (SSR-Seq) identified in the nuclear genome. These genomic resources are fundamental for this species conservation strategies and the development of a future breeding program.


a b s t r a c t
The neotropical fruit plant Eugenia klotzschiana Berg. is endemic from South America and occurs in the Brazilian savannah areas, a biome threatened by intensive agriculture. This species is a plant listed on the Brazilian list of Plants for the Future. The E. klotzschiana fruits have great nutritional value and antioxidant activity and are consumed in natura or processed into juice or jelly. However, their harvest is predominantly in native areas and needs further studies for large-scale commercialization. Nuclear genomic data and population genetic tools are still quite scarce for the species. Here, we provide data on the first partially assembled genome of E. klotzschiana (211 Mbp, ∼75.16% genome coverage, N50 = 3,407, and 46.8% BUSCO completeness), the raw Illumina sequencing reads, and two sets of primers for microsatellite (SSRs) high-throughput genotyping-by-sequencing (SSR-Seq) identified in the nuclear genome. These genomic resources are fundamental for this species conservation strategies and the development of a future breeding program.  Table   Subject Biology Specific subject area Genomics, Horticultural science Type of data Whole-genome sequence raw data, partial genome assembly and primers designed and evaluated in silico for candidate microsatellites markers for high-throughput genotyping-by-sequencing (SSR-Seq) How the data were acquired High-throughput DNA sequencing (Illumina MiSeq). Data format Raw sequence reads (fastq), partial genome assembly (fasta), and primers (

Value of the Data
• This dataset provides the first partial genome sequence for Eugenia klotzschiana Berg. (Myrtaceae, Myrteae). The E. klotzschiana partially assembled genome can be used as an initial reference in several future studies in the areas of evolutionary biology, comparative genomics, and plant breeding. It can also be used on the development of molecular markers for genotyping on a large scale mainly in population genetics and genomics studies; • The genomic data made available by us is a partial genome assembly for Eugenia ( ∼75.16% % of Eugenia mean genome size). Our assembly is the most contiguous and complete genome (considering the number of genes) obtained so far for the genus; • The primers designed for SSR high-throughput genotyping-by-sequencing are genetic tools that were validated in silico and have the potential to be useful in many population genetics applications, after proper laboratory testing, such as genetic characterization of entire populations with less sequencing effort on Illumina Platform. Therefore, it enables fast and concise analysis of genetic diversity for E. klotzschiana .

Data Description
The neotropical species Eugenia klotzschiana Berg. (Myrtaceae, Myrteae), popularly known as "pêra-do-cerrado", is an important genetic resource from the Brazilian Cerrado. This species is on the Brazilian list of Plants for the Future [1] , as a priority for cultivation and marketing because of its nutrition and antioxidant values. The fruits of E. klotzschiana are used in the production of jellies and juices, presenting high economic value. Although E. klotzschiana has no record of  ( Fig. 1 ). This assembly of E. klotzschiana shows greater completeness and less missing data than E. uniflora , making it a reference genomic resource for the genus Eugenia . However, there is a level of genome fragmentation that represents 23.4%, indicating the need for new sequencing approaches to obtain a more complete genome. Information about physical measurements of the genome size of E. klotzschiana is still lacking, but we calculated the predicted genome size based on the k-mers distribution for E. klotzschiana and found it to be of 282 Mb, meaning that the assembly provided on this work covers approximately 75.16% of the total expected genome size. The E. klotzschiana assembly has a genome depth of 12.98. This genome size is similar to other eight species of the Eugeniinae subtribe with an average of 258.74 Mbp estimated by flow cytometry [2] . The partial genome is available on NCBI JANUWG0 0 0 0 0 0 0 0 0 and the raw data is available on NCBI with the accession number SRR21159655. MISA identified 14,287 microsatellite regions, comprising 11,662 dinucleotides, 1,767 trinucleotides, 505 tetranucleotides, 193 pentanucleotides and 160 hexanucleotides. Additionally, using QDD, we isolated 9,087 sequences containing microsatellite regions, from which 327 sequences were filtered and used for primer design. We provide two sets of primers for SSR-Seq that were tested only in silico . The first set of primers (Ekl_SSR-Seq-1) is composed of 44 primers with minimum cross dimerization energy of -7 G ( Table 2 ), of which 25 primer pairs (Ekl_SSR-Seq-2) show minimum cross-dimerization energy greater than -6 G and represent the second set of primers ( Table 3 ). The available primers can be used in different Illumina sequencing platforms by adding specific platform tags, 5 -TCGTCGGCAGCGTCAGATGTGTATAAGACAG -3 for the Forward primers and for the Reverse primers, 5 -GTCTCGTGGGCTCGGAGATGTGTATAA-GACAG -3 . The use of SSR-Seq can accelerate genetic studies in natural populations because of rapid polymorphism identification. The validation of these sets of SSR-Seq primers on natural populations will allow us to investigate polymorphism both on size and on the nucleotide sequence of the repetitive units. Therefore, SSR-Seq is a resolving methodology for size homoplasy on microsatellite regions [3] , due to misinterpretations of mutations that lead to identity by state and not identity by descent [4] . In addition, the use of high-throughput sequencing technology with SSR-Seq makes it possible to multiplex samples in a single sequencing run by using individual-specific barcoding, which can be applied to population genetics, increase the data, and reduce costs [5] . Table 2 First set of microsatellite primers (Ekl_SSR-Seq-1) with minimum cross dimerization > -7 G for Eugenia klotzschiana high-throughput genotype-by-sequencing.

DNA Extraction and Sequencing
Leaves were collected from a single adult specimen of E. klotzschiana in a natural population from Senador Canedo -Goiás -Brazil (16 °37 32,197" S, 49 °4 22,696" W) with the voucher UFG0020120 deposited at Herbarium UFG. The samples were also registered on the National System for the Management of Genetic Heritage and Associated Traditional Knowledge (SisGen) under the authorization number A7D3EC4. Leaf tissues were dehydrated on silica gel and stored in a -80 °C freezer, following t otal DNA extraction with the CTAB protocol [6] . The DNA integrity was accessed using both agarose gel electrophoresis 1% and quantified using the Qubit 2.0 equipment (ThermoFisher). Genomic library was constructed with the SureSelectQXT kit (Agilent Technologies). Sequencing was performed on Illumina MiSeq platform, using MiSeq Reagent V3 kit (600 cycles), in the 2x300 bp paired-end mode.

Gene Annotation, SSR Identification and Primer Multiplex Construction
Two approaches were used to characterize microsatellite regions. The first involves the use of MIcroSAtellite identification software [ 10 , 11 ] available https://webblast.ipk-gatersleben.de/misa/ for the identification and characterization of minimal repeats motif 10, 8, 6, 6 for dinucleotide, trinucleotides, tetranucleotide, pentanucleotide and hexanucleotide, respectively. Afterwards, the QDD software was used to select candidate regions for designing microsatellite primers in the context of SSR-Seq. The QDD software set the minimal repeats was the same applied with MISA software [12] . Primer development was done using Primer 3 software implemented in QDD software, defining the following parameters: (i) PCR product size between 120 and 200 bp; (ii) Primer size (minimum -optimal -maximum) of 18 bp -20 bp -23 bp; Melting temperature (minimum -optimal -maximum) of 48 °C -55 °C -62 °C; (iv) Primer GC content (minimumoptimal -maximum) of 20% -50% -80%; (v) Maximum melting temperature difference 1 °C [ 12,13 ]. It was possible to design primers for 9,087 sequences, from which the primers that met the following criteria were removed: (i) trinucleotide SSR motif; (ii) SRR with repeats with 3 or more adenines in a row (AAA * ); (iii) SSR motif with 100% AT content; (iv) SRR with distance of less than 20 bp from primers; (v) SRR in the context of transposable elements. After that, 327 SSRs with a primer pair survived the filter and were used to define a set of primers for multiplexing on FastPCR software [14] . From the single multiplex containing 99 primer pairs resulting from the FastPCR analysis, an in silico PCR was performed to evaluate the primer sets using openPrimeR [15] on Docker ( https://hub.docker.com/r/mdoering88/openprimer/ ) . The assembled contigs were used as sequence templates and the designed primers were placed on a fasta file with forward and reverse sequences to be analyzed separately. Two sets of primers were tested, the first with minimum energy for cross-dimerization of -7 G (Ekl_SSR-Seq-1, Table 2 ) and the second including just primers with minimum energy for cross-dimerization of -6 G (Ekl_SSR-Seq-2, Table 3 ), which is a subset of the first set. For both sets the main settings for the evaluation and in silico PCR followed the program's default with some slight differences: optimal primer size was set to 18-22 bp, allowed mismatches between the primer sequence and the template were 5 bp and mismatches were forbidden on the last 6 bp of the primer's 3 end. After the evaluation, primers bound to other regions rather than the target ones, primers set with only one primer orientation binding to the target region, and primers that did not fulfill the desired physicochemical properties were filtered and discarded. The remaining primers were chosen to build two multiplex sets presented here on Tables 2 and 3 . For genotype-by-sequence on Illumina platforms, it is necessary to add specific tags to the primers, 5 -TCGTCGGCAGCGTCAGATGTGTATAAGACAG -3 in the Forward primers and in the Reverse primers 5 -GTCTCGTGGGCTCGGAGATGTGTATAAGACAG -3 .

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Data on the draft genome assembly of Eugenia klotzschiana (Myrtaceae) (Original data) (NCBI and in the paper).