Draft genome of a biparental beetle species, Lethrus apterus

Background The lack of an understanding about the genomic architecture underpinning parental behaviour in subsocial insects displaying simple parental behaviours prevents the development of a full understanding about the evolutionary origin of sociality. Lethrus apterus is one of the few insect species that has biparental care. Division of labour can be observed between parents during the reproductive period in order to provide food and protection for their offspring. Results Here, we report the draft genome of L. apterus, the first genome in the family Geotrupidae. The final assembly consisted of 286.93 Mbp in 66,933 scaffolds. Completeness analysis found the assembly contained 93.5% of the Endopterygota core BUSCO gene set. Ab initio gene prediction resulted in 25,385 coding genes, whereas homology-based analyses predicted 22,551 protein coding genes. After merging, 20,734 were found during functional annotation. Compared to other publicly available beetle genomes, 23,528 genes among the predicted genes were assigned to orthogroups of which 1664 were in species-specific groups. Additionally, reproduction related genes were found among the predicted genes based on which a reduction in the number of odorant- and pheromone-binding proteins was detected. Conclusions These genes can be used in further comparative and functional genomic researches which can advance our understanding of the genetic basis and hence the evolution of parental behaviour. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07627-w.


Background
Sociality among insects is highly diverse ranging from simple interactions to complex hierarchical societies [34]. However, the social behaviours and their genetical background have been investigated mainly in eusocial species with well-ordered colonies (e.g. [67,81]). Therefore, the literature lacks studies on the molecular basis of social behaviour in simpler societies such as subsocial species [34]. Among these insects, diverse forms of parental care occur, including guarding for the eggs, food provision, protecting the freshly hatched offspring, and even biparental care with division of labour between the parents [43]. These behaviours appeared independently in 13 different insect orders, including 15 families of Coleoptera such as Scarabaeidae and Silphidae. Insect species with parental care are feasible subjects of sociogenomics aiming to understand the interaction between behaviour and genes regulating parental behaviour [63]. Therefore, knowing the mechanistic principles of parental care among subsocial insects could lead to a better understanding of social evolution [15]. Another pathway to inferring the origin of sociality is via comparative analysis of genomes of many organisms with analogous parental behaviours [13].
Beetles have evolved an extraordinary variety of life history strategies [50], including very diverse social and reproductive behaviours, from aggregation through creating nests to biparental care [9]. Despite of the fact that beetles represent the most diverse animal order, only a handful of coleopteran genomes have been published to date, including model species like the red flour beetle (Tribolium castaneum), a burying beetle, Nicrophorus vespilloides and important pests like the Colorado potato beetle (Leptinotarsa decemlineata) and the small hive beetle (Aethina tumida) [49]. However, of the sequenced beetle species, only N. vespilloides and Onthophagus taurus have parental care.
Lethrus apterus Laxmann 1770 (Coleoptera: Geotrupidae; Fig. 1) is one of the few insect species that has biparental care [12]. These beetles are only active during their breeding season, which lasts from early March to the beginning of June, and outside of this period the adults spend their time in a diapause in the soil [19]. At the beginning of their breeding season, adults choose mates with whom they excavate underground nests [64]. After this, a division of labour can be observed with females collecting leaves for each offspring in separate underground chambers while males guard the entrance of the nest from intruders, e.g. other L. apterus individuals or predators [65]. At the end of the reproductive period, adults dig themselves into the soil while the hatching larvae consume the stored leaves [33].
In this study, we report the draft de novo genome of L. apterus which is the first published genome in the family Geotrupidae. We performed functional annotation, and searched directly for potential parental behaviour regulator genes in the genome. Additionally, we investigated the single nucleotide polymorphism variants distribution based on samples from eight populations in Hungary.

Results and discussion
Assembly quality and completeness The final assembly of Lethrus apterus comprised 66,933 scaffolds with an N50 value of 8902 bp ( Table 1). The total length of the genome was estimated to be 286.93 Mbp, comparable with other beetle genomes published to date, and with the estimated size by GenomeScope (252.49 Mbp, Fig. 2). The GC content of the final assembly is 31.66%, similar to other beetle genomes. GenomeScope results showed a relatively low heterozygosity rate (0.148%) and low percentage of unique sequences (55.8%) which was probably caused due to the combined dataset (see Materials and methods) (Fig. 2). An additional peak was formed by the high frequency (0.864%) of duplicated k-mers which predicts high proportion of repeats in the genome [80]. High ratio of repeat regions together with short reads sequencing can lead to fragmented genome assemblies as repeats are often longer than the reads [54]. Therefore, low contiguity of our assembly, even after merging assemblies generated by diverse applications, is likely due to the high repeat sequence content to which the lack of a closely related reference genome contributed as well. Nevertheless, the L. apterus assembly has a high gene completeness, since 93.5% of BUSCOs from the Endopterygota database were detected ( Table 2).
The distribution of contigs analysed for their coverage and GC content state space resulted in scaffolds separated into two groups according to their read coverage (Fig. 3). Both groups contained genes identified by the BUSCO analysis. The quotient of the coverage of the two groups was 3/4 and the proportion of males and females in the combined sample (see Materials and methods) was 1:1, suggesting that the group of scaffolds with lower coverage could reflect sequences from the chromosomes. This is further supported by Fig. 4 showing that the lower coverage group of scaffolds are only present in males. This suggests an XY or X0 sex determination system in Lethrus apterus.

Genome annotation
RepeatMasker was used to identify repetitive elements in the assembly of Lethrus apterus. Results showed that a high proportion (36.44% of bases) of the genome contains interspersed repeats, most of which (71.46%) could not be classified as known repeats ( Table 3). The most abundant repeats were A-rich sequences with low complexity. Only three of the next 10 repeat classes showed significant matches with the NCBI nt database. One matched with an inverted repeat in the pannier region of the harlequin ladybeetle (Harmonia axyridis), the other two had hits with uncharacterised genome regions of the mountain pine beetle (Dendroctonus ponderosae) and the ringlet butterfly (Aphantopus hyperantus).
Ab initio gene prediction resulted in 25,385 sequences whereas homology based prediction found 22,551. After merging and filtering the two gene sets, 34,392 remained from which 20,734 were functionally annotated by Inter-ProScan or Diamond using different databases. The annotated gene set had a 1425 bp mean CDS length, 475 amino acids mean protein length and contained 5.10 exons and 4.10 introns per gene on average.
Based on the functional annotation, the potential sex chromosome related genes coded mostly proteins necessary for cell maintenance, including housekeeping genes and mitochondrial proteins, however, some transposons and retrotransposons were found. In addition, proteins involved in the innate immune responses, circadian rhythm and memory were also identified.

Annotation of reproductive behaviour related genes
Based on a literature search, 23 candidate genes were found in 21 research articles ( Table 4). Of these, 19 genes were found in coleopteran species and were stored in NCBI. All 19 candidate genes had significant hits with the predicted genes of Lethrus apterus, sequences of the hits can be found as Additional file 1. Compared to the other examined beetle species, L. apterus had lower number of hits of odorant-binding and pheromonebinding proteins. These molecules play a significant role in recognition of the signals of the environment, such as food resources or recognition of conspecifics [21]. The loss of these genes may be a great starting point of a research on the evolution of olfactory perception among dung beetles, however, we should note that the low   number of hits could be caused by the fragmented genome hence further investigation would require the involvement of other data, such as RNA sequencing. In addition, two of the 19 genes, namely troponin C, and octopamine receptor were found among genes located on potentially sex chromosomes and thus may serve as targets for future research on gene regulation of reproductive behaviour.
Fragmented genome assembly can lead to overprediction of paralogous genes, especially in case of gene families with high number of similar members [44]. The results of the candidate gene search, however, showed that the number of hits in genes of Lethrus apterus and the related species were similar, suggesting that the low contiguity of the assembly did not influence the gene prediction.

Comparison with other coleopteran species
Fourteen coleopteran proteomes available on NCBI and predicted Lethrus apterus genes were used to perform comparison of orthologous genes by Orthofinder. Based on the results, 357,992 genes (95.9% of the total number of genes) were assigned to 23,528 orthogroups. All species were present in 4754 orthogroups of which 44 included only single-copy genes. 9618 orthogroups were speciesspecific from which 607 (consisting of 1664 genes) were specific to L. apterus. Of the predicted genes, 436 were not assigned to any orthogroups. Finally, 42.1% of the orthogroups contained L. apterus genes. Our phylogenetic results are in line with those relationships described in [88,89]. Based on the species trees reconstructed with two independent methods, Nicrophorus vespilloides appeared to be the sister taxa of Lethrus apterus + Onthophagus  taurus. Monophyly of these groups received a high statistical support in all of our analyses (Fig. 5). This branching not only marks the divergence of Staphylinoidea and Scarabeoidea, but also separates those only three species in our dataset that have biparental care. Further studies are now needed to more precisely decipher the origin of biparental care among beetles.

Variant calling
The three approaches used (see Materials  Based on the principal component analysis, the samples forming our Susa dataset were grouping together which supports our theory that there should be low variation between these populations ( Figure S1). One interesting finding was that one population, namely the Debrecen population was separated from all of the other populations. This clustering can serve as great basis of future studies thusthese variant loci can serve as a resource for future population genomics analysis in Lethrus apterus.

Conclusion
We have reported the genome of Lethrus apterus, the first genome in the family Geotrupidae. Although the assembly is highly fragmented, probably due to the repetitive nature of the genome, it has a high level of gene completeness. This beetle species is a good model for division of labour during parental care. All genes related to reproductive and parental behaviours published so far were located in the final genome assembly, therefore, its use for future investigation for the genetic architecture of parental care among insects should be pursued. Further, potential sex chromosome related genes were identified which can be useful for

Sampling and DNA extraction
In the spring of 2016, 32 individuals were collected from eight locations (2 males and 2 females from each location) across Hungary. Samples were taken from thorax muscles from each individual. Each tissue sample was ground in a clean mortar using pestle and liquid nitrogen and when grinding was finished 430 μl of extraction buffer was added. According to Gilbert et al. [23], the buffer contained 3 mM CaCl2, 2% sodium dodecyl sulphate (SDS), 40 mM dithiotreitol (DTT), 250 μg/ml proteinase K, 100 mM Tris buffer pH 8 and 100 mM NaCl (final concentrations). The solution was suspended into a clean 1.5 ml microcentrifuge tube and 50 μl RNase was added. Suspensions were vortexed and centrifuged for 30 s at 400 G. Samples were incubated overnight at 56°C until no tissue clumps were visible. After the mixtures were cooled down at room temperature, 0.5 volume of 7.5 M ammonium-acetate was added to each sample and vortexed, and then put in a − 20°C freezer for 10-15 min. After this, 1 volume of chloroform:isoamylalcohol 24:1 was added to each sample and the tubes were vortexed for 2 s. Samples were centrifuged for 5 min at 18000 G and the supernatant was carefully transferred to a clean tube. This step was repeated once more. Two volumes of ice cold 96% ethanol were added and samples were mixed by inverting, then incubated on − 20°C overnight. Tubes were centrifuged at 18000 G for 10 min and the liquid phase was carefully removed from the pellets, then 500 μl 70% ethanol was added. After inverting, samples were centrifuged at 5000 G for 3 min, then the pellets were dried at room temperature. Finally, 50-100 μl of 10 mM Tris-EDTA (pH = 9.0) buffer was added to dissolve the pellets.

DNA library preparation and sequencing
The sequencing library was prepared using 300 ng of DNA fragmented with a Bioruptor sonicator (Diagenode) with high mode 10 cycles setting (30 s on/30 s off). For the library preparation the TruSeq® Nano DNA LT Sample Preparation Kit (Illumina) was used following the manufacturer's protocol. The individually barcoded libraries were diluted to 10 nM and two pools of 16 samples each were prepared for sequencing. Paired-end sequencing of 125 bp reads was performed on a HiSeq2500 machine at the EMBL GeneCore Facility (Heidelberg).
Individual sequence coverage was on average low (~10x), therefore, the reads of 12 individuals collected in neighbouring regions (Susa, Uraj and Ózd) were combined to achieve a higher coverage that would facilitate genome reconstruction. The base of the sample choice was their close spatial proximity, hence less variation was expected between them. Hereafter, we refer to this dataset of combined samples as the Susa dataset.

Quality control and filtering
The read quality of the Susa dataset was checked with FastQC (v0.11.5, [3]). The proportion of bases with a base quality of 30 or higher was above 90% in all samples. Trimmomatic (v0.36, [8]) was used to eliminate any adapter sequences from the reads, with at least 97% of the reads passing the trimming process. Next, the Susa reads were combined into two FASTQ files, one for the forward and one for the reverse sequence, and because these reads derived from different individuals, Musket (v1.1, [41]) was used to reduce the variation introduced by combining samples. After this correction, the quality of the dataset was checked with SGA-PreQC (v0.10.15, [72]). Results showed 20% were PCR duplicates, high mean quality score (>Phred 35), low sequencing error rate (< 5 × 10 − 5 ), a peak of 51-mer distribution at count 70, and also predicted high repeat branch frequency similar to the human and oyster genomes. Additionally, GenomeScope (v1.0 [80];), which works very well on short reads, was run on the 17-mer count histogram of the Susa dataset after Musket produced using Jellyfish (v2.2.6 [47];) to check heterozygosity and repeat region contant.

Genome assembly
MEGAHIT (v1.1.1, [38]) was used (minimum k-mer size 51, maximum k-mer size 111, k-mer step size 20) to generate the initial assembly of the Susa reads and contig length was increased through running three cycles of SSPACE (v3.0, [6,7]) alternating with GapFiller (v1.10, [7,54]). Another assembly of the Susa reads was run using SOAPdenovo2 (v2.04, [46]) with the default k-mer size. These two assemblies were then merged with gam-nsg (v1.1b, [79]) to improve contig sizes. The final assembly was obtained by removing contaminant and duplicated contigs as reported by GenBank Contamination Screen and the potentially poor quality contigs shorter than 501 bp as short contigs can bias downstream analyses, e.g. mislead the gene prediction methods and the variant calling.
To assess the completeness of our assemblies, BUSCO (v4.1.2, [71]) was run to identify the proportion of the Endopterygota gene set that was found in the Lethrus genome. For this, the Tribolium castaneum gene set (tri-bolium2012) was used for algorithm training. The distribution of coverage and GC along the contigs was also analysed in R statistical environment (v3.5.2 [61];).

Genome annotation
The distribution of repetitive sequences was analysed with RepeatModeler (version open-1.0.8, [73]) and RepeatMasker (version open-4.0.7, [74]). First, a de novo repeat library was built using RepeatModeler. RepeatMasker was used for repeat analysis based on RepeatModeler output and Repbase (version 20170127).
For gene prediction in the soft masked genome assembly, ab initio and homology-based prediction methods were combined with the BRAKER pipeline (v2.1.5 [26,27];). Ab initio prediction was carried out using Augustus (v3.3.3 [75,76];). For homology-based prediction, hints for GeneMark-EP (GeneMark-ES Suite v4.57_lic [45];) were generated with ProtHint (v2.4.0 [10];) based on the endopterygota_db10. Augustus was then trained with these sequence hints with default settings and used for prediction. This was followed with another round of training and prediction based on the first iteration results and GeneMark hints. Predicted coding sequences containing premature stop codons and shorter than 50 amino acids were filtered out.

Analysis of reproductive behaviour related genes
To identify genes potentially involved in parental behaviour, a literature search was carried out on 12th May 2020 on Web of Science using the following "Topic" keywords: (insect* AND ("reproductive behavior" OR "reproductive behaviour" OR "parental care") AND (gene OR genes)). This resulted in 96 articles from which a list of candidate genes was extracted on the basis of the genes being identified as playing a role in insect reproductive behaviour. NCBI was searched and filtered for complete and reliable protein sequences (not partial, predicted, putative, hypothetical, unknown or uncharacterised) of coleopteran species. These sequences were used as query for searching homologous proteins among the predicted genes of Lethrus apterus using Diamond (in sensitive mode, with e-value set to 10 − 5 and query coverage to 40%). Besides, the same search was run against the proteomes of the following species: Tribolium castaneum which has a well annotated genome; Nicrophorus vespilloides that is a model species for parental care among insects; Onthophagus taurus, the most closely related species with an available genome; Asbolus verrucosus which has a highly fragmented genome, therefore can be a scale for the accuracy of our result from the assembly of L. apterus.

Variant identification
Cleaned reads for each individual sample were aligned to the final assembled genome using BWA mem (v0.7.15-r1140, [39]). The resulting sam files were converted to sorted and indexed bam files with samtools (v1.2, [40]). Since base correction might cause bias in the number of variants, three different methods were used for variant calling which was followed by rigorous filtering steps. The first set of variants were called with the samtools mpileup bcftools call pipeline (bcftools v1.3, [40]). Variants were also identified with GATK (v3.6.0.g89b7209, [48]) and freebayes (v1.1.0-3-h961e5f3, [22]). To run GATK, picard was used to mark duplicates, create a sequence directory of the reference genome, add to read group information and build a bam index. Then GATK Haplotypecaller was run for each sample separately and the results were combined with GenotypeGVCFs. For freebayes, all samples were called simultaneously.
Freebayes produced the highest number of polymorphisms (see Results), therefore, we filtered them to remove loci which had at least one variant with a missing genotype, a minimum alternate count of less than two, a minimum coverage of less than four or a maximum coverage higher than 25. After this filtering, loci which had variants with Phred based genotype likelihood less than 20 were marked as missing, i.e. they were removed from the data of the given sample. From these loci only those which were also called by GATK and samtools were retained. From the filtered loci found by all three methods, those positioned in low complexity (identified using dustmasker v1.0.0, [53]) or repetitive regions were removed. From the remaining variants, only SNPs were retained.
Additional file 1. Amino acid sequences of genes that are potentially involved in the regulation of reproductive behaviours in Lethrus apterus.
Additional file 2: Figure S1. Principal component analysis plot of samples from eight populations of Lethrus apterus. The first two components (PCs) are plotted and the sample names are included. Eigenvalues are shown in the bottom right corner. The red ellipse includes the samples forming the Susa dataset.