MEHunter: transformer-based mobile element variant detection from long reads

Abstract Summary Mobile genetic elements (MEs) are heritable mutagens that significantly contribute to genetic diseases. The advent of long-read sequencing technologies, capable of resolving large DNA fragments, offers promising prospects for the comprehensive detection of ME variants (MEVs). However, achieving high precision while maintaining recall performance remains challenging mainly brought by the variable length and similar content of MEV signatures, which are often obscured by the noise in long reads. Here, we propose MEHunter, a high-performance MEV detection approach utilizing a fine-tuned transformer model adept at identifying potential MEVs with fragmented features. Benchmark experiments on both simulated and real datasets demonstrate that MEHunter consistently achieves higher accuracy and sensitivity than the state-of-the-art tools. Furthermore, it is capable of detecting novel potentially individual-specific MEVs that have been overlooked in published population projects. Availability and implementation MEHunter is available from https://github.com/120L021101/MEHunter.


Introduction
Mobile genetic element variants (MEVs) account for approximately 25% of structural variations (SVs) in the human genome (Gardner et al. 2017), encompassing elements such as long interspersed nuclear element 1 (L1), Alu, and SINE-VNTR-Alu (SVA) elements.Active MEVs act as insertional mutagens that can alter genetic traits, potentially disrupting gene function and leading to various genetic disorders (Kojima et al. 2023).
Long-read sequencing technologies represent a significant advancement over traditional next-generation sequencing (NGS) by offering extended sequence lengths.This capability enhances genome-spanning ability, thereby providing a detailed resolution of SVs across a broad spectrum of scales and types (Porubsky and Eichler 2024).Characterized by their variable lengths and often only partial or fragmented sequence components, MEVs present a particular challenge for detection.Current algorithms such as rMETL (Jiang et al. 2019), Palmer (Zhou et al. 2020), and xTea (Chu et al. 2021), while effective in some contexts, still occasionally struggle due to their inadequate parsing of nuanced sequence content crucial for accurately identifying MEVs.This underscores the need for more refined detection methods capable of addressing the complex nature of MEVs.
Herein, we introduce MEHunter, an innovative longread-based mobile element insertions and deletions (MEIs/ MEDs) detection tool through fine-tuned transformer model.MEHunter not only enhances the accuracy of MEV detection but also provides researchers with the flexibility to focus on specific MEV events according to their study objectives.MEHunter represents a significant leap forward in the detection of MEVs, promising to unlock new possibilities in genomic and clinical studies.

Materials and methods
MEHunter identifies MEVs through the following four steps.i) MEHunter utilizes a modified version of cuteSV (Jiang et al. 2020) to precisely and exhaustively identify generic SV characteristics (e.g.loci, signatures, genotypes, etc) from the Binary Alignment Map (BAM) files; ii) MEHunter clusters the extracted features of the SVs and employs abPOA (pyabpoa v1.4.3) (Gao et al. 2021) to build the consensus sequence for each cluster; iii) MEHunter uses the consensus sequences along with known ME sequences as inputs for a lightweight, modified Smith-Waterman (SW) algorithm to achieve firstround MEVs classification; iv) For the remaining unclassified consensus sequences, MEHunter uses minimap2 (Li 2018) as a preclassifier to exclude completely unrelated sequences and applies fine-tuned DNABERT2 (Zhou et al. 2023) to enhance the detection of potential MEVs.Please also refer to Supplementary Figs S1 and S2 for schematic illustrations and consult Supplementary Notes for more detailed information on the implementation of MEHunter.

Results and discussion
To evaluate the performance in identifying MEVs, we conducted a comprehensive comparative analysis of MEHunter, rMETL (v1.0),Palmer (v2.0.0, termed as Palmer2), and xTea (v0.1.0)to assess their performance in detecting MEVs on both simulated and real long-read datasets.Palmer2 was excluded from the comparison due to its relatively lower computational efficiency.Moreover, it consistently failed to report the MEVs of the SVA class, as the program crashed.

Assessment on simulated datasets
PacBio HiFi-like and Oxford Nanopore Technologies (ONT)-like long-read sequencing datasets at four sequencing depths (5×, 10×, 20×, and 30×) were simulated using an in silico diploid human genome.The genome includes 20 000 MEVs comprising Alu, SVA, and L1 elements, alongside 5000 ordinary SVs, as detailed in Section 2.1 of the Supplementary Notes.For MEHunter, rMETL, Palmer2, and xTea, default parameters were utilized to call MEVs, except for adjustments to the number of supporting reads (specified by the -s parameter), as outlined in Supplementary Tables S1  and S2.
Overall, MEHunter exhibited exceptional performance, achieving F1 scores exceeding 99.42% for both MEIs and MEDs across various 30× sequencing datasets (as shown in Fig. 1a and detailed in Supplementary Tables S1 and S2).The performance marks at least a 16.48% improvement over the scores attained by rMETL, Palmer2, and xTea.Notably, MEHunter maintained consistent genotyping accuracy with F1 scores surpassing 98.43%, which significantly highlights its superior capabilities.Moreover, MEHunter obtained the lowest false discovery rates (FDRs) on the simulated datasets.It is also worth noting that Palmer2 and xTea only detect MEIs and cannot report the corresponding genotypes.
Furthermore, it is well-documented that the ability to detect MEVs diminishes with reduced sequencing depth.

Figure 1 .
Figure 1.Benchmarking the performance of MEV detection on simulated (sim-) and real (real-) long-read sequencing data.(a) Evaluations across varying coverages of simulated PacBio HiFi and ONT long reads.(b) Evaluations across different coverages of authentic PacBio HiFi and ONT long reads for the HG00731 human individual.(c) Distribution of MEV identification rates for the HG00731 sample, categorized by the presence of SVs shared among 32 individuals.(d) Benchmark results of MDRs for trio data (HG00731, HG00732, and HG00733).(e) Benchmarking results for elapsed time and memory footprint using 15× HG00731 PacBio HiFi data.In the figure, "N" and "N-GT" indicate the statistics without and with genotyping, respectively