A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Mun, Seyoung; Kim, Songmi; Lee, Wooseok; Kang, Keunsoo; Meyer, Thomas J.; Han, Bok-Ghee; Han, Kyudong; Kim, Heui-Soo

doi:10.1038/s12276-021-00586-y

Download PDF

Article
Open access
Published: 08 April 2021

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Seyoung Mun^1,2,3,
Songmi Kim^1,2,
Wooseok Lee³,
Keunsoo Kang⁴,
Thomas J. Meyer^5,6,
Bok-Ghee Han⁷,
Kyudong Han ORCID: orcid.org/0000-0001-6791-2408^2,3,4 &
…
Heui-Soo Kim⁸

Experimental & Molecular Medicine volume 53, pages 615–630 (2021)Cite this article

2970 Accesses
6 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Advances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

A pangenome reference of 36 Chinese populations

Article Open access 14 June 2023

Building a Chinese pan-genome of 486 individuals

Article Open access 30 August 2021

The complete and fully-phased diploid genome of a male Han Chinese

Article Open access 14 July 2023

Introduction

As next-generation sequencing (NGS) technologies developed by Illumina¹, Pacific Biosciences², and Ion Torrent³ have been commercialized, genomic research on a variety of species, including humans, has accelerated⁴. Studies related to various genetic diseases, including cancers and phenotypic traits, have yielded many accurate human individual genome datasets^5,6,7,8. The integration of these genomic datasets, which are mostly comprised of whole-genome resequencing and whole-exome sequencing data, has allowed the exploration of a wide range of complex variation types, such as single-nucleotide polymorphisms (SNPs), insertions and deletions (INDELs), inversions, rearrangements, copy number variations (CNVs), and structural variations mediated by transposable elements (TEs)^9,10,11.

It is well established that TEs account for at least 50% of the human genome¹². They are among the key factors contributing to structural variations and are classified as transposons (cut-and-paste) or retrotransposons (copy-and-paste) on the basis of their mechanisms of mobilization^13,14,15,16. Among TEs, non-long terminal repeat (non-LTR) retrotransposons, such as Alu elements, long interspersed elements (LINEs), SINE-R/VNTR/Alu-like (SVA) elements, and LTR retrotransposons, including human endogenous retroviruses (HERVs), are still retrotranspositionally competent in the human genome^17,18,19,20. Comparative genomic studies of primate genomes show that TEs have played a significant role in shaping primate genomic architecture^21,22,23,24. The fact that some TEs are currently active in the human genome indicates their potential to structurally alter the human genome. Examples are known whereby TEs have influenced the expression and structure of gene products by supplying alternative splice donor/acceptor and transcription binding sites^25,26. As recognition of the importance of TEs has grown, many researchers have studied the relationship between TE-associated genetic variations and the risks of carcinogenesis^27,28,29 and other genetic disorders^30,31. Furthermore, new methods using high-throughput NGS methods, such as TE capture sequencing and bioinformatic techniques, have allowed more detailed determination of TE contributions to the human genome^32,33. Despite considerable progress in our understanding of the biology and distribution of TEs and their impact on the structure, function, and evolution of the human genome, a great deal of uncertainty still remains concerning differences between individuals. As many previous studies used only short sequence reads for sequence mapping and repeat assembly, their ability to interrogate certain types of variations was limited. These challenging tasks include the identification of private structural variations, high copy number gene families, INDELs of intermediate size, and especially TEs, whose high copy number and sequence identity can cause problems for both mapping and assembly^34,35. To overcome these limitations in genome analysis based on short reads, a practical method for the de novo assembly of large genomes from short sequence reads using SOAPdenovo2 was developed. This method employs short-read sequencing in parallel with multiple insert-size libraries, cost-effectively enabling not only the construction of de novo genomes but also the accurate detection of structural variations. The SOAPdenovo2 assembler consists of six systematic steps, as follows: read-error correction, de Bruijn graph (DBG) construction, contig assembly, paired-end (PE) read mapping, scaffold construction, and gap closure. In the DBG-based sequence assembly method, it is very important to choose a proper k-mer size for the determination of optimized contigs. The use of a large k-mer size can solve the problem of short repetitive sequences when the sequencing depth is sufficient and can improve the quality of the resulting contig assembly. Complementarily, the use of small-sized k-mers can reduce some negative effects on assembly related to sequencing error and heterozygosity. Utilizing the advantages of both large and small k-mers during assembly, the SOAPdenovo2 assembler provides an effective solution for large genome assembly and identification of TE-associated structural variations (TASVs)^36,37.

Here, we report the de novo-assembled draft genome and genetic information for a Korean individual (KPGP9) generated using the SOAPdenovo2 assembler with multiple insert-size libraries (Supplementary Fig. S1). Using this pipeline, the draft genome for KPGP9 was constructed from raw sequencing data, yielding a genome of ~2.86 Gb in length, with 105.16X coverage and a scaffold N50 size of 13.23 Mb. Here, we provide a detailed description of the TASVs identified from this draft genome and show a total of 271 TASVs that have contributed to the generation of an appreciable proportion of genomic variation between human individuals.

Materials and methods

DNA sample preparation and genome sequencing

The blood sample was donated from a Korean male (KPGP9) following the Korean ethical guidelines for human genome research. We extracted genomic DNA using the QIAamp DNA Blood Mini Kit (Qiagen, Germany) according to the manufacturer’s instructions. Libraries of various insert sizes were constructed using paired-end (insert size: ~170, 500, and 800 bp) and mate-pair (insert size: 2, 5, 10, 20, and 40 kb) methods. Size selection for the intended libraries was conducted using a Covaris S2 Ultrasonicator system (Covaris, USA). The final libraries were sequenced on an Illumina Hi-Seq 2500 platform using the TruSeq Paired-End Cluster Kit v3 (Illumina, USA). A total of 315.47 Gb of sequence data were produced using 9 lanes. The read lengths of paired-end and mate-pair libraries were 90–100 bp and 49 bp, respectively (Supplementary Fig. S1 and Supplementary Table S1).

Raw data filtering

Prior to genome assembly, we carried out sequence filtering to reduce assembly error due to low-quality reads. The reads with low quality were filtered out as follows: (i) reads with multiple poly-A sequences or ambiguous bases (represented by the letter N); (ii) reads with ≥40% low-quality bases (base quality ≤7) in paired-end libraries (~170, 500, and 800 bp) or mate-pair libraries (2, 5, 10, 20, and 40 kb); (iii) reads containing adaptor sequences; (iv) cases in which read1 and read2 overlapped by ≥10 bp; (v) low-quality bases trimmed using SICKLE (https://github.com/najoshi/sickle); and (vi) PCR duplications that were identified using PICARD (http://picard.sourceforge.net/index.shtml). Thus, we used only high-quality sequence reads for further analysis.

De novo assembly of the KPGP9 genome

The corrected reads were used for de novo assembly using the De Bruijn graph assembly algorithm implemented in SOAPdenovo ver. 2.04 with default parameters. The resulting assembled genome had a total length of 2.86 Gb and an N50 scaffold length of 13.23 Mb³⁸ (Table 1). First, sequence reads from short-insert libraries were used for the construction of a De Bruijn graph. In the contig step of SOAPdenovo2 assembly, we excluded erroneous connections such as clip tips, low-coverage links, small repeats, and merged bubbles with the following parameters: “contig -g inputGraph -R -M 1 -D”. Second, we realigned the qualified reads to contig sequences to measure the amount of shared paired-end relationships between contigs, followed by the construction of the scaffolds stepwise from the short-insert size to long-insert-size paired-ends. Finally, we used paired-end reads to close the gaps within the constructed scaffolds³⁶. This application of de novo genome assembly combined with stringent filtering and conditioning of the data allowed us to construct a high-quality KPGP9 draft genome. Alignment of large-scale scaffold blocks to the human reference genome was carried out using algorithms in SyMAP software (http://www.agcol.arizona.edu/software/symap/, v4.2), which is one of the tools for detecting orthologous segments of the genome^39,40. The large-scale scaffold regions were finally visualized using the webtool idiographica (http://www.ncrna.org/idiographica/)⁴¹. The de novo-assembled genome data for KPGP9 and the coordinates of scaffolds aligned to the human reference genome are available to download at ftp://210.102.196.176.

Table 1 Summary of KPGP9 de novo assembly statistics.

Full size table

Repeat annotation

Tandem repeats in the KPGP9-assembled scaffolds were identified using Tandem Repeats Finder13 (http://tandem.bu.edu/trf/trf.html, version 4.04), with parameters set to “2 7 7 80 10 50 2000 -f -m -h –d”⁴². To screen TEs in the genome, a homology-based repeat element search was implemented using RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker, v4.0.6) and Repbase, a database of known repeats (http://www.girinst.org/repbase)^43,44. All repeats were sequentially categorized into their relevant types.

Data mining and manual inspection of TASVs

We sorted events by the ages of repeat subfamilies, reasoning that the youngest TE subfamilies are most likely to be still active in the human genome. We extracted sequences for LINE-1 (L1HS), Alu (10 AluY subfamilies), HERV (HERV-K), and SVA (SVA A to F) elements, along with approximately 2–4 kb of flanking region from the KPGP9 scaffolds on either side of the TE. Sequence extraction was performed by a custom Perl script. The extracted candidate sequences were mapped to the human reference genome. Initially, a total of 19,224 putative TASVs (14,548 insertion and 4676 deletion candidates) were identified by comparison of KPGP9 to the human reference genome. Candidate insertions less than 100 bp in size were filtered out, but candidate deletions were included in the downstream analysis without size selection.

To isolate precise TASVs, we first filtered out uncertain candidates containing either ambiguous sequences (represented by the letter N) or small insertion sizes (<100 bp) or lacking target site duplications (TSDs), which are the universal hallmark of retrotransposons generated at both ends of most TE insertions⁴⁵. In this way, we were able to avoid assembly bias and systematic errors to effectively screen putative TASV insertions^34,35. The flanking sequences were utilized as queries for BLAST-Like Alignment Tool (BLAT) (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start) searches against the human genome assembly (hg19; February 2009 freeze)⁴⁶. We examined the alignment of the flanking sequences with the corresponding regions in other nonhuman primate genomes, including chimpanzee (panTro5; May 2016), gorilla (gorGor5; May 2016), and orangutan (ponAbe2; July 2007 freeze), to ensure that the TASV candidates were unique to the KPGP9 genome. Repeat elements in the flanking sequences of TASV candidates from the KPGP9 genome and its counterparts from the human reference genome were annotated with RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker)^43,47. In this way, we filtered out the TASV insertions previously reported and shared with the KPGP9 genome. To refine the set of TASV deletion candidates, we initially identified deletion regions in scaffolds by comparing them with the human reference genome (hg19) (Supplementary Fig. S1). Next, the flanking sequences on either side of these deletion breakpoint loci were subjected to RepeatMasker analysis (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker)⁴⁷. Next, we selected and analyzed only those candidates anchoring repeat elements in both flanking sequences of the breakpoint. Candidates that were unrelated to the TE-mediated deletion were filtered out according to internal sequence variations in the element, incompletely assembled regions (i.e., N stretches), or variations caused by simple repeats. Manual inspection was performed to identify the mechanism of each using alignments with their flanking sequences using BioEdit software v.7.0.5.3⁴⁸.

Experimental validation of precise TASVs

To validate all TASVs, we conducted PCR amplification and DNA sequencing as described below. Oligonucleotide primers for PCR amplification of each locus were designed using Primer3 software⁴⁹. PCR amplification of each locus was performed with initial denaturation at 95 °C for 10 min; 35 cycles of 30 s of denaturation at 95 °C, 30 s at the annealing temperature, and 1 to 4 min of extension at 72 °C (depending on the expected size of the PCR product); and a 5 min final extension at 72 °C. Each PCR was performed in a 25 μL reaction containing 10 μL of 2X Lamp Pfu DNA Polymerase (BioFact, South Korea), forward and reverse oligonucleotide primers (10 pmol/μL each), and more than 20 ng of target template DNA. PCR confirmation analyses were conducted in two different human genomic DNA panels. One DNA panel purchased from the Coriell Institute for Medical Research was composed of 80 human individuals (20 individuals from each of four different populations: Asian, South American, European, and African American). The other panel was composed of 38 genomic DNA samples that were extracted from whole blood of healthy Korean donors. KPGP9 DNA was used as a control in every round of experiments. Each DNA sample was qualified and quantified by using a spectrometer (TITERTEK, Germany). The PCR products were loaded on a 2% agarose gel for electrophoresis, stained with EcoDye™ Staining Solution (Biofact, South Korea), and visualized using UV fluorescence. For several candidates involving stretches of ambiguous sequence (Ns) or otherwise identified as uncertain amplicons, the PCR products were confirmed with Sanger DNA sequencing using an ABI 3500 Genetic Analyzer (Applied Biosystems, USA). We gratefully acknowledge the Center for Bio-Medical Engineering Core Facility at Dankook University for providing equipment, including a Sanger sequencer.

Comparison analysis with the public genomic variant database

In comparing the TASVs with large-scale genomic variants, we used public data for the AK1 genome⁵⁰, The Cancer Genome Atlas (TCGA)⁵¹, and the 1000 Genomes SV dataset⁸, downloaded from dbVar (estd219)⁵². BEDTools (v2.27.0) software⁵³ was used to identify the overlap between TASVs and genomic features such as CNVs, TE insertions, and deletions from the public database. To verify the precision of the TASV position from the KPGP9 genome, the predicted insertion and deletion sites in other data were compared, and the margin of error that results from comparing the corresponding TASV sites with AK1 and TCGA was calculated.

Inferring clinical impacts and visualizing the identified TASVs from the KPGP9 genome

To observe risk genes associated with phenotypic abnormalities and disease, all genes associated with TASV insertion or deletion events were searched in the DisGeNET⁵⁴ and OMIM databases⁵⁵. We analyzed sequence homology for TASV insertions harbored in the UTR regions of the genes through a search of the miRBase database⁵⁶. The ideogram figure indicating the positions along chromosomes of common SNPs in three Koreans and our TASV insertions and deletions were created using the webtool idiographica (http://www.ncrna.org/idiographica/)⁴¹.

Alu and L1 densities in genic region

To measure the Alu and L1 densities within the human genes, we used RefSeq gene data from the UCSC human genome (hg19). We extracted all repeat information from the RepeatMasker track using the UCSC genome table browser. Alu and L1 insertions in genic regions were extracted by comparison with the locations of human reference genes. The composition ratio of Alu and L1 elements in genes was calculated based on length (Alu and L1/gene length).

Ethical approval

This study was approved by the Central Ethics Committee on Research involving human subjects of the Genome Research Foundation (GRF), an Institutional Review Board (IRB) certified by the Korean office of Human Research Protection (IRB number 20101202-001).

Results

Sequencing and de novo assembly of a Korean individual genome

We generated sequencing data from total of nine libraries, including three paired-end (PE) libraries and six mate-paired (MP) libraries, with the Illumina HiSeq 2500 platform. The raw sequence data yielded by these PE and MP libraries consisted of 201.27 Gb and 114.2 Gb, respectively (Supplementary Table S1). A total of 284.24 Gb of high-quality reads (94.7X coverage) covered approximately 99.23% of the human reference genome (hg19; http://genome.ucsc.edu/), as measured by mapping-based methods (Supplementary Table S2). De novo assembly of the KPGP9 genome was performed using the De Bruijn graph assembly algorithm with SOAPdenovo ver. 2.04^36,38. This highly accurate de novo assembly resulted from a total of 1,439,891 contigs and 31,505 scaffolds, including the longest scaffold of length 75,412,104 bp (Table 1). A total of 555 large-scale scaffold blocks were used for genome alignments (Supplementary Table S3 and Supplementary Fig. S2). As with this result, we constructed the de novo draft genome with sufficiently long scaffolds for the detection of TASVs from the Korean KPGP9 genome. We investigated the composition of the repeat elements using RepeatModeler 1.0.8 (Supplementary Table S4).

Computational data mining of TE-associated structural variations (TASVs)

To investigate TASVs in the KPGP9 genome, we focused on two different structural variations derived from TE insertions and TE-mediated genomic deletions. We initially extracted all TE-containing sequences within the scaffolds corresponding to four retrotransposon families (LINE-1, Alu, LTR retrotransposon, and SVA elements), along with their 5’ and 3’ flanking sequences. A total of 14,548 TASV insertions and 4676 TASV deletion candidates were initially selected from all TE-containing genomic regions. As described in the Materials and Methods, uncertain insertion candidates were excluded. After BLAT analysis, we found 352 nonreference TE insertions, 97 of which were previously reported in the Database of Retrotransposon Insertion Polymorphisms (dbRIP; http://falcon.roswellpark.org:9090)⁵⁷. Thus, our analysis of TASV insertion candidates identified a total of 255 novel, nonreference TE insertions in the KPGP9 genome (Table 2).

Table 2 Summary of TASV insertions in the KPGP9 individual genome.

Full size table

For genomic deletions by retrotransposons, we focused on two mechanisms known to produce human genomic rearrangements: nonallelic homologous recombination (NAHR) and nonhomologous end joining (NHEJ), both of which can be mediated by retrotransposons^24,58,59. NAHR between LINE-1 or Alu repeats has also emerged as a major driver of copy number variation (CNV) in primate genomes^23,24. DNA double-strand breaks (DSBs) are typically one of the most dangerous types of DNA damage in the human genome and can be repaired by TEs, especially LINE-1 and Alu elements, using either NHEJ or homologous recombination (HR)⁶⁰. In previous studies, it has been reported that retrotransposon-mediated genomic deletions may be linked to certain cancers and genetic disorders and have contributed to the evolution of genetic differences among and between human populations⁶¹. After a comparative analysis of the KPGP draft genome and the human reference genome, 1355 and 3321 retrotransposon-mediated deletion candidates associated with LINE-1 and Alu elements, respectively, were detected. After manual inspection using BLAT and multiple sequence alignment analysis, the numbers of deletions associated with L1 and Alu were 40 and 240, respectively. Thus, we categorized each event as belonging to one of four categories: 131 NAHR-Alu recombination-mediated deletions (ARMDs), 109 NHEJ-ARMDs, seven NAHR-L1 recombination-associated deletions (L1RADs), and 33 NHEJ-L1RADs (Table 3).

Table 3 Summary of TASV deletions in the KPGP9 individual genome.

Full size table

Experimental validation for TASVs

To experimentally validate our computationally detected TASV insertions and deletions, we designed target-specific primers for a total of 255 insertion and 280 deletion candidates. PCR amplification was performed in a panel consisting of material from 80 non-Korean individuals (20 individuals from each of four populations) and 39 Korean individuals, including KPGP9 DNA (Supplementary Table S5). Several loci containing poly(N) stretches or displaying unexpected amplicon sizes were resequenced via Sanger sequencing. By applying the processes mentioned above, a total of 65 insertion (25.39%) and 128 deletion (45.71%) candidates were eliminated as false positives. A further eight insertion (3.12%) and 23 deletion candidates (8.12%) failed to produce any amplicons. Of the original 255 TASV insertion candidates, we were able to experimentally confirm 182 (71.37%) as authentic (Table 2 and Supplementary Data S1). In addition, 89 (31.78%) out of 280 TASV deletion candidates were confirmed as authentic (Table 3 and Supplementary Data S2). All TASV loci inspected through this approach are listed in Supplementary Table S6 and Supplementary Table S7. In summary, 271 TASVs accounting for 54.6% of the 496 TASV candidates identified through manual inspection were authentic (Supplementary Fig. S3). Our results once again indicated the importance of experimental validation in NGS-based human TASV analysis (Supplementary Fig. S4).

Among our set of 182 authentic TASV insertions, we experimentally validated 155 Alu, four L1 and six LTR retrotransposons, and two SVA insertions that exhibit the hallmarks of retrotransposition expected of classic insertions^62,63. Meanwhile, the remaining 15 TASV insertions were found to be non-classical Alu insertions (NCAIs), which were inserted into the genome via an alternative pathway involved in DSB repair. These NCAIs were characterized by the absence of TSDs and several other structural features (e.g., truncated Alu and absence of the poly-A tail) and were accompanied by deletions of the preinsertion sequence^64,65. Of the six LTR retrotransposon insertions, four were found to be solitary LTRs caused by unequal homologous recombination between the two LTRs, but the site was designated as empty in the human reference genome (the Human DNA panel we analyzed showed the empty and solitary LTR insertion alleles; with more experimental evidence, we would expect three different types of genotypes (i.e., trimorphism): absence of the HERV element, presence of the HERV element, and presence of a solitary LTR)⁶⁶. The remaining two LTR retrotransposon insertions were HERV-K insertions with internal sequence deletions⁶⁷. The homologous loci of these insertion sites in the human reference genome contained only solitary LTRs (Supplementary Data S1 and Supplementary Table S6). We investigated the associations of TASV insertion with disease and found that 11 TASV insertions were located in the intragenic regions of 10 genes closely related to 16 hereditary diseases identified in recent studies (Table 4).

Table 4 Clinical implication of genes-associated with novel TASVs in the KPGP9 genome.

Full size table

In addition, we categorized 89 TASV deletions based on their deletion mechanisms. A total of 51 NAHR-ARMDs, 26 NHEJ-ARMDs, two NAHR-L1RADs, and 10 NHEJ-L1RADs were validated as new genomic deletions (Supplementary Data S2 and Supplementary Table S7). We experimentally confirmed that all TASV deletions were polymorphic in our panel of five different human populations, including 38 Korean individuals (e.g., Supplementary Fig. S5). We believe these results support the concept that data mining of TASVs within subpopulation DNA panels can be an effective strategy for pursuing population-specific genomic research⁶⁸. In addition, our analysis of TASVs has demonstrated that these events have contributed to the genetic polymorphism observed in human populations.

Contribution of TASVs in genomic variation

As previously suggested, TEs can cause genetic diversity and instability through their insertions into genic regions or through homologous recombination between two TE copies with high sequence similarity in the human genome. To characterize these processes in our de novo genome, we ascertained the contributions of identified TASVs to genomic variations and structural alterations, which are associated with genes containing TASVs⁶⁹. A total of 182 TASV insertions by Alu, L1, SVA elements, and LTR retrotransposons are responsible for a gain of 64,232 bp of sequence within the KPGP9 genome relative to the reference genome. A total of 170 of these consisted of Alu insertions, contributing 49,608 bp of sequence, with an average length of 292 bp (range: 111 to 598 bp). A majority of the Alu insertions are members of the AluYa5 subfamily, which accounted for 26,333 bp of the sequence gain. We also found 15 NCAIs, which include insertion of the TE sequence and also genomic deletions at each insertion site. These insertions contain truncated Alu sequences with no TSDs. In total, these NCAI events resulted in the deletion of 3369 bp of genomic sequences. However, these are not simple deletions, as TE sequence is also inserted during such events. For example, one of our NCAI insertions resulted in the addition of 139 bp of sequence in the form of a truncated AluYb9 element, but also the deletion of 3270 bp of pre-integration sequence, for a net sequence loss of 3131 bp. The remaining events, consisting of four L1HS, two SVA elements, and six LTR retrotransposon insertions, contributed sequence gains of 6399, 2275, and 5950 bp, respectively, in the KPGP9 genome (Table 2).

To investigate the genetic impact of TASV insertions, we identified 80 of 182 TASV insertions (43.95%) as existing in intragenic regions by comparing the insertion positions with the RefSeq Gene database⁷⁰. Of these 80 TASV insertions, 78 are located in the intronic regions of 78 genes. Interestingly, the remaining two intragenic TASV insertions are Alu elements that have inserted into the 3′ untranslated regions (UTRs) of two genes. One inserted into the signal-regulatory protein beta 1 (SIRPB1) gene, which is a member of the immunoglobulin superfamily⁷¹. The other inserted into the xenotropic and polytropic retrovirus receptor 1 (XPR1) gene, which is known to be a phosphate exporter molecule⁷². We confirmed that these two Alu insertions into the SIRPB1 and XPR1 genes are polymorphic on our panel of five human populations, including 38 Korean individuals (Fig. 1). It was previously found that TEs, especially Alu insertions in gene regions, may be considered a potential resource for some microRNAs (miRNAs) as a rare genetic event that provides miRNA target sites for transcriptional regulation. We therefore also investigated the influence of these insertions on miRNA binding sites. Both AluYa5 insertions in the SIRPB1 and XPR1 genes were found to contain a potential binding site for the miRNA hsa-miR-619 (MIMAT0026622) in a search for homologous miRNA sequences. In recent studies, it has been demonstrated that the majority of mature hsa-miR-619-5p binding sites are located in the 3’UTRs (214 sites) of 201 human genes⁷³ and derived from the Alu element⁷⁴. This indicates that the embedded Alu insertions at the 3′ UTR in both genes may act as a translation inhibitor interacting with miRNA, a function that is highly conserved among Alu sequences⁷⁵.

**Fig. 1: Novel *Alu*Ya5 insertion events in the 3’UTRs of the *SIRPB1* and *XPR1* genes.**

We also identified 89 TASV deletions that removed a total of 82,772 bp of sequence in the KPGP9 genome. Among those occurring between Alu elements and L1s, the 51 NAHR-ARMD events accounted for a large portion of the total genomic sequence deleted (42,909 bp) from the KPGP9 genome (Table 3). Most (75%) TASV deletions were less than 1 kb in length, and only four deletion events were >5 kb (Fig. 2a). The largest genomic deletion we observed was 13,698 bp in size and was caused by an NAHR-ARMD event between two AluSz6 insertions in the intronic region of the ankyrin repeat domain 30B (ANKRD30B) gene. In addition, we investigated homologous reference sequences in the recombination breakpoints of NAHR deletions and in the microhomology sequences of NHEJ deletions. The length of homologous sequence stretches between Alu-Alu recombinants ranged from 3 to 283 bp, with 43 bp being the average. Three positions along the Alu elements participating in these events, including the Alu recombination “hotspots” previously identified based on Alu consensus sequences (AluS, AluJ, and AluY families), were found to have slightly higher recombination frequencies (Supplementary Fig. S6)²¹. We also found microhomologies ranging from 1 to 18 bp in size, which implied a microhomology-mediated end joining mechanism in 15 out of 26 Alu NHEJ-ARMD events⁷⁶. Among a total of 10 NHEJ-L1RAD events, we verified homologous sequences associated with deletions in only seven events. All details of our homologous sequence and microhomology investigations at TASV deletion loci are detailed in Supplementary Data S2 and Supplementary Table S7.

**Fig. 2: Size distribution of TASV deletion events and *Alu* subfamily composition involved in ARMD events.**

To verify which subfamilies are most commonly involved in TASV deletions, we examined the distribution of Alu and L1 subfamilies among these events. A total of 29 Alu subfamilies were associated with our ARMD events (14 AluS, 11 AluY, 2 AluJ, FLAM, and FRAM). We further found that 91 AluS insertions were among the 154 Alu insertions associated with deletions, including ~54% NAHR-ARMD and ~67% NHEJ-ARMD events. This observation of an overrepresentation of AluS insertions in ARMDs is in accord with the results of previous studies. It is perhaps unsurprising that the most successful Alu subfamily in terms of copy number (AluS) has also been involved with the most ARMD events²¹. Previous studies also found that the AluJ subfamily contributed less to the detected ARMD events, an observation thought likely to be driven by the fact that the older AluJ copies had accumulated more point mutations, thereby lowering their chances of being involved in recombination events²³. Consistent with these expectations, we also observed substantially fewer contributions of AluJ insertions to our detected deletion events (Fig. 2b). We investigated all Alu copy interaction combinations located in the homologous upstream and downstream regions of each ARMD breakpoint and found that 20 NAHR-ARMDs and 10 NHEJ-ARMDs appeared to have occurred between AluS copies. The AluS-AluY combinations were also found at high frequency, while we identified only a small number of ARMD events (4 of 77) associated with AluJ copies (Fig. 2c). For L1RADs, members of 16 different L1 subfamilies, including L1HS, were involved in L1RAD events in various combinations, as shown in Supplementary Table S7. Of our 89 authentic TASV deletion events, 40 occurred in the intronic regions of 40 genes. These included 24 NAHR-ARMDs, 11 NHEJ-ARMDs, one NAHR-L1RAD, and four NHEJ-L1RADs. The ~51% of authentic TASV deletions occurring in intronic regions are responsible for the loss of 42,538 bp of sequence.

Comparison with the published genomic databases

With careful bioinformatics identification of TASVs followed by rigorous experimental validation, we confirmed that TASVs contribute to genetic differences within different ethnic groups. The majority of TASVs in this study are polymorphic among human populations. We first compared and analyzed all TASVs in a thorough search of the Database of Genomic Structural Variation (dbVar) in NCBI⁷⁷. This database provides detailed information on human SVs from multiple studies, including the 1000 Genomes Project (1000 GP). The results showed that 26 out of a total of 182 TASV insertions and 24 out of 89 TASV deletions were previously observed but not experimentally confirmed (Supplementary Table S6 and Supplementary Table S7). Additionally, we compared and analyzed the population allele frequencies of 26 confirmed insertions and 24 deletions, which averaged 0.279 and 0.496, respectively. In particular, AluYc3_4 (ALU_umary_ALU_7707) had an average allele frequency of 0.17 in five superpopulations (African, American, East Asian, European, South Asian) and was observed only in East Asians (0.04), Africans (0.047), and South Asians (0.011). NHEJ-L1RAD_6 (UW_VH_1123) and NHEJ-L1RAD_7 (YL_CN_LWK_752) were confirmed to be dominant in all but the African population (Supplementary Table S8). Next, a comparative analysis was conducted to evaluate their distribution in recently published studies focusing on TASV insertions, such as the Korean AK1 genome⁵⁰ and TCGA⁵¹, and how accurate the identification was. Of the 10,077 insertion events found in the AK1 genome, we selected 1097 TE-associated insertion events, which are newly identified cases (707 Alu, 177 L1, 18 HERV, and 195 SVA) and compared with TASV insertions in the KPGP9 genome. Only 24 events are shared with AK1. Furthermore, among a total of 7724 TE-associated insertion events from TCGA data, 87 TASV events were shared in both. Compared to previous studies, we have confirmed that a total of 94 TASV insertions (approximately 51 %) were found uniquely in the KPGP9 genome (Fig. 3 and Supplementary Table S9). To analyze the exact insertion site between the given data, we investigated the error of the insertion sites based on the manually investigated TSD. Observing the error of the expected location of the TASV insertions, which that were shared among the three datasets, AK1 had an average margin of error of only 4 bp, but in TCGA, AK1 had a margin of error of more than 15 bp in both upstream and downstream sequences (Fig. 3). NGS analysis with short-read sequencing data is straightforward to explore for a wide range of genome variants because they generate sufficient query bulk data in a short period and have a guaranteed price advantage. However, there are still difficulties in determining exact INDEL sites, identifying the full-length sequences of insertion variations, and providing base-level resolution from NGS data. Therefore, validation tests with PCR and Sanger sequencing are needed. AK1, on the other hand, was highly efficient in identifying convoluted and repetitive TEs because it used a newly assembled genome generated using single-molecule real-time sequencing, physical mapping, microfluidics-based linked reads, and bacterial artificial chromosome (BAC) sequencing approaches.

**Fig. 3: Comparisons of TASV insertions among three genome datasets.**

Discussion

In summary, the aim of this study was to establish a de novo-assembled Korean genome and to identify new TASVs through comparative analysis with the human reference genome. The use of the SOAPdenovo2 assembler based on the parallel assembly method with various insert-size libraries has enabled de novo construction of this Korean genome, which was then suitable for the identification of precise TASVs. Through this, we verified that many genomic variations associated with TEs contribute to polymorphisms in an individual Korean genome relative to the reference genome and a population panel, and we report the detailed genomic information of these TASVs.

Recent studies into the mechanisms of genetic diseases and cancer have underlined the clinical relevance of TEs, which can alter sequences through insertional mutagenesis and recombination-mediated deletion and modify the epigenetic regulation of nearby loci^69,78. To study the potential impact of TEs in our de novo genome on the genic loci associated with disease, we were able to identify 10 genes bearing 10 Alu insertions and one SVA_F insertion. These loci are associated with the 16 diseases listed in Table 4. Of note, two AluYa5 insertions are observed in the intronic region of the EPB41 gene, which encodes a red blood cell cytoskeleton protein (protein 4.1) and is associated with the autosomal dominant hereditary disease elliptocytosis⁷⁹. From previous studies, it is well known that approximately one million Alu insertions represent more than 10.6% of the human genome and that the density of Alu insertions in the genome is 306 per megabase (Mb)^80,81. Here, we estimated the exact composition of Alu and L1 elements in genic regions based on UCSC human RefSeq Gene data (hg19)⁸⁰. A total of 27,767 gene locations were involved in this calculation, and we found that 20,972 and 17,340 genes were associated with at least one Alu or L1 element, respectively. The densities of Alu and L1 elements in genic regions were 14.91% and 11.81%, respectively (Supplementary Fig. S7). The EPB41 gene contains a markedly higher density (31%) of Alu insertions. We speculate that this high density of Alu insertions could lead to genomic instability by providing many sites of high sequence similarity between which ARMD may occur in the EPB41 gene. In addition, the PRPH2 and MTPAP genes were shown to have high densities of Alu insertions (also over 30%). PRPH2 is associated with retinitis pigmentosa (RP)⁸², while MTPAP is associated with spastic ataxia-4 (SPAX4)⁸³. Interestingly, one AluYb8 insertion (chrX: 41391772-41391772) that is only commonly polymorphic in the Korean population (Fig. 4) was inserted into the CASK gene, which is closely linked to mental retardation and microcephaly with pontine and cerebellar hypoplasia (MICPCH)^84,85. A previous study found that an NAHR event between an AluJb insertion in intron 6 and an AluSq2 insertion in intron 8 was responsible for an 11,218 bp deletion in a MICPCH patient⁸⁶. Based on this association, we wondered whether the novel AluYb8 insertion could be a potential risk factor for CASK gene mutation, at least in the Korean population, because it contributes one possible site to participate in an ARMD event. Additionally, we investigated disease-associated genes that were also associated with our TASV deletions. As before, this was accomplished by using DisGeNET and OMIM, and we found three NAHR-ARMDs, one NHEJ-ARMD, and one NHEJ-L1RAD to be associated with five disease-associated genes (FA2H⁸⁷, EHMT1⁸⁸, INSR⁸⁹, SERPINF2⁹⁰, and PCCB⁹¹). Analysis of the TE density in these genes showed that the five TASV deletions occurred in regions that contained higher densities of the involved TE families than expected from average Alu and L1 densities (Alu: 14.91%; L1: 11.81%) in genic regions, based on UCSC human RefSeq gene data (hg19) (Table 4). All details of TASV insertion loci are annotated in Supplementary Data S1 and Supplementary Table S6.

**Fig. 4: Polymorphic *Alu* insertions in Asian populations.**

Despite retrotransposition and TE-mediated deletion having been noted as one of the major drivers of genomic structural variation in humans and other nonhuman primates, analyzing the TEs underlying these polymorphisms has proven challenging with high-throughput resequencing methods. As TEs are highly repetitive and abundant, detecting novel TASVs with usual strategies for mapping to the reference genome is difficult. In attempts to overcome this difficulty, various software programs for detecting TEs in WGRS data have been developed, including Tangram⁹², Mobile Element Locator Tool (MELT)⁹³, and RetroSeq⁹⁴. These tools detect novel TE insertions, including Alu, L1, SVA, and LTR transposons, by searching first for discordant read pairs and split reads in WGRS data at the genomic locations of TEs in the reference. The ability of these broadly applicable software programs has been proven by benchmarking on 1000 GP⁹⁵ and TCGA data⁹⁶.

The advent of these methods has enabled meaningful analyses of TASVs using high-throughput sequencing data in studies involving population genetics, human diseases, and clinical genomics. However, limitations remain. It can be expensive to generate the appropriately deep coverage of the whole genome required to identify these events with confidence. Furthermore, it is difficult to detect the actual breakpoint in the nonreference TE insertion region due to the interval between two aligned mate reads. The complete and interior sequences of repetitive TE insertions cannot be determined by methods that use only short-read mapping. To overcome these difficulties, it is necessary to include sequencing validation with traditional Sanger technology or synthetic long read sequencing technologies that allow the construction of synthetic long reads from short sequencing reads such as the Moleculo method⁹⁷. In the studies using AK1⁵⁰ and TCGA data⁵¹, it was possible to identify even more events because these datasets did not focus only on the most recently inserted youngest TEs but all events associated with structural variants caused by TEs. In particular, analysis of TCGA genome data of 200 normal subjects revealed approximately 800 TASV insertions per person on average. It is true that the computational framework using the short-read mapping approach, as mentioned in the results, has a wide margin of error in finding insertion locations, and it is challenging to search for TSDs and their alternative mechanisms accurately. Given these differences in analysis, our study also identified new TASV insert events in the KPGP9 individual genomes and provided more detail for each event by checking the potential mechanisms in sequence.

Although our analysis of a de novo Korean-assembled genome created from a combination of PE and MP libraries with a variety of insert sizes does not provide a large number of TASVs compared to previous studies, we have fully validated and characterized each TASV with respect to features such as TSDs, mechanism, exact size, and complete sequence. Our meticulous analysis of TASVs based on their nucleotide sequence enabled us to determine the likely mechanisms responsible for these structural variations in the KPGP9 genome. These included events derived from classical insertion, NCAI, NAHR, and NHEJ mechanisms. We speculate that these TASVs may influence the regulation of several genes. The mechanisms by which such influence could occur include direct mutation or epigenetic alteration of the enhancer region of these genes²⁶, as well as the introduction of novel splice sites by insertions into the intronic regions_ENREF_100⁹⁸. We believe that as future studies produce larger numbers of high-quality phased genomes, the genetic and phenotypic relationships of TASV events between ethnic groups or between normal and patient groups will be further validated as an important source of variation.

Data availability

The data can also be accessed through BioProject accession number PRJNA408178 for whole-genome sequence data. All raw sequence data are available at the NCBI Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) with BioSample accession number SAMN07680091 and accessible through SRA accession number SRR6059039-6059046. The detailed sequence data for the TASVs are available in Supplementary Data S1 and S2.

References

Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
Article CAS PubMed PubMed Central Google Scholar
McCarthy, A. Third generation DNA sequencing: pacific biosciences’ single molecule real time technology. Chem. Biol. 17, 675–676 (2010).
Article CAS PubMed Google Scholar
Rothberg, J. M. et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011).
Article CAS PubMed Google Scholar
Zhang, J., Chiodini, R., Badr, A. & Zhang, G. The impact of next-generation sequencing on genomics. J. Genet. Genomics 38, 95–109 (2011).
Article PubMed PubMed Central Google Scholar
Office, I. D. A. C. & Committee, I. I. D. A. Analysis of five years of controlled access and data sharing compliance at the International Cancer Genome Consortium. Nat. Genet. 48, 224–225 (2016).
Article CAS Google Scholar
Tomczak, K., Czerwinska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. 19, A68–A77 (2015).
Google Scholar
Forbes, S. A. et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783 (2017).
Article CAS PubMed Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article CAS PubMed PubMed Central Google Scholar
Handsaker, R. E., Korn, J. M., Nemesh, J. & McCarroll, S. A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).
Article CAS PubMed PubMed Central Google Scholar
Shen, Y. et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 20, 273–280 (2010).
Article CAS PubMed PubMed Central Google Scholar
Xing, J. et al. Mobile elements create structural variation: analysis of a complete human genome. Genome Res. 19, 1516–1526 (2009).
Article CAS PubMed PubMed Central Google Scholar
Tang, W., Mun, S., Joshi, A., Han, K. & Liang, P. Mobile elements contribute to the uniqueness of human genome with 15,000 human-specific insertions and 14 Mbp sequence increase. DNA Res. 25, 521–533 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kazazian, H. H. Jr. Mobile elements: drivers of genome evolution. Science 303, 1626–1632 (2004).
Article CAS PubMed Google Scholar
Cordaux, R. & Batzer, M. A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 10, 691–703 (2009).
Article CAS PubMed PubMed Central Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article CAS PubMed Google Scholar
Ewing, A. D. Transposable element detection from whole genome sequence data. Mob. DNA 6, 24 (2015).
Article PubMed PubMed Central CAS Google Scholar
Cordaux, R., Hedges, D. J. & Batzer, M. A. Retrotransposition of Alu elements: how many sources? Trends Genet. 20, 464–467 (2004).
Article CAS PubMed Google Scholar
Brouha, B. et al. Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl Acad. Sci. USA 100, 5280–5285 (2003).
Article CAS PubMed PubMed Central Google Scholar
Shin, W. et al. Human-specific HERV-K insertion causes genomic variations in the human genome. PLoS ONE 8, e60605 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ostertag, E. M., Goodier, J. L., Zhang, Y. & Kazazian, H. H. Jr. SVA elements are nonautonomous retrotransposons that cause disease in humans. Am. J. Hum. Genet. 73, 1444–1451 (2003).
Article CAS PubMed PubMed Central Google Scholar
Han, K. et al. Alu recombination-mediated structural deletions in the chimpanzee genome. PLos Genet. 3, 1939–1949 (2007).
Article CAS PubMed Google Scholar
Mills, R. E. et al. Recently mobilized Transposons in the human and chimpanzee Genomes. Am. J. Hum. Genet. 78, 671–679 (2006).
Article CAS PubMed PubMed Central Google Scholar
Sen, S. K. et al. Human genomic deletions mediated by recombination between Alu elements. Am. J. Hum. Genet. 79, 41–53 (2006).
Article CAS PubMed PubMed Central Google Scholar
Han, K. et al. L1 recombination-associated deletions generate human genomic variation. Proc. Natl Acad. Sci. USA 105, 19366–19371 (2008).
Article CAS PubMed PubMed Central Google Scholar
Buckley, R. M. & Adelson, D. L. Mammalian genome evolution as a result of epigenetic regulation of transposable elements. Biomolecular Concepts 5, 183–194 (2014).
Article CAS PubMed Google Scholar
Slotkin, R. K. & Martienssen, R. Transposable elements and the epigenetic regulation of the genome. Nat. Rev. Genet. 8, 272–285 (2007).
Article CAS PubMed Google Scholar
Belancio, V. P., Roy-Engel, A. M. & Deininger, P. The impact of multiple splice sites in human L1 elements. Gene 411, 38–45 (2008).
Article CAS PubMed PubMed Central Google Scholar
Bratthauer, G. L., Cardiff, R. D. & Fanning, T. G. Expression of LINE-1 retrotransposons in human breast cancer. Cancer 73, 2333–2336 (1994).
Article CAS PubMed Google Scholar
Lin, C. et al. Nuclear receptor-induced chromosomal proximity and DNA breaks underlie specific translocations in cancer. Cell 139, 1069–1083 (2009).
Article CAS PubMed PubMed Central Google Scholar
Beck, C. R., Garcia-Perez, J. L., Badge, R. M. & Moran, J. V. LINE-1 elements in structural variation and disease. Annu. Rev. Genom. Hum. G 12, 187–215 (2011).
Article CAS Google Scholar
O’Donnell, K. A. & Burns, K. H. Mobilizing diversity: transposable element insertions in genetic variation and disease. Mobile DNA 1, https://doi.org/10.1186/1759-8753-1-21 (2010).
Baillie, J. K. et al. Somatic retrotransposition alters the genetic landscape of the human brain. Nature 479, 534–537 (2011).
Article CAS PubMed PubMed Central Google Scholar
Witherspoon, D. J. et al. Mobile element scanning (ME-Scan) by targeted high-throughput sequencing. BMC Genomics 11, 410 (2010).
Article PubMed PubMed Central CAS Google Scholar
Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2012).
Article CAS Google Scholar
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Article CAS PubMed PubMed Central Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18 (2012).
Article PubMed PubMed Central Google Scholar
Luo, R. B. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 4, https://doi.org/10.1186/s13742-015-0069-2 (2015).
Soderlund, C., Bomhoff, M. & Nelson, W. M. SyMAP v3.4: a turnkey synteny system with application to plant genomes. Nucleic Acids Res. 39, e68 (2011).
Article CAS PubMed PubMed Central Google Scholar
Soderlund, C., Nelson, W., Shoemaker, A. & Paterson, A. SyMAP: A system for discovering and viewing syntenic regions of FPC maps. Genome Res. 16, 1159–1168 (2006).
Article CAS PubMed PubMed Central Google Scholar
Kin, T. & Ono, Y. Idiographica: a general-purpose web application to build idiograms on-demand for human, mouse and rat. Bioinformatics 23, 2945–2946 (2007).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Smit AHR, G. P. RepeatMasker software. RepeatMasker Open-4.0.6. (1996–2010).
Bao, W., Kojima, K. K. & Kohany, O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
Article PubMed PubMed Central Google Scholar
Kazazian, H. H. Jr. & Moran, J. V. The impact of L1 retrotransposons on the human genome. Nat. Genet. 19, 19–24 (1998).
Article CAS PubMed Google Scholar
Kent, W. J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
CAS PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Hall, T. A. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic acids Symp. Ser. 41, 95–98 (1999).
CAS Google Scholar
Untergasser, A. et al. Primer3–new capabilities and interfaces. Nucleic Acids Res. 40, e115 (2012).
Article CAS PubMed PubMed Central Google Scholar
Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Article CAS PubMed Google Scholar
Helman, E. et al. Somatic retrotransposition in human cancer revealed by whole-genome and exome sequencing. Genome Res. 24, 1053–1063 (2014).
Article CAS PubMed PubMed Central Google Scholar
Phan, L. et al. dbVar structural variant cluster set for data analysis and variant comparison. F1000Res. 5, 673 (2016).
Article PubMed Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar
Pinero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, D833–D839 (2017).
Article CAS PubMed Google Scholar
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Article CAS PubMed Google Scholar
Zhong, X., Heinicke, F. & Rayner, S. miRBaseMiner, a tool for investigating miRBase content. RNA Biol. 16, 1534–1546 (2019).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum. Mutat. 27, 323–329 (2006).
Article PubMed PubMed Central CAS Google Scholar
Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18, 74–82 (2002).
Article CAS PubMed Google Scholar
Bentley, J., Diggle, C. P., Harnden, P., Knowles, M. A. & Kiltie, A. E. DNA double strand break repair in human bladder cancer is error prone and involves microhomology-associated end-joining. Nucleic Acids Res. 32, 5249–5259 (2004).
Article CAS PubMed PubMed Central Google Scholar
de Smith, A. J. et al. Small deletion variants have stable breakpoints commonly associated with Alu elements. PLoS ONE 3, https://doi.org/10.1371/journal.pone.0003104 (2008).
Callinan, P. A. & Batzer, M. A. Retrotransposable elements and human disease. Genome Dyn. 1, 104–115 (2006).
Article CAS PubMed Google Scholar
Feng, Q., Moran, J. V., Kazazian, H. H. Jr. & Boeke, J. D. Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 87, 905–916 (1996).
Article CAS PubMed Google Scholar
Cost, G. J., Feng, Q., Jacquier, A. & Boeke, J. D. Human L1 element target-primed reverse transcription in vitro. EMBO J. 21, 5899–5910 (2002).
Article CAS PubMed PubMed Central Google Scholar
Srikanta, D. et al. An alternative pathway for Alu retrotransposition suggests a role in DNA double-strand break repair. Genomics 93, 205–212 (2009).
Article CAS PubMed Google Scholar
Callinan, P. A. et al. Alu retrotransposition-mediated deletion. J. Mol. Biol. 348, 791–800 (2005).
Article CAS PubMed Google Scholar
Mun, S., Lee, J., Kim, Y. J., Kim, H. S. & Han, K. Chimpanzee-specific endogenous retrovirus generates genomic variations in the chimpanzee genome. PLoS ONE 9, e101195 (2014).
Article PubMed PubMed Central CAS Google Scholar
Hughes, J. F. & Coffin, J. M. Human endogenous retrovirus K solo-LTR formation and insertional polymorphisms: implications for human and viral evolution. Proc. Natl Acad. Sci. USA 101, 1668–1672 (2004).
Article CAS PubMed PubMed Central Google Scholar
Foster, M. W. & Sharp, R. R. Race, ethnicity, and genomics: social classifications as proxies of biological heterogeneity. Genome Res. 12, 844–850 (2002).
Article CAS PubMed Google Scholar
Arman, E., Katzir, N., Rechavi, G. & Givol, D. Transposable elements and cancer. Curr. Top. Microbiol. Immunol. 132, 90–97 (1986).
CAS PubMed Google Scholar
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).
Article CAS PubMed Google Scholar
Kharitonenkov, A. et al. A family of proteins that inhibit signalling through tyrosine kinase receptors. Nature 386, 181–186 (1997).
Article CAS PubMed Google Scholar
Legati, A. et al. Mutations in XPR1 cause primary familial brain calcification associated with altered phosphate export. Nat. Genet. 47, 579–581 (2015).
Article CAS PubMed PubMed Central Google Scholar
Atambayeva, S. et al. The Binding Sites of miR-619-5p in the mRNAs of Human and Orthologous Genes. BMC Genomics 18, 428 (2017).
Article PubMed PubMed Central CAS Google Scholar
Qin, S., Jin, P., Zhou, X., Chen, L. & Ma, F. The role of transposable elements in the origin and evolution of microRNAs in human. PLoS ONE 10, e0131365 (2015).
Article PubMed PubMed Central CAS Google Scholar
Chen, L. L. & Yang, L. ALUternative regulation for gene expression. Trends Cell Biol. 27, 480–490 (2017).
Article CAS PubMed Google Scholar
Truong, L. N. et al. Microhomology-mediated end joining and homologous recombination share the initial end resection step to repair DNA double-strand breaks in mammalian cells. Proc. Natl Acad. Sci. USA 110, 7720–7725 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lappalainen, I. et al. DbVar and DGVa: public archives for genomic structural variation. Nucleic Acids Res. 41, D936–D941 (2013).
Article CAS PubMed Google Scholar
Hancks, D. C. & Kazazian, H. H. Jr Roles for retrotransposon insertions in human disease. Mob. DNA 7, 9 (2016).
Article PubMed PubMed Central CAS Google Scholar
Niss, O. et al. Genotype-phenotype correlations in hereditary elliptocytosis and hereditary pyropoikilocytosis. Blood Cells Molecules Dis. 61, 4–9 (2016).
Article CAS Google Scholar
Rosenbloom, K. R. et al. The UCSC genome browser database: 2015 update. Nucleic Acids Res. 43, D670–D681 (2015).
Article CAS PubMed Google Scholar
Deininger, P. Alu elements: know the SINEs. Genome Biol. 12, 236 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wang, X. et al. Comprehensive molecular diagnosis of 179 Leber congenital amaurosis and juvenile retinitis pigmentosa patients by targeted next generation sequencing. J. Med. Genet. 50, 674–688 (2013).
Article CAS PubMed Google Scholar
Crosby, A. H. et al. Defective mitochondrial mRNA maturation is associated with spastic ataxia. Am. J. Hum. Genet. 87, 655–660 (2010).
Article CAS PubMed PubMed Central Google Scholar
Dimitratos, S. D., Stathakis, D. G., Nelson, C. A., Woods, D. F. & Bryant, P. J. The location of human CASK at Xp11.4 identifies this gene as a candidate for X-linked optic atrophy. Genomics 51, 308–309 (1998).
Article CAS PubMed Google Scholar
Hayashi, S. et al. Novel intragenic duplications and mutations of CASK in patients with mental retardation and microcephaly with pontine and cerebellar hypoplasia (MICPCH). Hum. Genet. 131, 99–110 (2012).
Article CAS PubMed Google Scholar
Moog, U. et al. Phenotypic spectrum associated with CASK loss-of-function mutations. J. Med. Genet. 48, 741–751 (2011).
Article CAS PubMed Google Scholar
Pierson, T. M. et al. Exome sequencing and SNP analysis detect novel compound heterozygosity in fatty acid hydroxylase-associated neurodegeneration. Eur. J. Hum. Genet. 20, 476–479 (2012).
Article CAS PubMed Google Scholar
Kleefstra, T. et al. Further clinical and molecular delineation of the 9q subtelomeric deletion syndrome supports a major contribution of EHMT1 haploinsufficiency to the core phenotype. J. Med. Genet. 46, 598–606 (2009).
Article CAS PubMed Google Scholar
Hojlund, K. et al. A novel syndrome of autosomal-dominant hyperinsulinemic hypoglycemia linked to a mutation in the human insulin receptor gene. Diabetes 53, 1592–1598 (2004).
Article CAS PubMed Google Scholar
Favier, R., Aoki, N. & de Moerloose, P. Congenital alpha(2)-plasmin inhibitor deficiencies: a review. Br. J. Haematol. 114, 4–10 (2001).
Article CAS PubMed Google Scholar
Rodriguez-Pombo, P. et al. Human propionyl-CoA carboxylase beta subunit gene: exon-intron definition and mutation spectrum in Spanish and Latin American propionic acidemia patients. Am. J. Hum. Genet. 63, 360–369 (1998).
Article CAS PubMed PubMed Central Google Scholar
Wu, J. et al. Tangram: a comprehensive toolbox for mobile element insertion detection. BMC Genomics 15, 795 (2014).
Article CAS PubMed PubMed Central Google Scholar
Gardner, E. J. et al. The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).
Article CAS PubMed PubMed Central Google Scholar
Keane, T. M., Wong, K. & Adams, D. J. RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics 29, 389–390 (2013).
Article CAS PubMed Google Scholar
Genomes Project, C. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Article CAS Google Scholar
Cancer Genome Atlas Research, N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article CAS Google Scholar
Li, R. et al. Illumina synthetic long read sequencing allows recovery of missing sequences even in the “Finished” C. elegans genome. Sci. Rep. 5, 10814 (2015).
Article CAS PubMed PubMed Central Google Scholar
Feschotte, C. Transposable elements and the evolution of regulatory networks. Nat. Rev. Genet. 9, 397–405 (2008).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Research Program funded by the intramural research fund of the Korea National Institute of Health, KCDC (2013-E72002-00 and 2013-E72002-01). We especially thank Dr. Jungnam Lee for her support and insightful commentary during preparation of the manuscript.

Author information

Authors and Affiliations

Department of Nanobiomedical Science, Dankook University, Cheonan, 31116, Republic of Korea
Seyoung Mun & Songmi Kim
DKU-Theragen Institute for NGS analysis (DTiNa), Cheonan, 31116, Republic of Korea
Seyoung Mun, Songmi Kim & Kyudong Han
Center for Bio‑Medical Engineering Core Facility, Dankook University, Cheonan, 31116, Republic of Korea
Seyoung Mun, Wooseok Lee & Kyudong Han
Department of Microbiology, Dankook University, Cheonan, 31116, Republic of Korea
Keunsoo Kang & Kyudong Han
CCR Collaborative Bioinformatics Resource, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Thomas J. Meyer
Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
Thomas J. Meyer
Center for Genome Science, National Institutes of Health, Korea Centers for Disease Control and Prevention, Osong, 28160, Republic of Korea
Bok-Ghee Han
Department of Biological Sciences, Pusan National University, Busan, 46283, Republic of Korea
Heui-Soo Kim

Authors

Seyoung Mun
View author publications
You can also search for this author in PubMed Google Scholar
Songmi Kim
View author publications
You can also search for this author in PubMed Google Scholar
Wooseok Lee
View author publications
You can also search for this author in PubMed Google Scholar
Keunsoo Kang
View author publications
You can also search for this author in PubMed Google Scholar
Thomas J. Meyer
View author publications
You can also search for this author in PubMed Google Scholar
Bok-Ghee Han
View author publications
You can also search for this author in PubMed Google Scholar
Kyudong Han
View author publications
You can also search for this author in PubMed Google Scholar
Heui-Soo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bok-Ghee Han, Kyudong Han or Heui-Soo Kim.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mun, S., Kim, S., Lee, W. et al. A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome. Exp Mol Med 53, 615–630 (2021). https://doi.org/10.1038/s12276-021-00586-y

Download citation

Received: 09 February 2019
Revised: 26 January 2021
Accepted: 27 January 2021
Published: 08 April 2021
Issue Date: April 2021
DOI: https://doi.org/10.1038/s12276-021-00586-y

This article is cited by

Transposable element-mediated rearrangements are prevalent in human genomes
- Parithi Balachandran
- Isha A. Walawalkar
- Christine R. Beck
Nature Communications (2022)
A comprehensive analysis of gorilla-specific LINE-1 retrotransposons
- Soyeon Jeon
- Songmi Kim
- Kyudong Han
Genes & Genomics (2021)

Subjects

Abstract

Similar content being viewed by others

A pangenome reference of 36 Chinese populations

Building a Chinese pan-genome of 486 individuals

The complete and fully-phased diploid genome of a male Han Chinese

Introduction

Materials and methods

DNA sample preparation and genome sequencing

Raw data filtering

De novo assembly of the KPGP9 genome

Repeat annotation

Data mining and manual inspection of TASVs

Experimental validation of precise TASVs

Comparison analysis with the public genomic variant database

Inferring clinical impacts and visualizing the identified TASVs from the KPGP9 genome

Alu and L1 densities in genic region

Ethical approval

Results

Sequencing and de novo assembly of a Korean individual genome

Computational data mining of TE-associated structural variations (TASVs)

Experimental validation for TASVs

Contribution of TASVs in genomic variation

Comparison with the published genomic databases

Discussion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Transposable element-mediated rearrangements are prevalent in human genomes

A comprehensive analysis of gorilla-specific LINE-1 retrotransposons

Search

Quick links