Skip to main content

LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants

Abstract

Background

Short tandem repeats (STRs) are widely distributed across the human genome and are associated with numerous neurological disorders. However, the extent that STRs contribute to disease is likely under-estimated because of the challenges calling these variants in short read next generation sequencing data. Several computational tools have been developed for STR variant calling, but none fully address all of the complexities associated with this variant class.

Results

Here we introduce LUSTR which is designed to address some of the challenges associated with STR variant calling by enabling more flexibility in defining STR loci, allowing for customizable modules to tailor analyses, and expanding the capability to call somatic and multiallelic STR variants. LUSTR is a user-friendly and easily customizable tool for targeted or unbiased genome-wide STR variant screening that can use either predefined or novel genome builds. Using both simulated and real data sets, we demonstrated that LUSTR accurately infers germline and somatic STR expansions in individuals with and without diseases.

Conclusions

LUSTR offers a powerful and user-friendly approach that allows for the identification of STR variants and can facilitate more comprehensive studies evaluating the role of pathogenic STR variants across human diseases.

Peer Review reports

Background

Short tandem repeats (STRs), also known as microsatellites, are DNA sequences composed of either identical (perfect) or highly similar (imperfect) short repetitive units (Supplement Fig. 1) [1]. By definition, the length of the repeated unit is usually shorter than 6bp [2]. STRs are typically flanked by patternless sequences. Since their first characterization in vivo, STRs have been found throughout the genome of both prokaryotes and eukaryotes [3,4,5]. Under the common definition of STR, more than 3% of human genome reference contains STR sequences, and about 90% of known human genes contain at least one STR locus within the protein-coding regions [2, 6].

STR variants include both nucleotide and length changes, resulting in both mismatches and repeat insertion/deletions (rINDELs). The slippage model first proposed by Kornberg is one widely accepted mechanistic model explaining the high mutation rate at STRs compared to non-STR regions [2, 7, 8]. This model posits that the length of the STR repeat sequence can either expand (increase repeat number) or contract (decrease repeat number) due to a mispairing of the repetitive sequence in the nascent strand to the template strand during DNA replication. This mispairing creates a loop in either the nascent or template strand thus leading to a larger or smaller tandem repeat number in the newly formed DNA strand. In most cases STRs vary by only a single repeat addition or subtraction, but in some cases the STR loci can expand or contract by several thousand repeats [9, 10]. Such length variations may cause structural disruption and result in altered gene expression when they happen within protein coding or non-coding regulatory regions [11,12,13]. The majority of research into the biological relevance of STRs focuses on the impact of the size of STRs, or the total number of repeated DNA units on each allele at the STR locus [9, 10]. Pathogenic STR expansions cause multiple severe human neurological disorders, including Huntington disease, amyotrophic lateral sclerosis (ALS), fragile X syndrome, and Friedreich ataxia [14,15,16,17,18]. Interestingly, the length of the expansion has been shown to vary in different tissues and cells within the same individual which gives rise to mosaicism [18,19,20]. In fact, mosaicism has been reported in both clinical cases and mouse models for multiple disease associated STR loci [21,22,23,24,25,26,27,28]. In addition to the contribution of STRs in disease, the high variation rate of STRs also provides polymorphic DNA markers in every individual. Thus, STRs can also be important targets in kinship determination and identity verification when a reliable genotyping method is available [29,30,31].

The unique properties of STRs make the genotyping of these sites extremely challenging. Historically STRs genotyping was done using repeat-primed polymerase chain reaction (RP-PCR) and southern blotting, however, these approaches are inefficient and require advance knowledge of the target site [32,33,34]. Genome sequencing technologies offer the potential for a more efficient and more cost-effective way to genotype STRs genome-wide and without bias. Short read sequence outputs have been adopted more widely because application of the emerging long reads sequencing technologies are still limited by cost and high sequencing error rates [35]. Although small STR expansions or contractions can be identified via standard variant calling pipelines as small insertion-deletion variants, the robustness and accuracy of the genotype can be significantly affected by the structural complexity of the STRs, especially when the variant size exceeds the sequenced read lengths [36]. Efforts have been made to develop computational tools specifically for STR realignment and variation calling [37,38,39,40,41,42,43,44,45,46], but significant challenges still exist. Many of the STR calling pipelines require the user to provide target STR loci with inflexible input requirements. A recently developed tool ExpansionHunter Denovo does not require information of STR loci and allows for an unbiased screen. ExpansionHunter Denovo uses only paired reads composed of one read mapping to the flanking region and one read mapped to only the region of repeated sequence to detect signals of expansions. This approach only applies to long expansions limiting the ability to genotype specified STR loci when they have no or only small size variations [47]. Furthermore, to our knowledge there are very limited options to detect mosaicism at STR loci which has been observed in some individuals [20]. While the link between somatic mutations and cancer and neurological disorders has been well established, the full contribution of somatic STR variants in disease is yet to be revealed [20, 48, 49]. Given the high mutability of STR variants, post-zygotically acquired pathogenic STR expansions and contractions, which would give rise to mosaicism, may be more involved in disease risk than currently appreciated [25, 28, 50,51,52,53,54,55,56,57,58,59,60,61].

Here we have developed a novel STR variant calling tool, LUSTR (LU developed STR toolkit), for short read next generation sequencing which offers accurate germline and somatic STR calling in a highly user-friendly format.

Implementation

We designed the LUSTR pipeline to provide an accurate, robust, and easy-to-use method to call germline and somatic STR variants from short read next-generation sequence data. The pipeline is divided into several modules (Fig. 1), each described below. The Perl scripts used for the LUSTR pipeline are available on GitHub (https://github.com/JLuGithub/LUSTR/releases/tag/STR).

Fig. 1
figure 1

LUSTR pipeline and modules. LUSTR distinguishes itself from other existing pipelines or tools in the following aspects: (1) A “finder” module to standardize extraction of genomic STR regions to be genotyped. The “finder” module aims to simplify the information required to target specific STRs, diminish the impact of imperfect input, and provide flexibility to allow easier user customized target lists, ranging from unbiased compilations for genome-wide scans or a small number of targeted STR sequences. (2) Instead of directly processing mapped reads (.bam files) obtained from alignment pipelines not necessarily optimized for STRs, LUSTR de novo remaps the raw reads (.fastq files) to STR references defined in the “finder” module, with parameters adjusted specifically for STR calling to enhance the performance. We provide an “extractor” module to retrieve reads from the.bam file if raw reads are not available. This remapping step and the “finder” module, indicated by a dashed rectangle, are unique to LUSTR and are not available in other STR calling pipelines. (3) LUSTR implements a flexible two-step strategy for STR genotyping, separating the local realignment step in the “realigner” module to incorporate reads that may have been discarded during the mapping process, and a freestanding calling step in the “caller” module which processes the realignment results to estimate the genotypes for each STR. This modular approach allows for precise tracking of reads through realignment which is critical for debugging and performance evaluation, and allows easier implementation of necessary updates or incorporation of project specific optimization. (4) The “realigner” module applies both flanking-guided and repeat-guided realignment to ensure both accuracy and sensitivity. (5) The “caller” module allows fractional multiallelic STR genotyping results amenable to the calling of germline or somatic expansions or contractions. (6) LUSTR minimizes the prerequisites and only requires pre-installations of samtools and bwa

Finder module

The purpose of this module is to identify the genomic coordinates to extract the repeat and flanking sequences for the STRs the user seeks to genotype. There is no limit to the number of STR sites that can be interrogated. Since the exact sequence of an STR may vary due to the presence of mismatches in some of the repetitive sequences or incomplete repeats (Supplementary Fig. 1), providing exact STR boundaries can be difficult and imprecise. Therefore, in addition to the repeat unit, LUSTR requires only the approximate position of the targeted STR, which can merely include sufficient repeats as seeds to initiate the search. Using this information LUSTR searches the reference sequence for both perfect and imperfect repeats around the given positions, periodically extends the repeats, and automatically determines the boundaries between flanking and repeat sequences using default or user-defined parameters that specify how permissive the user wants to be regarding the extent of mismatch and gaps (Supplementary Fig. 2). The LUSTR-defined genomic coordinates, sequences associated with the targeted STRs, and the parameters used to generate the list will then be carried to the following modules.

RefCreator module and extractor module

Given the unique requirements for the alignment of sequencing reads at STR loci, LUSTR requires de novo mapping of raw reads to STR loci. Based on the sequences determined by the “Finder” module using the user-defined parameters (Supplementary Fig. 1), the “RefCreator” creates separate references from the flanking and the repeat sequences, as well as artificial references composed by perfect repetitive units of target STRs. In case of unavailability of the original raw reads (.fastq), LUSTR provides the “Extractor” module to pull all of the raw reads from bam files using a single command regardless of the way the bam files are sorted. Alternatively, users can choose samtools or other existing tools to prepare the raw reads after the bam files are sorted by reads ID. The mapping of the raw reads to STR references can then be done by existing tools such as bwa with appropriate parameters for STRs (defined in the user manual), to provide primary alignments as sam or bam files for the following LUSTR modules. Quality control can be applied either before or after the mapping to reduce false signals in the subsequent steps. Note that this de novo mapping step, as well as the “Finder” module, are unique to LUSTR to increase calling accuracy.

Realigner module

LUSTR then uses the “Realigner” module to map any unmapped reads and to map the unmapped portions of partially mapped reads from the previous step. Specifically, when the majority of the read is from a flanking sequence, the “Realigner” module will try to align the remaining part to the repeat sequence using the periodic Smith-Waterman algorithm. When the majority of the read is from a repeat sequence, the “Realigner” module will try to align the remaining part to the flanking sequence using the regular Smith-Waterman algorithm. Reads with non-contiguous realignment will be presented as split portions of the read belonging to up-stream flanking, repeat, and down-stream flanking regions of a STR. To analyze each STR in the subsequent step, all realigned reads are categorized according to the STR regions they map to, allowing for single reads to map to multiple different locations if homologous sequences exist. Paired-end reads unable to be mapped to the same target STR(s) are discarded.

Caller module

In the last step, the “Caller” module collects the information from the alignment procedures described above and lists each potential repeat size at the STR locus that is supported by at least one read. Alleles with repeat sizes short enough to be supported by spanning reads will be determined directly, while the size of long repeats (those exceeding read length) will be estimated by taking the ratio of the number of reads realigned to the flanking and the repeat regions. The quality of the calls can then be determined by inspection of the number of realigned reads and the randomness of their distribution at the STR loci following default or user-provided thresholds. By categorizing pairs supporting each of the potential alleles, the “Caller” module estimates the fraction of each allele, allowing for the possibility of somatic STR variants. Considering the complexity of STRs, the “Caller” module returns the genotyping results in plain text format, which can be easily converted to VCF or other file formats if needed. Furthermore, the “Caller” module also integrates an option to narrow down the STR candidates by generating a list with alleles meeting user-customized thresholds in several features, such as the expansion size, call quality, and allele fraction. Additionally, in the presence of bias detected between upstream and downstream flanking sequences, the “Caller” module will also provides a warning message for users to investigate potential off-targets or complex mutations close by.

Results

Application of LUSTR in simulated short reads sequencing datasets

We first tested how well LUSTR performs the local realignment using the “realigner” module, as this step is critical for accurate genotyping and estimating the number of variant alleles present. Simulated reads were generated from the STR locus in human C9ORF72 gene (Table 1). The C9ORF72 STR contains tandemly repeated GGGGCC sequences (or GGCCCC on the forward strand), whose expansion is well-studied and known to be associated with ALS (Supplementary Fig. 1). We simulated individual libraries of C9ORF72 STR alleles with different repeat sizes as follows: (Library 1) allele with the original repeat size (62bp by the default parameters of LUSTR Finder module), (Library 2) expanded allele with 2 times repeats to the original size, (Library 3) expanded allele with 4 times repeats to the original size which exceed standard short read lengths, (Library 4) contracted allele with half number of repeats to the original size, and (Library 5) an allele missing the repeats entirely. Twenty thousand raw reads with lengths of 150 nucleotides were generated in pairs for each library, randomly from the 2X1000bp flanking sequences and the repeat regions. Note that by these settings, the repeat region of C9ORF72 STR in Library 3 was unable to be fully spanned by any reads due to the length limitation. To simulate sequencing errors, we allowed mismatches, insertions, and deletions (indels) at each nucleotide position at a rate of 0.5%. Raw simulated pairs were then processed following the LUSTR pipeline. The realignment annotations by the LUSTR “realigner” module of flanking and repeat lengths were compared to the records during the generation of the raw reads, and the repeat size estimations by the “caller” module were then compared to the expectation (Table 1). Notably, LUSTR showed high specificity in all libraries and successfully excluded all pairs that were not generated in the forward-reverse pattern (true negative) without calling any positive signals incorrectly (false positive). Among the remaining pairs, LUSTR also exhibited high sensitivity > 99% by successfully retrieving most of the positive pairs (true positive) and missing only a few pairs in certain libraries (false negative). The false negative calls arose because of the mismatches or INDELs that occasionally occurred within correlated reads, which rendered the realignment scores below the threshold and triggered them to be discarded. Moreover, LUSTR annotated > 99% of the true positive pairs identically to the way they were generated, with only a few pairs annotated imperfectly. We found most of the misannotated pairs were due to simulated sequencing errors at the exact boundary between the flanking and repeat regions, which resulted in one nucleotide shifts in the annotation results. These results show that LUSTR was both sensitive and specific to realigning raw reads to the STR loci.

Table 1 Performance of LUSTR in genotyping simulated short reads sequencing libraries

We next tested the ability of LUSTR to estimate the size of STR from short reads (Fig. 2a). We simulated homozygous C9ORF72 STR references with different repeat sizes along with 2X1000bp flankings, and randomly generated forward-reverse 150 nucleotide pairs from each of them. Mismatches or INDELs were allowed at each nucleotide position at a rate of 0.5% to imitate expected sequencing errors. To test the robustness of LUSTR under low sequencing depth, we generated the libraries under different average coverages varying from 1 to 100X. Each condition was repeated 10 times independently, and the raw pairs in each simulated library were processed by LUSTR up through the “caller” module. Individual size variation estimation by LUSTR for each library was shown in Fig. 2a, and the average of each condition was compared to the expectation. We also calculated the square of the correlation coefficient (r2) to summarize the ability of LUSTR to call expected sizes under different coverage conditions. LUSTR successfully estimated the STR size variation in libraries with sequencing depth as low as 5X (r2 = 0.74), and performed more accurate estimations by the increase of sequencing depth (r2 = 0.97 at 30X, Fig. 2a). LUSTR was even able to make an accurate estimation when the STR repeat sizes were close to the simulated read lengths (150bp, variation + 15). These results indicated that LUSTR robustly estimates STR sizes with high accuracy.

Fig. 2
figure 2

LUSTR is robust in tests with simulated libraries. To test the performance of LUSTR in size and allele fraction estimations, we generated simulated reads from C9orf72 locus including 2X1000bp flanking regions and the repeats of (a) homozygous alleles with different expanded or contracted repeat sizes (ranging from -10.3 to + 1000), and (b) heterozygous alleles with one reference allele and one expanded allele (+ 100 repeats), mixed by different fractions. Reads 150 nucleotides in length were generated in pairs with an error rate of 0.5% including mismatches, insertions, and deletions, under different average coverage ranging from 1 to 100X. Each combination was repeated 10 times as a group. The number of failed libraries in each group, which were due to low coverage and mainly for 1X coverage condition, is indicated by red shade. For successfully called libraries, we examined the estimated repeat size variants (a) and then estimated the fraction of the reference allele (b). The observed and expected are shown for each scenario evaluated. We compared the average result in each group (indicated by a black solid line) with the expectation (indicated by a blue dotted line) and calculated the square of correlation coefficient (r2). Among the sizes evaluated, we specifically tested the repeat size variations for the deletion allele (-10.3), reference allele (0), and allele with repeat sizes close to reads length (+ 15) in Fig. 2a. For size estimation (a), LUSTR showed robust performance starting from 5X coverage and became very close to the expectations from as low as only 10X coverage. For fraction estimation (b) LUSTR required higher coverage, but still exhibited reliable estimates showing the expected allelic ratio with only 10X coverage. This result showed that LUSTR robustly infers both repeat size and allele fraction estimations even for low coverage libraries

The estimation of STR allele fraction has not been explored to any great extent with existing STR calling tools but is essential for somatic variant analysis. Therefore, we further tested the ability of LUSTR to accurately determine STR allele fraction (Fig. 2b). We simulated heterozygous C9ORF72 STR references composed of two alleles along with 2X1000bp flankings: one with a normal C9ORF72 STR repeat size (62bp including 18bp perfect repeat units), and the other with a very large expansion in the range commonly found in humans with ALS or FTD (about 100 repeat units longer than reference). The normal C9ORF72 STR allele fraction was then varied from 10 to 90%. Raw pairs of 150 nucleotides were randomly generated in a forward-reverse pattern under different average coverages varying from 1 to 100X. Randomly generated substitutions or INDELs at a rate of 0.5% were incorporated to account for expected sequencing errors. Simulations for each allelic fraction evaluated were repeated 10 times and independently processed by LUSTR. The estimated allelic fraction of the original C9ORF72 STR allele in each library is shown in Fig. 2b. The average of each condition was compared to the expectation. We found that the estimation of STR allele fraction required higher sequencing depth compared to that required for non-mosaic STR sizing. Although LUSTR exhibited a correlation between the estimation averages and the expectations starting from 10X coverage (r2 = 0.56), it did not return a reliable estimation for individual libraries until 30X (r2 = 0.77) or 50X (r2 = 0.88) coverage (Fig. 2b). These results indicate that LUSTR is able to successfully estimate the fractions of STR alleles in deep sequenced short reads libraries, although the performance, as expected, could be affected by insufficient realigned reads when sequencing depth was low.

Identification of known STR variants from publicly-available sequence data using LUSTR

We next tested the ability of LUSTR to correctly identify STR variants in a database with benchmarking variant calls defined by the Genome in a Bottle Consortium (GIAB). GIAB integrates multiple short and linked read sequencing datasets to provide benchmark calls for human genomes and provides a valuable source for the optimization and validation of bioinformatics tools [62]. We downloaded the MGISEQ (150 nucleotide read length) and the BGISEQ (100 nucleotide read length) sequenced pair-ended short reads libraries by their availability for the Ashkenazim trio and the Chinese trio from GIAB. In addition to the analysis for each individual library, we also generated and analyzed merged libraries when the same individual was sequenced multiple times or across multiple sequencing lanes (Tables 2 and 3, Supplementary Table 1). We then selected 13 STR loci that were known to be associated with neurological disorders and thus have been used to validate existing STR calling software [42]. The raw pairs of each merged and individual library were processed by LUSTR using the default settings, and the genotype calls for the listed STR loci were compared with variant calling files (VCFs) provided by GIAB.

Table 2 Performance of LUSTR and ExpansionHunter in identifying STR variants reported in the GIAB database (Ashkenazim Trio)
Table 3 Performance of LUSTR and ExpansionHunter in identifying STR variants reported in the GIAB database (Chinese Trio)

Across all the Ashkenazim and Chinese trio libraries and the 13 loci, there were a total of 54 opportunities to compare the genotype provided by GIAB to that called by LUSTR (Tables 2 and 3). For 48 out of the 54 comparisons (88.9%), the predominant allele(s) identified by LUSTR matched that of the benchmark GIAB calls. Among the concordant calls, LUSTR also detected two contracted STR variants at the ATN1 and HTT loci for the son of the Ashkenazim trio at low levels, one with a five repeat units contraction by 5% allele fraction and the other with a nine repeat units contraction by 4% allele fraction (indicated as -5 and -9 in Tables 2 and 3, respectively). Although these small fraction alleles were not called by GIAB, they were supported by some reads realigned to the loci (Supplementary Fig. 4). This could be due to sequencing errors that generated a small fraction of reads artificially revealing the variants, or it could indicate the real presence of somatic STR variants at these loci. A minor fraction of reads supporting an allele that was + 12.7 were also detected at the ATXN3 loci in the Ashkenazim trio compared to the expected + 13. This minor discrepancy was attributed to likely by sequencing errors or slight interpretation differences. For the six discordant calls (11.1%), they were either small differences in repeat count unlikely to alter the interpretation (i.e., for the Ashkenazim trio LUSTR called -1/ + 1 repeats for the two ATXN1 alleles whereas GIAB reported 0/ + 1) or due to reads supporting variant alleles being absent in specific libraries (i.e., ATXN7 in the mother of the Ashkenazim trio and in the child of the Chinese trio) (Supplementary Table 1a and b and Supplementary Fig. 3).

For the 50 instances without GIAB calls (NAs), LUSTR called 34 genotypes identical or close to the reference (68%), which likely explains the absence of calls in GIAB VCFs. In addition to observing a high rate of calling concordance, there were several cases where LUSTR detected a genotype that was not called by GIAB. For example, at the DMPK locus, LUSTR called a genotype of -15, -9 in two sequencing runs for the mother of the Ashkenazim trio that was not reported by GIAB (Tables 2 and 3). The reason this was not called in GIAB is unclear. However, in all cases, there was clear sequence read evidence supporting the presence of these alleles (Supplementary Fig. 5). The LUSTR genotype calls in the child also followed a Mendelian inheritance pattern which further supports the accuracy of the calling (Tables 2 and 3).

Since ExpansionHunter provides curated information and input format for these 13 STR loci, we also ran ExpansionHunter (ver 4.0, default settings) and compared the results in all of the eight merged libraries to further evaluate the performance of LUSTR. Among all the 104 comparisons from both Ashkenazim and Chinese trios, LUSTR showed 94 calls (90.4%) identical or equivalent to ExpansionHunter results, including those loci referred above where LUSTR called genotypes significantly different from GIAB database (Tables 2 and 3, Supplementary Fig. 5). Among the 10 calls that were discordant between LUSTR and ExpansionHunter, there were instances where ExpansionHunter was able to reveal the hidden allele missed by LUSTR (e.g. ATXN7 in the mother from Ashkenazim trio), but also instances where LUSTR showed more convincing results by raw reads inspection (e.g. HTT in the father from Ashkenazim trio) (Tables 2 and 3). These results collectively support that LUSTR can accurately genotype STR variant alleles in short reads sequencing libraries.

LUSTR was accurate and robust to call mosaic STR variants in the in silico mixture libraries

We next tested the ability of LUSTR to call mosaic STR variants by mixing short reads from real data libraries in silico. We selected two MGISEQ sequenced libraries with equal sequencing lengths, one from the father of the Ashkenazim trio and the other from the child of the Chinese trio. We generated an in silico mixture by randomly selecting varying proportions of reads (Table 4) from the two libraries. The mixed libraries were then processed by LUSTR for the 13 tested STR loci shown in Tables 2 and 3. To validate the performance, we first assumed the STR genotypes of the two samples by integrating both GIAB calls and reliable LUSTR calls in the previous tests. We then estimated the expected STR allele fractions in the mixture libraries by assuming that both original samples had homozygous or heterozygous germline genotypes at these loci (i.e., 100% or 50% variant allele frequency, Table 4). The expected calls were then compared with the calls by LUSTR. In the mixed library consisting of a 1:2 ratio of the two genomes, LUSTR successfully called the alleles with fractions very close (< 10%) to expected for six out of the 13 STRs (ATN1, ATXN3, C9ORF72, CBL, DMPK, and HTT) (46.2%, Table 4). For 5 STRs (ATXN2, ATXN10, CACNA1A, JPH3, and PPP2R2B), LUSTR called allelic fractions deviating greater than 10% from expectation. This could be due to read bias in the original samples or sampling error (Table 4, Supplementary Table 1a and b). LUSTR missed STR alleles for ATXN1 and ATXN7, but these were due to missing or low-quality reads supporting the non-dominant alleles in the original libraries (Table 4, Supplementary Table 1).

Table 4 Ability of LUSTR to estimate allele fraction by in silico mixture of samples

To further test the ability of LUSTR under more extreme conditions, we then mixed the two samples by an approximate ratio of 1:10 to mimic mosaic STR alleles with fractions as low as 5 or 10% (Table 4). Considering that such low fractions were made by selecting only a few reads from the sample, we performed three replicates to reduce the impact of sampling error that can occur during the mixing (Table 4). LUSTR successfully called the alleles with expected fractions in at least one of the three replicates for 6 out of the 13 STRs (ATN1, ATXN3, C9ORF72, CBL, DMPK, and PPP2R2B) (46.2%, Table 4). Notably, LUSTR was able to call the minor alleles with low fractions (5 or 10%) for ATXN3, CBL, DMPK, and PPP2R2B with very close estimations (< 10%) (Table 4). At the HTT locus, LUSTR called the correct fraction in one of three mixtures but flagged the call as not being reliable. This suggests that allowing more permissive calling may be needed to capture mosaic STRs. At the JPH3 locus, LUSTR estimated allelic fractions that did not align well with expectation (> 10% difference) (Table 4). The reason for this is unclear but is likely due to allelic bias from the Chinese trio Son library as shown by original LUSTR calls (Supplementary Table 1a and b). LUSTR consistently missed the minor alleles for ATXN1, ATXN2, ATXN7, ATXN10, and CACNA1A (Table 4) due to the loss of all reads supporting that allele when randomly sampling from the non-dominant genome.

While noisier than germline calling, these results collectively support the ability of LUSTR to accurately call mosaic STR variant alleles with variant allele fractions as low as 5%. We note however that the accuracy will be greatly influenced by read depth at the locus, as is the case for calling of any allele with low representation in a genome.

Identification of undiagnosed STR expansions in subjects by unbiased whole genome scan using LUSTR

We next tested the ability of LUSTR to identify clinically significant STR expansions using an unbiased whole genome scan in samples harboring known pathogenic STRs. We collected raw whole genome sequence data (short read paired end sequencing) from three individuals with presumed genetic disorders sequenced as part of the Undiagnosed Disease Network (UDN). These subjects were genetically undiagnosed, but all had STR expansion variants that may explain their phenotype (Table 5). We were blind to the specific phenotypes or genotypes while performing the scans so not to bias the analyses. Two libraries were sequenced for subject 1 and subject 2, and four libraries were sequenced for subject 3 (Table 5). We also collected the libraries from the unaffected parents and siblings for subject 1 and subject 2 to determine inheritance (Table 5). To prepare for the whole genome scan, we used Tandem Repeats Finder [37] to obtain the basic information of STRs across the whole human genome reference (build 37). We ran Tandem Repeats Finder using the recommended settings (match/mis/gap/PM/PI/minscore = 2/-5/-7/80/10/50), selected those STRs located within 1000 bp distance to known genes as defined by UCSC genome annotation database (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/), and retrieved a customized list of 162,840 STRs. We then applied the LUSTR “Finder” module to retrieve the standardized STR sequences for these 162,840 loci by default settings (match/mis/gap/stop =  + 2/-5/-7/-30) and generated reference sequences by using the “RefCreater” module.

Table 5 Unbiased whole genome scan by LUSTR for known STR expansions in undiagnosed subjects

Raw read libraries of the three subjects were mapped to the references generated by LUSTR using bwa mem. All bam files from each individual library as well as merged bam files for each subject were then processed blindly by the LUSTR “Realigner” and “Caller” modules against the customized STR list. The parallel processing function provided by LUSTR was applied to reduce the processing time for calling. We set thresholds for the “Caller” module to call all STR loci with alleles expanded larger than 100 bp compared to the human genome reference, allelic fractions larger than 5%, and variant sites called by more than 15 realigned pairs without repeat-only pairs in at least medium calling quality determined by LUSTR “Caller” module. Note that such settings can be relaxed to reduce the risk of false negatives and to capture mosaicism. The STR expansions fulfilling the quality control metrics were then checked to assess whether they were detected in both individual libraries of subject 1 or subject 2, or were detected in at least three individual libraries of subject 3. Following these steps, we identified 86 candidate STR expansions for subject 1, 78 candidates for subject 2, and 33 candidates for subject 3 (Table 5).

Among the 86 candidates for subject 1, 49 STR expansions were also detected with similar or larger sizes in subjects 2 and 3 and assumed to be either benign polymorphisms or sequencing artifacts. Among the 37 remaining we focused on the 20 candidates with high calling quality for primary investigation (Tables 5 and 6). We next looked into the detailed features of these 20 candidates to decide their priority ranking based on the likelihood they may contribute to the individual’s phenotype. Distinct from the previously excluded 49 candidates that passed the threshold and were also called in subjects 2 and 3, many of these 20 candidates were either called in only one of subjects 2 or 3, or called with a smaller expansion or a low allele fraction that didn’t pass the threshold for subjects 2 and 3. This may indicate non-specificity, but could also indicate potential genetic penetrance. We decided to keep them on the list, but took this into consideration when making priority determination (Table 6). Another important feature being considered for the 20 candidates was the reference size of each candidate STR, since the estimation for STR expansions with reference sizes longer than read length was more likely to be affected by sequencing randomness and off-target repeats, compared to those with relatively shorter sizes (Table 6). We also investigated other features such as the locations of the candidates to the affected genes, the potential for off-target alignment or the presence of mutations in the flanking sequences, and the number of called alleles which could indicate complex situations requiring further examination (Table 6). Among all these candidates, the STR expansion at the GLS gene, a known pathogenic STR, was deemed the most likely candidate in subject 1 (Table 6). We also identified STR expansions at ARHGAP28 and other loci with high priorities that may also be worthy of further consideration (Table 6). Once unblinded, we found that the GLS expansion was indeed the suspected pathogenic variant identified for subject 1.

Table 6 Evaluation of candidate STR expansions by LUSTR unbiased whole genome scan for subject 1

We applied a similar procedure to subjects 2 and 3 and narrowed down the candidate list to 21 and one high quality STR calls, respectively (Supplementary Table 2). However, we could only deem TCF4 STR expansion as a possible candidate for subject 2 and no possible candidates were identified for subject 3. Following unblinding the cases, both harbored likely pathogenic RFC1 STRs. The RFC1 STR variants in the two subjects included a replacement of the repetitive “AAAAG” with “AAGGG”, a 1-bp shift, and the expansion (AA + AAAAG × 11 + AAAAAG—> AAA + AAGGG × 10 + AAGAAAAAG—> AAA + AAGGG x n + AAGAAAAAG). This explains why LUSTR, when searching for “AAAAG” repeats under the default settings, actually gave expansion signals at RFC1 locus for the two subjects by very low realignment coverage and low calling quality, which did not happen for the parents and sibling (Supplementary Table 3). To evaluate the flexibility of LUSTR to fulfill the detection of this complex RFC1 expansion, we first tried reducing the mismatching penalty. More pairs were realigned, but the calling qualities were not improved adequately for successful detection as merely penalty change did not benefit retrieving repeat dominant reads (Supplementary Table 3). However, by applying a customized alternative RFC1 STR reference with “AAGGG” repeats accordingly, the RFC1 expansions were successfully detected with high coverage and quality for both subjects 2 and 3 (Supplementary Table 3). Moreover, by combining both results by the two RFC1 STR references, LUSTR genotyped an “AAGGG” expansion allele in subject 1 inherited from the mother, as well as four individuals carrying “AAAAG” expansion alleles in the families of subject 1 and subject 2 (Supplementary Fig. 6). These cases exemplify the challenges of STR calling but also demonstrate the flexibility of LUSTR for customization upon user-specified settings. Developing LUSTR to call non-reference STRs sequences de novo is an area for future development of the software.

Discussion

Besides the utility of STRs in kinship determination and identity verification, STRs have attracted significant attention for their role in human neurological disorders. Genome-wide sequencing offers tremendous potential to identify STRs that may contribute to disease. Despite the recent progress made in calling STR variants in short read sequence data, there is an on-going need for improvements to make calling more user friendly and interpretable [20, 35, 63].

The LUSTR pipeline described here builds on the advantages of several different existing STR variant calling tools [37,38,39,40,41,42,43,44,45]. LUSTR specifically aims to provide an alternative choice to benefit users with varied conditions or in need of more flexible input requirements (Supplementary Table 4). LUSTR applies the strategy to realign as many reads as possible to each STR locus in order to allow for the most sensitivity and accurate STR calling as possible. It also enables the detection of deviations in allele frequencies that may indicate mosaicism, which has hardly been addressed to date in existing STR callers  (Supplementary Table 4). LUSTR follows the classic pipeline of mapping, local realignment, and then STR calling. However, distinct from other existing tools, LUSTR requires a de novo mapping from the raw reads to STR specific references generated in the pipeline, rather than directly processing bams from whole genome mapping. Although it may increase the cost of running time and storage space, this design aims to improve the sensitivity to specifically call STRs, and reflects the idea that STR mutation should be considered as a unique type of variation that requires a distinctive pipeline from that designed to call SNVs and INDELs. In our tests running LUSTR along with the existing STR variant calling tools, LUSTR and ExpansionHunter showed consistent calls in most cases (> 90%, Tables 2 and 3). For the discordant loci, neither LUSTR nor ExpansionHunter showed a significant overall advantage over the other, indicating that each tool has pros and cons under different conditions. As for the running speed, a single process for a whole genome STR genotyping by LUSTR takes days to finish, varying within about a seven day range depending on several factors including sequencing depth, list size of target STRs, and running platform conditions. This running speed, mostly dictated by the Realigner module, is slower than ExpansionHunter or GangSTR when simple target inputs are supplied, but comparable when off-target information is provided [42, 45]. Moreover, LUSTR allows for parallel processing, which will greatly increase the running speed (Supplementary Table 4). In the local realignment step, LUSTR uses the periodic Smith-Waterman algorithm to solve the challenges of imperfect repeats and sequencing errors that happen within STR repetitive regions. While this approach increases sensitivity for long expansions with an expected trade-off in specificity, we note that parameters in the Finder module and subsequent calling step can be altered to favor specificity over sensitivity. New optional modules are under development to further reduce noise and enhance specificity to benefit certain situations such as cohort-level association analyses.

Long read sequencing technologies that have recently emerged will likely improve STR variant calling. LUSTR is designed based on short read sequenced data which remains much more commonly used due to cost and accuracy limitations of current long reads sequencing technologies. Even when long-read sequencing is more economical and accurate, there will still be large numbers of genomes sequenced with short-read sequencing genomes for which short-read STR variant callers will be still be needed. Newer tools have been developed to incorporate algorithms compatible to long sequenced reads to address this emerging need [46]. Another future development of LUSTR will be focused on ensuring compatibility of the caller with long read sequencing data.

Both the local realignment and the variant calling steps are widely acknowledged as critical factors required for accurate STR variant calling [38,39,40,41,42,43,44,45,46]. However, the importance of STR sequence definition is often underestimated when STR target list customization is required, which is another important feature where LUSTR will provide an improved experience compared to other existing tools (Supplementary Table 4). The repeat regions of STRs often contain partial or imperfect repetitive sequences, natural SNVs and short INDELs, as well as sequencing errors during the establishment of the reference. Therefore, the boundaries of STRs may vary largely according to different definition rules, making it difficult for users to precisely define STRs regions of the genome. Furthermore, the inconsistent rules applied to STR boundary definition and local realignment may lead to aberrant calls. One solution is to provide an STR list with the optimal format [42, 64]. Although the list can be updated and expanded following newly emerging clinical discoveries, the feature would limit the ability of the user to add new STR loci of interest or that arise for new releases of the reference genome. Beyond the widely used GRCh37/38 genomes, there have been several new genome references, such as Telomere-to-Telomere genome reference (T2T), Han Chinese genome reference (HG00514), and Japanese genome reference (JG2) [65,66,67,68]. To apply STR analysis using these novel genomes or even non-human references, users need to be able to easily switch between genomes and add novel STR loci of interest. LUSTR fulfills this need with the Finder module that allows for flexible input and a standardized approach for determining STR boundaries. By automatically applying an exact set of parameters in the following local realignment, it allows easy customization of STR lists and also makes it possible to apply unbiased whole genome-wide scans for STR variants.

STR analysis can also be challenging when certain loci share homologous flanking sequences with other STRs with identical or similar repetitive units. These loci can result in off-target mapping and ultimately inaccurate STR calling. To solve this, LUSTR provides warning messages to indicate signals for potential off-targets or complex mutations close to the STR boundaries that may affect on-target mapping. By flagging these sites, users can investigate and determine if the call may have arisen due to mapping errors, and apply the option provided by LUSTR to process only the non-homologous side when necessary. LUSTR also takes steps to minimize the potential issues arising from mapping by giving flexibility to adjust alignment approaches. For example, in our analyses we noticed that bwa mem automatically adjusts the mapping depth and generated more off-target hits when the reads were mapped to a small target STR list compared to a whole genome scan [69]. One strategy that LUSTR allows is to use a larger STR list for bwa mem, and then focus on the small list for subsequent local realignment and calling. Also, the de novo mapping design of the LUSTR pipeline provides the flexibility for users to easily apply alternative approaches. Alternative mapping methods such as bwa aln or bowtie may work better in the situation where target STRs have homologous loci [69,70,71] and this can be easily incorporated into LUSTR calling pipeline. Furthermore, LUSTR splits the local realignment and variant calling steps apart and provides intermediate output in plain text format. This design allows for an intermediate checkpoint for users to track the performance and allows for modifications to be made with ease if desired.

LUSTR exhibited great potential in terms of both simulated and real data sets. In addition to LUSTR calling concordant genotypes for > 85% of GIAB benchmark calls, LUSTR also successfully identified several STR variants that were not identified by GIAB published variation calls (VCFs). The variants called by LUSTR were further supported by examining Mendelian inheritance rules, visual inspection of the raw sequence reads (Supplementary Fig. 5), and independent calls by ExpansionHunter (Tables 2 and 3). While the exact reason for the overlooked call is unclear, the difference between LUSTR and GIAB calls may highlight the importance of applying STR specific variant calling tools instead of modified INDEL calling methods as was used for GIAB calling [62]. In evaluating subject samples with expected pathogenic STR variant loci, LUSTR proves its ability and power to apply a whole genome scan to identify disease-causing STR expansions. Using the parallel processing option available in LUSTR, the realignment for the whole genome 168k STRs can be done within days, and the calling step based on the realignment results can be finished within minutes. Among all of the 168k STRs called genome-wide in subject 1, LUSTR successfully identified the expected target, an expansion in the GLS gene, among a small list of high-quality candidate calls. Given that such a result was obtained by an investigation of only three individuals to filter non-specific STR variants, it supports the utility of LUSTR to identify clinically meaningful variants when only a small cohort is available. Furthermore, the performance that LUSTR showed in both detection sensitivity and noise exclusion in just three samples suggests that LUSTR will offer a powerful tool to facilitate large-scale association studies looking for STRs that are associated with disease risk [11]. Besides the GLS STR expansion recognized with top priority, LUSTR also detected several other candidate loci that may contribute to the individual's phenotype. This feature renders it possible to use LUSTR identify oligo- or polygenic risk factors associated with disease [72,73,74,75].

While the term “de novo STR mutation” usually indicates the situation when the progeny carries a new STR mutation or pathogenic size expansion not inherited from the parents, the term “novel STR mutation” can be confusing and is often used either in clinical diagnosis or in annotation to distinguish from the term “known STR mutation” [44, 76]. The clinically “novel” STR expansions can refer to newly identified causative expansions without a previous clinical report. The reference-related “novel” STRs, however, indicates repeats that are not present in reference but appear in individuals. Such reference “novel” STRs are challenging to detect by traditional pipelines especially when no preliminary knowledge is available. With more and more attention attracted, several recent tools have been developed to be compatible with “novel” STRs [47, 76, 77]. Aiming to more informative genotyping of each given STR locus, LUSTR focuses on known STRs with repeats available in the reference, with the module for novel STRs to be added in the future updates. Alternatively, when the information is available from clinical reports or from other STR variant calling tools, users can easily customize the list for novel STRs of interest. This limitation of LUSTR explains why it missed the RFC1 calls in the UDN subjects but was able to detect it later with simple running modifications. The RFC1 expansion in the two UDN subjects is inherited from an expansion and replacement activity localized to an Alu element, after a nucleotide switch from “AGA” to “GGG” as well as a single nucleotide shift, rendering the reference repetitive unit changed from “AAAAG” to “AAGGG” [78, 79]. It was equivalent to a novel STR expansion and hereby escaped the detection of LUSTR when the reference “AAAAG” repetitive unit was expected, with extremely low numbers of reads able to be realigned. Such coverage warning can serve for users to notice the potential existence of this type of STR mutation and can be scheduled for future updates of LUSTR. However, by simply applying a customized “AAGGG” RFC1 STR reference or modifying the running with a lower mismatch penalty to allow the realignment of “AAGGG” to “AAAAG”, LUSTR was able to detect the expansion. Furthermore, with such modifications LUSTR identified the inheritance of an “AAAAG” expansion allele and the carriers of heterozygous “AAGGG” expansion allele in the families of two UDN subjects (Supplementary Fig. 6), allowing for further investigation into the potential unrevealed contributions to the phenotype, which so far are suggested to be benign [78,79,80]. This case indicated the flexibility of LUSTR when applied to complex situations encountered with novel STRs.

Conclusions

In summary, LUSTR is a reliable and powerful tool for both germline and somatic STR variant calling, and we expect its application to contribute to studies evaluating the role of STR mutations in disease.

Software availability and requirements

Project name: LUSTR

Project home page: https://github.com/JLuGithub/LUSTR

Operating system: Linux

Programming language: Perl

Other requirements: samtools, mapping software such as bwa or bowtie

License: GNU GPL

Any restrictions to use by non-academics: licence needed

Method

LUSTR script

The code for each of the  LUSTR modules were written in Perl script. Regular Smith-Waterman algorithm was applied to the local realignment of short sequences to STR flanking regions. Periodic Smith-Waterman algorithm with modifications was applied to the recognition of STR repeat sequences. The sizes of STR repeats and allele fractions were estimated by calculating the ratios between the counts of reads with and without flanking sequences. The core concept equations are listed below, with modifications applied in practice to allow for random sequencing bias. Equations 1 and 2 are first applied to judge the existence of the allele with repeat length longer than the sequencing read length. Upon the detection of a signal, Equation 3 is used to estimate the size of the repeat region for the allele. The fraction of each allele is then determined by the combination of Equations 4, 5, 6 and 7. The calling reliability was determined by the counts of reads categorized into different patterns and flanking-repeat length distributions under the parameters provided by users. Future updates of LUSTR script will include applications of probability methods to the repeat size estimation, statistics methods to the reliability determination, and functions to incorporate de novo STR variants and long read sequencing libraries.

$$\begin{array}{cc}E_{n+1}=1&\lbrack if{(O}_{n+1}>O_n)\&(O_n\geq\sum_{i=1}^n\frac{2S_iC_i}{L-S_i})\rbrack\end{array}$$
(1)
$$\begin{array}{cc}E_{n+1}=0&(if\ else)\end{array}$$
(2)
$$\frac{2L{R}_{1}}{{O}_{n+1}-\sum_{i=1}^{n}\frac{2{S}_{i}{C}_{i}}{L-{S}_{i}}}+L\le {S}_{n+1}\le \frac{2L{(R}_{1}+{R}_{2})}{{O}_{n+1}-\sum_{i=1}^{n}\frac{2{S}_{i}{C}_{i}}{L-{S}_{i}}}+L (if\ {E}_{n+1}=1)$$
(3)
$$\sum\nolimits_{i=1}^{n+1}{F}_{i}=1$$
(4)
$$\begin{array}{cc}\frac{F_i}{F_j}=\frac{C_i(L-S_j)}{C_j(L-S_i)}&(1\leq i,j\leq n)\end{array}$$
(5)
$$\begin{array}{cc}\frac{F_{n+1}}{F_i}=\left(O_{n+1}-\sum_{i=1}^n\frac{2S_iC_i}{L-S_i}\right)\cdot\frac{L-S_i}{2C_iL}&(1\leq i\leq n,\ if\ E_{n+1}=1)\end{array}$$
(6)
$$\begin{array}{cc}\frac{F_{n+1}}{F_i}=0&(1\leq i\leq n,\ if\ E_{n+1}=0)\end{array}$$
(7)

L indicates the sequencing read length (bp).

n indicates the number of alleles with repeat sizes (bp) that can be directly detected by reads containing sequences from both flanking regions.

En+1 indicates the existence of the allele (allele n + 1) with repeat size longer than the sequencing read length.

Si (1 ≤ i ≤ n) indicates the repeat size of allele i directly detected by reads; Sn+1 indicates the repeat size of allele n + 1 that is longer than the sequencing read length thus needs to be estimated.

Ci (1 ≤ i ≤ n) indicates the number of reads containing sequences from both flanking regions and a repeat region with size of Si, thus belonging to allele i.

On indicates the number of reads containing sequences from only one flanking region, and from the pairs with any repeat length ≤ the maximum from S1 to Sn; On+1 indicates the number of all of the reads containing sequences from only one flanking region.

Fi (1 ≤ i ≤ n) indicates the fraction of allele i; Fn+1 indicates the fraction of allele n + 1 whose repeat size is longer than the sequencing read length.

R1 indicates the number of reads containing only repeat sequences but not from repeat-only pairs; R2 indicates the number of reads containing only repeat sequences and from repeat-only pairs.

Data processing

The running of LUSTR and the processing of the short read sequencing libraries were done in Linux, with SAMTOOLS 1.14 pre-installed. The mapping of reads to STR references was done by BWA MEM version 0.7.

Data generation

The simulated data in the performance test of LUSTR was generated by in-house Perl script. STR references with expected repeat sizes were prepared, and then read pairs were generated in random directions from the STR references. The pattern of each read was recorded to evaluate the performance of LUSTR calling. Each nucleotide of reads was by a given chance altered, deleted, or inserted to imitate sequencing errors.

The mixed library data was generated by an in-house Perl script.

Availability of data and materials

The raw read libraries and variant calling result files from the GIAB project were downloaded from ftp://ftp-trace.ncbi.nih.gov/giab/ftp/. The raw read libraries UDN project were obtained directly from NIH Undiagnosed Diseases Program by collaboration.

Abbreviations

STRs:

Short tandem repeats

INDELs:

Insertions or deletions

ALS:

Amyotrophic lateral sclerosis

PCR:

Polymerase chain reaction

GIAB:

Genome in a Bottle Consortium

VCFs:

Variant calling files

UDN:

Undiagnosed Disease Network

T2T:

Telomere-to-Telomere genome reference

References

  1. Tautz D, Schlötterer C. Simple sequences. Curr Opin Genet Dev. 1994;4(6):832–7. https://doi.org/10.1016/0959-437x(94)90067-1. PMID: 7888752.

    Article  CAS  PubMed  Google Scholar 

  2. Fan H, Chu JY. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5(1):7–14. https://doi.org/10.1016/S1672-0229(07)60009-6. PMID:17572359;PMCID:PMC5054066.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Hamada H, Petrino MG, Kakunaga T. A novel repeated element with Z-DNA-forming potential is widely found in evolutionarily diverse eukaryotic genomes. Proc Natl Acad Sci U S A. 1982;79(21):6465–9. https://doi.org/10.1073/pnas.79.21.6465. PMID:6755470;PMCID:PMC347147.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Tautz D, Renz M. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res. 1984;12(10):4127–38. https://doi.org/10.1093/nar/12.10.4127. PMID:6328411;PMCID:PMC318821.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. van Belkum A, Scherer S, van Alphen L, Verbrugh H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol Mol Biol Rev. 1998;62(2):275–93.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Madsen BE, Villesen P, Wiuf C. Short tandem repeats in human exons: a target for disease mutations. BMC Genomics. 2008;12(9):410. https://doi.org/10.1186/1471-2164-9-410. PMID:18789129;PMCID:PMC2543027.

    Article  CAS  Google Scholar 

  7. Kornberg A, Bertsch LL, Jackson JF, Khorana HG. Enzymatic synthesis of deoxyribonucleic acid, XVI. Oligonucleotides as templates and the mechanism of their replication. Proc Natl Acad Sci U S A. 1964;51(2):315–23. https://doi.org/10.1073/pnas.51.2.315. PMID: 14124330; PMCID: PMC300067.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Strand M, Prolla TA, Liskay RM, Petes TD. Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature. 1993;365(6443):274–6. https://doi.org/10.1038/365274a0. Erratum.In:Nature1994Apr7;368(6471);569 PMID: 8371783.

    Article  CAS  PubMed  Google Scholar 

  9. Weber JL, Wong C. Mutation of human short tandem repeats. Hum Mol Genet. 1993;2(8):1123–8. https://doi.org/10.1093/hmg/2.8.1123. PMID: 8401493.

    Article  CAS  PubMed  Google Scholar 

  10. Ellegren H. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat Genet. 2000;24(4):400–2. https://doi.org/10.1038/74249. PMID: 10742106.

    Article  CAS  PubMed  Google Scholar 

  11. Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, Daly MJ, Price AL, Pritchard JK, Sharp AJ, Erlich Y. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–9. https://doi.org/10.1038/ng.3461. Epub 2015 Dec 7. PMID: 26642241; PMCID: PMC4909355.

    Article  CAS  PubMed  Google Scholar 

  12. Sun JH, Zhou L, Emerson DJ, Phyo SA, Titus KR, Gong W, Gilgenast TG, Beagan JA, Davidson BL, Tassone F, Phillips-Cremins JE. Disease-associated short tandem repeats co-localize with chromatin domain boundaries. Cell. 2018;175(1):224-238.e15. https://doi.org/10.1016/j.cell.2018.08.005. Epub 2018 Aug 30. PMID: 30173918; PMCID: PMC6175607.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Hannan A. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet. 2018;19:286–98. https://doi.org/10.1038/nrg.2017.115.

    Article  CAS  PubMed  Google Scholar 

  14. Fu YH, Kuhl DP, Pizzuti A, Pieretti M, Sutcliffe JS, Richards S, Verkerk AJ, Holden JJ, Fenwick RG Jr, Warren ST, et al. Variation of the CGG repeat at the fragile X site results in genetic instability: resolution of the Sherman paradox. Cell. 1991;67(6):1047–58. https://doi.org/10.1016/0092-8674(91)90283-5. PMID: 1760838.

    Article  CAS  PubMed  Google Scholar 

  15. Kremer B, Almqvist E, Theilmann J, Spence N, Telenius H, Goldberg YP, Hayden MR. Sex-dependent mechanisms for expansions and contractions of the CAG repeat on affected Huntington disease chromosomes. Am J Hum Genet. 1995;57(2):343–50. PMID: 7668260; PMCID: PMC1801544.

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Mirkin SM. Expandable DNA repeats and human disease. Nature. 2007;447(7147):932–40. https://doi.org/10.1038/nature05977. PMID: 17581576.

    Article  CAS  PubMed  Google Scholar 

  17. La Spada AR, Taylor JP. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat Rev Genet. 2010;11(4):247–58. https://doi.org/10.1038/nrg2748. PMID:20177426;PMCID:PMC4704680.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. McMurray CT. Mechanisms of trinucleotide repeat instability during human development. Nat Rev Genet. 2010;11(11):786–99. https://doi.org/10.1038/nrg2828. Erratum.In:NatRevGenet.2010Dec;11(12):886.PMID:20953213;PMCID:PMC3175376.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Pearson CE, Nichol Edamura K, Cleary JD. Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet. 2005;6(10):729–42. https://doi.org/10.1038/nrg1689. PMID: 16205713.

    Article  CAS  PubMed  Google Scholar 

  20. Depienne C, Mandel JL. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet. 2021;108(5):764–85. https://doi.org/10.1016/j.ajhg.2021.03.011. Epub 2021 Apr 2 PMID: 33811808.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Lavedan C, Hofmann-Radvanyi H, Shelbourne P, Rabes JP, Duros C, Savoy D, Dehaupas I, Luce S, Johnson K, Junien C. Myotonic dystrophy: size- and sex-dependent dynamics of CTG meiotic instability, and somatic mosaicism. Am J Hum Genet. 1993;52(5):875–83. PMID: 8098180; PMCID: PMC1682032.

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Anvret M, Ahlberg G, Grandell U, Hedberg B, Johnson K, Edström L. Larger expansions of the CTG repeat in muscle compared to lymphocytes from patients with myotonic dystrophy. Hum Mol Genet. 1993;2(9):1397–400. https://doi.org/10.1093/hmg/2.9.1397. PMID: 8242063.

    Article  CAS  PubMed  Google Scholar 

  23. Ashizawa T, Dubel JR, Harati Y. Somatic instability of CTG repeat in myotonic dystrophy. Neurology. 1993;43(12):2674–8. https://doi.org/10.1212/wnl.43.12.2674. PMID: 8255475.

    Article  CAS  PubMed  Google Scholar 

  24. Telenius H, Kremer B, Goldberg YP, Theilmann J, Andrew SE, Zeisler J, Adam S, Greenberg C, Ives EJ, Clarke LA, et al. Somatic and gonadal mosaicism of the Huntington disease gene CAG repeat in brain and sperm. Nat Genet. 1994;6(4):409–14. https://doi.org/10.1038/ng0494-409. Erratum.In:NatGenet1994May;7(1):113 PMID: 8054984.

    Article  CAS  PubMed  Google Scholar 

  25. Helderman-van den Enden AT, Maaswinkel-Mooij PD, Hoogendoorn E, Willemsen R, Maat-Kievit JA, Losekoot M, Oostra BA. Monozygotic twin brothers with the fragile X syndrome: different CGG repeats and different mental capacities. J Med Genet. 1999;36(3):253–7. PMID: 10204857; PMCID: PMC1734321.

    CAS  PubMed  Google Scholar 

  26. Fortune MT, Vassilopoulos C, Coolbaugh MI, Siciliano MJ, Monckton DG. Dramatic, expansion-biased, age-dependent, tissue-specific somatic mosaicism in a transgenic mouse model of triplet repeat instability. Hum Mol Genet. 2000;9(3):439–45. https://doi.org/10.1093/hmg/9.3.439. PMID: 10655554.

    Article  CAS  PubMed  Google Scholar 

  27. Gonitel R, Moffitt H, Sathasivam K, Woodman B, Detloff PJ, Faull RL, Bates GP. DNA instability in postmitotic neurons. Proc Natl Acad Sci U S A. 2008;105(9):3467–72. https://doi.org/10.1073/pnas.0800048105. Epub 2008 Feb 25. PMID: 18299573; PMCID: PMC2265187.

    Article  PubMed  PubMed Central  Google Scholar 

  28. McGoldrick P, Zhang M, van Blitterswijk M, Sato C, Moreno D, Xiao S, Zhang AB, McKeever PM, Weichert A, Schneider R, Keith J, Petrucelli L, Rademakers R, Zinman L, Robertson J, Rogaeva E. Unaffected mosaic C9ORF72 case: RNA foci, dipeptide proteins, but upregulated C9ORF72 expression. Neurology. 2018;90(4):e323–31. https://doi.org/10.1212/WNL.0000000000004865. Epub 2017 Dec 27. PMID: 29282338; PMCID: PMC5798652.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Hearne CM, Ghosh S, Todd JA. Microsatellites for linkage analysis of genetic traits. Trends Genet. 1992;8(8):288–94. https://doi.org/10.1016/0168-9525(92)90256-4. PMID: 1509520.

    Article  CAS  PubMed  Google Scholar 

  30. Bruford MW, Wayne RK. Microsatellites and their application to population genetic studies. Curr Opin Genet Dev. 1993;3(6):939–43. https://doi.org/10.1016/0959-437x(93)90017-j. PMID: 8118220.

    Article  CAS  PubMed  Google Scholar 

  31. Butler JM. Genetics and genomics of core short tandem repeat loci used in human identity testing. J Forensic Sci. 2006;51(2):253–65. https://doi.org/10.1111/j.1556-4029.2006.00046.x. PMID: 16566758.

    Article  CAS  PubMed  Google Scholar 

  32. Warner JP, Barron LH, Goudie D, Kelly K, Dow D, Fitzpatrick DR, Brock DJ. A general method for the detection of large CAG repeat expansions by fluorescent PCR. J Med Genet. 1996;33(12):1022–6. https://doi.org/10.1136/jmg.33.12.1022. PMID:9004136;PMCID:PMC1050815.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Buchman VL, Cooper-Knock J, Connor-Robson N, Higginbottom A, Kirby J, Razinskaya OD, Ninkina N, Shaw PJ. Simultaneous and independent detection of C9ORF72 alleles with low and high number of GGGGCC repeats using an optimised protocol of Southern blot hybridisation. Mol Neurodegener. 2013;8(8):12. https://doi.org/10.1186/1750-1326-8-12. PMID:23566336;PMCID:PMC3626718.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Akimoto C, Volk AE, van Blitterswijk M, Van den Broeck M, Leblond CS, Lumbroso S, Camu W, Neitzel B, Onodera O, van Rheenen W, Pinto S, Weber M, Smith B, Proven M, Talbot K, Keagle P, Chesi A, Ratti A, van der Zee J, Alstermark H, Birve A, Calini D, Nordin A, Tradowsky DC, Just W, Daoud H, Angerbauer S, DeJesus-Hernandez M, Konno T, Lloyd-Jani A, de Carvalho M, Mouzat K, Landers JE, Veldink JH, Silani V, Gitler AD, Shaw CE, Rouleau GA, van den Berg LH, Van Broeckhoven C, Rademakers R, Andersen PM, Kubisch C. A blinded international study on the reliability of genetic testing for GGGGCC-repeat expansions in C9ORF72 reveals marked differences in results among 14 laboratories. J Med Genet. 2014;51(6):419–24. https://doi.org/10.1136/jmedgenet-2014-102360. Epub 2014 Apr 4. PMID: 24706941; PMCID: PMC4033024.

    Article  CAS  PubMed  Google Scholar 

  35. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21(1):30. https://doi.org/10.1186/s13059-020-1935-5. PMID:32033565;PMCID:PMC7006217.

    Article  PubMed  PubMed Central  Google Scholar 

  36. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. https://doi.org/10.1101/gr.107524.110. PMID: 20644199; PMCID: PMC2928508.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80. https://doi.org/10.1093/nar/27.2.573. PMID:9862982;PMCID:PMC148217.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62. https://doi.org/10.1101/gr.135780.111. Epub 2012 Apr 20. PMID: 22522390; PMCID: PMC3371701.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Cao MD, Tasker E, Willadsen K, Imelfort M, Vishwanathan S, Sureshkumar S, Balasubramanian S, Bodén M. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Res. 2014;42(3):e16. https://doi.org/10.1093/nar/gkt1313.

    Article  CAS  PubMed  Google Scholar 

  40. Kojima K, Kawai Y, Misawa K, Mimori T, Nagasaki M. STR-realigner: a realignment method for short tandem repeat regions. BMC Genomics. 2016;17(1):991. https://doi.org/10.1186/s12864-016-3294-x. PMID:27912743;PMCID:PMC5135796.

    Article  PubMed  PubMed Central  Google Scholar 

  41. Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017;14(6):590–2. https://doi.org/10.1038/nmeth.4267. Epub 2017 Apr 24. PMID: 28436466; PMCID: PMC5482724.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Dolzhenko E, van Vugt JJFA, Shaw RJ, Bekritsky MA, van Blitterswijk M, Narzisi G, Ajay SS, Rajan V, Lajoie BR, Johnson NH, Kingsbury Z, Humphray SJ, Schellevis RD, Brands WJ, Baker M, Rademakers R, Kooyman M, Tazelaar GHP, van Es MA, McLaughlin R, Sproviero W, Shatunov A, Jones A, Al Khleifat A, Pittman A, Morgan S, Hardiman O, Al-Chalabi A, Shaw C, Smith B, Neo EJ, Morrison K, Shaw PJ, Reeves C, Winterkorn L, Wexler NS, US–Venezuela Collaborative Research Group, Housman DE, Ng CW, Li AL, Taft RJ, van den Berg LH, Bentley DR, Veldink JH, Eberle MA. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 2017;27(11):1895–903. https://doi.org/10.1101/gr.225672.117. Epub 2017 Sep 8. PMID: 28887402; PMCID: PMC5668946.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, Ramakrishnan S, Lavrenko V, Kakaradov B, Hou C, Hicks B, Heckerman D, Och FJ, Caskey CT, Venter JC, Telenti A. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes. Am J Hum Genet. 2017;101(5):700–15. https://doi.org/10.1016/j.ajhg.2017.09.013. PMID:29100084;PMCID:PMC5673627.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, Davis M, Lamont P, Clayton JS, Laing NG, MacArthur DG, Oshlack A. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19(1):121. https://doi.org/10.1186/s13059-018-1505-2. PMID:30129428;PMCID:PMC6102892.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019;47(15):e90. https://doi.org/10.1093/nar/gkz501. PMID:31194863;PMCID:PMC6735967.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Wang X, Huang M, Budowle B, Ge J. TRcaller: a novel tool for precise and ultrafast tandem repeat variant genotyping in massively parallel sequencing reads. Front Genet. 2023;18(14):1227176. https://doi.org/10.3389/fgene.2023.1227176. PMID:37533432;PMCID:PMC10390829.

    Article  CAS  Google Scholar 

  47. Dolzhenko E, Bennett MF, Richmond PA, Trost B, Chen S, van Vugt JJFA, Nguyen C, Narzisi G, Gainullin VG, Gross AM, Lajoie BR, Taft RJ, Wasserman WW, Scherer SW, Veldink JH, Bentley DR, Yuen RKC, Bahlo M, Eberle MA. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol. 2020;21(1):102. https://doi.org/10.1186/s13059-020-02017-z. PMID:32345345;PMCID:PMC7187524.

    Article  PubMed  PubMed Central  Google Scholar 

  48. Martincorena I, Campbell PJ. Somatic mutation in cancer and normal cells. Science. 2015;349(6255):1483–9. https://doi.org/10.1126/science.aab4082. Epub 2015 Sep 24. Erratum in: Science. 2016 Mar 4;351(6277). pii: aaf5401. doi: 10.1126/science.aaf5401. PMID: 26404825.

    Article  CAS  PubMed  Google Scholar 

  49. Benjamin D, Sato T, Cibulskis K, Getz G, Stewart C, Lichtenstein L. Calling somatic SNVs and Indels with Mutect2. bioRxiv. 2019. https://doi.org/10.1101/861054.

  50. Manley K, Shirley TL, Flaherty L, Messer A. Msh2 deficiency prevents in vivo somatic instability of the CAG repeat in Huntington disease transgenic mice. Nat Genet. 1999;23(4):471–3. https://doi.org/10.1038/70598. PMID: 10581038.

    Article  CAS  PubMed  Google Scholar 

  51. Matsuura T, Sasaki H, Yabe I, Hamada K, Hamada T, Shitara M, Tashiro K. Mosaicism of unstable CAG repeats in the brain of spinocerebellar ataxia type 2. J Neurol. 1999;246(9):835–9. https://doi.org/10.1007/s004150050464. PMID: 10525984.

    Article  CAS  PubMed  Google Scholar 

  52. van den Broek WJ, Nelen MR, Wansink DG, Coerwinkel MM, te Riele H, Groenen PJ, Wieringa B. Somatic expansion behaviour of the (CTG)n repeat in myotonic dystrophy knock-in mice is differentially affected by Msh3 and Msh6 mismatch-repair proteins. Hum Mol Genet. 2002;11(2):191–8. https://doi.org/10.1093/hmg/11.2.191. PMID: 11809728.

    Article  PubMed  Google Scholar 

  53. Kennedy L, Evans E, Chen CM, Craven L, Detloff PJ, Ennis M, Shelbourne PF. Dramatic tissue-specific mutation length increases are an early molecular event in Huntington disease pathogenesis. Hum Mol Genet. 2003;12(24):3359–67. https://doi.org/10.1093/hmg/ddg352. Epub 2003 Oct 21 PMID: 14570710.

    Article  CAS  PubMed  Google Scholar 

  54. Gomes-Pereira M, Fortune MT, Ingram L, McAbney JP, Monckton DG. Pms2 is a genetic enhancer of trinucleotide CAG.CTG repeat somatic mosaicism: implications for the mechanism of triplet repeat expansion. Hum Mol Genet. 2004;13(16):1815–25. Epub 2004 Jun 15. PMID: 15198993.

    Article  CAS  PubMed  Google Scholar 

  55. Kovtun IV, Thornhill AR, McMurray CT. Somatic deletion events occur during early embryonic development and modify the extent of CAG expansion in subsequent generations. Hum Mol Genet. 2004;13(24):3057–68. https://doi.org/10.1093/hmg/ddh325. Epub 2004 Oct 20 PMID: 15496421.

    Article  CAS  PubMed  Google Scholar 

  56. Matsuura T, Fang P, Lin X, Khajavi M, Tsuji K, Rasmussen A, Grewal RP, Achari M, Alonso ME, Pulst SM, Zoghbi HY, Nelson DL, Roa BB, Ashizawa T. Somatic and germline instability of the ATTCT repeat in spinocerebellar ataxia type 10. Am J Hum Genet. 2004;74(6):1216–24. https://doi.org/10.1086/421526. Epub 2004 May 4. PMID: 15127363; PMCID: PMC1182085.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Rindler PM, Clark RM, Pollard LM, De Biase I, Bidichandani SI. Replication in mammalian cells recapitulates the locus-specific differences in somatic instability of genomic GAA triplet-repeats. Nucleic Acids Res. 2006;34(21):6352–61. https://doi.org/10.1093/nar/gkl846. Epub 2006 Nov 16. PMID: 17142224; PMCID: PMC1669776.

    Article  CAS  Google Scholar 

  58. Kovtun IV, Liu Y, Bjoras M, Klungland A, Wilson SH, McMurray CT. OGG1 initiates age-dependent CAG trinucleotide expansion in somatic cells. Nature. 2007;447(7143):447–52. https://doi.org/10.1038/nature05778. Epub 2007 Apr 22. PMID: 17450122; PMCID: PMC2681094.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Shelbourne PF, Keller-McGandy C, Bi WL, Yoon SR, Dubeau L, Veitch NJ, Vonsattel JP, Wexler NS, US-Venezuela Collaborative Research Group, Arnheim N, Augood SJ. Triplet repeat mutation length gains correlate with cell-type specific vulnerability in Huntington disease brain. Hum Mol Genet. 2007;16(10):1133–42. https://doi.org/10.1093/hmg/ddm054. Epub 2007 Apr 4. PMID: 17409200.

    Article  CAS  PubMed  Google Scholar 

  60. Libby RT, Hagerman KA, Pineda VV, Lau R, Cho DH, Baccam SL, Axford MM, Cleary JD, Moore JM, Sopher BL, Tapscott SJ, Filippova GN, Pearson CE, La Spada AR. CTCF cis-regulates trinucleotide repeat instability in an epigenetic manner: a novel basis for mutational hot spot determination. PLoS Genet. 2008;4(11):e1000257. https://doi.org/10.1371/journal.pgen.1000257. Epub 2008 Nov 14. PMID: 19008940; PMCID: PMC2573955.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Goula AV, Berquist BR, Wilson DM 3rd, Wheeler VC, Trottier Y, Merienne K. Stoichiometry of base excision repair proteins correlates with increased somatic CAG instability in striatum over cerebellum in Huntington’s disease transgenic mice. PLoS Genet. 2009;5(12):e1000749. https://doi.org/10.1371/journal.pgen.1000749. Epub 2009 Dec 4. PMID: 19997493; PMCID: PMC2778875.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, De La Vega FM, Xiao C, Sherry S, Salit M. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019;37(5):561–6. https://doi.org/10.1038/s41587-019-0074-6. Epub 2019 Apr 1. PMID: 30936564; PMCID: PMC6500473.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Cao MD, Balasubramanian S, Bodén M. Sequencing technologies and tools for short tandem repeat variation detection. Brief Bioinform. 2015;16(2):193–204. https://doi.org/10.1093/bib/bbu001. Epub 2014 Feb 6 PMID: 24504770.

    Article  CAS  PubMed  Google Scholar 

  64. Halman A, Dolzhenko E, Oshlack A. STRipy: a graphical application for enhanced genotyping of pathogenic short tandem repeats in sequencing data. Hum Mutat. 2022;43(7):859–68. https://doi.org/10.1002/humu.24382. Epub 2022 Apr 21. PMID: 35395114; PMCID: PMC9541159.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Via M, Gignoux C, Burchard EG. The 1000 Genomes Project: new opportunities for research and social challenges. Genome Med. 2010;2(1):3. https://doi.org/10.1186/gm124. PMID:20193048;PMCID:PMC2829928.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Hickey G, Heller D, Monlong J, Sibbesen JA, Sirén J, Eizenga J, Dawson ET, Garrison E, Novak AM, Paten B. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 2020;21(1):35. https://doi.org/10.1186/s13059-020-1941-7. PMID:32051000;PMCID:PMC7017486.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Takayama J, Tadaka S, Yano K, Katsuoka F, Gocho C, Funayama T, Makino S, Okamura Y, Kikuchi A, Sugimoto S, Kawashima J, Otsuki A, Sakurai-Yageta M, Yasuda J, Kure S, Kinoshita K, Yamamoto M, Tamiya G. Construction and integration of three de novo Japanese human genome assemblies toward a population-specific reference. Nat Commun. 2021;12(1):226. https://doi.org/10.1038/s41467-020-20146-8. PMID:33431880;PMCID:PMC7801658.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen NC, Cheng H, Chin CS, Chow W, de Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, Xiao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM. The complete sequence of a human genome. Science. 2022;376(6588):44–53. https://doi.org/10.1126/science.abj6987. Epub 2022 Mar 31. PMID: 35357919; PMCID: PMC9186530.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics. 2013. https://doi.org/10.48550/arXiv.1303.3997.

  70. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. https://doi.org/10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4. PMID: 19261174; PMCID: PMC2690996.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Oliva A, Tobler R, Llamas B, Souilmi Y. Additional evaluations show that specific BWA-aln settings still outperform BWA-mem for ancient DNA data alignment. Ecol Evol. 2021;11(24):18743–8. https://doi.org/10.1002/ece3.8297. PMID:35003706;PMCID:PMC8717315.

    Article  PubMed  PubMed Central  Google Scholar 

  72. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273(5281):1516–7. https://doi.org/10.1126/science.273.5281.1516. PMID: 8801636.

    Article  CAS  PubMed  Google Scholar 

  73. Altmüller J, Palmer LJ, Fischer G, Scherb H, Wjst M. Genomewide scans of complex human diseases: true linkage is hard to find. Am J Hum Genet. 2001;69(5):936–50. https://doi.org/10.1086/324069. PMID: 11565063; PMCID: PMC1274370.

    Article  PubMed  PubMed Central  Google Scholar 

  74. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53. https://doi.org/10.1038/nature08494. PMID:19812666;PMCID:PMC2831613.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Ibanez L, Farias FHG, Dube U, Mihindukulasuriya KA, Harari O. Polygenic risk scores in neurodegenerative diseases: a review. Curr Genet Med Rep. 2019;7:22–9. https://doi.org/10.1007/s40142-019-0158-0.

    Article  Google Scholar 

  76. Dashnow H, Pedersen BS, Hiatt L, Brown J, Beecroft SJ, Ravenscroft G, LaCroix AJ, Lamont P, Roxburgh RH, Rodrigues MJ, Davis M, Mefford HC, Laing NG, Quinlan AR. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. bioRxiv. 2021.11.18.469113. https://doi.org/10.1101/2021.11.18.469113.

  77. Fearnley LG, Bennett MF, Bahlo M. Detection of repeat expansions in large next generation DNA and RNA sequencing data without alignment. Sci Rep. 2022;12(1):13124. https://doi.org/10.1038/s41598-022-17267-z. PMID:35907931;PMCID:PMC9338934.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Cortese A, Simone R, Sullivan R, Vandrovcova J, Tariq H, Yau WY, Humphrey J, Jaunmuktane Z, Sivakumar P, Polke J, Ilyas M, Tribollet E, Tomaselli PJ, Devigili G, Callegari I, Versino M, Salpietro V, Efthymiou S, Kaski D, Wood NW, Andrade NS, Buglo E, Rebelo A, Rossor AM, Bronstein A, Fratta P, Marques WJ, Züchner S, Reilly MM, Houlden H. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat Genet. 2019;51(4):649–58. https://doi.org/10.1038/s41588-019-0372-4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Rafehi H, Szmulewicz DJ, Bennett MF, Sobreira NLM, Pope K, Smith KR, Gillies G, Diakumis P, Dolzhenko E, Eberle MA, Barcina MG, Breen DP, Chancellor AM, Cremer PD, Delatycki MB, Fogel BL, Hackett A, Halmagyi GM, Kapetanovic S, Lang A, Mossman S, Mu W, Patrikios P, Perlman SL, Rosemergy I, Storey E, Watson SRD, Wilson MA, Zee DS, Valle D, Amor DJ, Bahlo M, Lockhart PJ. Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am J Hum Genet. 2019;105(1):151–65. https://doi.org/10.1016/j.ajhg.2019.05.016. Epub 2019 Jun 20. PMID: 31230722; PMCID: PMC6612533.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Currò R, Salvalaggio A, Tozza S, Gemelli C, Dominik N, Galassi Deforie V, Magrinelli F, Castellani F, Vegezzi E, Businaro P, Callegari I, Pichiecchio A, Cosentino G, Alfonsi E, Marchioni E, Colnaghi S, Gana S, Valente EM, Tassorelli C, Efthymiou S, Facchini S, Carr A, Laura M, Rossor AM, Manji H, Lunn MP, Pegoraro E, Santoro L, Grandis M, Bellone E, Beauchamp NJ, Hadjivassiliou M, Kaski D, Bronstein AM, Houlden H, Reilly MM, Mandich P, Schenone A, Manganelli F, Briani C, Cortese A. RFC1 expansions are a common cause of idiopathic sensory neuropathy. Brain. 2021;144(5):1542–50. https://doi.org/10.1093/brain/awab072. PMID:33969391;PMCID:PMC8262986.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Michael Guo from the University of Pennsylvania for assistance generating data to allow for comparisons of  LUSTR to other currently available STR callers. We thank Undiagnosed Diseases Network for providing sequenced libraries and information required to perform blinded whole genome screening. Full list of Undiagnosed Diseases Network members can be found in Additional file 8.

Undiagnosed Diseases Network2

Maria T. Acosta2, Margaret Adam2, David R. Adams2, Raquel L. Alvarez2, Justin Alvey2, Laura Amendola2, Ashley Andrews2, Euan A. Ashley2, Carlos A. Bacino2, Guney Bademci2, Ashok Balasubramanyam2, Dustin Baldridge2, Jim Bale2, Michael Bamshad2, Deborah Barbouth2, Pinar Bayrak-Toydemir2, Anita Beck2, Alan H. Beggs2, Edward Behrens2, Gill Bejerano2, Hugo J. Bellen2, Jimmy Bennett2, Beverly Berg-Rood2, Jonathan A. Bernstein2, Gerard T. Berry2, Anna Bican2, Stephanie Bivona2, Elizabeth Blue2, John Bohnsack2, Devon Bonner2, Lorenzo Botto2, Brenna Boyd2, Lauren C. Briere2, Gabrielle Brown2, Elizabeth A. Burke2, Lindsay C. Burrage2, Manish J. Butte2, Peter Byers2, William E. Byrd2, John Carey2, Olveen Carrasquillo2, Thomas Cassini2, Ta Chen Peter Chang2, Sirisak Chanprasert2, Hsiao-Tuan Chao2, Ivan Chinn2, Gary D. Clark2, Terra R. Coakley2, Laurel A. Cobban2, Joy D. Cogan2, Matthew Coggins2, F. Sessions Cole2, Heather A. Colley2, Heidi Cope2, Rosario Corona2, William J. Craigen2, Andrew B. Crouse2, Michael Cunningham2, Precilla D’Souza2, Hongzheng Dai2, Surendra Dasari2, Joie Davis2, Jyoti G. Dayal2, Esteban C. Dell’Angelica2, Patricia Dickson2, Katrina Dipple2, Daniel Doherty2, Naghmeh Dorrani2, Argenia L. Doss2, Emilie D. Douine2, Dawn Earl2, David J. Eckstein2, Lisa T. Emrick2, Christine M. Eng2, Marni Falk2, Elizabeth L. Fieg2, Paul G. Fisher2, Brent L. Fogel2, Irman Forghani2, William A. Gahl2, Ian Glass2, Bernadette Gochuico2, Page C. Goddard2, Rena A. Godfrey2, Katie Golden-Grant2, Alana Grajewski2, Don Hadley2, Sihoun Hahn2, Meghan C. Halley2, Rizwan Hamid2, Kelly Hassey2, Nichole Hayes2, Frances High2, Anne Hing2, Fuki M. Hisama2, Ingrid A. Holm2, Jason Hom2, Martha Horike-Pyne2, Alden Huang2, Sarah Hutchison2, Wendy Introne2, Rosario Isasi2, Kosuke Izumi2, Fariha Jamal2, Gail P. Jarvik2, Jeffrey Jarvik2, Suman Jayadev2, Orpa Jean-Marie2, Vaidehi Jobanputra2, Lefkothea Karaviti2, Shamika Ketkar2, Dana Kiley2, Gonench Kilich2, Shilpa N. Kobren2, Isaac S. Kohane2, Jennefer N. Kohler2, Susan Korrick2, Mary Kozuira2, Deborah Krakow2, Donna M. Krasnewich2, Elijah Kravets2, Seema R. Lalani2, Byron Lam2, Christina Lam2, Brendan C. Lanpher2, Ian R. Lanza2, Kimberly LeBlanc2, Brendan H. Lee2, Roy Levitt2, Richard A. Lewis2, Pengfei Liu2, Xue Zhong Liu2, Nicola Longo2, Sandra K. Loo2, Joseph Loscalzo2, Richard L. Maas2, Ellen F. Macnamara2, Calum A. MacRae2, Valerie V. Maduro2, AudreyStephannie Maghiro2, Rachel Mahoney2, May Christine V. Malicdan2, Laura A. Mamounas2, Teri A. Manolio2, Rong Mao2, Kenneth Maravilla2, Ronit Marom2, Gabor Marth2, Beth A. Martin2, Martin G. Martin2, Julian A. Martínez-Agosto2, Shruti Marwaha2, Jacob McCauley2, Allyn McConkie-Rosell2, Alexa T. McCray2, Elisabeth McGee2, Heather Mefford2, J. Lawrence Merritt2, Matthew Might2, Ghayda Mirzaa2, Eva Morava2, Paolo Moretti2, John Mulvihill2, Mariko Nakano-Okuno2, Stanley F. Nelson2, John H. Newman2, Sarah K. Nicholas2, Deborah Nickerson2, Shirley Nieves-Rodriguez2, Donna Novacic2, Devin Oglesbee2, James P. Orengo2, Laura Pace2, Stephen Pak2, J. Carl Pallais2, Christina G.S. Palmer2, Jeanette C. Papp2, Neil H. Parker2, John A. Phillips III2, Jennifer E. Posey2, Lorraine Potocki2, Barbara N. Pusey Swerdzewski2, Aaron Quinlan2, Deepak A. Rao2, Anna Raper2, Wendy Raskind2, Genecee Renteria2, Chloe M. Reuter2, Lynette Rives2, Amy K. Robertson2, Lance H. Rodan2, Jill A. Rosenfeld2, Natalie Rosenwasser2, Francis Rossignol2, Maura Ruzhnikov2, Ralph Sacco2, Jacinda B. Sampson2, Mario Saporta2, Judy Schaechter2, Timothy Schedl2, Kelly Schoch2, Daryl A. Scott2, C. Ron Scott2, Elaine Seto2, Vandana Shashi2, Jimann Shin2, Edwin K. Silverman2, Janet S. Sinsheimer2, Kathy Sisco2, Edward C. Smith2, Kevin S. Smith2, Lilianna Solnica-Krezel2, Ben Solomon2, Rebecca C. Spillmann2, Joan M. Stoler2, Kathleen Sullivan2, Jennifer A. Sullivan2, Angela Sun2, Shirley Sutton2, David A. Sweetser2, Virginia Sybert2, Holly K. Tabor2, Queenie K.-G. Tan2, Amelia L. M. Tan2, Arjun Tarakad2, Mustafa Tekin2, Fred Telischi2, Willa Thorson2, Cynthia J. Tifft2, Camilo Toro2, Alyssa A. Tran2, Rachel A. Ungar2, Tiina K. Urv2, Adeline Vanderver2, Matt Velinder2, Dave Viskochil2, Tiphanie P. Vogel2, Colleen E. Wahl2, Melissa Walker2, Stephanie Wallace2, Nicole M. Walley2, Jennifer Wambach2, Jijun Wan2, Lee-kai Wang2, Michael F. Wangler2, Patricia A. Ward2, Daniel Wegner2, Monika Weisz Hubshman2, Mark Wener2, Tara Wenger2, Monte Westerfield2, Matthew T. Wheeler2, Jordan Whitlock2, Lynne A. Wolfe2, Kim Worley2, Changrui Xiao2, Shinya Yamamoto2, John Yang2, Zhe Zhang2, Stephan Zuchner2

Funding

This work was funded by R01-NS094596.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

JL wrote the scripts of LUSTR, performed the tests on simulated and real datasets, and analyzed and interpreted the results of the simulations and analyses. CT, DRA and the UDN provided the genomic sequence data from individuals with genetic diseases and collaborated in the blinded whole genome analyses. WL, YL and BV collaborated with the optimization and application of LUSTR. CAMM and MBH were involved in the design and building of LUSTR scripts. ELH oversaw all aspects of the work and wrote the manuscript with JL. All authors read, edited, and approved the final manuscript.

Corresponding authors

Correspondence to Jinfeng Lu or Erin L. Heinzen.

Ethics declarations

Ethics approval and consent to participate

Genomic sequence data used in this study were either publicly available or from individuals who were consented to allow for their de-identified data to be used to develop analytical tools to analyze and interpret genomic data under the guidance of local institutional review boards.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Supplementary Figure 1.

Structure of C9orf72 STR. We show here the reference sequence surrounding an STR within C9orf72 as a typical example of the complexities of STR structure. This STR has been reported to be associated with amyotrophic lateral sclerosis (ALS) and contains GGCCCC repeats. It is located on chromosome 9, and the genomic location (build 37) is shown in the figure. The approximate boundaries between the repeat and flanking regions are indicated. This figure shows how allowing incomplete repeats and tolerating repeat mismatches can greatly influence how one defines the repeat region that will be interrogated in the downstream models to infer genotype. *Note that the algorithm is agnostic to strand. For this C9orf72 STR, inputting CCGGGG from the reverse strand will be treated as equivalent to indicating CCCCGG from the forward strand.

Additional file 2: Supplementary Figure 2.

Determination of the repeat sequence of C9orf72 STR by LUSTR applying periodic Smith-Waterman algorithm. We show here as an example how the LUSTR finder module determines the repeat sequence of C9orf72 STR by applying the periodic Smith-Waterman algorithm, searching for GGCCCC repetitive sequences using the default settings as follows: match/mismatch/gap/stop = 2/-5/-7/-30. Starting from the seed sequence (two GGCCCC repeats, highlighted in yellow), the finder module aligns the reference periodically to GGCCCC in both upstream and downstream directions and records the best score at each nucleotide. Scores above 0 will be reset to 0, and routines with a score below the stop limit will be blocked for further extension. In this case, the extension stops when the best score is below -30 (highlighted in orange), and the repeat sequence is determined by the farthest nucleotides with a score of 0 (highlighted in green).

Additional file 3: Supplementary Figure 3.

Average read coverage of 13 STR loci in GIAB trios. Average read coverage by GIAB trio libraries for the 13 STR loci tested in this study. Reads from each individual or merged library were first mapped to the whole human genome by bwa mem. Coverage of each nucleotide within the STR loci region (repeat region plus 2 x 50 bp flanking sequence at both sides) was calculated by SAMTOOLS depth, and the average coverage of each STR locus was calculated. STRs with failed or allele-missing calls in certain libraries are indicated by red color.

Additional file 4: Supplementary Figure 4.

Reads realigned to ATN1 and HTT STR loci from the son of GIAB Ashkenazim trio. Raw sequences of the reads realigned to the two loci were collected from the libraries sequenced for the Ashkenazim son. Gaps are indicated, and mismatched nucleotides are marked in red. Reads are categorized according to their repeat sizes. Interestingly, besides the dominant alleles, LUSTR identified one read directly supporting the -5 allele at ATN1 STR locus, and one read directly supporting the -9 allele at HTT STR locus. These reads might indicate potential small fraction somatic STR variants, but further confirmation is needed to exclude the possibility of random sequencing error.

Additional file 5: Supplementary Figure 5.

Reads supporting the STR alleles called by LUSTR but not revealed in GIAB database. Raw sequences of the reads realigned to (a) ATXN3 STR locus in father and son from the Ashkenazim trio, (b) DMPK STR locus in mother and son from the Ashkenazim trio, (c) DMPK STR locus in mother from the Chinese trio, and (d) PPP2R2B STR locus in father and mother from the Chinese trio. Gaps are indicated, and mismatched nucleotides are marked in red. Reads are categorized according to their repeat sizes.

Additional file 6: Supplementary Figure 6.

Potential inheritance of RFC1 STR alleles in the families of UDN subject 1 and 2. The genotypes of RFC1 STR alleles identified by LUSTR are shown for the pedigrees of UDN families of subject 1 and subject 2, for whom nuclear family members were available. The reference RFC1 STR allele (AAAAG wt, marked in blue) has two mutant types, AAAAG expansion (marked in orange and not known to be associated with disease) and AAGGG expansion (marked in red). The alleles were confirmed by checking the raw reads in sequenced libraries.

Additional file 7: Supplementary Table 1.

a Performance of LUSTR in identification of STR variants in GIAB database (Ashkenazim Trio). b Performance of LUSTR in identification of STR variants in GIAB database (Chinese Trio). Supplementary Table 2. a Evaluation of candidate STR expansions by LUSTR unbiased whole genome scan for subject 2. b Evaluation of candidate STR expansions by LUSTR unbiased whole genome scan for subject 3. Supplementary Table 3. a RFC1 expansion calls by LUSTR with alternative references for subject 2. b RFC1 expansion calls by LUSTR with alternative references for subject 3. Supplementary Table 4. Comparison among LUSTR, ExpansionHunter, and GangSTR

Additional file 8.

Full list of Undiagnosed Disease Network members.

Additional file 9.

A .zip file including following LUSTR scripts, which can also be downloaded from https://github.com/JLuGithub/LUSTR: LUSTR_Finder.pl, LUSTR_RefCreator.pl, LUSTR_Extractor.pl, LUSTR_Realigner.pl, LUSTR_Caller.pl, README.txt, README.md, README_detail.txt, QuickGuide.txt, LICENSE.txt, testdata/test_genome_hg19_chr9_27571483_27575544.fa, testdata/test_pairendreads_C9orf72_ref70exp30.fastq, testdata/test_STRinfo.txt.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, J., Toro, C., Adams, D.R. et al. LUSTR: a new customizable tool for calling genome-wide germline and somatic short tandem repeat variants. BMC Genomics 25, 115 (2024). https://doi.org/10.1186/s12864-023-09935-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-023-09935-9

Keywords