Duplications and Retrogenes Are Numerous and Widespread in Modern Canine Genomic Assemblies

Abstract Recent years have seen a dramatic increase in the number of canine genome assemblies available. Duplications are an important source of evolutionary novelty and are also prone to misassembly. We explored the duplication content of nine canine genome assemblies using both genome self-alignment and read-depth approaches. We find that 8.58% of the genome is duplicated in the canFam4 assembly, derived from the German Shepherd Dog Mischka, including 90.15% of unplaced contigs. Highlighting the continued difficulty in properly assembling duplications, less than half of read-depth and assembly alignment duplications overlap, but the mCanLor1.2 Greenland wolf assembly shows greater concordance. Further study shows the presence of multiple segments that have alignments to four or more duplicate copies. These high-recurrence duplications correspond to gene retrocopies. We identified 3,892 candidate retrocopies from 1,316 parental genes in the canFam4 assembly and find that ∼8.82% of duplicated base pairs involve a retrocopy, confirming this mechanism as a major driver of gene duplication in canines. Similar patterns are found across eight other recent canine genome assemblies, with metrics supporting a greater quality of the PacBio HiFi mCanLor1.2 assembly. Comparison between the wolf and other canine assemblies found that 92% of retrocopy insertions are shared between assemblies. By calculating the number of generations since genome divergence, we estimate that new retrocopy insertions appear, on average, in 1 out of 3,514 births. Our analyses illustrate the impact of retrogene formation on canine genomes and highlight the variable representation of duplicated sequences among recently completed canine assemblies.


Introduction
Gene duplication is a major driver of evolution, but correctly resolving duplications in genome assemblies remains challenging.Gene duplication is a form of structural variation, which refers to chromosomal alterations that are typically larger than 1,000 bases (1 Kb) in size, and includes other changes such as insertions, deletions, inversions, and other copy number changes (Freeman et al. 2006).Copy number variants (CNVs) that affect genes are a key mechanism in the evolution of novel gene functions and the emergence of gene families.Duplicated genes are associated with a number of core biological processes, including development, transcriptional regulation, and environmental response (Conrad and Antonarakis 2007).
Gene duplications arise through two means: the reverse transcription and integration of RNA transcripts, or the duplication of DNA segments.Duplications on the DNA-level can arise via whole genome duplications, in which the entire genome is replicated.Whole genome duplications have occurred multiple times in evolutionary history and have contributed to increased genome complexity in vertebrates (Dehal and Boore 2005).Alternatively, DNA-level duplications may be smaller and affect only a segment of the genome.These segmental duplications may encompass genes, giving rise to additional copies of genes and the expansion of related gene families.Segmental duplications are typically defined as nonrepetitive sequences longer than 1 Kb that are found in more than one copy in a haploid genome (Abdullaev et al. 2021).This definition differentiates duplications from common repeats, such as Long INterspersed Element (LINEs) and Short INterspersed Element (SINEs), which proliferate throughout genomes through distinct mechanisms.
Segmental duplications are further divided into three types based on the relative locations of duplicates: tandem duplications, in which duplications appear adjacent to one another; intra-chromosomal dispersed duplications, which is when duplications occur on the same chromosome but with an intervening sequence; and inter-chromosomal dispersed, when duplication copies occur on separate chromosomes.The intervening sequence between related duplicates can include genes and is susceptible to additional duplication or deletion (Rody et al. 2017).DNA duplications likely first arise as a result of replication-based errors including break-induced replication and template switching mechanisms (Payen et al. 2008).Once formed, additional copies can be gained or lost due to nonallelic homologous recombination (Du et al. 2012).No matter their type, when these copies differ in count between individuals or populations, they are referred to as CNVs.
Gene duplications may also be caused by a retrotransposition-associated mechanism, in which mRNAs are reverse transcribed and inserted into the genome at new locations (Esnault et al. 2000).The resulting duplications, known as retrocopies, lack introns and show the hallmarks of retrotransposition, including a poly(A) tail and flanking target-site duplications (Kaessmann et al. 2009).While retrocopies refer to any derivation of a gene transcript from retrotransposition, whereas the term retrogene implies functionality, both terms will be used interchangeably within this study (Casola and Betran 2017).The retrocopies may be full-length or truncated at their 5′ end.Over time, mutations in retrocopies may accumulate that disrupt gene function creating pseudogenes (Tutar 2012).However, retrogenes may be functional; several have been documented to have associations with important biological processes, including immune function, metabolism, environmental response, physical appearance, and disease, in both humans and canines (Brown et al. 2017;Casola and Betran 2017).Gene conversion and changes in pseudogene sequence can also give rise to actual and novel function, altering the name from pseudogenes into functional retrogenes.Retrocopies commonly occur in dogs (Batcher et al. 2022;Meadows et al. 2023) and confound the interpretation of genetic variation (Bianchi et al. 2023).
Novel developments in sequencing technology have allowed for the rapid and thorough analysis of thousands of samples.Despite this, systematically identifying and analyzing duplications remains challenging.To investigate CNVs in samples, the most common method of analysis is to look for an increase in read-depth, typically obtained from whole genome sequencing, relative to a reference genome assembly.Duplications can also be found by aligning a genome reference to itself to create a genome selfalignment that identifies sequences present multiple times (Išeric´ et al. 2022).When constructing such genome selfalignments, it is critical to first filter out common repeats since the proliferation of repeats overwhelms the alignment process.
There are several challenges associated with properly assembling duplicated regions.Assembly errors may lead to collapsed duplications in which duplicated sequences are only represented once in an assembly (Salzberg and Yorke 2005).This commonly occurs at the end of assembled contigs due to a failure to correctly assemble sequences spanning the duplication.Such assembly errors can also give rise to chimeric contigs that have been incorrectly assembled as a result of misjoining nonadjacent segments (Pan and Lonardi 2019).False duplications can also arise due to misassembly.This can occur when contigs created from divergent alleles are mistakenly represented as being duplicated (Alkan et al. 2011;Hartasanchez et al. 2018;Ko et al. 2022).Correctly assembling duplications is further complicated by the diploid nature of most analyzed genomes.As a result of challenges in correctly assembling duplicated sequences, the duplication content of a genome assembly may not be an accurate representation of the true duplication content of the analyzed genome.Assemblies derived from short sequencing reads are plagued by incorrect representation of duplicates (Alkan et al. 2011).Longer sequencing reads coupled with new methodologies for haplotype-resolved assembly (Koren et al. 2018;Nurk et al. 2022) promise to alleviate these concerns resulting in a more accurate representation of genome content and structure.
The canine remains a powerful system for genetics and genomics.The domestic dog is in the unique position of having coevolved alongside humans, while simultaneously experiencing human-driven extreme artificial selection for over 10,000 years (Frantz et al. 2016).The first canine reference genome was released in 2005, derived from Tasha the Boxer (Lindblad-Toh et al. 2005).This was further refined into canFam3.1,released in 2014, also based on Tasha (Hoeppner et al. 2014).The canine reference genome was assessed for copy number variation across multiple studies.Initial studies estimated that ∼4.21% of the canine reference genome is composed of recent segmental duplication, including 841 genes in predicted segmental duplications (Nicholas et al. 2009).Array-CGH data showed that many duplications were copy number polymorphic (Chen et al. 2009).Additional studies of canine CNVs have linked these variations to conditions such as anterior cruciate ligament rupture (Binversie et al. 2020) and breedspecific morphological structure (Serres-Armero et al. 2021).Other studies have concluded that CNVs likely did not have major effects on canid domestication, with suggestions that dogs and wolves had similar proportions of CNVs loci (Serres-Armero et al. 2017;Pendleton et al. 2018).
Since the release of canFam3.1, multiple assemblies of dogs and wolves have been generated using long-read technology (Field et al. 2020;Edwards et al. 2021;Halo et al. 2021;Jagannathan et al. 2021;Player et al. 2021;Sinding et al. 2021;Wang et al. 2021;Field et al. 2022).These new assemblies have helped elucidate novel genes, regulatory elements, and variants, while revealing new aspects of canine genome biology.These assemblies have been created using a variety of sequencing technologies [Oxford Nanopore (ONT) and PacBio] and algorithms.However, limited analyses of the duplication content of these assemblies have been performed, with most descriptions focused on specific loci, such as the amylase region, or other regions of suspected misassembly (Arendt et al. 2014).To address the lack of analysis of duplications in these long-read genomes, we aim to characterize the duplication content of different recently published long-read canine genomes and assess patterns of gene duplication found throughout canines.We apply two computational analyses to nine recently released canine genome assemblies: genome assembly self-alignment, which involves the mapping of an assembly to itself in order to find matching duplicated segments (Išeric´ et al. 2022) and read-depth analysis, which searches for regions of unusually high coverage in Illumina sequencing data (Pendleton et al. 2018;Shen and Kidd 2020).We further perform a search for retrocopy insertions in each genome assembly.We find notable and large regions of duplicated sequence that are not properly represented in existing dog assemblies, describe polymorphic duplications that differ between canine assemblies, and identify retrocopies as a major driver of gene duplications in canine genomes.

Results
Duplications Identified by Self-alignment of the Mischka Genome Include Segments With Multiple Duplicated Copies The UU_Cfam_GSD_1.0 assembly (accession GCA_ 011100685.1),derived from a female German Shepherd Dog named Mischka, has an extensive annotation and serves as the reference genome used by the Dog10K consortium (Wang et al. 2019a;Wang et al. 2021;Meadows et al. 2023).This genome assembly was constructed using PacBio Sequel sequencing and consists of sequences assigned to 39 assembled chromosomes (chr1-38 and chrX), the mitochondria (chrM), and 2,158 unplaced contigs not localized to a chromosome.Due to the importance of this reference to ongoing research, we first performed a detailed analysis of duplications in this assembly.Using genome assembly self-alignment, we identified 491,661 pairwise duplicated segments of 1,000 bp or greater with at least 90% sequence identity in the Mischka assembly.Of these, 122,521 segments are located in the 38 canine autosomes or chromosome X, and 369,135 duplications are located on unplaced contig sequences.Of the duplicated segments on the assembled chromosomes, 50,629 have a paralog on an unplaced contig (41.03%).The largest single duplicated segment is nearly 1.8 Mb in length, located at chr8:74446738-76245570.
Many of the duplications detected on the assembled chromosomes are small; 46.62% (57,120/122,521) of duplicated segments are shorter than 2 Kb.Examination of the size distribution of duplicated segments smaller than 2 Kb revealed three notable peaks at approximately 1,250, 1,375, and 1,775 bp.Inspection of the duplication landscape on the University of California, Santa Cruz (UCSC) Genome Browser identified numerous highrecurrence duplications that contained many short alignments linking a locus with multiple paralogous segments (supplementary fig.S1, Supplementary Material online; Kent et al. 2002).We defined a high-recurrence segment as any area that contained at least four or more paralogous duplications smaller than 2.5 Kb and identified 1,492 highrecurrence duplications, totaling approximately 3.0234 Mb.Overall, this makes up 3.095% of duplicated base pairs located by self-alignment.The segment with the largest number of copies is found at chr1:100216700-100219200, with 159 intersecting duplications.These high-recurrence duplications appear to be distributed throughout the Mischka genome.
Because the same locus can be included in multiple duplications at varying levels of sequence similarity, we merged segments together, resulting in 6,040 duplicated intervals in the assembled chromosomes and 1,940 duplicated intervals in the unplaced contigs (supplementary fig.S2, Supplementary Material online, supplementary tables S1 and S5, Supplementary Material online).Duplications occupy 96.89 Mb, or about 4.12%, of the assembled chromosomes.In sharp contrast, about 90.15% (115.8Mb) of the sequence on unplaced contigs is duplicated.In total, including unplaced contigs, 8.58% of the Mischka assembly is duplicated according to genome assembly self-alignment.
Duplications are not distributed uniformly along canine autosomes.Of the 5,525 duplicated segments located on the autosomes, 193 (3.49%) overlap the first megabase of the beginning of the chromosome, while 172 (3.11%) are located within the last megabase of the autosomes.Relative to 1000 random permutations, this is a 2.5-fold increase for the first megabase (mean, 1.7%; max, 2.2%) and a 1.75-fold increase for the last megabase (mean, 1.6%; max, 2.2%).

Duplications Identified in Mischka Using Read-depth Are Not Uniformly Distributed
To complement duplications identified via genome assembly self-alignments, we identified duplications based on read-depth.Illumina sequencing reads derived from Mischka were aligned to the Mischka assembly and processed using fastCN (Pendleton et al. 2018).Repetitive sequences are masked in the assembly prior to alignment, and the read-depth profiles are used to estimate the absolute copy number in windows consisting of 1,000 unmasked positions.We define read-depth duplications as segments consisting of four or more consecutive windows with an estimated copy number exceeding 2.5 and with a minimum size of 10 Kb, resulting in 1,573 segmental duplications defined by read-depth (supplementary tables S2 and S6, Supplementary Material online).We find that ∼60% (30.6 Mb) of unplaced contigs that contain at least four windows are duplicated.In comparison, 56.21 Mb (1,033 segments) is duplicated on the assembled chromosomes, making up 2.39% of the 38 autosomes and X chromosomes.
The duplications identified by read-depth range in size from 10 Kb to 1.6 Mb and have a mean size of 48 Kb and a median of 21 Kb.We find that 57 (6.1%) duplications are present in the first megabase of the autosomes, while 36 (3.8%) are located in the last 1 Mb of each autosome.Relative to random permutations, we find a 3.19-fold increase in duplication content in the first megabase of autosomes (mean, 1.9%; max, 3.5%), and a 2.1-fold increase in the last megabase (mean, 1.8%; max, 3.2%).The median copy number of all the duplications was 3.62, and the mean was 4.85.The maximum copy number of a given duplication was 78.3, located at chr27:33386819-33407975.
Most chromosomes have between 1 and 5% of their sequence annotated as a duplication by read-depth; however, 8.85% of chromosome 26 is duplicated.Examination revealed that a 2.4 Mb region encompassing six segmental duplications, located at chr26:25,387,954-27,841,554, accounts for this high duplication percentage.This region has a median estimated copy number of 9.5.

Read-depth Analysis and Self-alignment Show Low Levels of Agreement
We compared the position of duplications identified by genome self-alignment and read-depth to search for concordance.To account for detection differences between the methods, we only considered duplications that were located on assembled chromosomes, greater than 15 Kb in size, have at least 95% identity, and were not located in the first megabase of an autosome.A total of 1,122 duplications identified by genome assembly self-alignment and 918 duplications identified by read-depth met these criteria (supplementary tables S1, S2, S5, and S6, Supplementary Material online).
We explored how many duplications in each set were found to be unique.By intersecting the self-alignment set and the read-depth set, we find that 528/1,122 selfalignment duplications (47.1%) and 504/918 read-depth duplications (54.9%) share any overlap.The union of these two sets contains approximately 77.78 Mb in total, 26.6 (34.2%) of which is duplicated in both sets.Selfalignment uniquely detects 29.88 Mb (38.41%) and readdepth uniquely detects 21.02 Mb (27.02%;Fig. 1b).Despite only 35% agreement between the two methods, both analyses show similar trends at a chromosomal level, including an increase in duplication count on chr26 (supplementary fig.S3, Supplementary Material online).

Duplication Content Varies Across Canine Assemblies
We expanded our self-alignment and read-depth analysis to eight additional recently published canine assemblies (Table 1, Fig. 1a) (Field et al. 2020;Edwards et al. 2021;Halo et al. 2021;Jagannathan et al. 2021;Player et al. 2021;Sinding et al. 2021;Wang et al. 2021;Field et al. 2022).Most assemblies used some method of PacBio sequencing, predominantly Sequel, but the China and Yella assemblies used ONT sequencing.We note that China and Wags are different Basenjis sequenced using different technologies by the same study.While the authors of that paper state that China is the better assembly, we exclude China from the readdepth analysis due to a lack of available Illumina data (Edwards et al. 2021).Notably, among the genomes is mCanLor1.2,a Gray wolf genome that serves as an outgroup to the other breed dog genomes (Sinding et al. 2021).
Canine assemblies have between 1.8% (Tasha) and 5.6% (mCanLor1.2) of their genome identified as duplicated by genome assembly self-alignment (Table 1).Assemblies with a low duplication content (below 2.5%) include Sandy, Tasha, and Zoey, whereas assemblies with a high percentage of duplications (above 4%) include Mischka, mCanLor1.2,Nala, Wags, and Yella.A similar divide in duplication content among assemblies remains after filtering for duplications above 15 Kb and increasing the sequence identity minimum to 95%.By analyzing readdepth, we see a slight change, where mCanLor1.2,Nala, and Yella have the highest fraction of their genome annotated as duplicated.The number and duplication status of unplaced contigs differs widely among the assemblies (supplementary table S1, Supplementary Material online).The highest duplicated fraction is found in Mischka (90%) and Zoey (70%), with the others having <50% of their unplaced contigs duplicated.
Additionally, we explored 13,817 protein-coding genes for their duplication status as determined by read-depth.Illumina data from each sample was aligned to the Mischka assembly; China and Nala are removed from this analysis, due to a lack of Illumina data and strong bias based on GC composition, respectively (supplementary fig.S4, Supplementary Material online).The copy number of each gene was then estimated using read-depth.We identify a total of 1,068 genes as duplicated in at least one canine; 222 of those genes are considered duplicated in all seven canines.Further analysis indicated that the most highly duplicated genes have no annotated function and are not associated with a named gene (supplementary table S7, Supplementary Material online).We performed a gene ontology (GO) enrichment analysis of the 423 genes identified as duplicated based on read-depth from Mischka.Enriched functional categories include "intermediate filament organization" (GO:0045109), "signal transduction" (GO:0007165), "positive regulation of cellular Duplications and Retrogenes in Modern Canine Genomic Assemblies process" (GO:0048522), "multicellular organismal process" (GO:0032501), and "negative regulation of cellular process" (GO:0048523) (supplementary table S9, Supplementary Material online) (Ashburner et al. 2000;Thomas et al. 2022;Gene Ontology Consortium et al. 2023).
To explore the impact of PacBio HiFi reads on representing duplications, we compared duplications identified by assembly self-alignment and read-depth in the mCanLor1.2assembly.Using the same filtering criteria as used in Mischka, we find that a total of 115.3 Mb is duplicated in the union of self-alignment and read-depth, a 1.5-fold increase over Mischka.We also find an increase in congruency between the two methods, as 60.07 Mb is duplicated in both the self-alignment set and the readdepth set, making up 52.11% of the total base pairs, a notable increase from the 34.2% overlap found in Mischka (Fig. 1c).

Retrocopies Proliferate Throughout Canine Genome Assemblies and Show Hallmarks of Retrotransposition
While exploring self-alignment duplications, we found that high-recurrence duplications in the Mischka assembly often corresponded to the exons of transcripts found in other species, suggesting the presence of gene retrocopies (Fig. 2a).To confirm this, we identified candidate retrogenes in the Mischka assembly using BLAT.From an initial set of 19,972 multiexon protein-coding genes, we identified 3,892 retrocopies from 1,316 parental genes, which includes 66 retrocopies that retained the full open reading frame of the parental gene (Table 2, supplementary table S4, Supplementary Material online).There are 3,658 (94%) of these retrocopies located on assembled chromosomes.We found an enrichment of retrocopies on chrX, with a 1.18-fold increase over expectation from 1,000 random permutations (supplementary fig.S5, Supplementary Material online).Additionally, retrocopies that retained the full open reading frame of their parental gene had much greater sequence identity with the parent gene (median, 99.92%) versus those that did not (median, 96.14%; P-value, 3.81*10 −48 [supplementary fig.S6, Supplementary Material online]).
By intersecting high-recurrence duplications with annotated retrocopies, we determine that 1,080/1,492 (72.39%) high-recurrence duplications segments overlap with a retrocopy; in turn, 1,078/3,658 (29.47%) retrocopies on assembled chromosomes intersect with highrecurrence duplications.On a base-pair level, retrocopies located in highly duplicated segments make up a total of 4.35 Mb.Thus, multiple pairwise alignments among retrocopies derived from the same parent gene account for the highly duplicated regions identified by genome self-alignment.
We additionally investigated how many retrocopies retained the signature hallmarks of retrotransposition, which are the presence of target-site duplications (TSDs) and poly(A) tails (Szak et al. 2002;Fig. 2b).Of the 3,496 retrocopies on assembled chromosomes in the Mischka genome that passed filtering, 3,442 (98.46%) show at least one of the two hallmarks (Fig. 2c).In total, 1,387 (39.67%)only have TSDs of at least 5 bp, 94 (2.69%)only have poly(A) tails, and 1,961 (56.09%) have both signatures.The mean length of poly(A) tails is 11.91 bases, and the mean length of TSDs is 11.19 bases, with 1082 having TSDs 10 bp or longer (Fig. 2d and e).All 66 retrocopies that retain the parental open reading frame (ORF) have TSDs, while 20 also have a poly(A) tail.
Another known hallmark of retrotransposition is the presence of a consensus LINE-1 endonuclease recognition sequence, commonly read as TTTT/AA, where "/" marks the cleavage site.We searched for the presence of the expected consensus sequence across retrocopies and recovered the expected endonuclease recognition sequence (Fig. 2f) (Feng et al. 1996;Jurka 1997;Morrish et al. 2002;Flasch et al. 2019).We find that approximately 40% of retrocopies annotated with a TSD 5 bp or longer contain the expected cut site sequence.
When analyzing only the retrocopy loci that are dimorphic among assemblies that passed filters (n = 212), thus permitting a more precise definition of event Listed in each row is the name of the sample used for each assembly, the sample type, the N 50 in megabases as reported by each publication, and the main sequencing approach used to construct the assembly as reported in each publication.Aside from Mischka/canFam4 at the top, the rest of the assemblies are ordered by their contig N 50 .boundaries based on comparison with the pre-integration empty site, we find a confirmed TSD of at least 5 bp or greater, with an average TSD length of 15.1 bp.Additionally, loci that possessed a confirmed TSD also possessed an endonuclease cleavage site (supplementary fig.S7, Supplementary Material online).
We expanded the retrocopy analysis to all other canine assemblies, using protein-coding genes converted from Mischka coordinates as queries.Retrocopy counts are broadly similar across assemblies (Table 2, supplementary table S8, Supplementary Material online).Mischka's slightly greater number of retrocopies appears to be accounted for by the presence of unplaced contigs, many of which correspond to duplicated sequences (supplementary table S1, Supplementary Material online).Similar trends of identified TSDs and poly(A) tails are observed across the other assemblies, with over 98% of all identified retrocopies showing at least one hallmark (supplementary table S3, Supplementary Material online).When viewing retrocopies with stricter hallmark calls, based on 10 bp TSDs, poly(A)s, and an adjacency requirement that the TSD and poly(A) be less than 5 bp separated, approximately 60% of retrocopies retain evidence of any hallmark (supplementary fig.S9, Supplementary Material online).We also specifically analyzed retrocopies that retain the parental ORF for the presence of hallmarks (supplementary table S4, Supplementary Material online).Over 96% of fullORF retrocopies show at least one hallmark, usually the presence of a target-site duplication.
The top ten retrocopy producing parent genes were determined for each assembly (Fig. 3).There were only 11 parent genes present across the top ten lists of all assemblies, highlighting a high level of shared parental genes across the assemblies.Many of the top annotated genes are previously known retrocopy producers in human and/or canine (GAPDH, HNRNPA1, EEF1A1, HMGN2, and the RPL genes [Marquez et al. 2003;Breen 2008;Liu et al. 2009;Gao et al. 2019;Ciomborowska-Basheer et al. 2021]).One gene, BTF3, was unique to the Labrador Retriever genome, Yella, but retrocopies from this gene are present at slightly decreased frequency in the other canine assemblies.Additionally, there were several instances where we could not determine a unique progenitor among related parental genes.Most notable among this group are LOC100686340 and LOC100687749, which are present in the top ten retrocopy producer lists of all canine assemblies except Yella and Wags, the latter of which only has LOC100686340 annotated.
To explore the function of common retrocopy progenitors, we performed a GO enrichment analysis of the most common parent genes in the Mischka assembly.There is one significant enriched GO category among the 50 parent genes in the Mischka assembly that had at least 10 retrocopies, "cytoplasmic translation" (GO:0002181), which likely derives from the high number of ribosomal genes (RPL and RPS).There are 5 RPL genes in the top 10 parent genes, and 15 RPL or RPS genes in the top 50.We expanded this analysis to the 117 genes that had at least 5 retrocopies, which identified several enriched categories, including "ribosomal small subunit assembly" (GO:0000028) and "cytoplasmic translation" (GO:0002181) (supplementary table S9, Supplementary Material online) (Ashburner et al. 2000;Thomas et al. 2022;Gene Ontology Consortium et al. 2023).
Finally, we filtered retrocopies for any overlap with segmental duplications identified by read-depth.We find that the Mischka assembly has a high prevalence of retrocopies present in segmental duplications, with 466 compared to the range of 62 to 263 (avg, 126) found in other assemblies.We note that 162/466 (34.7%) are present on unplaced contigs, further highlighting Mischka's duplication bias toward unplaced contigs.No other canine assembly has any retrocopies mapped to unplaced contigs, indicating that Mischka retrocopies are frequently present in segmental duplications that get assembled onto unplaced contigs, in line with the Column 2 gives the number of protein-coding genes considered as potential gene parents in each assembly.Retrocopies in column 3 include all chromosomes, including mitochondrial and unplaced contigs, and column 4 contains only the assembled chromosomes, chr1-38 + X.The total number of retrocopies and the number that retain the open reading frame of the parental gene are shown.
increased duplication rate present in the self-alignment data.

Estimating the Rate of Retrocopy Insertion
To identify the rate of retrocopy insertion, we compared five canine assemblies from diverse dogs (Mischka, Nala, China, Sandy, and Zoey) against mCanLor1.2.Only retrocopies located on the autosomal chromosomes that did not overlap a read-depth segmental duplication were considered.First, we assessed which retrocopies were shared across these six assemblies.After applying strict filtering, we found a total of 3,377 unique retrocopies across the six assemblies, 3,111 of which were located in all six canines (92.12%); 46 were unique to the mCanLor1.2assembly, and the largest categories tended to be singletons (supplementary fig.S8, Supplementary Material online, supplementary table S10, Supplementary Material online).Mischka and Nala, both German Shepherds, have the fewest singletons, reflecting their similarity relative to the other assemblies.
We also searched for the presence of retrocopies in other Canines.We searched for 3,658 retrocopies from Mischka in the fox (VulVul2.2) and dhole (GWHAAAC00000000) genome assemblies (Kukekova et al. 2018;Wang et al. 2019b).In total, 1923 retrocopies (52.2%) were found in the fox and/or the dhole genomes, 468 retrocopies (12.79%) were present in both the fox and the dhole, 549 were unique to the fox (15%), and 906 were unique to the dhole (24.76%) (supplementary table S11, Supplementary Material online).Consistent with their more recent origin, none of the 66 Mischka retrocopies with intact ORFs were found in either the fox or dhole assemblies.We also find that sequence similarity of retrocopies unique to Mischka (median, 96.75%) versus those found in either the fox or the dhole (median: 95.72%) tended to be slightly higher (P-value: 6.02*10 −44 ) (supplementary fig.S6, Supplementary Material online).
To estimate the rate of retrocopy insertions, we first focus on the Mischka genome.We estimated the number of generations since genomic divergence from the mCanLor1.2wolf genome.Excluding duplicated segments and short tandem repeats, there are 4,219,105 autosomal single nucleotide differences between the Mischka and mCanLor1.2genome assemblies.Using the wolf mutation rate of 4.5*10 −9 per generation (Koch et al. 2019), and with an analyzed genome size of 2,022,154,146 bp, we calculated that 231,827 generations have elapsed since genome divergence between Mischka and mCanLor1.2.By comparing the retrocopies between Mischka and mCanLor1.2,we identified 48 retrocopies that are unique to Mischka, and 79 that are unique to mCanlor1.2.Retrocopies that were present uniquely to Mischka had much greater sequence similarity to their parent gene (mean, 99.92%) versus those shared in the mCanLor1.2assembly (mean: 96.14%) (P-value, 9.396*10 −29 ) (supplementary fig.S6, Supplementary Material online).By dividing the average amount of unique retrocopies Retrocopies presented here are limited to the assembled chromosomes.The number of retrocopies from each parent is given in parentheses.The same genes are colored identically to one another to quickly determine the presence of reoccurring genes.Only 11 parent genes are represented across all nine assemblies, with retrocopies from BTF3 found in the other assemblies at slightly decreased frequencies that removed it from the top ten.
over the number of generations, we estimate that the rate of retrocopy insertion as 2.74*10 −4 per generation, or ∼1 in 3,650 births, a higher rate than expected.Insertion rates estimated by comparing multiple canine assemblies, including Mischka, to the mCanLor1.2wolf genome range from 2.74*10 −4 per generation to 3.1*10 −4 per generation, or from 1 in 3,226 to 1 in 3,650 (avg: 1/3,514 births; Table 3).When considering a calibration based on the lower-bound and higher-bound wolf single nucleotide variant (SNV) mutation rate, which was defined as a range of 2.6 × 10 −9 to 7.1 × 10 −9 (Koch et al. 2019), we estimate the range of novel insertion in Mischka is 1.58*10 −4 per generation to 4.32*10 −4 per generation (1/2,315 to 1/ 6,329 births).

Discussion
We analyzed the duplication content of nine recent canine assemblies using two methods: genome assembly selfalignment and read-depth analysis, identifying 1.74 to 7.05% of each genome as duplicated.We found that duplications were enriched in the first and last megabases of canine autosomes.Since canine autosomes are acrocentric (Breen 2008), this suggests an enrichment of duplications in pericentromeric and subtelomeric regions, as found in other species (Linardopoulou et al. 2005).
The assembly self-alignment and read-depth approaches identified a range of duplication content in each assembly, ranging from 17.27 to 109.78 Mb by self-alignment and 41.17 to 166.47 Mb by read-depth.Overlap analysis shows a generally poor level of agreement, with most duplicated base pairs only identified by one of the methods.Thus, each of the assemblies offers an incomplete representation of the true duplication content in each sample.The assemblies differ in how they represent partially resolved duplications.For example, in the Mischka assembly, unresolved duplications were often retained as separate unplaced contigs that were not assigned to a chromosomal location.Other assemblies report few or no duplications on unplaced contigs.This difference may have implications in downstream analyses that utilize these assemblies.
Our initial analysis of duplications in the Mischka assembly revealed the presence of many short, high-recurrence duplications.Further investigation showed that these segments corresponded to multiple related retrogene insertions.To explore this further, we annotated 1,263 to 1,316 retrocopies across all assemblies and found that a large majority displayed the hallmarks of retrotransposition such as poly(A) tails and target-site duplications.When observing retrocopies with stronger evidence of hallmarks (10 bp TSDs, poly(A)s, and adjacency), we see a reduced count of retrocopies with hallmark evidence, down to 60%.We also note 49 to 66 retrogenes per assembly appear to maintain intact parental ORFs.Whether these intact This table describes the five assemblies retained for retrocopy comparison against mCanLor1.2,the Greenland Wolf Assembly.Column 1: Assembly Name.Column 2 contains the total sum of retrocopies between the assembly and mCanLor1.2.Columns 3 and 4 show how many retrocopies are unique to the assembly and to mCanLor1.2,respectively.Column 5 shows the number of generations elapsed since divergence between the assembly in mCanLor1.retrogenes are expressed or have any functional impact requires further investigation.We also note that our definition of retrogene boundaries is based on the existing annotation of genes in canines.This annotation of untranslated regions (UTRs) in these gene models is often incomplete, resulting in an underestimation of the size of retrogene insertions (Fig. 2b), and underestimating the count of retrocopies with TSDs and/or poly(A) tails.We also find that 53% of retrocopies from Mischka are present in the dhole or fox assemblies, implying an origin prior to wolf speciation.However, it should be noted that the dhole and red fox assemblies are highly fragmented (n = 29,680 and 82,424 contigs, respectively), limiting the ability to interpret sharing among the two genomes (Kukekova et al. 2018;Wang et al. 2019b).
Additionally, in the Mischka assembly, we find that 466 retrocopies are located within larger segmental duplications with over half of those retrocopies being present on unplaced contigs.This suggests a pattern where a retrocopy inserts into a locus, which is then subsequently duplicated, perhaps repeatedly, throughout the genome.A similar pattern of insertion-duplication has been described for retrogenes that may have functional effects in elephants and horses (Sulak et al. 2016;Batcher et al. 2023).
We take advantage of retrocopy comparisons between dog and wolf assemblies to estimate an average retrogene insertion rate.Methods for estimating mutation rates from patterns of sequence differences can be divided into three broad categories (Segurel et al. 2014).First, rates can be estimated by sequencing trios and counting the number of new mutations observed in children that are absent in their parents.This estimate is the least affected by natural selection, as only lethal mutations should be unobserved in the offspring.However, a large number of individuals may need to be analyzed, particularly for mutation types that arise at a slow rate.This method is also extremely sensitive to false positive calls (in the children) and false negative calls (in the parents).Second, rates can be estimated from the distribution of variants identified in multiple individuals using population genetic models.The expected distribution of differences among individuals is a function of multiple population genetic parameters.If some parameters are known, an estimate of the mutation rate can be obtained.For example, if the effective population size is known, or can be estimated using other data, then the mutation rate can be estimated based on the distribution of segregating sites in the sample.Similarly, mutation rates can be estimated from differences between chromosome segments that are inherited identical-by-descent (IBD) if the number of generations separating the IBD segments can be determined.Related methods have been applied over both recent and ancient time scales (Lipson et al. 2015;Palamara et al. 2015;Narasimhan et al. 2017;Tian et al. 2019).However, these methods may require models of how population sizes change over time, are sensitive to missing rare variants, and can be affected by natural selection.The third class of estimates takes a phylogenetic approach and estimates the mutation rate based on the number of differences that have occurred along two sampled lineages.The divergence time between sampled lineages can be estimated from species separation times inferred from the fossil record.However, for closely related groups, much of the lineage divergence time occurs as coalescence in the ancestral population, making estimates of the ancestral population size critical when a fossil calibration is used.As in population-based estimates, an alternative is to calibrate phylogenetic estimates based on a sequence divergence time estimated using a previously known rate for a different type of mutation.
In this study, we utilize phylogenetic comparisons between dog and wolf genomes to estimate the rate of retrocopy formation.We calibrate these estimates using a single nucleotide mutation rate inferred from sequencing a wolf pedigree.This mutation rate, 4.5*10 −9 /bp/gen, implies a divergence between assembled dog genomes and the mCanLor1.2Greenland wolf of 231,827 generations, or 695,481 years, assuming 3 years per generation (Skoglund et al. 2015).This deep divergence may seem inconsistent with the domestication of dogs in the past 15,000 to 30,000 thousand years (Skoglund et al. 2015;Botigue et al. 2017;Bergstrom et al. 2020).Dogs are thought to have been domesticated from a now-extinct progenitor population of wolves somewhere in Asia and the ancestral wolf population may have had a large effective size.When correcting for differences in the assumed mutation rate, a previous Bayesian coalescent analysis of dog and wolf divergence estimated that the ancestral population size of new-world and old-world wolves was quite large, 143,000 individuals (Fan et al. 2016).This implies a mean coalescence of two lineages in this ancestral population of 286,000 generations, which, along with the population separation time, is broadly consistent with our phylogenetically inferred estimate.We note that these estimates of population size may be biased by the presence of admixture from coyotes in some of the new-world wolf samples used to create the estimate (Fan et al. 2016;Sinding et al. 2018) or by unmodeled population structure among ancestral wolves.
We note that our initial annotation of retrocopies in each assembly (Table 2) applies conservative filters of alignment quality to reduce false positive annotations.However, this induces false negative calls across assemblies.For the rate estimate, we corrected this by assessing the presence of retrocopies called in any assembly among all other assemblies through an alignment of the flanking sequences (supplementary fig.S8, Supplementary Material online).Although our estimate of 1/3,514 births is affected by several assumptions including the absence of selection and of Duplications and Retrogenes in Modern Canine Genomic Assemblies GBE Genome Biol.Evol.16(7) https://doi.org/10.1093/gbe/evae142Advance Access publication 1 July 2024 a uniform rate, the estimate of the SNV mutation rate is likely the largest single source of uncertainty in our estimate.Utilizing the reported confidence interval of the SNP mutation rate (2.6*10 −9 to 7.1*10 −9 ) implies a retrogene insertion rate of between 1/2,226 and 1/6,075 births.
The inferred retrocopy insertion rate in canines is substantially higher than that found in humans; between humans and chimps, one new gene retrocopy insertion was estimated for every 6,000 births, based on the 1,000 Genomes Project (Ewing et al. 2013).However, this estimate is calibrated based on an assumed human effective population size of 10,000 which is largely consistent with an assumed human mutation rate of ∼2.5*10 −8 bp/gen (Nachman and Crowell 2000).Adjusting this for the human mutation rate obtained from pedigree sequencing, ∼1.3*10 −8 bp/gen (Jonsson et al. 2017) decreases the estimated rate of retrocopy insertion in humans to ∼1/11,500 births or about three times lower than the canine estimate.However, it is noted that long-read data enables a higher retrocopy insertion detection rate (Feng and Li 2021) so this may be an underestimate of the true human insertion rate.Other observations are consistent with a high rate of retrogene formation in dogs.A recent survey suggests that polymorphic retrogene insertions are 13-fold more common in dogs than in humans (Batcher et al. 2022).The most well-studied retrogenes in dogs are derived from the FGF4 gene which cause the short-leg phenotype found in some breeds such as corgis and beagles (Brown et al. 2017).FGF4 retrogenes have occurred multiple times independently, with seven independent copies identified to date (Batcher et al. 2020;Batcher et al. 2022).Additionally, dogs have a high rate of SINE polymorphism (Wang and Kirkness 2005;Halo et al. 2021).Both SINE and retrogene insertions require the activity of LINE-1 elements.In contrast to SINEs, which do not encode any proteins, LINE-1 s are autonomous retroelements that encode proteins required for their own retrotransposition (Beck et al. 2011).LINE-1 encoded proteins predominantly mobilize the RNA that encodes them through a process known as cispreference (Wei et al. 2001).However, one or more LINE-1 encoded proteins can rarely act to mobilize other RNAs such as SINEs or cellular mRNAs (Dewannieux et al. 2003).We speculate that the high rate of SINE and retrogene insertion found in dogs may result from a shared mechanism that involves a reduced level of cis-preference for the proteins encoded by the canine LINE-1 element.

Assemblies Considered
Our analysis involves nine published long-read canine assemblies (Field et al. 2020;Edwards et al. 2021;Halo et al. 2021;Jagannathan et al. 2021;Player et al. 2021;Sinding et al. 2021;Wang et al. 2021;Field et al. 2022).For simplicity, we refer to them by name.Assembly accessions and other details are provided in Table 1.

Genome Assembly Self-alignment
Genome assembly self-alignment was performed using Brisk Inference of Segmental duplication Evolutionary stRucture (BISER; v1.2.3) which was developed to detect segmental duplications at low identity across multiple genome assemblies (Išeric´ et al. 2022).BISER defines a segmental duplication with the following criteria: below a certain error threshold based on shared sequence identity ɛ (∼15% for 75% shared sequence identity), longer than 1,000 bp, and paralog sequences may overlap up to a certain amount of base pairs, also defined by the error threshold.Before alignment, repetitive regions in the genome are soft-masked, prohibiting the initialization of alignments in low-complexity and highly repetitive sequences, such as LINEs, SINEs, and simple repeats.Duplications involving the mitochondrial chromosome were removed from the analysis since nuclear mitochondrial insertions represent a distinct class of variation (Verscheure et al. 2015), and two categories of duplications were defined: one where the paralogous sequence between duplication copies is at least 90% identical, and a subset where the identity rate is at least 95%.These sets are further divided based on their placement on assembled chromosomes (the 38 canine autosomes and the X chromosome) versus unplaced contigs in each assembly.Because BISER identifies duplicated segments as pairs of sequences and duplication pairs often overlap each other, we used BEDTools merge (v2.30) to create a nonredundant set of duplicated intervals (Quinlan 2014).

Distribution of Duplications Along the Genome
We analyzed the distribution of duplications located on the 38 canine autosomes and the X chromosome by permutation.The distribution of duplications was compared against 1,000 random permutations using BEDTools shuffle (v2.30), limiting placement to the assembled chromosome sequence (Quinlan 2014).This was performed three times, on the self-alignment duplications, the read-depth duplications, and on the retrogene set to look for chromosomal enrichment.

Locating Highly Duplicated Segments
Short segments with a high recurrent duplication level were identified using the duplication pairs found by selfalignment located on the assembled chromosomes.All duplications smaller than 2.5 Kb were located, and their coordinates rounded to the nearest hundred (i.e. a coordinate of 2,349 would become 2,300, and a coordinate of 3,175 would become 3,200), to account for minor difference in alignment endpoints.Duplications were merged into segments, and segments that contained at least four duplications were labeled as highly recurrent.

Read-depth Analysis
Read-depth analysis was applied to Illumina data from the assembled individuals using fastCN (Pendleton et al. 2018).First, a version of the assembly was created in which elements identified by RepeatMasker, tandem repeat finder (Benson 1999), WindowMasker (Morgulis et al. 2006), and over-represented 50-mers were masked.Next, all placements within an edit distance of two for 36-bp segments of each Illumina read were determined using mrsFAST (v3.4.1;Hach et al. 2010).The resulting depth profiles were tabulated, corrected for local GC content per fastCN, and converted to copy number estimates based on a set of control regions.For copy number normalization, we excluded reported deletions, duplications, and insertions (Wang et al. 2021), duplicated intervals from a preliminary self-alignment of the Mischka genome, duplicated segments identified by a preliminary fastCN analysis, and all chromosomes other than chr1 to 38.Copy number profiles were calculated based on the mean normalized depth tabulated in windows containing 1,000 unmasked positions.Duplicated regions are defined as at least four consecutive windows whose copy numbers all exceed 2.5 and have a total combined size equal to or greater than 10 Kb.An overall copy number was determined by taking the median copy number of all windows within a duplicated region.Masking of the additional genomes was performed as described earlier.Normalization controls were converted from Mischka coordinates using minimap2 (v2.26;Li 2018).China was omitted from these analyses because, while it does have BGI-seq short read sequencing data, we opted to only utilize Illumina-based data.Nala was omitted due to extreme GC bias observed in the associated Illumina data (supplementary fig.S4, Supplementary Material online).

Comparison of Duplications Identified Using Self-alignment and Read-depth Approaches
To compare the results found by self-alignment and the results found by read-depth, both sets of duplications (BISER and read-depth) were filtered to retain only (i) duplications on the assembled chromosomes; (ii) duplications greater than 15 Kb; (iii) duplications with at least 95% identity; and (iv) duplications that are not located in the first megabase of any autosome.Analysis was limited to assembled chromosomes due to differences in the treatment of unplaced contigs between assemblies.The read-depth analysis has a mean window size of 3,822 bp, with four windows of an average size equaling approximately 15 kb.We utilized this metric to filter our read-depth and our self-alignment sets to match this new criterion.The first megabase of each autosome is excluded because, as a result of masking, there are no informative read-depth windows in these regions.Self-alignment duplications were intersected with 1 Kb unique sequence read-depth windows, and a self-alignment duplication was confirmed supported if the windows had a median copy number of at least 2.5.Read-depth duplications were counted as supported if they overlapped with any self-alignment duplication.The methods described above for the Mischka assembly were expanded to the eight other canine genome assemblies with control region coordinates using liftoff (v1.6.3 [Li 2018;Shumate and Salzberg 2021]).In the gene analysis, read-depth windows are intersected with the protein-coding genes annotated in Mischka.
For the gene set of choice, we used a subset of 13,817 genes identified in Mischka, with the restriction that genes must be protein-coding, must be located on the 38 autosomes or the X chromosome, and that a gene must fully encompass at least three windows, where each window represents 1 Kb of nonrepetitive DNA sequence used by fastCN.A table was built to observe copy numbers across the genome, where a row represents a canine sample and a column represents a pre-determined window in the Mischka coordinates, with entries being the copy number.
Copy numbers for genes were determined as the median of all copy numbers in windows encompassed by a gene.A canine assembly is considered to have a duplicated gene if the copy number exceeds 2.5.

Retrogene Analysis
We searched for candidate retrocopies in the Mischka assembly based on the annotation of 19,972 protein-coding genes used by the Dog10K consortium (Meadows et al. 2023).A single transcript was selected for each gene, with priority given to the longest isoforms and the fewest number of exons.We removed any genes with an intron less than 50 bp, to account for many uninvestigated genes with single base-pair introns, and any gene that only had a single exon.
We constructed cDNA sequences using SAMtools faidx (v1.13) to extract the sequence of each exon (Danecek et al. 2021).Using BLAT, cDNA sequences were mapped against the genome with repetitive 11 mers filtered (−ooc option) (v.35) (Kent 2002).The resulting alignments were examined for hits that could be retrocopies.Alignments were removed if they were <100 bp, overlapped the genomic coordinates of the parent gene, had <90% sequence identity with the parent gene, or did not cross an exon-exon boundary of the parent gene.The same analysis procedure was applied to the other eight genome assemblies after converting the genes annotated in Mischka to the coordinates of each assembly using liftoff based on minimap2 alignments (v2.26 [Li 2021;Shumate and Salzberg 2021]); 1000 random permutations were conducted in the same manner described in the section Distribution of duplications along the genome.Due to paralogous genes, parent genes could not be resolved for a few retrocopies in each canine assembly.

Conducting GO Analysis
In several instances, large lists of genes were submitted to the GO database in order to find functional links (Ashburner et al. 2000;Thomas et al. 2022;Gene Ontology Consortium et al. 2023).These were submitted to the GO Enrichment Analysis (Powered by Panther) under the "Biological Process" tab to Canis lupus familiaris.A list of the genes submitted and their results can be found in supplementary table S9, Supplementary Material online.We utilized the 2024-01-17 release (10.5281/zenodo.10536401).

Examination of Retrogene Hallmarks
Retrogene containing loci from all nine canines were explored for the presence of retrogene hallmarks: TSDs, poly(A) tails, and LINE-1 endonuclease cleavage sites.Before detection proceeded, loci within 100 bp of reference gaps were removed with BEDTools (v2.26.0) window with parameters -w 100 and -u (Quinlan 2014).Additionally, any loci within the first or last 100 base pairs of each chromosome as well as loci which were comprised of multiple segments did not undergo hallmark detection.For each locus, a symmetrical number of base pairs were extracted upstream and downstream of the retrogene with SAMtools faidx (v1.13) (Danecek et al. 2021).
To detect TSDs, the upstream and downstream fasta files were aligned to each other using the water function within the EMBOSS module (v6.6.0;Rice et al. 2000).Water is a modified version of a Smith-Waterman local alignment and reports the highest scoring alignment only (Smith and Waterman 1981).The scoring matrix used was a custom matrix that used +2 for all matches and −6 for all mismatches (−1,000 for any alignment containing "N").Gap opening and gap extension penalties were both set to 10.Alignment output containing "N" as a base pair was discarded, as were final alignments shorter than five base pairs in length.If alignments started within 5 bp of the start of the upstream query or 5 bp of the end of the downstream query, the flanking distance was increased by 5, and the alignment was repeated.
Poly(A) detection followed TSD detection.The original flanking distance worth of base pairs were again extracted beyond the 3′ end of the retrogene with SAMtools faidx (v1.13).From there, strings of A or T, depending on retrogene orientation, were detected within the region.Detection started at the 5′ most base in the reference regardless of retrocopy orientation.Poly(A)s were terminated when a nonhomopolymeric nucleotide was detected and less than 4 of the next 5 nucleotides would be homopolymeric.Poly(A)s were then trimmed until they started and ended with 3 homopolymeric nucleotides in a row.Poly(A)s were not extended into previously detected TSDs.If multiple putative poly(A) tails were detected, the poly(A) closest to the TSD was reported if one was present.If no TSD was detected, the Poly(A) closest to the retrogene was reported.The minimum poly(A) tail size was 5 bp.
To assess if insertions possess a LINE-1 endonuclease cleavage site, loci which possess a detected TSD as determined above had the first five base pairs of TSD sequence extracted along with the two base pairs before the TSD (in retrocopy orientation) using SAMtools faidx (v1.13).If the retrocopy is in the same orientation as the chromosome, the sequence is reverse-complimented so that all cleavage site sequences are reported on the minus strand.Cleavage sites were then aggregated on a sample-by-sample basis and plotted using matplotlib (Hunter 2007) and logomaker (v0.8) (Tareen and Kinney 2020).
Various extract distances were trialed ranging from 10 to 2,000 bp with 60 bp being chosen to maximize concordance between TSDs and poly(A) (concordance defined as the number of loci in which the TSD and poly(A) are not seperated by five or more bases).While many of the samples had maximum concordance at 200 bp of flanking sequence, we chose 60 bp of flanking sequence as there was an observable diminishing return of concordance and we wanted to minimize the size of the flanking sequences.Though this method is limited in its ability to detect TSDs/ poly(A) combinations that are more than 60 bp combined, as well as hallmarks present in retrocopies containing unannotated UTR sequence, these limitations are thought to be minor as the vast majority of retrogenes contained at least one hallmark.Additional filtering was performed on the above data to identify the effects of more restrictive parameters.Specifically, from the above dataset, we removed any TSD or poly(A) shorter than 10 bp.Furthermore, for a locus to be counted as having both hallmarks, the TSD and poly(A) cannot be separated by 5 or more bases.
To assess fullORF loci, the list of retrocopies with full open reading frames was intersected with the loci which had their hallmarks identified (with 60 bp of flanking distance) in the previous step using bedtools intersect (v2.26).
Polymorphic loci were investigated for an alternate pattern of TSDs or endonuclease cleavage sites.For each locus, an additional 5 kb of flanking sequence of a retrocopy (or lack thereof, referred to as the "empty site") was extracted using samtools faidx (v1.13).At each locus, canines with starting coordinates longer than 20,000 bp were excluded from the analyses.The extracted sequences were aligned to one another using the Alignment with Gap Excision program (AGE) using the parameter both (v0.4;Abyzov and Gerstein 2011).If an insertion was detected, the inserted sequence was used as a query to blat against the cDNA of the retrocopy of interest (v35).Blat commands followed as such: "blat -minIdentity = 85 extracted_insertion_sequence cDNA_file blat_output_file."The corresponding longest blat hit must be greater than 75% of the length of the insertion.Individual canine comparisons underwent separate processing if the blat alignment was of insufficient length, the maximum insertion size was less than 80 bp, or there was a complex insertion with structural variation present in both the query and the reference at the insertion site.These hits were not investigated in downstream steps.For remaining comparisons AGE also detects locations of TSDs and is more stringent than our previous method, with detected TSDs being exactly at the boundary of the detected insertion and requires perfect sequence identity between upstream and downstream TSDs (Abyzov and Gerstein 2011).For loci that possessed a TSD, an endonuclease cleavage site was determined by extracting the last two bases before the TSD and the first 5 bp of the TSD (in retrocopy orientation) using samtools faidx (v1.13).Cleavage sites were reverse complemented if the orientation of the retrocopy was forward.Logo plots and histograms of this dataset were generated using logomaker(v0.8)and matplotlib (Tareen and Kinney 2020).At each locus, only one confirmed filled site was processed.Priority was given to canines with at least 5 bp of TSD sequence in the following order: mCanLor, then Mischka, Sandy, Nala, China, Zoey.

Estimating the Insertion Rate of Retrocopies
We selected contiguous genome assemblies from diverse dog samples to search for the insertion rate of retrocopies, including Mischka, Nala, China, Sandy, and Zoey relative to the mCanLor1.2assembly.We discounted all retrocopies that overlap a read-depth segmental duplication.For each canine retrocopy set, pairs of 500 bp flanks were extracted from each retrocopy and aligned to mCanLor1.2using minimap2 (v2.26) with the -c and -x asm5 options (Li 2018).Retrocopies whose flanks did not map were discounted entirely from the analysis, due to an inability to determine the presence of sharing between samples.For all retrocopies, if a retrocopy's presence across all six assemblies had not yet been determined in the initial mapping, both the flanks and the intervening sequence were derived from an assembly that did have the sequence (in order, priority was given to mCanLor, then Mischka, Sandy, Nala, China, and Zoey in that order) and mapped to all dogs to determine whether or not the retrocopy was truly present.If the flanks were present and the intervening sequence was also present, the retrocopy was marked as shared.If only the flanks were present, the retrocopy was marked as absent in that dog, and if flanks did not map, the entire retrocopy was discounted, once again due to the inability to determine sharing.The remaining retrocopies were totaled for presence or absence in each assembly and converted into an upset plot using UpSet (v.0.9.0; Lex et al. 2014).
Then, a retrocopy insertion rate was estimated based on a comparison between each assembly and mCanLor1.2.For each independent comparison, a retrocopy was considered as long as it could be determined whether or not it was present in mCanLor1.2and the compared assembly, regardless of whether it failed to map in another assembly.This allowed for determination of retrocopies unique to mCanLor1.2versus a comparison assembly.
To identify sequence identity of retrocopies to the parent gene, retrocopy sequences were compared against the parent gene cDNA using BLAT (v.35) (Kent 2002).Sequence identity was determined as the number of matches over the number of matches plus mismatches.Retrocopies were then divided into categories for comparison: one group with fullORF retrocopies versus those that were not; one group with retrocopies that were also found in either the dhole or the fox versus those that were not; and one group with retrocopies unique to the Mischka assembly versus those that were also found in the mCanLor1.2.assembly.
To further estimate rates of differences between genome assemblies, we identified single nucleotide differences among assemblies.Autosomal assembly size was determined by called regions between an assembly and mCanLor, as determined by minimap2 (v2.20;Li 2018), with the additional flags of -c -cs, outputting differences in PAF format using the paftools.jscall command.Regions are filtered to remove duplicated segments identified by read-depth, duplications identified by selfalignment, and simple tandem repeats identified by Tandem Repeat Finder (Benson 1999).SNPs called by mini-map2 were then intersected with filtered rions.The resulting SNPs were used to estimate the number of generations that had passed since divergence between mCanLor1.2and the other genome assemblies.We utilize the wolf SNP mutation rate estimated by Koch et al. (2019)  This formula calculates the approximate number of generations since divergence.We also calculated a range for the generations based on the range of the wolf mutation rate, which was 2.6× 10 −9 and 7.1 × 10 −9 .We divide the average number of unique retrocopies detected between a given canine assembly and mCanLor1.2and divide by the estimated generational since divergence to get an estimate for the rate of retrocopy insertion, and additionally calculated a range of rate of insertion based on the generation range (Lindblad-Toh et al. 2005;Freedman et al. 2014;Larson and Bradley 2014).

Locating Ancestral Retrocopies
We selected 3,658 retrocopies from the Mischka assembly to search for in the red fox (Vulpes vulpes, VulVul2.2) and dhole (Cuon alpinus, GWHAAAC00000000) genome assemblies.It should be noted that both genome assemblies are composed of contigs, as opposed to chromosomes.We extracted 1 Kb flanks from all the retrocopies with samtools faidx (v1.13) and used minimap2 (v2.26) with the -c and -x asm20 options to align these flanks to the fox or dhole genome.We filtered out any retrocopies whose flank pairs either had at least one member missing, or whose flank pairs mapped to different contigs.We then took the intervening sequence between the flanks and removed any that were not at least 75% of the size of the original retrocopy in Mischka.With the remaining set, we extracted the intervening sequence with samtools faidx from their respective genomes and used minimap2 with the same options to map the putative retrocopies back to the Mischka assembly.If a candidate retrocopy from the fox or dhole did not overlap the original location in Mischka, it was discounted from the analysis.

Fig. 1 .
Fig. 1.Estimated duplication content of canine assemblies.Duplication content was determined based on read-depth and genome assembly selfalignment.a) The amount of each genome assembly detected as duplicated using each method is shown.Since the read-depth approach is best able to detect larger, more similar duplications, the regions detected by genome self-alignment were filtered to segments with 95% sequence identity at least 15 Kb in length.China lacks Illumina data and thus has no read-depth analysis.b) A Venn diagram showing the intersection of the two methods for the Mischka assembly and for the mCanLor1.2assembly.Values are plotted in Mb.

Fig. 2 .
Fig. 2. Hallmarks of retrotransposition in detected retrocopies.Analysis of retrocopy presence across canine assemblies is shown.a) A diagram depicting the key steps of retrocopy formation, including the presence of target-site duplications and poly(A) tails at the insertion, is shown.b) The annotated sequence of a retrocopy from the TBPL1 parent gene, which is inserted in chr9 in the Mischka assembly, is shown.Highlighted sequence depicts the aforementioned hallmarks as indicated.This retrocopy retains the parental ORF, shown in bold.The short purple highlighted sequence corresponds to UTR sequence included in the retrocopy that is not part of the existing TBPL1 annotation.c) The distribution of how many retrocopies possess target-site duplications (TSDs), poly(A) tails, both, or neither in each of the nine canine assemblies.d) A histogram of the TSD lengths found for retrocopies in the Mischka assembly.TSD lengths of 50 or greater have been grouped into one bin.e) A histogram of the lengths of poly(A) tails found for retrocopies in the Mischka assembly.Lengths of 50 or greater have been grouped into one bin.f) An examination of target-site cleavage sequence for retrocopies in Mischka for retrocopies with at least a 10 bp TSD (N = 1615).It shows the sequence composition spanning 5 bases before and 2 bases after the inferred cut site.

Fig. 3 .
Fig. 3.Most frequent retrocopy parental genes.Each column lists the 10 genes that give rise to the most identified retrocopies in each assembly.Retrocopies presented here are limited to the assembled chromosomes.The number of retrocopies from each parent is given in parentheses.The same genes are colored identically to one another to quickly determine the presence of reoccurring genes.Only 11 parent genes are represented across all nine assemblies, with retrocopies from BTF3 found in the other assemblies at slightly decreased frequencies that removed it from the top ten.
2. Column 6 shows the final determined insertion rate, calculated by dividing the average between Columns 3 and 4 over the generations in Column 5, shown as an insertion rate of X * 10 −4 , and presented as 1 in n births.Column 7 shows the range of generations elapsed, based on the upper and lower bounds of the wolf mutation rate.Column 8 shows the range of insertion rate, based on the generation range in Column 7.
of 4.5 * 10 −9 per generation () and the size of the analyzed genome after filtering to determine generations passed as:

Table 1
Canine assemblies investigated

Table 2
Retrocopies identified in each canine assembly

Table 3
Rate of retrocopy formation in canine assemblies