Evaluation of whole-genome enrichment and sequencing of T. pallidum from FFPE samples after 75 years

Summary The recent developments in genomic sequencing have permitted the publication of many new complete genome sequences of Treponema pallidum pallidum, the bacterium responsible for syphilis, which has led to a new understanding of its phylogeny and diversity. However, few archived samples are available, because of the degradability of the bacterium and the difficulties in preservation. We present a complete genome obtained from a Formalin-Fixed Paraffin-Embedded (FFPE) organ sample from 1947, kept at the Strasbourg Faculty of Medicine. This is the preliminary, proof-of concept study of this collection/biobank of more than 1.5 million FFPE samples and the evaluation of the feasibility of genomic analyses. We demonstrate here that even degraded DNA from fragile bacteria can be recovered from 75-year-old FFPE samples and therefore propose that such collections as this one can function as sources of biological material for genetic studies of pathogens, cancer, or even the historical human population itself.


Bonah
vincent.zvenigorosky@gmail. com Highlights FFPE samples can still be used for genomic studies 75 years after embedding Whole-genome enrichment can be successful even with fragile degraded bacteria A new avenue of research into symptom-strain correlation in diseases like syphilis Archived FFPE samples and medical records constitute potential biobanks of samples

INTRODUCTION
The Institute of Pathological Anatomy of the Faculty of Medicine of the University of Strasbourg has preserved around 1.5 million formalin fixed paraffin embedded (FFPE) samples, used for diagnoses related to clinical practice and autopsies over the twentieth century.This exceptional collection is associated with thousands of medical, autopsy, histology, and other records, making the ensemble a unique historical and biological resource.Although the usability of FFPE as a source of genomic DNA has been demonstrated, 1,2 this collection is exceptionally old, and significant degradation was expected. 3FFPE are however one of the only types of medical samples that remain a source of DNA decades after collection [4][5][6] and the only one that can be preserved at room temperature.
Our research focuses on syphilis because this sexually transmitted disease is both of historical interest and an ongoing public health concern. 7,8The number of genetic studies of the pathogenic spirochaete bacterium responsible for syphilis in humans, Treponema pallidum pallidum (thereafter referred to as TPA), has increased in recent years, 9 motivated by the resurgence/persistence of the disease in the western hemisphere.TPA is an obligate parasite, 10 and therefore could not be grown in the lab until very recently, and not without difficulty, 11 a fact that also precluded the use of many genomic analysis techniques, especially before the advent of Next Generation Sequencing (NGS) technologies. 9Lineage conservation in rabbits carries specific drawbacks, such as the appearance of new genomic variants that were not present in infected patients, 12 but a larger drawback is the long-term preservation of frozen samples, which has historically been rare because it was unjustified before the advent of molecular genetics.
The diagnosis of syphilis improved over time, 13 including the recent sequencing of a complete genome from an FFPE sample. 14Over the history of syphilis, however, detection methods have been somewhat unreliable [15][16][17][18] and the diagnosis of syphilis singularly complex. 19,20yphilis was also historically over-diagnosed for a variety of cultural or social reasons: the prevalence of the pathology in certain groups led to different disorders and rashes being attributed to syphilis without adequate testing.After the discovery of penicillin and the dramatic improvement of treatment, over-diagnosis was also maintained by the perceived innocuousness (partially justified) of prescribing antibiotics even in the absence of positive evidence of the disease. 21,22hree subspecies have been described within the Treponema pallidum species, which cause yaws and bejel (two tropical diseases) and syphilis, caused by TPA.The more recent evolution and diversification of this bacterium is due to five factors: (1) it is a persistent pandemic, (2)

OPEN ACCESS
it has been left untreated in some regions or groups, (3) it has been historically treated with ineffective methods, (4) it has sometimes been treated with (or exposed to) macrolide antibiotics instead of derivatives of penicillin and (5) it coexists with the two aforementioned diseases in tropical regions and other sexually transmitted infections around the world.Consequently, aside from sequencing for direct medical applications (namely vaccine development) and simpler methods of detection, population genetics studies of TPA focus on three aspects of the genome: (1) the assignment to the major clusters (namely strains Nichols and SS14 23 ) and their evolution, (2) the evolution of the variant (point mutation) associated with macrolide resistance 24 and (3) the interactions between the three known subspecies. 25We therefore performed analyses with the aim of detecting the bacterium, characterizing the strain and determining whether a sample contained macrolide-resistant TPA.

Histological positives
Of the 32 re-included blocks tested using the Warthin-Starry staining method, 26 9 showed the presence of spirochaete bacteria [Table 1], surrounded by inflammatory infiltrate, as observed under a microscope [Figure 1 and Supplementary Materials and Methods (Observed Spirochaete)].The presence of fungi was noted in some of the newly made slides [Supplementary Materials and Methods], indicating that the paraffin blocks themselves were contaminated.

PCR positives
The targeted fragment was detected [Table 1] using capillary electrophoresis in 18 out of the 71 FFPE samples tested (25%), corresponding to 11 of 25 patients (44%) and this was confirmed using capillary sequencing in 10 out of 18 cases (determination of the sequence failed in 8 cases).
No results were obtained from three of the four cases in which paraffin blocks had undergone fixation using Bouin.However, the one positive obtained after Bouin fixation (case 6709) corresponded to the most successful of all samples, with four separate PCR positives.

SNP multiplex
Only cases 6709 and 6715 gave satisfactory results (11 and 12 SNP respectively, less than 5 for all others) and both could be assigned to the SS14 strain [Table 1].These samples were therefore selected for targeted enrichment and whole-genome sequencing.

Complete genomes
Enrichment and Illumina MiSeq sequencing of the two selected high-quality samples (6709 and 6715) produced around 7 million reads.We mapped sequenced reads on the SS14 TPA reference genome (access number GenBank: CP004011.1),but the majority of reads did not map to that reference.There were 99.64% of unmapped reads for sample 6709 (4,217,078 out of 4,232,247, or 15,169 mapped reads) and 81.89% of unmapped reads for sample 6715 (1,252,682 out of 1,529,671, or 276,989 mapped reads).This was the consequence of unusually high coverage (>10000x) for 50 of the designed probes, almost exclusively covering the 23S rRNA gene, which is present in two copies.We used the Kraken program to characterize unmapped reads and we detected contamination from both bacterial and human DNA.The most common bacterial contaminants belonged to bacteria of the Burkholderia and Paraburkholderia genera and Cutibacterium acnes (all around 10% of the reads in sample 6709 but less than 1% in sample 6715), as well as Schlegelella aquatica (around 1.1% in sample 6709).Out of the 50 over-represented regions, 24 corresponded specifically to Burkholderia and explained the majority of unmapped reads.Despite excluding the human genome in our design, human sequences make up 36.2% of contaminants in sample 6709 and 93.6% in sample 6715.These did not, however, hamper the analysis in any detectable way.We could not recover enough DNA from the same FFPE samples to repeat the analysis using another panel excluding the now known contaminants.
The final consensus sequence for sample 6709 only covered 32.25% of the bacterial genome at a depth superior to 1x (the average depth was 0.71x).It was therefore not included in phylogenetic analyses.Sample 6715 yielded a near-complete sequence, with 99.66% of the genome covered over 1x and an average sequencing depth of 13.99x.The sequenced genome of the bacterium for case 6715 was deposited into the NCBI database (accession number CP115658) and both Fasta-format files are provided in Supplementary Materials (Supplementary Files SF1 and SF2).The variants associated with macrolide resistance were not found in either genome and the reliability of this result is supported by the good coverage of these regions for both samples.

Phylogeny
Our results [Figure 2] place the genome of Treponema recovered from case 6715 (Strasbourg_6715) in the SS14 cluster (as also indicated by the multiplex).More specifically, the genomes closest to the Strasbourg_6715 sequence are Mexico-A (Mexico, 1953 25,27 ), MD18Be (USA, 1998 28 ), and MD06B (USA, 2002 28 ).We computed pairwise differences (single indels and substitutions) between Strasbourg_6715, the aforementioned sequences, the SS14 reference, the Nichols reference and CW82, the Nichols sequence closest to the SS14 cluster.We find that  Strasbourg_6715 is equidistant from Mexico-A, MD18Be and MD06B on one side (182 pairwise differences on average) and the SS14 reference on the other (160 pairwise differences).The strains in the Mexico-A cluster are closer to one another (<87 differences) than they are to either Strasbourg_6715 or SS14 (159-209 and 99 to 127 differences, respectively).These pairwise differences are distributed throughout the genome [Figure S7] but a few regions of interest show higher variability between Strasbourg_6715, the Mexico-A cluster, SS14 and CW82: the two 23S ribosomal RNA regions, TprI, TprJ, and TprK.

Successful histological and genetic detection of TPA
Previous studies have demonstrated the possibility of extracting DNA from FFPE samples stored for up to 40 years. 5The present work shows the possibility of confirming past diagnoses from FFPE samples older than 70 years, even when storage conditions are sub-optimal.We have detected the presence of DNA endogenous to the sample (human or bacterial) in most paraffin blocks and confirmed half (14/28) of the past syphilis diagnoses using either histological staining or PCR.Given the complexity of the staining method and the scarcity of the bacterium inside the tissue after the removal of the inflammation area, this is a very satisfactory result.This also opens the possibility of analyzing other markers, such as methylations, which have so far been characterized in 30-year-old FFPE samples 4 or RNA sequences, obtained after 20 years. 29hese results also exemplify the difficulties inherent in working with archived clinical samples.All paraffin blocks produced without standardization or automation had to be re-embedded in cassettes (up to four different cassettes for the larger ones, increasing the number of slides to be examined).Only after delineation of regions of interest for genetic analyses on H&E-stained slides, re-cutting, and re-embedding, could the blocks be re-stained.Moreover, the Warthin-Starry method itself is a time-consuming procedure without automation and calls for very specific expertise, especially on fragile archived samples.
The contamination of blocks and slides by fungi resulting from temperature variation and humidity did not affect the stains but highlighted the necessity for such invaluable material to be stored in an appropriate environment.

Ideal samples for DNA analysis
There does not appear to be a correlation between the dates of collection and the efficacy of genetic analyses.Positive samples range from 1947 to 1962 and negative samples range from 1950 to 1962.The age of the patients at death is not evidently correlated but the most successful samples corresponded to the younger patients: four positives for case 6709 (six weeks old), two positives for case 6715 (nine days old), three positives for case 7767 (43 years old) and two positives for case 9904 (39 years old).There is also no detectable correlation between the number of paraffin blocks/samples analyzed and the number of positives.Instead, there is great variability between cases and blocks corresponding to any single case.
Although there is no clear correlation between the age of individuals and the quality of the DNA extracted, the analysis of histological records confirms one of the likely explanations for the success or failure of individual analyses: fixation time in formalin. 30Our records do not indicate the delay between sample collection and histological analysis per se, but they mention the date of autopsy and often the date at which the report was finished and signed.From this information, we can infer that after the autopsy of younger patients, histological analyses were usually performed sooner after death.This could be explained by the necessity of confirming a diagnosis after an untimely death and is especially true in cases 6709 and 6715, two very young infants, for which histological analyses were possibly intended to confirm the serological status of their respective mothers (it is unclear whether the detection of syphilis would have been accurate enough for these women at the time).Further study of the records could allow for better estimations of the duration of each preparatory step and help select the FFPE samples most likely to yield high-quality DNA.

Phylogeny
The most notable cluster related to sample 6715 composed of three genomes from samples collected in 1953, 1998, and 2002, in Mexico and the USA.This group has been described as the Mexico-A cluster 25,31 or the SS14-like cluster 32 and is an outlier compared to other SS14 substrains. 33Genome Strasbourg_6715 is as distant from this cluster as it is from other SS14, which places it closer the divergence point between those two branches and makes it another outlier within the SS14 cluster.Based on these 7 accurately dated sequences, we can estimate a mutation rate of 0.58 per genome per year.

Perspectives
The lack of knowledge of potential contaminants led to the design of an enrichment panel that favored the contaminant over the organism of interest in certain genomic regions.While these results are somewhat specific to our collection, they certainly warrant caution in all cases and will permit the design of better-adapted panels for the sequencing of Treponema genomes or markers of other diseases in the future.Burkholderia bacteria in our FFPE samples likely originate from the direct environment of the autopsy, collection area or storage area, and Cutibacterium acnes could have been introduced by the examiners, the lack of proper sterilization of the instruments, the reuse of instruments or any manipulation of the bodies.
Treponema bacteria are especially fragile and PCR detection at different stages of the disease can be ineffective, even in living tissue or biological fluids.The fact that the present study was able to successfully sequence a full Treponema genome from 75-year-old FFPE material is an indication of the possibilities regarding other, more durable bacteria, as well as genetic markers of cancer, viral infection, or other infectious or non-infectious diseases.
It is also evident that the recovery of human DNA from the same samples would present far fewer difficulties and could be of many uses.This collection represents a cross-section of the entire Alsatian population from 1947 to the 2000s and includes a great number of different diseases and conditions.

Conclusion
We have shown that genetic material can be recovered from most FFPE samples from the 1940s onward and that bacterial genetic material can also be recovered if the histological treatment of the samples falls within certain parameters.In some cases, it is even possible to recover enough material to succeed in sequencing whole bacterial genomes.Complex histological staining can be applied to the material and spirochaete bacteria can be observed after 75 years.This study should serve to highlight the necessity of preserving knowledge of techniques now considered obsolete or unnecessarily complex, without which the analysis of these archived samples would have been more difficult.
Collections of archived FFPE samples associated with extensive technical and medical records are not commonplace, and the present study demonstrates that their value exceeds their historical significance.Such collections are potential sources of human, bacterial, viral, parasite, or tumor DNA (or RNA from more recent FFPE samples).Retrospective studies could be performed using hundreds or thousands of confirmed cases of any given pathology.It is, therefore, necessary to ensure the preservation of such records, by improving storage conditions, digitizing written records, transcribing the information into a usable format, and standardizing genetic and histological protocols.
The complete genome presented here represents the oldest TPA sequence obtained directly from a clinical sample and its analysis reveals the presence of the SS14-like strain earlier than previously described (and in Europe).It therefore appears essential to proceed with the analysis of all archived syphilis samples available in this collection or other collections, to better understand the phylogenetic history of the bacterium.

Limitations of the study
Our research was limited to retrieving syphilis cases associated with postmortem samples.This is because, unlike other pathologies that might lead to biopsies and sampling in living patients, tissue samples are not taken from living patients in suspected cases of syphilis.Diagnosis is based on serological samples.Samples from living patients, which might provide valuable insights, are generally not preserved.
The complete genome we have obtained (Strasbourg_6715) was retrieved from a sample corresponding to a case of congenital syphilis, in which the treponemal burden is higher than in other cases.Further analyses are therefore necessary to confirm that PCR positives obtained from adult cases could also lead to the successful sequencing of complete genomes.
The impact of fixatives used in sample preservation also remains largely unexplored.Although we initially assumed that the use of Bouin fixative could make subsequent analyses unfeasible, our study found that this was not the case.
The presence of fungi and other contaminants highlighted the complex degradation conditions of the FFPE.These samples were not preserved in well-controlled conditions (i.e., temperature, humidity, dust).This is crucial information for similar archives yet to be mobilized, because it confirms that conditions within such environments can significantly affect the integrity and analytical outcomes of the samples.We compared Strasbourg_6715 to the five closest samples in the phylogeny and analysed the distribution of SNPs across the genomes.Our results show that they are not strictly concentrated in one region but that certain regions are overrepresented, most notably the 23S ribosomal RNA, an outer membrane protein, TprI, TprJ and TprK.

QUANTIFICATION AND STATISTICAL ANALYSIS
FastQ files generated by the Illumina MiSeq were treated using the EAGER pipeline v2.4.5. 34The quality of sequenced reads was estimated using FastQC V0.11.9 35 and the adapters were removed using AdapterRemoval V2.3.2. 36Sequenced reads were mapped against a TPA strain SS14 (CP004011.1)using BWA v0.7.17. 37Duplicate reads were removed using markduplicates v2.26.0. 38Consensus sequences were generated using Samtools Consensus (Samtools v1.16.1 39 ) under the 'output all bases' parameter and not showing insertions.The complete sequence for the sample for patient 6715 was aligned against 269 published sequences of the three subspecies of human T. pallidum (T.p. pallidum, T. p. pertenue, T. p. endemicum).These published sequences were collected from the NCBI database and represent all available complete genomes for which dating and strain information was available [Table S2].A prior alignment of these sequences was performed against the same TPA SS14 reference (CP004011.1)using MAFFT v7.505 40 with a PAM of 1 and no iterative refinement.The sample from patient 6709 was excluded from these analyses.
The phylogeny was reconstructed using iqtree v2.0.3 41 and 1000 bootstrap replications.We used TempEst v1.5.3 42 and the original sampling dates to root the tree.The phylogenetic tree was visualised using the ape package 43 for R. 44 The characterization of contaminating sequences was performed using Kraken. 45,46

Figure 1 .
Figure 1.Histological characterization of case 6715 Liver sample from a congenital syphilis case in 1947, fixed in formalin; red arrows indicate spirochaete bacteria.

Table 1 .
Summary of results