Clinical Annotation Reference Templates : a resource for consistent variant annotation

Annotating the impact of a variant on a gene is a vital component of genetic medicine and genetic research. Different gene annotations for the same genomic variant are possible, because different structures and sequences for the same gene are available. The clinical community typically use RefSeq NMs to annotate gene variation, which do not always match the reference genome. The scientific community typically use Ensembl ENSTs to annotate gene variation. These match the reference genome, but often do not match the equivalent NM. Often the transcripts used to annotate gene variation are not provided, impeding interoperability and consistency. Here we introduce the concept of the Clinical Annotation Reference Template (CART). CARTs are analogous to the reference genome; they provide a universal standard template so reference genomic coordinates are consistently annotated at the protein level. Naturally, there are many situations where annotations using a specific transcript, or multiple transcripts are useful. The aim of the CARTs is not to impede this practice. Rather, the CART annotation serves as an anchor to ensure interoperability between different annotation systems and variant frequency accuracy. Annotations using other explicitly-named transcripts should also be provided, wherever useful. We have integrated transcript data to generate CARTs for over 18,000 genes, for both GRCh37 and GRCh38, based on the associated NM and ENST identified through the CART selection process. Each CART has a unique ID and can be used individually or as a stable set of templates; CART37A for GRCh37 and CART38A for GRCh38. We have made the CARTs available on the UCSC browser and in different file formats on the Open Science Framework: . We have also https://osf.io/tcvbq/ made the CARTtools software we used to generate the CARTs available on GitHub. We hope the CARTs will be useful in helping to drive transparent, stable, consistent, interoperable variant annotation.


Introduction
An integral component of next generation sequencing (NGS) gene analysis methods is the annotation of variation using the human reference genome as a baseline. By contrast, historical gene analysis methods, such as Sanger sequencing, can choose which sequences to use for variant annotation. The majority of the clinical community, and much of the clinical research community, use RefSeq NM transcripts as baseline sequences for variant annotation 1 . This can result in annotation inconsistencies for several reasons. Firstly, there are often several different NMs available for a particular gene and laboratories choose to use different NMs to annotate the same variant in different ways. Secondly, NMs are curated mRNA derived sequences and do not always match the reference genome, leading to annotation inconsistencies with NGS-based gene analyses. Thirdly, gene structure information, such as intron-exon boundary positions, is not included in the NMs. Instead this information is overlaid inconsistently, which can lead to different annotations of exon deletion/duplication variants, which are an important cause of disease 2,3 . ALMS1 provides an example of the variant annotation inconsistencies that can occur. Pathogenic variants in ALMS1, cause Alstrom syndrome (MIM 606844). There is only one RefSeq transcript for ALMS1, NM_015120.4. This transcript has two different 3 bp insertions in exons 1 and 8, compared to reference genome build GRCh37. This means variants with the same genomic coordinates can have different annotations. For example variant chr2:73717247C>T (GRCh37) is annotated as c.8164C>T; p.Arg2722Ter in ClinVar, OMIM and the medical literature 4,5 , but as c.8158C>T; p.Arg2720Ter in resources that use reference genome based transcripts for annotation such as ExAC, VEP or CAVA 6-8 . Adding further complexity, NM_015120.4 is different to build GRCh38 by only one 3 bp insertion, in exon 1. So the same variant annotated on GRCh38 would have genomic coordinates of chr2:73490120C>T, and would be annotated as c.8164C>T; p.Arg2722Ter using NM_015120.4 but c.8161C>T;p.Arg2721Ter in resources using reference genome based transcripts for annotation.
NGS-based gene analyses often use ENST transcripts as the baseline sequences for variant annotation. ENSTs always match the reference genome. However, similar to NMs, multiple ENSTs are available for many genes and laboratories may choose to use different ENSTs to annotate variation in a given gene. This can result in variant annotation differences between laboratories using different ENSTs and between laboratories using NMs for annotation and those using ENSTs. A further issue is that ENSTs are frequently updated, potentially compromising the stability of annotations, particularly if an ENST is retired. For example, compared with Ensembl release 91, Ensembl release 92 retired 314 transcripts, included 3,226 new transcripts, and 1,336 new version numbers for existing transcripts, predominantly due to changes in the untranslated region (UTR) and/or coding sequence (CDS). BMPR1A exemplifies the problems updates can inadvertently cause. NM_004329.2 is the only RefSeq transcript available for BMPR1A. Historically, this RefSeq NM was associated with ENST00000224764, which has been used to annotate BMPR1A variants in many publications, reports and databases 6,9 . However, this ENST00000224764 is no longer available, links to it now state 'this identifier is not in the current EnsEMBL database', compromising integration of historical and current BMPR1A variant annotations.
Although it is usually possible to work out how different ENSTs and NMs relate to each other, it is difficult, time-consuming and rarely done. Instead people often assume annotations are consistent. If this is a misassumption, it can lead to downstream scientific and clinical errors, particularly in relation to clinical interpretations about variant pathogenicity. A common error of this type is the assumption that if a variant (annotated using an NM) is not present in the default presentation of ExAC (annotated using an ENST) it must be exceptionally rare, and hence more likely to be pathogenic. However, it is possible the relevant variant is present in ExAC but has a different annotation, because the ENST selected differs from the NM selected. BDNF is an example of this problem. BDNF is associated with 17 different NMs. In 2002, a BDNF variant, c.29C>T;p.Thr2Ile, annotated using NM_001709.4, was proposed to cause a severe condition called congenital central hypoventilation syndrome (CCHS) 10 . In ClinVar NM_170731.4 is used and the same BDNF variant is called p.Thr10Ile. Neither p.Thr2Ile, nor p.Thr10Ile appear in the default annotations in ExAC, which use ENST00000438929. This results in the variant, g.11:27680107G>A, being called p.Thr84Ile, which is present in 132 individuals. At this allele frequency (0.001) it would be a major cause of CCHS if it was a disease-causing variant. However, to our knowledge no one with CCHS and this variant has been reported since the original publication, and it is highly unlikely to be a pathogenic variant. OMIM have downgraded the variant from pathogenic to uncertain significance since we brought this issue to their attention (MIM 113505).
Given the intrinsic differences in the widely used variant annotation systems it is essential that the transcripts used for variant calling are transparently provided and stably available. However this often does not occur. Moreover, it is becoming increasingly challenging to provide this information on a geneby-gene basis, because many analyses now generate variant calls from thousands of genes.
To address this important issue we here introduce the concept of the Clinical Annotation Reference Template (CART) and provide CARTs for GRCh37 and GRCh38 11 . The CARTs aim to provide standard, interoperable, stable gene templates for variant annotation that are based on the reference genome sequence, include the required structural information, and can be used either individually or as set.
CARTs can be considered analogous to the reference genome; they provide a universal standard template so the reference genomic coordinates of a variant are consistently annotated at the protein level. Of course, there are many situations where annotations using a specific transcript, or all available transcripts are useful. The aim of the CARTs is not to impede or curb this practice. Rather, we propose that the CART annotation is always provided, as an anchor to ensure interoperability between different annotation systems and variant frequency accuracy. Additionally, annotations using other explicitly-named transcripts should also be provided where necessary or useful. To facilitate transparent, consistent variant annotations of panel/exome/ genome tests the CARTs can be used as a set. For example at the bottom of a clinical exome report, or in a publication it could be stated that variants were called using the CART37A series, except where otherwise stated.
We hope the CARTs will be useful in helping to drive transparent, stable, consistent, interoperable variant annotations.

Datasets
We used the following datasets in the CART selection process.  The CART selection and generation process is described below. The process uses eight scripts which are described in detail in Extended Data File 2 11 . We have made the scripts available as CARTtools (see CART availability) 16 . They can be run in one command or each script can be used separately.

Algorithmic NM selection
For each gene we used both the APPRIS and RefSeq genomic alignment data to identify a single NM on which to base the CART. We call this the 'Algorithmic NM' (Figure 1). If a gene had an APPRIS principal isoform associated with only one correctly aligned NM it was selected as the Algorithmic NM. If a gene had multiple NMs associated with the APPRIS principal isoform we used a UTR selection process to select a single Algorithmic NM (Figure 1). The goal of the UTR selection process is to reduce the number of transcripts through a sequential selection process until only one NM remains. The UTR selection process includes three major criteria (A, B, C) and 10 minor criteria (A1-A3, B1-B4, and C1-C3) as described in Figure 1. The criteria are applied sequentially to all the available NMs associated with the APPRIS principal isoform until one NM is removed. The UTR selection process then restarts at A1 using the remaining NMs, until a single NM remains, which becomes the Algorithmic NM.
We did not select an Algorithmic NM if a) there were no NMs to select from, b) there were multiple NMs with the same genomic coordinates, c) one or more NMs assigned as principal by APPRIS had different CDS or d) we could not match the NCBI Gene ID to an HGNC ID. The Algorithmic NMs are given in Extended Data File 1 11 .

Community NM selection
We used RefSeqGene and ClinVar data to identify NMs used in the clinical diagnostic and clinical research communities. If the gene was in the RefSeqGene database we used the RefSeqGene NM(s) as the Community NM(s). If the gene was not in the RefSeqGene database we used the ClinVar NM(s) as the Community NM(s). If a gene was in neither database we did not

CART associated NM selection
We used the Algorithmic and Community NMs to select the final CART associated NM. We used the Algorithmic NM as the CART associated NM if there was no Community NM or if the Algorithmic NM was a Community NM. We used the Community NM as the CART associated NM if there was no Algorithmic NM or if there was a single Community NM that differed from the Algorithmic NM. The CART associated NMs are given in Extended Data File 1 11 .

CART associated ENST selection
For each CART associated NM we next selected the closest matching ENST using the coordinates of the mapped NM (from RefSeq or UCSC if coordinates were not available from RefSeq) and Ensembl's ENST coordinates (Extended Data File 1 11 ). To be selected, the CDS genomic coordinates of the ENST had to be identical to the CART associated NM. If there was only one ENST with identical CDS to the CART associated NM, it was selected as the associated ENST. If the CDS matched but there were UTR differences between the CART associated NM and available ENSTs we used the following selection process to select a single ENST. We prioritised ENSTs with the same number of 5' UTRs as the CART associated NM. If none were available we prioritised ENSTs in which the 5' UTR genomic location encompassed the 5' UTR genomic location in the CART associated NM. If more than one ENST was available that matched these prioritisation criteria, or no ENST was available that matched the prioritisation criteria we used the UTR selection process shown in Figure 1, to select a single CART associated ENST.  between genome builds the CARTNumber also changes, with the new CARTNumber always being the next available CARTNumber. For example, the CARTs for UMPS are CART37A11618 and CART38A28332 because the UMPS 3' UTR is longer in GRCh38 than in GRCh37.
Using the above process, we were able to generate CARTs for 94% (18,000/19,171) of genes on GRCh37 and 96% (18,330/19,171) of genes on GRCh38. With respect to the differences between the CART associated ENST and the CART associated NM, all have identical CDS (by definition) and 16% (3,110) in GRCh37 and 17% (3,325) in GRCh38 also have identical UTRs ( Figure 3A). The CARTs for GRCh37 and GRCh38 have identical CDS and UTR for 75% (14,350/19,171) of genes and identical CDS for 91% of genes (17,514/19,171)

The CARTs
The genomic coordinates of the associated ENST were used as the genomic coordinates of the CART. Each CART has a unique identifier (CART ID) defined as: CART<genomeBuild> <series><CARTNumber> (Figure 2). The genomeBuild is the human reference genome build the CARTs are aligned to, for example 37 for GRCh37. The CART series represents the full set of stable templates that the template belongs to, for example A for series A. The CARTNumber is a unique template number starting at 10,001. We used the same CARTNumber if the genomic sequence of the CART template for the gene did not change between builds. Thus for KCNC3 the CART IDs are CART37A25530 and CART38A25530 because the sequence and structures of the UTR and CDS are identical on GRCh37 and GRCh38. If the genomic sequence of the CART changed ( Figure 3B). The CARTtools output provides further details about the CARTs as shown in Extended Data File 1 11 .

Data availability Underlying data
We have made the CARTs available on the UCSC browser. They can be found by searching for 'CART37A' or CART37B' or directly through the following links. For CART37A, the CARTs for GRCh37: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_ doOtherUser=submit&hgS_otherUserName=Rahman.team&hgS_ otherUserSessionName=CART37A.
We have also made the CARTs 11 available in the following annotation file formats GFF2, GFF3, GenePred, GenBank, FASTA and CAVA database, so the CARTs can be easily integrated into popular variant annotation or analysis tools such as VEP 7 , SnpEff 17 , ANNOVAR 18 , Mutation Surveyor 19 and CAVA 6 . If a gene does not have a CART we provide Ensembl's 'canonical' ENST for that gene in the output files. Further information is available in the CARTtools documentation (Extended Data File 2 11 ) 16 .
The data files for CART37A and CART38A are available on the Open Science Framework (OSF): http://doi.org/10.17605/OSF. IO/TCVBQ 11 . Data are available under the terms of a CC0 1.0 Universal licence.

Extended data
Extended data files have been archived on Open Science Framework: http://doi.org/10.17605/OSF.IO/TCVBQ 11 . Data are available under the terms of a CC0 1.0 Universal licence.

Extended Data File 1. CART summary information.
Descriptions of the column headings are given on OSF.

Software availability
The latest release of CARTtools 16 is available at: https://github. com/RahmanTeamDevelopment/CARTtools/releases. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Summary
The authors address the issue of how to ensure coherent annotated transcript references, considering that their further utilization by variant annotation tools represents an important step in ensuring derived variants that can be consistently disseminated. A comprehensive motivation details why the utilization of current RefSeq NMs and Ensembl ENSTs could lead to transcript annotation inconsistencies for the same genomic variant, possibly leading to further downstream clinical and scientific errors. They tackle the aforementioned problem by proposing a selection algorithm to integrate information from both RefSeq and ENST transcript records into what they entitle as CARTs (Clinical Annotation Reference Templates). The authors hope that CARTs will be helpful for the community by offering the means to derive consistent variant annotations.
Issues to be addressed -The definition of a CART seems to be ambiguous. A more formal definition would be required. How do the authors envision the use of CARTs in clinical practice? How do CARTs improve on utilizing genomic references with respect to interoperability? -How to derive a CART description in a variant annotation step?
-The reasonings behind the selection processes employed in various parts of the algorithm are completely missing. This is essential in being able to properly judge the approach used.
-While the authors provide valuable examples in their motivation of why the utilization of RefSeqs and/or ENSTs is problematic, after describing CARTs derivation, they do not return to their examples to indicate if and how CART would actually help in those respects.
-It seems from the text that Figure 1 does not include all the cases when no Algorithmic NM was selected.
-Figures for other selection processes besides the algorithmic one would make the paper more readable and the concepts much easier to be understood.
-Could it be that the utilization of genomic references would totally alleviate the mentioned NMs and ENSTs issues? Would this make CART obsolete in a short time period?
-For a significant number of transcripts no CARTs were identified. What solutions do the authors suggest for those transcripts? Is there any relation between those CART-less transcripts and the transcripts for which RefSeq and ENSTs also give problems?
-What happens when the genomic reference annotations are updated? The GRCh38 employed by CART is p10, but now there is already p12 available.
-The CART series is not clearly defined. Does it ever change? If so, in what circumstances?
-Why was not the HGNC ID employed as the CARTNumber? It could possibly lead to further RefSeq -ENST unification and disambiguation.
-In the "Methods" section the provided information does not allow for an easy and straightforward replication. There are multiple files at the provided link locations and their match with the text is not clear. An exact file link and (maybe) some details about their content and what information was employed would further improve the replication process.
-In the introduction it is mentioned that transcripts are used for variant calling. Actually, in the field genomic references are used in variant calling, while transcripts being used for variant annotation.
-It seems like CARTs do not consider RefSeq NMs that do not have HGNC IDs.
-The provided genbank files are not self consisting since they do not contain the RefSeq NM nor the ENST. They could be present in the source feature, which is missing, even though it represents a mandatory feature for genbank files (according to the information provided in the link referred to in the documentation file). In addition, there is no information about the chromosomal reference.
-It is worth mentioning that similar initiatives have been already proposed, i.e., LRG, MANE, and CHESS. A thorough comparison with the alternatives is required.
Is the rationale for creating the dataset(s) clearly described? Partly the research and clinical community. Clinical pipelines are complex and not particularly responsive to change. The same could perhaps be said of research pipelines used to generate large amounts of sequencing data.
Major points: -While this effort is necessary and important, other unification efforts do exist that should be commented on in the introduction or the discussion. How is this effort different from the Locus Reference Genome (LRG) effort? Although the Matched Annotation of NCBI and Embl-EBI (MANE) was released fairly recently, perhaps after this article was submitted, it deserves a mention too.
-More flow diagrams need to be added to figure 1. Choosing the Community NM, Choosing the ENST, and Choosing the final NM (Community vs Algorithmic) should also be visualized.
-More information should be provided about the comparison of the Community vs the Algorithmic NMs. How often did they match?
-Only a small set of genes could not have a CART designated. Why was this the case? Did they fall in particular disease areas or was it just random? Please comment.
-One of the criticisms for LRG transcripts is their inflexibility when transcripts change. How flexible are CARTS? Is each CART stable for the genome build or could a new series of CARTS be generated before the release of a new genome build? Do CARTS always have to be generated in sets or could an update be done on a per-gene basis?
-Why were ENSTs with the same number of 5' UTRs as the CART-associated NM prioritized? -Because the ENST was chosen second based on the CART-associated NM, how often was the canonical ENST chosen? -Have you looked at GTEx and compared expression levels for your NM transcript choices?
Minor points: -More general information about transcript curation efforts would be useful in the introduction -It would be useful to provide the list of the NM and ENST that make up each CART in case researchers would still like to take advantage of your thorough curation efforts, but would prefer to annotate using those transcripts so as not to change their pipelines.
-In the first paragraph of the introduction "By contrast, historical gene analysis methods, such as Sanger sequencing, can choose which sequences to use for variant annotation." is awkwardly worded and should be re-written.
-On page 3, second column, 3 rd full paragraph, there is a typo: "The CARTs aim to provide standard, interoperable, stable gene templates for variant annotation that are based on the reference genome sequence, include the required structural information, and can be used either individually or as a set.
Is the rationale for creating the dataset(s) clearly described?