The UCSC Genome Browser database: 2023 update

Abstract The UCSC Genome Browser (https://genome.ucsc.edu) is an omics data consolidator, graphical viewer, and general bioinformatics resource that continues to serve the community as it enters its 23rd year. This year has seen an emphasis in clinical data, with new tracks and an expanded Recommended Track Sets feature on hg38 as well as the addition of a single cell track group. SARS-CoV-2 continues to remain a focus, with regular annotation updates to the browser and continued curation of our phylogenetic sequence placing tool, hgPhyloPlace, whose tree has now reached over 12M sequences. Our GenArk resource has also grown, offering over 2500 hubs and a system for users to request any absent assemblies. We have expanded our bigBarChart display type and created new ways to visualize data via bigRmsk and dynseq display. Displaying custom annotations is now easier due to our chromAlias system which eliminates the requirement for renaming sequence names to the UCSC standard. Users involved in data generation may also be interested in our new tools and trackDb settings which facilitate the creation and display of their custom annotations.


INTRODUCTION
The University of California Santa Cruz (UCSC) Genome Browser (1) is an online resource for the genomics community providing data access and visualization, collaboration and support resources, and a suite of tools that are now standard in the field. With the ever-increasing amounts of data being generated every year, tools like the UCSC Genome Browser and other browsers (2)(3)(4)(5)(6) are increasingly playing a key step in analysis and interpretation. Our resource services over 1.4 million users per year across its primary site as well as its European and Asian based mirrors. We also maintain near 100% uptime and continually update our software on a tri-week cycle.
With regards to data access and visualization, we offer over 6000 tracks on the two latest human GRCh assemblies alone, GRCh38/hg38 and GRCh37/hg19. There are also over 200 assemblies available on the Genome Browser and over 2000 if GenArk (7) is included. We support over 30 data formats such as bed/bigBed, wig/bigWig (8), VCF (9) and GTF/GFF. This not only allows users to display their own annotations, but also to visualize data from a large number of sources in a single location. Nearly all data is available for extraction via bulk download, public MySQL server, RESTful API (10) or the Table Browser (11).
We also provide tools to facilitate scientific collaboration as well as support for the community. Immutable snapshots of annotations and locations can be shared via the sessions feature (My Data → My Sessions), custom data can be shared as custom tracks (My Data → Custom Tracks) and hubs (My Data → Track hubs), and user-generated hubs can be shared with the wider community by means of the Public Hub list. We also respond to over 600 mailing list questions per year, assisting users with topics such as how best to display their data, troubleshooting our tools, and generating chain files for lifting between assemblies.
Lastly, we provide and support many other tools and utilities. Some of the most popular tools not yet mentioned are BLAT (12) for placing sequences, In-Silico PCR for identifying PCR primers, and LiftOver which provides a web interface for converting genomic coordinates between assemblies. Our hundreds of utilities (https://hgdownload.soe. ucsc.edu/downloads.html#utilities downloads) can also be downloaded. These include file format creation, such as bedToBigBed, command line versions of our web tools such as liftOver, and other resources. And for users that may have sensitive data or poor connections, we offer various ways to mirror our software locally (Mirrors → Mirroring Instructions). For more information on what the Genome Browser has to offer, visit our training page (https://genome.ucsc. edu/training).

NEW AND UPDATED ANNOTATIONS
Over the last year we have added and updated over 50 annotation tracks to existing assemblies, added a new single-cell annotation group to hg38, made seven new or updated Public Hubs available, and added over 900 new assembly hubs via GenArk. We have also created hundreds of liftOver files, all of which are available on our download server, which allow coordinate lifting between assemblies. This includes 36 files directly requested by users on our mailing list.

New clinical data
Twelve new tracks have been added to human assemblies in support of variant interpretation and clinical genomics. Some notable examples include DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) (13), which aggregates variant information from various sources, added to hg38; Orphanet (14), which provides comprehensive datasets related to rare diseases and orphan drugs from the Orphanet knowledge base; GenCC (The Gene Curation Coalition) (15), which aims to collect and standardize gene-disease validity annotations across various submitters; and dbSNP155 (16), which is the latest NCBI dbSNP release with over one billion variants. We have also continued to update our Microarray Probesets tracks, which now contain the positions of probes and targets of over 50 NGS arrays. A new Constraint Scores track is also available which hosts various mutation constraint annotations from different data providers. For a complete list of new and updated clinical tracks see Supplementary Table  S1.
In order to better introduce the clinical resources available on hg38, we have expanded our Recommended Track Sets (17) (https://genome.ucsc.edu/goldenPath/newsarch. html#022222) feature to hg38 ( Figure 1). Like hg19, this feature contains 4 sets of curated track configurations for different clinical applications.

New single-cell track group on hg38
We have added a new single-cell RNA-seq (scRNA-seq) track group to hg38 ( Figure 2). It currently contains 14 scRNA-seq tracks, originally wrangled into our Cell Browser (https://cells.ucsc.edu) (18), covering major organs of the body with each track being comprised of 2-19 individual mRNA expression tracks in barChart format. There are also two aggregate tracks: Tabula Sapiens (19), which contains data from the Tabula Sapiens Consortium providing an atlas of nearly 500 000 cells from 24 organs of 15 normal humans, and a Merged Cells track, which is an aggregate track created by the Genome Browser containing data from 12 papers covering 14 organs. A complete list of new single-cell tracks is available in Supplementary  Table S2.

Gene set updates
This year we have added or updated 15 gene annotations for human and mouse. We continue to provide the latest GENCODE gene models (20), currently v41, which are always available on hg38, hg19 and mm39. We also archive these releases for reproducibility, having added 38-41 during this period. The NCBI RefSeq gene models (21) on hg38 and hg19 have also been updated corresponding to NCBI release 109.20211119 and 105.20220307 respectively. We have also added the 1.0 release of the Matched Annotation from NCBI and EMBL-EBI (MANE) project (22), which provides a set of highconfidence transcripts that are identically annotated between RefSeq and Ensembl/GENCODE. Lastly, we have updated select tables (kgXref, kgAlias) and our search files for the default hg19 gene annotation track UCSC Genes (knownGene) (23) so that new and updated gene symbols can be found.

Other new tracks
In addition to clinical, single-cell and gene tracks, we have added 8 new tracks to our vertebrate assemblies. These include a European Variation Archive (EVA) (24) track corresponding to EVA release 3 on 14 assemblies including mm39, providing novel variant data on these Browsers. There is also now a 241-way Cactus (25,26) comparative genomics alignment track on hg38 generated by the Zoonomia Project (27), which is the largest conservation track in the Genome Browser. We have also added various regulatory tracks and additional annotations from the GTEx Consortium (28). See Supplementary Tables S2-S4 for a full list of tracks and assemblies. We also continue to run our pipelines which automatically update annotations on 13 tracks, which can be seen in Supplementary Table S5.

SARS-CoV-2 genome browser updates
We continue to regularly update our SARS-CoV-2 assembly data, adding or updating 14 tracks over the last year. Among these tracks is our Variants of Concern (VOC) track, which we continue to update with the latest WHOdesignated variants of concern. For a full list of updated tracks see Supplementary Table S4.
Curation has also been ongoing on the growing phylogenetic tree which supports our tool for placing SARS-CoV-2 sequences using UShER (29), hgPhyloPlace (https://genome.ucsc.edu/cgi-bin/hgPhyloPlace). The tree now contains over 12 million sequences, with updates occurring daily. A minimized version of the tree is included in the pangolin tool (30), used by public health departments worldwide to assign lineages to new sequences. The full tree including GISAID sequences cannot be redistributed  due to GISAID restrictions (www.gisaid.org), but we offer download files for a public sequence tree with over 6 million sequences (https://hgdownload.gi.ucsc.edu/goldenPath/ wuhCor1/UShER SARS-CoV-2/) (31). We have also recently expanded hgPhyloPlace for use with monkeypox (RefSeq NC 063383.1).

New hubs
This year we have added 7 new 'Public hubs', which are externally hosted and maintained annotations available to our users via the Track Data Hubs page (https://genome.ucsc. edu/cgi-bin/hgHubConnect). We continue to accept submissions from users looking to promote and share their data. These new hubs include the 2022 update of the popular ReMap Regulatory Atlas hub (32), which contains transcriptional regulator annotations on 6 model genomes, and a 605 species Mammal and Bird alignment using the Cactus aligner. For a full list see Supplemental Table S6.

NEW ASSEMBLY DATA
Over the last year we have updated the official patch sequences from the Genome Reference Consortium (GRC) for hg38 and mm10. The GRCh38/hg38 assembly has been updated to patch 13, and GRCm38/mm10 has been updated to patch 6. These updates contain both fix sequences and alternate haplotypes.

Genome Archive (GenArk)
With the continued drop in sequencing cost and increase in assembly quality, we have expanded the resources spent on rapid creation of browsers via assembly hubs based on Gen-Bank (33) assembly accessions. This collection of in-house generated hubs, referred to as Genome Archive (GenArkhttps://hgdownload.soe.ucsc.edu/hubs/), currently contains 2589 hubs. Over the last year alone we have added 904 new NCBI/VGP assemblies. There is now also a viral genomes category (https://hgdownload.soe.ucsc.edu/hubs/ viral/index.html) containing 257 viral assemblies ready for display. In response to user demand, we have created an assembly request page (https://genome.ucsc.edu/ assemblyRequest.html). This page allows users to search for most GenBank assemblies, currently containing 15 018 eligible browser candidates, and request a browser be created if one does not already exist. New browsers are typically ready in less than a week. For more information on GenArk, see our detailed four-part blog series on the topic (https://genome-blog.soe.ucsc.edu/blog/2021/ 11/23/genark-hubs-part-1/).
In anticipation of many high-quality genomes becoming available in the near future, T2T CHM13 v2.0 was the first human assembly to be elevated from hub to curated hub. Curated hubs, while still hubs, have all the support of native assemblies such as easier discovery and track search, API support, and the ability to add custom annotations without first having to connect to the hub. With this change to curated hub the assembly name was changed to Homo sapiens 1 (hs1). T2T CHM13 v2.0 (hs1) can be accessed directly from the Genomes dropdown menu. It is worth noting that to users curated hubs are functionally identical to native as-semblies (e.g. hg19, hg38), and that in line with our reproducibility practice, all previous hubs and hub data will continue to exist.

NEW GENOME BROWSER SOFTWARE
Over the last year we have expanded functionality of the Genome Browser with small additions as well as new and updated displays and settings. By user request, we have added a comma separated values option to the Table Browser output. The new setting can be toggled on the 'output field separator' tab and facilitates data download for use in other software such as excel. It is now also possible to include Public Hub tracks in the Track Search (https://genome.ucsc.edu/cgi-bin/hgTracks? db=hg38&hgt tSearch=track+search) results by toggling on the feature in the advanced options ( Figure 3).

New displays
bigBarChart. Two new settings have been added to the bigBarChart (https://genome.ucsc.edu/goldenPath/help/ barChart.html) format to allow for additional customization of how the bars display: barChartBarMinWidth and barChartBarMinPadding. There is also a new feature for bigBarChart tracks that enables a facet display (Figure 4) in the item details page and track configuration page (https://genome.ucsc.edu/cgi-bin/hgTrackUi?db= hg38&c=chrX&g=tabulaSapiensTissueCellType). These facets allow for visualization and grouping of complex and expansive data, such as single cell data, into various categories and granularities based on associated metadata. The facets are enabled by adding the new trackDb settings barChartFacets and barChartStatsUrl (https://genome. ucsc.edu/goldenPath/help/barChart.html#example6).
bigRmsk. The bigRmsk track type (https://genome.ucsc. edu/goldenPath/help/bigRmsk.html) has been added for displaying repeat annotations generated by the Repeat-Masker program. The setting is optimized for displaying repeat types, automatically changing its display based on the window size. The track includes item coloring based on the classification of the repeat, and the Full mode includes additional details such as length of unaligned repeat model sequence and context for where a repeat fragment originates ( Figure 5A).  dynseq display. We have added support for the dynseq display (35) developed by the Kundaje lab (https://kundajelab. github.io/dynseq-pages/). This display scales the height of each nucleotide letter based on the signal value within a bigWig track (https://genome.ucsc.edu/goldenPath/help/ bigWig.html#Ex4; Figure 5B).

New hub features and TrackDb statements
In order to facilitate custom annotations and user content, we have expanded custom track as well as hub support and added 15 new trackDb settings with various functions (Table 1). When creating hub tracks for a genome that is included in GenArk, you can now designate the GCA/GCF identifier and the Genome Browser will automatically attach the matching GenArk assembly hub genome and display the data on it (https://genome.ucsc.edu/FAQ/FAQlink. html#genArkTrackHub). This harmonizes the system to function like native assemblies, such as hg19 and hg38, and removes the requirement of a multi-line genome stanza. Another new feature that builds upon hub annotations which are designated by the bigDataUrl setting is access to the extended case/color options. This means that when browsing the tracks display while displaying hub data, you can go to View → DNA in the top blue bar menu and select the 'extended case/color options' button. In that page you will be able to modify the DNA sequence in the window in various ways depending on the data tracks which are currently being displayed, such as adding a specific color for any part of the sequence covered by the annotations.
chromAlias. The chromAlias system provides an index of corresponding sequence names across different groups and consortiums. An example would be how UCSC names chromosomes with the 'chr' prefix while other groups such as Ensembl (36) list only the number: 'chr2' in UCSC corresponds to '2' in Ensembl. In the past users would have to modify the sequence names if they did not adhere to the UCSC convention, but that is no longer the case. chromAlias associations have been built for all native Genome Browsers as well as GenArk assemblies, looking for corresponding matches in GenBank, Ensembl, and Ref-Seq when available. When custom annotations are now attached, if the sequence names do not match UCSC's, then the chromAlias table is referenced and displays the annotations if a match is found. This support has also been extended to the bedToBigBed utility, which now optionally accepts a chromAlias file instead of a chrom.sizes file and will build the bigBed without any need for renaming sequences. These chromAlias files can be found on our download server, e.g. hg38 (https://hgdownload.soe.ucsc. edu/goldenPath/hg38/bigZips/hg38.chromAlias.bb).

TrackDb settings
We added 15 new trackDb settings to our Hub Track Database Definition document (https://genome.ucsc. edu/goldenPath/help/trackDb/trackDbHub.html). These include additional configurations for Hi-C track display, bigBarChart display and PSL display among others. See Table 1 for a full list and short description.

New and updated tools
We continue to add to and maintain our suite of over 300 command line tools We provide email support through a public and a private mailing list where users can avail themselves of our expert and responsive staff. Access to the mailing lists can be found at https://genome.ucsc.edu/contacts.html, where there is also a link to an archive of previously answered questions from the public list.
In response to inquiries from our users, we released a module of content designed for use in the undergraduate classroom. This content features vignettes written by undergraduates to illustrate, using the Genome Browser, a variety of lessons in Molecular Biology, Genetics, Medicine, Population Biology and Evolution. This can be found at https://genome.ucsc.edu/training/education.

FUTURE PLANS
This coming year represents the first in our new 5-year planning cycle. A major goal during this time is evaluation and adoption of a pangenome graph data format. We will also be releasing a new site-wide search function and a track duplication feature. Work continues to expand hub support.
Most new data will be created in big formats and new assemblies will be implemented as hubs instead of SQL databases (e.g. hs1). Along those lines, work will begin on a tool to facilitate hub development. Lastly, an emphasis on clinical genomics and single cell data will continue, with features such as Recommended Track Sets and the new single cell track group seeing updates throughout the year.

DATA AVAILABILITY
The UCSC Genome Browser (https://genome.ucsc.edu/) is freely available to all users. The only exceptions are the source code for the Genome Browser, Blat utility, liftOver utility and other utilities which are free for non-profit academic research and for personal use. A license is required for commercial use of these utilities or the source code.