scRepertoire: An R-based toolkit for single-cell immune receptor

Single-cell sequencing is an emerging technology in the field of immunology and oncology that allows researchers to couple RNA quantification and other modalities, like immune cell receptor profiling at the level of an individual cell. A number of workflows and software packages have been created to process and analyze single-cell transcriptomic data. These packages allow users to take the vast dimensionality of the data generated in single-cell-based experiments and distill the data into novel insights. Unlike the transcriptomic field, there is a lack of options for software that allow for single-cell immune receptor profiling. Enabling users to easily combine mRNA and immune profiling, scRepertoire was built to process data derived from 10x Genomics Chromium Immune Profiling for both T-cell receptor (TCR) and immunoglobulin (Ig) enrichment workflows and subsequently interacts with the popular Seurat R package. The scRepertoire R package and processed data are open source and available on   and provides in-depth tutorials on the capability of the package. GitHub

The molecular resolution offered by single-cell sequencing (SCS) technologies has led to extensive investigations in the realms of developmental biology, oncology, and immunology. In terms of the latter field, SCS offers the ability to couple the exploration of transcriptomic heterogeneity in immune cells along a disease process with clonality 1 . A number of methods exist for dimensional reduction of mRNA data, reviewed by Chen et al. 2 that have been implemented into R packages to assist in processing and analysis of SCS experiments. However, a gap exists in the processing of V(D)J sequencing, descriptive statistics, clonal comparisons, and repertoire diversity with the current SCS R packages.
With these limitations in mind, scRepertoire 3 was generated ( Figure 1). Built using R, scRepertoire is a toolkit to assist in the analysis of immune profiles for both B and T cells, while interacting with the popular Seurat pipeline 4-6 . scRepertoire also includes processed single-cell mRNA and V(D)J sequencing data of 12,911 tumor-infiltrating and peripheral-blood T cells derived from three renal clear cell carcinoma patient, which is characterized below to demonstrate the capabilities of the package.

Operation
System requirements for running scRepertoire 3 include the installation of R v3.5.1 and the the Seurat R package (v3.1.2). Utilization of scRepertoire is dependent on the total number of single-cells being processed, with a base estimate of 1 Gb of random-access memory and a modern CPU.

Data
The isolation and processing of the 10x-Genomics-based singlecell mRNA and V(D)J Chromium sequencing data for immune cells has previously been described 7,8 . In addition, T cells were identified using expression values for canonical T cell markers: CD3D, CD4, CD8A, CD8B1 and previous clustering. T cells were isolated and reclustered using the integration method from the Seurat R package (v3.1.2) with 20 principal components and a resolution of 0.5 4 . All code used to generate the figures appearing in the manuscript is available at https://github.com/ ncborcherding/scRepertoire.

Implementation
The scRepertoire was built and tested in R v3.5.1. Analysis for scRepertoire was inspired from the bulk immune pro- Figure 1. A general workflow for single-cell data analysis involving scRepertoire. The analysis starts with the single-cell immune and mRNA sequencing and Cell Ranger-based alignment with the 10x Genomics pipeline. With the TCR or Ig sequencing, scRepertoire can import the filtered overlapping DNA segments, or contigs. The alignments are filtered by cell type of interest and combined using the individual cell barcodes. Clonotypes can be called using the gene sequence of the immune receptor loci, CDR3 nucleotide sequence or CDR3 amino acid sequence. After clonotype assignment, more extensive clonotypic analysis can be performed at the individual sample level or across all samples. General outputs from scRepertoire can be imported into Seurat objects to visualize clonotype data overlaid onto the cell clustering. Likewise, metadata from the Seurat objects can be imported into scRepertoire to analyze clonotypes by assigned clusters.
filing tcR (v2.2.4) R package without derivations in code 9 . Clonotypes can be called using the combination of immune loci genes, a more sensitive approach, or the nucleotide/amino acid sequence of the complementary-determining region 3 (CDR3). In addition to the base functions in R, data processing was performed using the dplyr (v0.8.3) and reshape2 (v1.4.3) R packages. Visualizations are generated using the ggplot2 (v3.2.1) and ggalluvial (v0.11.1) R packages with color pallets derived from the use of colorRamps (v2.3) and RColorBrewer (v1.1.2) R packages. Diversity metrics are calculated using the vegan (v2.5-6) R package. Visual outputs of functions are stored as layers of geometric or statistical ggplot layering, allowing users to easily modify presentation.

Results
Clonal analysis scRepertoire 3 can be used to call clonotypes using the CDR3 amino acid/nucleotide sequences, by gene usage, or by the combination of CDR3 nucleotide sequences and genes. Using the quantContig function, unique clonotypes can be visualized as raw values or scaled to the size of the library for samples or by type ( Figure 2A). The total abundance of clonotypes can also be visualized calling abundanceContig ( Figure 2B) or relative abundance of clonotypes ( Figure 2C). Additionally, the distribution of CDR3 nucleotide or amino acid sequences for clonotypes can be visualized with lengthContig ( Figure 2D).

Proportional analysis and diversity measures
More in depth analysis of clonal architecture is available. Within the framework of scRepertoire, analysis of clonal homeostasis, or the clonal space occupied by clonotypes of specific proportions, can be visualized by clonalHomeostasis function ( Figure 3A). Similarly, clonalProportion can be called to look at the proportion of clonal space occupied by specific clonotypes ( Figure 3B). Overlap between the samples can be calculated and visualized with clonalOverlap, using either the overlap coefficient or Morisita index methods ( Figure 3C). Measured of diversity across samples or groups can be quantified with the clonalDiversity function, demonstrating an overall reduction in clonal diversity in tumor samples ( Figure 3D).

Seurat interaction
After the processing and analysis of the TCR repertoire with the base features, the next step is using scRepertoire to interact with the single-cell mRNA data. The expression data for the 12,911 cells built into the package have already been clusters ( Figure 4A), with a clear distribution of the clusters into peripheral-blood-versus tumor-predominant ( Figure 4B). Using the combineSeurat function in scRepertoire, we can look at the clonotypic frequencies of cells that comprise the UMAP-based clusters ( Figure 4C), with notable expansion in the C2, C3, and C6 clusters ( Figure 4D). The C7 and C8 clusters also have a relatively high frequency. In addition to clonal distribution, we can also use highlightClonotypes to set specific sequences of clonotypes to be visualized ( Figure 4E), with clonotype 1 referring to the amino acid sequence "CAVNGGSQGN-LIF_CSAEREDTDTQYF" and clonotype 2 for the amino acid sequence "NA_CATSATLRVVAEKLFF". Interesting clonotype 2 is restricted to a subcluster of the C6 cluster ( Figure 4E). After combining both the clonotype and expression data, interaction between categories, such as cluster label and clonotype frequency can be visualized with the alluvialGraph function.

Conclusions
scRepertoire 3 is a R-based toolkit for the analysis of singlecell immune receptor profiling. The package is able to take the annotated filtered outputs from the 10x Genomics Cell Ranger   platform and provide analysis a number of modalities, including calling clonotypes, clonal space/homeostasis, clonal diversity, and repertoire overlap between samples. Outputs from scRepertoire can combined with dimensional reduction strategies for single-cell RNA quantifications, allowing users to analyze mRNA and immune profiles together. Under the creative commons v4.0 license, the scRepertoire package is freely available from the GitHub repository and is extensively annotated to assist in implementation and modification.
Folder 'Data' contains all data required to run the vignettes described in the Results. This is also available on GitHub.
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability
Source code is available from GitHub: https://github.com/ ncborcherding/scRepertoire. Currently the package includes all rendered figures in the vignettes folder. I would recommend removing these files, and also removing the `ggsave` function calls within the vignette to prevent these files being written while compiling the vignette. This will reduce the size of the package.
Several functions in the package use scoping assignment to assign variables in the global environment. This is considered bad practice and should be avoided in all cases.
Extra files such as .DS_Store should be removed from the git repository and from the package. Use the .gitignore and .Rbuildignore files for this.
Avoid importing code within R functions, for example `require(ggplot2)` calls. Instead, document the dependencies using roxygen2, for example `@importFrom ggplot2 ggplot`.
To access data in a Seurat object, I highly recommend using the functions defined in Seurat for this purpose rather than accessing the slots directly. For example, use obj[[]] to access metadata rather than obj@meta.data and Idents(obj) rather than obj@active.ident In general the documentation of functions can be greatly improved. Try to include a text description of each function, a detailed description of the parameters, document the returned values, and include an executable example.
It is generally not advisable to overwrite functions in base R or other packages with variable names, for example the `call` variable in `clonalDiversity` overwrites the base R `call` function.
Replace code like `class(df) [1] == "Seurat"` with `inherits(x = df, what = "Seurat")Ì n plotting functions such as clonalOverlap, consider returning the ggplot object rather than printing the object. For example, replace `suppressWarnings(print(plot))` with `return(plot)`. This will allow users to modify the plot that is generated. Some code sections are duplicated, for example L91:104 and L131:144 in seuratFunctions.R. Consider putting duplicated code into functions. Imported functions should be added to the namespace. Documenting the imports using roxygen2 (as has been done for parameters and exports) will take care of this.
The highlightClonotypes function is a bit redundant with existing functions in Seurat, ie DimPlot function with the cells.highlight parameter.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Yes like the Recon package by Kaplinsky and Arnaout or Startrac package by Zhang and colleagues . Interaction with such packages would greatly increase scRepertoire analytical effectiveness.
Similarly, the function abundanceContig(), especially in the unscaled form, is of little use per se. Interestingly, recent methods have been proposed to for the comparative analysis of clone size distributions and could be easily incorporated into scRepertoire adding considerable power to it (though assessment of required numerosity should be introduced).
The package claims to be designed both for TCR and BCR analysis but definition of clonality in B cells is slightly different than in T cells due to isotype switch and somatic hypermutation phenomena following activation. Therefore clonotype identity between two cells should be defined differently between BCR and TCR analysis.
Concerning overlap, beside the nice representation as a heatmap, it could be useful to have the chance to output the matrix itself rather than the plot only.
10x vdj methods occasionally fails to reconstruct complete clonotypes or it reconstructs putatively aberrant clonotypes (clonotypes with multiple beta chains). Currently scRepertoire does not allow to filter for specific chain compositions but such feature would be worth adding, together with a graphical visualization of relative frequencies of chain composition across clonotypes.
Paired 10x gene expression profile and vdj scoped analysis are not guaranteed to reconstruct the information for the exact same pool of barcodes, thus the combineSeurat() function could be improved by allowing to specify whether an inner or Seurat-sided joining is to be performed and ensuring that the joining is performed correctly.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes No competing interests were disclosed. Competing Interests: Reviewer Expertise: Integrative Biology, Cancer Immunology We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com