Vaeda computationally annotates doublets in single-cell RNA sequencing data

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) continues to expand our knowledge by facilitating the study of transcriptional heterogeneity at the level of single cells. Despite this technology’s utility and success in biomedical research, technical artifacts are present in scRNA-seq data. Doublets/multiplets are a type of artifact that occurs when two or more cells are tagged by the same barcode, and therefore they appear as a single cell. Because this introduces non-existent transcriptional profiles, doublets can bias and mislead downstream analysis. To address this limitation, computational methods to annotate and remove doublets form scRNA-seq datasets are needed. Results We introduce vaeda (Variational Auto-Encoder for Doublet Annotation), a new approach for computational annotation of doublets in scRNA-seq data. Vaeda integrates a variational auto-encoder and Positive-Unlabeled learning to produce doublet scores and binary doublet calls. We apply vaeda, along with seven existing doublet annotation methods, to 16 benchmark datasets and find that vaeda performs competitively in terms of doublet scores and doublet calls. Notably, vaeda outperforms other python-based methods for doublet annotation. Altogether, vaeda is a robust and competitive method for scRNA-seq doublet annotation and may be of particular interest in the context of python-based workflows. Availability and implementation Vaeda is available at https://github.com/kostkalab/vaeda, and the version used for the results we present here is archived at zenodo (https://doi.org/10.5281/zenodo.7199783). Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Single-cell RNA sequencing (scRNA-seq) continues to impact our understanding of diverse biomedical domains by providing highresolution gene expression measurements at scale. Resulting datasets often comprise many thousands of cells or more; while they provide valuable insights, they are also limited by technical artifacts, like doublets/multiplets. Doublets or multiplets occur when two or more cells receive the same identifier during library construction and thus appear as one single cell. As such, doublets introduce nonexistent expression profiles, which can lead to incorrect interpretation of data in downstream analysis. While there are experimental techniques (McGinnis et al., 2019b;Stoeckius et al., 2018;Zheng et al., 2017) that identify and annotate doublets, these methods are currently not typically employed (reasons include an increased experimental burden), and they are not available for pre-existing datasets.
Therefore, computational methods to identify doublets are needed. Current methods that address this challenge include the R packages scDblFinder (Germain et al., 2021), doubletFinder (McGinnis et al., 2019a), the cxds, bcds, and hybrid methods of the scds approach (Bais and Kostka, 2020), the python package scrublet (Wolock et al., 2019) and solo (Bernstein et al., 2020), which is used from the command line. Many of these methods share core concepts in their approach, for example generating artificial doublets from observed data, which then form the basis of deriving doublet scores for computational doublet detection. Here, we propose vaeda (Variational Auto-Encoder for Doublet Annotation), a new pythonbased computational approach with a similar paradigm to existing methods; vaeda combines variational autoencoders [VAEs] and Positive-Unlabeled Learning [PU-Learning, Liu et al. (2003) and Mordelet and Vert (2014)] to annotate doublets in scRNA-seq data. Similar to solo, vaeda uses a VAE to derive a low-dimensional representation of the input data. PU-Learning, on the other hand, is a learning framework designed for instances where there is a set of positively labeled examples and a set of unlabeled examples. In the context of doublet detection, the unlabeled examples are input data, while the examples with labels are artificially generated doublets. Therefore, PU-Learning appears well-suited for doublet detection; however, to our knowledge, while PU-Learning has been used to identify cell-free droplets in scRNA-seq data by Yan et al. (2021), it has not been used to identify doublets.
In the following, we present the vaeda method in detail; we also apply it to 16 benchmark datasets with experimental doublet annotation (Xi and Li, 2021a) and show that vaeda can accurately annotate doublets in scRNA-seq data, performing well when compared to seven existing methods. Particularly, vaeda integrates seamlessly into the scanpy (Wolf et al., 2018) workflow and drastically outperforms scrublet, the only other python-based doublet detection method. Furthermore, we assess robustness of performance for all methods we study, and overall, we observe meaningful differences in how well the methods' doublet scores correlate with experimental doublet annotations. Nevertheless, we also find that performance differences of binary doublet calls are less pronounced. This is of particular interest, because in practice, binary doublet calls are what enable researchers to remove artifacts and improve their data quality.

Vaeda method for doublet annotation
The vaeda method for doublet annotation consists of several steps summarized in Figure 1. In the following, we describe each step in more detail.

Input data, doublet simulation and gene selection
The input data for vaeda are raw, un-normalized count matrix X with n rows (one per cell barcode, encompassing singlets and doublets/multiplets) and p columns (one per gene). These data are then used to simulate artificial doublets as follows: An index pair (i, j) is sampled randomly and a doublet precursor xd is created by adding the corresponding rows of X: xd ¼ x i þ x j . Index pairs are retained for 'homotypic' simulated doublet exclusion after clustering. Next, to generate an artificial doublet x d , its library size ' (i.e. the number of counts) is determined as a random sample of all library sizes in X (i.e. the row sums of X) that are at least as large as the larger library size of x i and x j . Then, x d is determined as ð' Á xd Þ= P k xd k . With this approach, we create a count matrix X 0 of n simulated doublets, and vaeda proceeds with the augmented matrix X obtained by stacking X and X 0 (and scaling and centering columns), and associated labels y indicating simulated doublets: Finally, genes (i.e. columns) of X are selected by: (a) removing columns (i.e. genes) that had zero values for more than 99% of rows (i.e. cells) and then (b) focusing on the p most variable columns (we choose p ¼ 2000), where variability is defined by variation. Overall this procedure yields an initial augmented expression matrix X 2 R 2nÂp .
2.1.2 Adjusting the number of simulated doublets Next, to adjust the number of simulated doublets in the augmented count matrix, vaeda uses the following approach. First, an initial estimate for the number of doublets present in the input data is derived. To that end, the augmented data X is projected on its first 30 principal components. Next, a k nearest neighbor (knn) graph is constructed, where we set k ¼ ffiffiffiffiffi ffi 2n p . Then, for each cell the fraction of its nearest neighbors that are simulated doublets is calculated as a (preliminary) doublet score fs i g 2n i¼1 . A score cutoff c is determined as the 25% quantile of simulated doublets' scores. Finally, the number of doublets n d in the original data is estimated aŝ n d ¼ P i 1½ð1 À y i Þs i ! c, the number of input cells with scores equal or larger than the cutoff.
Second,n d n cells are randomly selected from simulated doublets by sampling without replacement, where the sampling probability for doublet k is given by p k ¼ s k = P 2n j¼nþ1 s j . This results in X 2 R ðnþn d ÞÂp and y consisting of n zeros andn d ones.
2.1.3 Low-dimensional representation by cluster-aware variational auto-encoding We derive a low-dimensional representation of the augmented data using a cluster-aware autoencoder; this representation will then in turn form the basis of doublet annotation.
Clustering: First, we group cells in the augmented dataset X into clusters, using the Leiden algorithm (Traag et al., 2019). Specifically, the rows of X are centered and scaled to unit variance. After scaling, X is then projected onto its first 30 principal components. Next, for small datasets (1000 cells or less), a neighborhood graph (McInnes et al., 2018;Wolf et al., 2018) is computed followed by Leiden clustering. For larger datasets, we pre-cluster the projected data using mini-batch k-means [similar to Hicks et al. (2021), with k set to 10% of the number of cells] and generate meta-cells (i.e. cluster centers). Again, the first 30 principal components of meta-cells are used to compute a neighborhood graph, followed by Leiden clustering. This step creates cell annotations c ¼ fc i g nþn d i¼1 with c i ¼ k if cell i is assigned to cluster k. Homotypic doublet exclusion: Next, we remove 'homotypic' simulated doublets, which are defined as simulated doublets that are comprised of cells from the same cluster. We use index pairs from the doublet simulation step to determine which simulated doublets to exclude. Specifically, doublet x d is considered a 'homotypic' simulated doublet if for the index pair (i, j) comprising x d , we have c i ¼ c j .
Cluster-aware autoencoder: The cluster-aware autoencoder then takes X (log-transformed and scaled as described previously) and c as input and consists of an encoder network, a decoder network, and a cluster classifier. Let an input instance (i.e. cell and label) be denoted by ðx; cÞ.
The encoder network consists of an input layer (p neurons), followed by a dense layer with 256 neurons, batch normalization and dropout (rate ¼ 0.3) layers, and a tensor flow probability layer parameterizing a d-dimensional Normal distribution with diagonal covariance matrix (we use d ¼ 5); d is the dimension of the latent space representation of the input data.
The decoder network consists of an input layer (d neurons), a dense layer of 256 neurons, followed batch normalization, dropout (rate ¼ 0.3) layer, and a tensor flow probability layer parameterizing a p-dimensional Normal distribution with diagonal covariance matrix, modeling the input data, X.
The cluster classifier consists of an input layer (d neurons), batch normalization and a fully connected layer with the number neurons equal to the number of clusters present and with softmax activation, modeling each cell's cluster assignment. Let the cluster classifier's class assignments be denoted by fðxÞ, and let # denote the model's trainable parameters.
The loss function of vaeda is defined as: where the first two terms represent the loss function of the Fig. 1. Summary of the vaeda method. Input cells X are subjected to data augmentation, where artificial doublets are simulated, a preliminary doublet score s is derived, and an augmented dataset X is created. Next, a low-dimensional representation Z of X is derived, using a cluster-aware variational autoencoder. Finally, positive unlabeled learning is used to derive final doublet scores n for each input cell/barcode auto-encoder, and the third term is the classification loss of the cluster classifier (categorical cross entropy). Specifically, l d ðx; #Þ and r d ðx; #Þ represent output of the decoder's final layer and are parameters of a Normal distribution modeling the input data, while l e ðx; #Þ and r e ðx; #Þ represent output of the encoder network's final layer. Therefore, the first term denotes the negative log likelihood of the input data (¼reconstruction error), while the second term is the Kullback-Leibler divergence between the (probabilistic) lowdimensional representation of the input and a standard Normal distribution of appropriate dimensions (¼regularization). CCE is the categorical cross entropy between the input's cluster label and the cluster classifier's output fðx; #Þ, and b is a parameter adjusting the scale between autoencoder loss and classifier loss (we set b ¼ 20; 000). Vaeda is trained using the Adamax optimizer with default options. After the third epoch, the learning rate (0.001) decays at a rate of 0.75. Ten percent of input data is used as a validation set and training stops if validation loss has not improved for 20 epochs.
Moving forward, we use the auto-encoder's low-dimensional representation of the input data X which we denote Z 2 R ðnþn d ÞÂd .

Doublet scoring by PU learning
To score cells as potential doublets we use a positive unlabeled learning approach, with the rationale that doublet labels on the input data are not observed (unlabeled), whereas we have positive labels on the simulated doublets in the augmented, reduceddimensional dataset ðZ; yÞ. Before we apply bagging PU learning, we augment ðZ; yÞ by appending preliminary doublet scores s we used for the initial estimate of the number of doublets (n d ), so that we finally use for doublet annotation. The PU bagging approach we use is summarized in Algorithm 1, which is adapted from the original publication (Mordelet and Vert, 2014). Unlabeled examples U are the first n rows of Z, whereas the positive examples P are the lastn d rows of Z. As classifier f(x), we use logistic regression, implemented as a neural network with an input layer (d þ 1 neurons), a batch normalization layer, and a dense output layer with sigmoid activation. We determine the number of epochs for training as follows: The network is trained on the first fold for 250 epochs; then the KneeLocator function of the kneed python module (Satopaa et al., 2011) is used to find an inflection point in the loss curve, which is then used to determine the number of training epochs for all folds. Scores fn i g n i¼1 returned by PU bagging for the rows of Z that represent input cells are the doublet scores reported by the vaeda method.

Doublet calling
In addition to doublet scores, vaeda also provides doublet calls as a binary prediction for whether a cell is a singlet or a doublet. Vaeda generates these calls by selecting a threshold t ? where all cells scoring above this threshold are called as doublets and all cells scoring below this threshold are called as singlets (e.g. Fig. 2B, Supplementary Fig. S3). The threshold is selected by the following minimization problem where FNRðtÞ is the fraction of simulated doublets called singlets at threshold t (% false negative rate), FPRðtÞ is the fraction of input cells called doublets at threshold t (% false positive rate), and nðtÞ is the number of input cells called doublets at threshold t. LLðnðtÞjl; rÞ is the (Gaussian) log-likelihood of observing nðtÞ doublets given an expected number of doublets l with standard deviation r. If l is not provided by the user, we use l ¼ n 2 Á 10 À5 , motivated by the heuristic that n Á 10 À5 is a good approximation for the fraction of doublets in a dataset (e.g. see Germain et al., 2021). Via a binomial model, we arrive at a variance of lð1 À l=nÞ for the number of doublets, which determines the parameters of the loglikelihood term. We square the log likelihood to flatten its minimum, and the parameter a controls a trade-off between misclassification of simulated doublets and the number of expected doublets (we set a ¼ LLðnðt max Þjl; rÞ À1 , where t max is the largest possible threshold on the input data). We note that this approach is similar to that of Germain et al. (2021); however, the expected number of doublets is incorporated differently in our approach.

Benchmark data and application of existing methods
Scrublet ( (Bais and Kostka, 2020) were run with default parameters using code from the R package DoubletCollection (Xi and Li, 2021b). These methods also provide binary doublet calls in addition to continuous doublet scores, so we modified the package to access these calls for our doublet call analysis. Solo was run via the command line using parameters provided in the file model_json on solo's github page (https://github.com/calico/solo). We use 16 datasets from the benchmark study of Xi and Li (2021a), downloaded from Zenodo (https:// zenodo.org/record/4062232#.X3YR9Hn0kuU).

Analyses with down-sampled datasets
For several analyses, we randomly down-sampled datasets 10 times to contain a fraction of cells (we used 95%), while preserving the original singlet to doublet ratio. Each doublet annotation method was then run on each down-sampled dataset resulting in 10 annotation scores for each cell per method per dataset. Because of its longer running time solo was only run on five sub-samples instead of 10 (for run times, see Supplementary Fig. S5). To test for a difference in performance between two methods on a specific dataset we then used a Wilcoxon rank-sum test. To test for performance difference across all datasets, we used a paired Wilcoxon rank-sum test is used, where the pairing takes into account systematic differences between Algorithm 1: PU learning, adapted from Mordelet and Vert (2014) Randomly split U into K folds fU 1 ; . . . ; U k g for k ¼ 1 . . . K do Train classifier f k to discriminate P against U k Update: for x 2 U n U k do fðxÞ f ðxÞ þ f k ðxÞ nðxÞ nðxÞ þ 1 end end end Return: score nðxÞ f ðxÞ=nðxÞ for x 2 U datasets. In cases of comparing solo to other methods, we only use the five repetitions where solo was applied.
To obtain a standard deviation of a performance measure averaged across datasets (e.g. Fig. 3), we estimated standard deviations for each dataset and then used standard error propagation.

Missed versus captured doublets
We characterize and contrast experimentally annotated doublets we call 'captured' and 'missed' (Fig. 4). We define a captured doublet as a doublet that is classified as a doublet by at least four (out of eight) methods and a missed doublet as a doublet that is misclassified as a singlet by all of the methods. In this analysis, we use for each method the m top-scoring doublets and as computationally annotated doublets, where m is the number of experimentally annotated doublets (i.e. we do not use the methods' doublet callers). For each doublet, we then obtain the fraction of singlets amongst its k-nearest neighbors (k ¼ 5). Figure 4 shows violin plots of these singlet fractions, stratified by missed versus captured doubles. Circles represent averages for each dataset.

Doublet detection with the vaeda method
Here, we present the vaeda method for computational annotation of doublets in single-cell RNA sequencing data. Vaeda reflects other doublet annotation methods (Bais and Kostka, 2020;Bernstein et al., 2020;Germain et al., 2021;McGinnis et al., 2019a;Wolock et al., 2019), in that artificially generated doublets (we also call them 'simulated' doublets) are used as a means to infer actual (or 'real') doublets in a dataset. Conditional on simulated doublets, this approach naturally leads to a statistical learning setup discriminating artificial doublets from input data, and classification scores are then used for doublet annotation. However, real doublets are present in the input data as well. With vaeda, we explicitly account for this by viewing the learning task as a Positive-Unlabeled (PU) learning problem, where simulated data is viewed as a positive set (P), while input data is assumed unlabeled. This approach differs from standard classification, which implicitly assumes that no doublets are present in the input.
Briefly, the vaeda method works as follows. After variable gene selection and generation of artificial doublets, clustering is performed and a cluster-aware VAE is used to learn a latent representation of input data and artificial doublets both. Learned latent projections, together with a knn-feature encapsulating the fraction of simulated doublets in each cell's neighborhood, are then used as input for PU bagging (Mordelet and Vert, 2014) to derive doublet scores. Further on, vaeda can perform doublet calling. Similar to Germain et al. (2021), vaeda balances a heuristically expected number of doublets with false positive and false negative doublet calls, as quantified by the number of misclassified simulated doublets and input cells, respectively. See Section 2 for details and Figure 1 for an overview of vaeda. In the following, we assess vaeda on 16 benchmark datasets (Xi and Li, 2021a) where doublets have been experimentally annotated.

Ablation analysis of the vaeda method
In order to evaluate the relative importance of vaeda's different components we performed ablation analyses. Specifically, we assessed the following components of our method: (i) inclusion of the fraction of (simulated) doublets in a cell's neighborhood into the learning problem, versus not including them; (ii) classification algorithm: knn classifier versus logistic-regression classifier; (iii) excluding simulated doublets that might be homotypic, i.e. doublets simulated by combining two cells from the same cluster; (iv) type of low-dimensional representation: pca versus variational auto-encoder versus cluster-aware variational auto-encoder; (v) PU learning versus regular classification. We measured performance for every combination of these components, and results are summarized in Supplementary Figure S1.
We find that a combination of a cluster-aware autoencoder, homotypic doublet exclusion, PU learning with a logistic regression

. Vaeda's latent representation preserves doublet annotation information and vaeda's caller correlates with actual doublet fractions. (A) The performance of a knn classifier predicting annotated doublets on the input data (left) and on vaeda's latent representation (right) is shown. (B) The cost function vaeda minimizes for doublet calling for the HMEC-rep-MULTI dataset is shown. (C) Vaeda-estimated doublets on the x-axis and experimentally annotated doublets on the y-axis for 16 benchmark datasets is shown.
The line y ¼ x is shown in black. (D) UMAP density plots of four datasets (rows) with all cells, experimentally annotated singlets and doublets (columns two and three), as well as vaeda-simulated doublets (column four) are shown. Darker areas denote regions of higher cell density type classifier, and including the neighborhood doublet fraction as a feature yielded the best results (average area under the precision recall curve: 55.8%, see Fig. 3A). We also find that combinations including the neighborhood fraction of doublets and a logistic regression type classifier perform best, while methods without this feature using a logistic regression classifier perform worst. Combinations with a knn-type classifier perform in-between. Focusing on twelve high-performing combinations (i.e. with neighborhood doublet fraction and logistic regression classification), we find that combinations including PU learning perform better than those using regular classification (four of the top-six and two of the top-three combinations use PU learning). Results with regards to the low-dimensional embedding are a bit less clear; however, two of the top-three combinations use a cluster-aware vae (one uses PCA), while only one of the worst-performing combinations uses this approach for dimension reduction (the other two methods used are PCA and a regular auto-encoder). Overall this analysis motivates our design of the vaeda method. While the biggest effects came from the neighborhood doublet fraction and classifier type, we note that PU learning and the cluster-aware autoencoder increased average performance from 54.5% to 55.8%, a mild but noticeable improvement.

Vaeda's latent representation preserves doublet annotation
Vaeda learns a latent space representation of the data by training a cluster-aware VAE. We quantitatively and qualitatively assessed how well the latent representation preserves experimental doublet annotation. First, to quantitatively assess whether this embedding preserves local label (¼doublet) information, we followed the approach of Zhou and Troyanskaya (2021) and use a knn (k ¼ 5) classifier trained using experimental doublet annotations to predict cell labels in both the input space and in the latent space. We find that label accuracy in vaeda's latent space is 91.598% (averaged over datasets), approximately the same as in the input space (91.586%) (see Fig. 2A). This indicates that vaeda's latent representation accurately reflects local cell label information.
Next, to qualitatively assess vaeda's simulation of artificial doublets, we visualized simulated and experimentally annotated doublets using vaeda's latent representation. Plots for all benchmark datasets are shown in Supplementary Figure S2 (examples shown in Fig. 2D). Overall, we observe good agreement between experimentally annotated and simulated doublets. However, for some datasets (HMEC-rep-MULTI, cline-ch, mkidney-ch, pbmc-1B, and pbmc-1C), there exist doublet populations that are annotated but not covered well by simulated doublets. We also note that regions with high densities of singlets are typically distinct from high-density simulated doublet regions. Two exceptions are hm-6k and hm-12k, which both have small groups of simulated doublets overlapping annotated singlets. We note that experimental doublet annotations for these data do not consider homotypic doublets (i.e. doublets that occur when two cells of the same cell type combine). Figure 2D shows two examples of good real-simulated doublet overlap on the left, and HMEC-rep-MULTI and cline-ch on the right.

Vaeda's doublet scores and doublet calling reflect experimental annotations
Doublet scores produced by the vaeda method correlate well with experimental doublet annotation. Briefly, across 16 benchmark datasets, vaeda achieves an average area under the precision recall  (B) Density plots of the fraction of singlets in a doublet's neighborhood, stratified by whether the doubled is consistently missed, or captured is shown curve (AUPRC) of 55.8%, with AUPRCs for individual datasets reaching 97% and 95% (Fig. 3A). More details about vaeda's doublet score performance and a comparison with other methods are presented in the next section.
In addition to calculating doublet scores for each cell, vaeda can also classify cells as either a doublet or a singlet as described in Section 2.1.5. Briefly, vaeda's doublet caller simultaneously minimizes false positive rate, false negative rate, and the squared log likelihood of the predicted number of doublets given an expected number of doublets (Fig. 2B, Supplementary Fig. S3). The expected number of doublets is derived from the observation that, across experiments, the probability of a cell being a doublet scales linearly with the total number of cells, also see Section 2.1.5. Figure 2C shows vaeda's doublet calling results, compared to the number of experimentally annotated doublets. The correlation between annotated and estimated doublet rates (r 2 ¼ 0:59) is a noticeable improvement over using purely the expected number discussed above (r 2 ¼ 0:52, Supplementary Fig. S4). Further on, performance on datasets with reasonable doublet rates (less than 20%) appears more accurate, while the doublet fraction in two datasets with high doublet density (37.3% and 31.0%) is underestimated.
In summary, these results demonstrate that the vaeda method is a reasonably designed method that generates accurate doublet scores for scRNA-seq data; it is able to perform meaningful classification into singlet versus doublet cells, outperforming a reasonable baseline based on cell-numbers alone. Next, we compared vaeda with other computational methods for doublet annotation.

Comparison with other doublet annotation methods
To compare vaeda to other methods, we used 16 benchmarking datasets that have previously been used to assess doublet annotation performance by Xi and Li (2021a). We proceed by comparing vaeda with scDblFinder, doubletFinder, hybrid, bcds, cxds, solo, and scrublet, while library size was included as a baseline. We used average precision (AUPRC) as the main performance metric, because the datasets are imbalanced (typically the fraction of doublets is low, see Supplementary Table S1). Results are summarized in Figure 3A.
We find that vaeda outperforms all competing methods on four datasets; this is slightly worse than scDblFinder (five), but better than doubletFinder (two), Solo (three), cxds (three), and scrublet (one). Hybrid and bcds did not outperform all other methods on any dataset. In terms of average performance, vaeda's AUPRC (averaged across datasets) is 55.8%, second compared with 56.4% for scDblFinder (the best method, on average) and 53% for doubletFinder (the number three method), see Figure 3A.

Vaeda performs competitively with existing methods
Performance results presented in the previous section were computed using all cells in each benchmark dataset. To study if the performance differences we observed were robust and meaningful, we sub-sampled cells in each benchmark dataset repeatedly and observed the resulting empirical distribution of performance metrics. Figure 3B shows results using repeated (ten times) 95% downsampling in terms of average AUPRC. Examining average performance, we find that vaeda outperforms all other methods, except scDblFinder and doubletFinder. We observe a noticeable decrease in average performance between the top three methods (scDblFinder, vaeda, and doubletFinder) and other methods. Similarly, bcds, scrublet, and cxds on average perform worse than the rest. We also quantitatively assessed performance differences using Wilcoxon rank-sum tests for pairs of methods on the 95% down-sampled datasets, aggregating across all 16 datasets (see Sections 2 and 2.3); we find that (using continuous doublet scores and AUPRC) scDblFinder outperforms vaeda and vaeda outperforms doubletFinder (Table 1). However, we note that even though scDblFinder outperforms vaeda on average, vaeda achieves significantly better performance on the HEK-HMEC-MULTI, HEK-orig-MULTI, and HMEC-rep-MULTI datasets, and there are no significant differences between veada and scDblFinder on the J293t-dm, hm-6k, and hm-12k datasets (Supplementary Table S2).
We also compared all pairs of doublet detection methods stratified by dataset, using paired Wilcoxon rank-sum tests to decide 'wins' (P 0.05, higher performance), 'ties' (P > 0.05), and 'losses' (P 0.05, lower performance). Supplementary Figure S8C shows the results. For each method, there are seven competitor methods and 16 datasets yielding 7 Â 16 ¼ 112 comparisons. Only scDblFinder, vaeda and doubletFinder are able to win more than half of their comparisons (83, 67, and 64, respectively). For the topthree methods (scDblFinder, vaeda, and doubletFinder), we also provide pairwise comparisons, stratified by dataset, in Supplementary Table S2.

Vaeda performs competitively at doublet calling
In practice, doublet annotation methods are used to filter putative/ annotated doublets from scRNA-seq datasets. In addition to doublet scores, vaeda is equipped with a doublet caller that provides a binary label for this purpose. Here, we compare vaeda's doublet calling with other methods and find that it performs competitively. Specifically, we applied vaeda along with other methods to the 16 benchmark datasets and recorded doublet calls. We then calculated the f1 score, mcc, precision, recall, and accuracy for the calls and averaged across datasets (Table 2, Supplementary Fig. S9). We see that vaeda performs well overall, with the second highest f1-score; compared to scDblFinder, vaeda appears to tradeoff recall (52.5% versus 56.9%) to gain precision (59% versus 53.7%), but coming out a little behind overall. Interestingly doubletFinder does not perform well, calling too few doublets overall (see Section 4). We also observe that there is noticeable spread in performance between different datasets (Supplementary Fig. S9).
To investigate whether calling doublets is better than using the expected number of doublets, based on the heuristic n 2 Á 10 À5 (see Sections 2 and 2.1.5) to select a cutoff, we recalculated performance metrics for all methods with this approach. Supplementary Table S3 shows the results, and Supplementary Table S4 shows the difference Note: Paired Wilcoxon rank-sum tests were used identify significant (P 0.05) performance differences between pairs of methods using 10 95% down-samplings of each dataset. The method with the better performance is indicated. % implies P > 0.05. in performance. For most methods, performance differences between this approach and method-specific determination of the number of doublets are small; exceptions are scrublet and doubletFinder, with both of them losing precision and gaining recall as more doublets are called.
To test for significant differences in performance in doublet calling, we used Wilcoxon rank-sum tests, like before. Table 3 shows results comparing scDblFinder, vaeda, and doubletFinder for f1 score, mcc, precision, recall, and accuracy. We find that vaeda performs comparable to scDblFinder in terms of mcc, precision, and accuracy, and slightly worse in terms of f1 score and recall.

Consistently misclassified doublets may be homotypic
In a majority of benchmarking datasets, there is a large subset of annotated doublets that are misclassified by every method (Fig. 4A, Supplementary Figs. S10 and S11). We suspect that these consistently misclassified doublets, or 'missed' doublets, are composed primarily of homotypic doublets (i.e. doublets that are composed of two cells of the same type). Support for this hypothesis comes from the fact that the only two datasets that do not have a substantial amount of (consistently) missed doublets are the hm-6k and hm-12k datasets, where homotypic doublets are not annotated. In order to further test this idea, we measured the mixing between singlets, captured doublets, and missed doublets; in this analysis captured doublets are annotated by at least four methods, and missed doublets were not annotated by any of the eight methods we applied. Mixing was measured as the average fraction of singlets in the neighborhoods of the missed/captured doublets as described in Section 2.4. We found that missed doublets have higher mixing with singlets than captured doublets, providing support to the idea that the missed doublets are homotypic (i.e. near singlets presumably of the same type), see Figure 4B.

Discussion
Here, we present vaeda, a new tool for computational doublet detection. The vaeda method uses a similar paradigm to other approaches, where doublets scores are produced using artificially generated doublets [e.g. Bais and Kostka, 2020;Germain et al., 2021, and others]. While auto-encoders have been used in the context of doublet detection before (Bernstein et al., 2020), vaeda is unique in that it combines a cluster-aware variational auto-encoder with a PU-Learning approach. We carefully studied the effect of both of these concepts and showed that, even though other design choices have greater impact, they do enable a noticeable increase in performance.
We assessed vaeda on 16 benchmark datasets, and find that, overall, vaeda produces accurate doublet scores and is able to derive binary doublet predictions that reflect experimental annotations well. Further on, we find that vaeda's latent representation is helpful in determining datasets where simulated doublets agree well with experimental annotations, and cases where simulated and experimentally annotated doublets show more pronounced differences. We illustrate this with four examples in Figure 2D, with HMECrep-MULIT and cline-ch as examples where the distributions simulated and annotated doublets show differences. We note that vaeda, as well as other methods (see Fig. 3), do not perform particularly well on these data, indicating that improving our ability to simulate doublets might be a promising approach to improve method performance.
In terms of assessing method performance, we note that ultimately the metrics most relevant in practice are those for binary predictions, rather than rank-based metrics like area under the ROC curve or average precision (i.e. area under the precision recall curve). Such metrics are reported in Table 2, and we see that vaeda outperforms doubletFinder and performs comparable so scDblFinder in terms of mcc, precision, and accuracy; in terms of f1 score and recall scDblFinder performs a bit better than vaeda. Therefore, we conclude that vaeda performs comparable to other state-of-the-art methods (scDblFinder and doubletFinder), but note that scDblFinder has slightly better average performance in terms of f1 score and recall. We highlight, however, that the only other native python method is scrublet, which performs significantly worse than vaeda.
Interestingly, DoubletFinder's doublet calling heuristic with default parameters does not perform well in Table 2. DoubletFinder's doublet scores perform well overall, however. We show this in Supplementary Table S5, where we use doublet scores to call the annotated number of doublets across benchmark datasets. This analysis places DoubletFinder into the top three methods in terms of f1 measure, for example.
We do note that the benchmarking datasets used in this work, and other doublet annotation manuscripts, are not entirely representative as they do not exhaustively cover technical and biological diversity. To add flexibility to our doublet calling approach, the parameter l in our thresholding function (Section 2.1.5) adjustable by the user to take specific attributes of their data into account.
In summary, we have shown that vaeda is a state-of-the-art method for computationally annotating doublets in scRNA-seq data, and that it is the top choice for python workflows. It also provides a low-dimensional data representation that can complement other approaches and be useful for data analysis and visualization. It therefore is a useful tool for single-cell RNA sequencing data analysis. Note: Paired Wilcoxon rank-sum tests were used identify significant (P 0.05) performance differences between methods' doublet calls. % means P > 0.05.