Scywalker: scalable end-to-end data analysis workflow for long-read single-cell transcriptome sequencing

Abstract Motivation Existing nanopore single-cell data analysis tools showed severe limitations in handling current data sizes. Results We introduce scywalker, an innovative and scalable package developed to comprehensively analyze long-read sequencing data of full-length single-cell or single-nuclei cDNA. We developed novel scalable methods for cell barcode demultiplexing and single-cell isoform calling and quantification and incorporated these in an easily deployable package. Scywalker streamlines the entire analysis process, from sequenced fragments in FASTQ format to demultiplexed pseudobulk isoform counts, into a single command suitable for execution on either server or cluster. Scywalker includes data quality control, cell type identification, and an interactive report. Assessment of datasets from the human brain, Arabidopsis leaves, and previously benchmarked data from mixed cell lines demonstrate excellent correlation with short-read analyses at both the cell-barcoding and gene quantification levels. At the isoform level, we show that scywalker facilitates the direct identification of cell-type-specific expression of novel isoforms. Availability and implementation Scywalker is available on github.com/derijkp/scywalker under the GNU General Public License (GPL) and at https://zenodo.org/records/13359438/files/scywalker-0.108.0-Linux-x86_64.tar.gz.


Introduction
Single-cell and single-nuclei (collectively called single-cell hereafter) transcriptome sequencing has revolutionized our understanding of processes in health and disease, especially in heterogeneous tissue like the human brain (Piwecka et al. 2023).Alternative isoforms, with variation in start sites, splicing of exons, or transcript ends, are highly prevalent in the transcriptome of complex eukaryotes (Wang et al. 2008, Park et al. 2018).However, the most commonly used singlecell short-read sequencing methods only result in data from the 5 0 or 3 0 ends of the transcripts.Even with short-read sequencing covering the entire gene length, there is no direct observation of full-length transcripts and only a partial reconstruction of the splice diversity.
In contrast, long-read sequencing methods from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) enable the sequencing of full-length transcripts and the identification and quantification of all isoform variations.Bulk long-read transcriptome sequencing consistently leads to the discovery of novel isoforms (Clark et al. 2020, Glinos et al. 2022, Heberle et al. 2023) but lacks the cell-type resolution and may miss isoforms from lowly abundant cell types.Therefore, there is great value in optimizing long-read singlecell transcriptomic methods.Initial strategies to deal with the higher error rate of long-read sequencing included combining short-read sequencing data from the same library (Gupta et al. 2018, Lebrigand et al. 2020), rolling circle amplification (Volden et al. 2018), or dimer nucleotide blocks for barcodes and UMIs (Philpott et al. 2021).Methods not relying on short-read sequencing data arose with decreasing nucleotidelevel error rates (Tian et al. 2021, You et al. 2023, Oxford Nanopore Technologies 2024).These methods were tested on relatively small datasets and, in our experience, did not scale to the large datasets (>10 000 cells per sample) currently produced.We developed scywalker to address this issue, creating a more scalable package for adequate analysis of this rich data source using novel methods and extensive parallelization while simultaneously improving the accuracy of the results and utility.In contrast to existing tools, scywalker can, in one command, provide analysis of multiple samples, generating ready-to-use per-cell expression counts and multi-sample per-cell-type summed pseudobulk counts, allowing easy comparison of gene and isoform expression over multiple samples and cell types.

Single-nuclei and single-cell transcriptome sequencing
Four fresh frozen human brain samples (BA10) were provided by the NeuroBiobank of the Born-Bunge Institute (IBB-Neurobiobank), Wilrijk (Antwerp), Belgium; ID: BB190113.The study was approved by the Ethics Committee of the University Hospital Antwerp and the University Antwerp (20/10/107).Nuclei isolation from human brain samples was performed using an adapted density gradient protocol (Habib et al. 2016).A different protease inhibitor (1× cOmplete EDTA-free protease inhibitor) was used, and the lysis time was reduced to 2 min.

Implementation of the scywalker pipeline
The scywalker pipeline (Supplementary Fig. S1) is implemented within the GenomeComb framework (github.com/derijkp/genomecomb/), a toolset for efficiently analyzing various types of sequencing data.

Scywalker droplet barcoding
All barcode files are merged, summarized, and sorted by read count.The top barcode (largest read count) is taken as a correct droplet barcode, and the remaining barcodes are searched for barcodes with one difference using costonly dynamic programming with a cost cut-off of 1.This search is precomputed in parallel in 50 batches.All hits will be assigned to the initial droplet barcode and removed from the list for further processing.This procedure is then repeated with the following (free) barcode in the list until it has fewer than (by default) 20 reads, or until (by default) 100 000 droplets have already been identified.If a whitelist of all potential droplet barcodes is given, only barcodes in this list will be processed.The resulting mapping of read barcodes to droplet/cell barcodes will be combined with the FASTQs and their barcoding files to generate FASTQs where the corrected (droplet) barcode and UMI are prepended before the read name and added as FASTQ comments.

Scywalker gene and isoform calling
The barcoded FASTQ files are aligned to the reference genome with minimap2 (Li 2018) using the splice preset in separate jobs for each file and coordinate sorted using a modified version of gnu-sort.The presorted alignments are merged using the gnu-sort mergesort, resulting in one sorted alignment file in bam format.
Genes and isoforms are then identified and counted based on the postprocessing of IsoQuant result, run highly parallelized (Prjibelski et al. 2023).Scywalker splits the alignments into smaller regions (default target 5Mbase, but splits can only happen in 250kbase regions without known genes) that are processed separately to improve efficiency and reduce peak memory use.These are run in separate jobs instead of threads (to allow distribution over a cluster).IsoQuant is run on each region alignment, and reads are analyzed as bulk data (command included in the Supplementary code).Scywalker then adapts the IsoQuant output: IsoQuant produces separate results for transcripts in the given reference set (known) and for predicted novel models not in the reference set.These are merged into one.In this process, the identifiers given to novel isoforms and genes by Isoquant are changed to a unique identifier based on the genomic location to avoid collisions when separately processed regions are later merged.The IsoQuant read_assignment file (for the reference calls) is also adapted by merging in the transcript_model_reads data (changing the assignment of reads supporting novel transcripts), putting information about detected polyA and classifications in separate columns, and adding information about the completeness of the match.The read_assignment files can be used later to calculate single cell counts because the cell barcodes are incorporated through the readnames.
Scywalker uses a custom algorithm to analyze organelle genomes.Only gene locations are loaded into memory to minimize memory use, while the numerous read alignments are loaded and analyzed individually.For each read alignment, the CIGAR string is parsed to define different alignment regions of the read, and the percent overlap with genes is calculated.If multiple overlaps exist, the read is assigned to the nonribosomal RNA gene with the highest pct overlap.A file similar to the IsoQuant read_assignment file is produced.In the assignment_events field of the file, one of the following is recorded to indicate how the match/read relates to a reference transcript: mono_exon_match if the alignments start and end coordinates differ a maximum of 3 bases from the known gene location, mono_exon_enclosed for alignments entirely within the gene (but not a match) and mono_-exon_overlap for alignments overlapping at least 5% of the gene.This information is later considered when calculating final read counts: reads with mono_exon_overlap are included in the total weighted count, while they are left out of unique counts.
After the regions are processed separately, the read assignment info is concatenated, sorted on the read name (which starts with barcode and UMI info), and processed per block of assignments concerning the same read (has the same barcode and UMI, or the same read name if those were not found).Reads uniquely assigned to one isoform are copied directly into the final read assignment file.Multiple assignments in a block are first filtered: assignments less informative than others (significantly shorter, not in a gene, while others match a gene/isoform) in the block are removed.Additional information that will serve as correction factors in the counting step is added to the read assignments: umicount (the number of reads having this same UMI), ambiguity (the number of isoforms supported by the read), gambiguity (the number of genes/locations supported), while information such as covered_pct (how much of the isoform is covered by the read) and inconsistency (level of inconsistency the alignment shows with the isoform structure) will also be used in providing different types of counts.Finally, the read assignments file is (re)sorted according to genomic location.
Scywalker returns different types of counts as separate fields in the results.These are calculated simultaneously by parsing the read assignments file and keeping a tally for each type of count and cell.A weight is added to the appropriate tallies for each read assignment.For the "weighted" isoform count, a weight of 1/(ambiguity � umicount) is added, while the "unique" count only takes reads into account that uniquely support one transcript, and "strict" only counts unique reads that cover ≥90% of the assigned transcript.Similar results are provided in the "aweighted," "aunique," and "astrict" columns, but limited to reads with a detected polyA tail.For the default gene "count" a weight of 1/(gambiguity � umicount) is added, including for reads mapping to introns of the genes (which is also the default for Cell Ranger), while "nicount" only counts reads matching an isoform.

Scywalker cell filtering and typing
Droplets that contain nuclei are determined based on the EmptyDrops algorithm (Lun et al. 2019).To improve cell clustering, novel genes detected by scywalker are excluded.Cells are clustered at a resolution of 0.5 using Seurat's K-nearest-neighbor graph-based approach based on the gene expression levels.The algorithm groups cells into communities based on similarity in their feature profiles.A high resolution would identify relatively large amounts of smaller clusters, while low values would result in relatively fewer clusters that group together more cells.In our case, the resolution is set slightly lower than the default in Seurat (0.8) to avoid over-clustering (identifying a large number of small communities) in this initial clustering step and focus only on capturing the large, overarching cell populations in the data.This will make their automatic annotation in the following step more robust.
Each cell is assigned to one of the cell types in a userprovided marker file using ScType (Ianevski et al. 2022) and scSorter (Guo and Li 2021).Users can instead opt to give only the tissue from which the sample originates, in this case, only ScType will be run using its general (human) marker database.The results are tab-separated group files (one for each typer) that link cell to cell type.These are then used together with the per-cell gene and isoform files to generate pseudobulk files giving gene and isoform counts for each cell type (in separate columns).When running on a project with multiple samples, Scywalker will make multi-sample, multicell-type count tables.Novel isoforms with the same junctions but differences at the ends from different samples will be merged into one line.The resulting pseudobulk files can be used directly for downstream analysis.Users can also inspect (recommended) and adapt the automatically assigned cell annotations, e.g. a more in-depth annotation with careful expert considerations.After recording adapted cell type annotations in a group file, a user can easily re-run scywalker's pseudobulk module to obtain pseudobulk matrices for the newly assigned populations.

Scywalker reporting
Alignment metrics from the minimap2 bam file are analyzed using cramino in spliced mode (De Coster and Rademakers 2023), additionally complemented with plots and metrics from identified nuclei and transcript assignment and summarized in an HTML report.The report provides key metrics of the sequencing and cell and gene identification, several interactive plots, such as a knee plot and quality control filters, and information on the read length, including a histogram and a comparison of read lengths against the number of exons detected per read.Furthermore, a histogram shows the number of genes identified per cell and the overlap between reads and known or novel genes, antisense, intergenic, or intronic intervals.Finally, the report includes the UMAP for cell type identification.

Comparison to existing methods
We compared the scywalker pipeline against the following existing workflows: BLAZE-FLAMES (You et al. 2023), respectively versions v1.1.0and v0.1, and the ONT wf-singlecell v0.2.8 (Oxford Nanopore Technologies 2024).Gene level UMI counts for BLAZE-FLAMES were obtained by summing transcript counts per gene.While scywalker outputs a UMI-corrected total cell read count table (independent of the transcript or gene level counts) as well, for accurate comparison of UMI counts per cell between workflows, we considered the cell level UMI counts that were obtained by summing the gene level UMI counts available in all compared workflows.Performance across different workflows was evaluated based on compute time and the correlation of the UMI counts per cell and per gene compared against the shortread quantification.For these correlation comparisons, we only considered the UMI counts per cell and per gene if they were higher than 1 in either of the compared workflows.Moreover, we only included the genes commonly available in the gene annotation references (i.e.GTF/GFF3 files) of the compared workflows, excluding rRNA genes from the comparisons for the plant samples.All software was run using standard parameters (code in Supplementary code), except a maximum edit distance (MAX_DIST) of 1 was used instead of the advised MAX_DIST ¼ 2 setting in match_cell_barcode step of the FLAMES workflow for the brain sample because of the higher accuracy of sequencing.We also tested with MAX_DIST ¼ 2, which resulted in a substantially lower correlation both for the cell UMI counts comparison (R ¼ 0.287 versus R ¼ 0.888, Supplementary Fig. S6) and for the gene/ cell UMI counts comparison (R ¼ 0.644 versus R ¼ 0.868, Supplementary Fig. S6).The correlation plots were generated in R (v4.2.0) using the following packages: ggplot2 (Wickham 2016), ggpairs [as implemented in GGally (Schloerke et al. 2024)], and ggpubr (Kassambara 2023).

Annotation of brain cell types
Scywalker's single-cell module creates a Seurat object using the retained cells after EmptyDrops filtering.It then follows the standard Seurat processing workflow with default parameters, except for a slightly lower resolution of 0.5 during clustering.Clusters were then annotated using scSorter and ScType, with the human brain marker genes for the major cell populations (excitatory neurons, inhibitory neurons, astrocytes, microglia, oligodendrocytes, OPCs, endothelial cells, and pericytes, Supplementary Table S2).These markers were selected from a list of literature-based genes based on their segregating capacity, which was determined using Garnett (Pliner et al. 2019).Markers were only used for annotation if they had an ambiguity of <3% in the human brain dataset.Counts were summed per cell population based on these annotations.

Differential transcript usage between neuronal and glial cells
We used the resulting pseudobulk counts (weighted) as a metric to assess scywalker's performance at the transcript level.Only transcripts detected in at least half of the samples within a given cell type were retained for subsequent analysis.Transcript counts of the four studied samples were summed for each cell type.The plots to visualize the performance were generated using R (v4.3.2) and the following packages: ggplot2 (Wickham 2016), ggpubr (Kassambara 2023), and ggupset.DTU analysis was conducted using the DRIMSeq R package (Nowicka and Robinson 2016), implementing a Dirichlet-multinomial model.We compared neuronal and glial cell types, with the analysis adjusted for individual variations and conditions (Alzheimer's Disease or neurologically normal).A two-stage statistical procedure with stageR (Van den Berge et al. 2017) accounted for multiple testing.In brief, the DRIMSeq gene P-values were analyzed in a screening stage to determine which genes show signs of DTU (gDTU).In gDTUs, the DRIMSeq transcript-level P-values were individually tested for DTU in a confirmation stage, resulting in a corrected P-value for each gene and each transcript of a significant gene.The differential transcript usage plots were made using the viz_transcripts tool included with scywalker, which is based on ggtranscript (Gustavsson et al. 2022) and ggplot2 (Wickham 2016).

Overview of the scywalker workflow
Scywalker is an integrated workflow for analyzing long-read single-cell sequencing data, currently tailored to the 10x Genomics microfluidics platform.Scywalker orchestrates a complete workflow from FASTQ to cell-type demultiplexed gene and isoform discovery and quantification.It consists of three main modules (Supplementary Fig. S1).
In the first step, 10x droplet barcodes are detected and assigned to all reads using a novel method: analogous to BLAZE and wf-single-cell, scywalker starts by sorting barcodes according to the number of reads they were found in.Unlike BLAZE and wf-single-cell, which select cell barcodes using a cut-off (based on the expected number of cells), scywalker detects droplet-barcodes (i.e.putative cell barcodes as well as empty droplets) using a stepwise procedure, to be used for the statistical identification of valid cell barcodes (Lun et al. 2019) later after gene quantification.The top barcode is selected as an accurate droplet barcode, and all barcodes with a single nucleotide difference are searched and added to this top barcode, eliminating them from further selection.This process is repeated for the remaining barcodes until the highest-ranked remaining barcode has 20 reads or fewer.As such, the software detects droplet barcodes well into the "empty drops" levels needed by the sophisticated EmptyDrops algorithm (Lun et al. 2019) to determine which barcodes represent real cells at a later stage.As identifying mismatched barcodes is integrated into the barcode detection, the module can immediately create barcoded FASTQ files where each read has its corrected droplet barcode and UMI sequence added.
In the second module, the droplet barcoded reads are aligned to the reference genome, and isoforms (and genes) are first detected and quantified using a method based on IsoQuant bulk analysis (Prjibelski et al. 2023) for the nonorganelle chromosomes.Gene counting for organelles is done separately using a specific method (see Section 2) because organelles often pose problems to standard isoform callers due to the different transcription structure (e.g.polycistronic) and extreme read counts.IsoQuant and organelle transcript identification produce files specifying the assignment of reads (and thus cell barcodes) to isoforms.This data is used together with the isoform output to produce the quantification of isoforms and genes per droplet barcode in the final step of this module.Based on extra information in the read assignment data (including detection of the polyA tail, completeness of the match, support for more than one isoform), different types of count are included in the result: weighted (1/n for reads supporting n isoforms), unique (counting only reads uniquely supporting the isoform), strict (only counting unique reads that are 90% complete), and analogous using only reads for which a polyA tail was detected.
Using various approaches initially developed for short-read single-cell analysis, the third module starts by filtering out lowquality and empty droplets from the output of the previous step (Lun et al. 2019), and the filtered results are processed using Seurat (Butler et al. 2018) for normalization, scaling, and clustering.The cell barcodes are automatically assigned to cell types using ScType and scSorter.Finally, pseudobulk counts per cell type are generated based on these assignments.Scywalker can also be used to generate pseudobulk results for user-supplied groupings.When analyzing multiple samples together, scywalker will also make multi-sample, multi-cell type count tables, matching novel isoforms with the same junctions that may have differences at the ends between different samples.
Scywalker supports scalable parallelization.In order to reduce memory use and improve performance, most steps are subdivided into smaller jobs (indicated by overlapping boxes in Supplementary Fig. S1), which are efficiently distributed over different processing cores, either on the same computer or over different computers in a cluster.Scywalker is distributed as a portable application directory that contains all dependencies and runs on any Linux system without installing or setting up environments.This facilitates workflow installation and execution, especially on clusters where root access is not always possible, the external network is not necessarily available, and systems may be heterogeneous.

Scywalker accurately quantifies cells and genes
We isolated single nuclei from four adult human brains and performed droplet barcoding and cDNA generation using 10x Chromium, aiming to get �10 000 nuclei per sample (hereafter mentioned as samples brain 1-4).The resulting libraries were sequenced on ONT PromethION (P24) and Illumina NovaSeq 6000 (run details see Supplementary Table S1).Short-read sequencing data is not needed for the scywalker analysis, but it provides a reference for evaluating the long-read analysis results on the cell and gene count level.We compared the UMI-corrected read counts per cell barcode found by scywalker in the long reads and Cell Ranger for the short read data.The correlation between the two (Supplementary Fig. S2) is very high (R ¼ 0.984, 0.993, 0.995, 0.994, respectively for brain1, brain2, brain3, and brain4), showing that scywalker accurately finds cell barcodes and assigns reads to them.Furthermore, scywalker also produces accurate results at the gene count per cell level when compared to the short-read data (Fig. 1).While some genes are only found in either short or long-read data here, these are, with a few exceptions, mostly confined to lower-expressed ones, and the overall correlation between short and long read data is high (R ¼ 0.934, 0.932, 0.931, 0.914, respectively for brain1, brain2, brain3, and brain4).

Comparison to other software
While we had successfully tested the wf-single-cell pipeline provided by ONT [formerly known as Sockeye (Oxford Nanopore Technologies 2024)] on smaller datasets, it could not handle the size of the single-nuclei brain dataset (>10 000 nuclei, >100M reads per sample).As an alternative, we tried the combination of BLAZE (You et al. 2023) for barcode discovery and FLAMES (Tian et al. 2021) for isoform analysis, which ran successfully but took over a week to complete.To properly compare scywalker to both wf-singlecell and BLAZE-FLAMES and show that scywalker also works on data from less accurate data from earlier iterations of the sequencing chemistry and basecaller versions (Guppy 3.1.5),we downloaded the scmixology2 (GSE154870) dataset, containing single-cell data of a mixture of equal proportions of five human lung adenocarcinoma cell lines sequenced with short and long reads (Tian et al. 2021, You et al. 2021).We used this smaller dataset (target 183 cells, 22M reads) to benchmark the different packages.
Scywalker shows a higher correlation (R ¼ 0.942) with the short read data than wf-single-cell (R ¼ 0.734) and BLAZE-FLAMES (R ¼ 0.65) at the cell identification level (Supplementary Fig. S3).The largest differences for scywalker were cells detected only in the ONT data, while the other two packages mainly missed cells found in the short read data.Moreover, at the level of gene counts per cell, scywalker also shows a higher correlation (R ¼ 0.89) than wfsingle-cell (R ¼ 0.765) and BLAZE-FLAMES (R ¼ 0.563) (Supplementary Fig. S4).Scywalker includes downstream analysis, including cell type identification.Scywalker found the expected five clusters and could assign them to the five cell lines using a custom marker set (Supplementary Fig. S5).Using the brain1 dataset for comparison also confirmed that the correlation of the cell (Supplementary Fig. S6) and gene quantification (Supplementary Fig. S7) compared with the short-read results was also higher for scywalker (R ¼ 0.984 and R ¼ 0.934) than for BLAZE-FLAMES (R ¼ 0.888 and R ¼ 0.868).A comparison of the run times for this smaller scmixology2 dataset, tested on a system with a 24-core EPYC 7443P and 512G memory, shows comparable results for the three tools, with BLAZE-FLAMES being the fastest (3h19), scywalker second (3h37) and wf-single-cell the slowest 7(5h41).The analysis of the larger brain1 sample using scywalker took 14h33, or around four times as long, on the same benchmark system.However, the analysis of this larger dataset using BLAZE-FLAMES took 172h06, or over 50 times as long as the smaller scmixology2, demonstrating the improved scalability of scywalker.

Differential usage of transcripts between brain cell types
Scywalker successfully assigned the different brain nuclei cell types (Fig. 2, Supplementary Fig. S8) for the four brain samples, generated pseudobulk files for both gene and isoform counts, and combined these in multi-sample, multi-celltype gene and isoform count files.
To evaluate the results of scywalker at the isoform level, we first compared the number of identified isoforms across cell types using the pseudobulk counts from the four brain samples.Excitatory neurons exhibited the highest number of identified isoforms, of which 11.3% were novel (i.e.not in the gencode v42 reference set), followed by inhibitory neurons and oligodendrocytes (Supplementary Fig. S9A).Endothelial cells, the cell type with the lowest cell count, also exhibited the fewest identified isoforms (Supplementary Fig. S9B).Additionally, we observed a correlation between the number of isoforms and the count of novel isoforms (R ¼ 0.99, P-value ¼ 1.9×10 − 5) (Supplementary Fig. S9B).Scywalker was also able to identify multiple transcripts per gene, as shown in Supplementary Fig. S9C.Among the identified isoforms, considered if present in a minimum of 2 samples in at least one cell type (188 806 isoforms), �40% of them were present in at least five distinct cell types, and 26% were neuron-specific (Supplementary Fig. S9D).
Next, we performed a differential transcript usage (DTU) analysis to assess the capability of scywalker to detect isoform expression variation.We compared the proportions of isoforms within each gene between neuronal cell types (excitatory and inhibitory neurons) and glial cell types (oligodendrocytes, astrocytes, OPCs, and microglia).We found 301 genes with DTU (gDTU) (FDR < 0.05), allowing the identification of neuron-specific isoforms of specific genes such as SEPTIN8 (gFDR ¼ 6.16×10 − 46, ENST00000378719.7 tFDR ¼ 1.09×10 − 57) (Fig. 3A).Septins are known to undergo extensive alternative splicing.In the case of SEPTIN8, it was previously suggested that some of its transcripts are ubiquitously expressed across tissues, while others show most expression in the central nervous system (Hall et al. 2005).Additionally, in 24 other genes, novel isoforms were found to be differentially used between neurons and glia, e.g. for PNKD (gFDR ¼ 2.21×10 − 16, novel transcript tFDR ¼ 1.38×10 − 9, Fig. 3B).

Scywalker extends beyond human nanopore sequencing
To showcase its broad applicability in the analysis of nonhuman species, we also tested scywalker on two large plant single-cell datasets.For this, protoplasts were prepared from mature A.thaliana leaves, followed by the generation of cellbarcoded cDNA using the 10x Chromium, obtaining �10 000 cells per sample.The cDNA was subsequently sequenced on both NovaSeq 6000 and ONT PromethION P24.The short read data (377M reads and 467M reads) was analyzed using Cell Ranger, while the long read data (220M and 230M reads) was analyzed with scywalker.As for the human brain samples, the short read (Cell Ranger) and long read (scywalker) UMI corrected read counts per cell showed excellent correlation (R ¼ 0.973 and R ¼ 0.988 respectively, for plant1 and plant2) (Supplementary Fig. S10).
Similarly, for the plant leaf samples, the UMI gene counts were compared for all genes (except rRNA genes) in cells that had more than a single count in at least one of the two datasets.Here, the correlation between scywalker and Cell Ranger counts is also high (R ¼ 0.89 and R ¼ 0.898 respectively for plant1 and plant2) (Supplementary Fig. S11), thus demonstrating that scywalker provides reliable single-cell expression metrics in human and plant long-read scRNAseq samples.
After data preprocessing and cell gene count estimation, scywalker performed follow-up single-cell analysis, including

Discussion
Long-read single-cell sequencing is essential for a complete transcriptome reconstruction and cell-type-specific alternative isoforms.Compared to short-read sequencing, long reads provide a better resolution at the gene level in organisms with incomplete or inaccurate annotation toward the transcript ends.At the transcript level, more reads are unambiguously assigned to a transcript, offering far better chances of detecting and reconstructing novel isoforms.
As a pilot experiment, we sequenced single-nuclei transcriptomes from four human brain samples on the PromethION P24 platform (ONT).In order to obtain a sufficient number of cells from cell types present at lower abundance and to reduce potential sampling errors at that level, we aimed to obtain >10 000 nuclei per sample.The runs generated, on average, >100 million reads per sample.Available analytical workflows did not scale well to this number of cell barcodes and reads.We developed the scywalker workflow, making use of novel algorithms and extensive parallelization, which could handle these data readily, analyzing one sample in <15 h on a single 24-core server, and is further scalable to high throughput by running on a cluster.Scalability is important as the field evolves toward datasets with more cells at higher depths as the technology matures.The initial implementation of the pipeline was tailored to nanopore sequencing of single-cell libraries from the 10× Genomics microfluidics system but also supports PacBio sequencing data, and is sufficiently flexible to be adapted to accommodate technologies that result in a different read structure, if the need arises.
Although long-read sequencing provides more extensive results at the transcript level, short-read sequencing is the current standard for single-cell gene quantification due to its lower cost, higher per-nucleotide accuracy, and wellestablished analysis tools.Therefore, we also sequenced the same libraries on an Illumina short-read platform to serve as a reference for benchmarking cell detection and gene quantification in the long-read data.We used the correlation between long-read and short-read data as a measure of the accuracy (even though it is probable that the long-read results are more accurate due to less multi-mapping for some genes).Using this approach, we showed that, combined with the latest improvements in nanopore sequencing accuracy,  scywalker's algorithms could generate single-cell gene counts that are very highly correlated (R > 0.9) with results from short reads and that these can also be successfully used for downstream analyses such as cell typing, thus obviating the need for dual sequencing.Even for the older and less accurate iterations of nanopore sequencing data, the correlation for the scywalker results was high (R ¼ 0.89), and the results were usable for downstream analysis.Regarding cell and gene level accuracy, Scywalker also outperformed the other tools (in the instances where they could be tested).In order to ensure the general applicability of scywalker outside of human (or even animal) data, we also generated two plant single-cell datasets (A.thaliana), which were analyzed successfully by scywalker, also showing very high correlation with their respective short-read results.
Isoform discovery in scywalker is based on IsoQuant, a proven tool showing good results in bulk long-read RNA-seq analysis (Prjibelski et al. 2023).In scywalker, the initial discovery is performed without taking cell barcodes into account, i.e. on bulk cDNA.This way, isoforms that have low expression in individual cells but are generally present are not discarded due to too low read counts.Based on the read assignments (to isoforms) and their barcodes, gene and isoform counts per cell are generated afterward.Due to experimental issues such as fragmentation, degradation, and premature incorporation of the template-switching oligonucleotide, even in long-read sequencing, most reads do not represent complete isoforms (e.g.Supplementary Fig. S13) and thus cannot always be assigned unambiguously to specific transcripts.Scywalker incorporates several ways of dealing with this ambiguity by calculating counts by (i) using reduced weight for ambiguous read assignments, (ii) counting only uniquely assigned reads, (iii) counting only (nearly) complete reads, the latter being, e.g.useful for assessing actual support for novel isoforms with different ends.One of the main goals of long-read sequencing is comparisons at the transcript level, and scywalker provides considerable utility toward that goal.The workflow includes determining cell types, generating pseudobulk data (per cell type), and creating a multi-sample pseudobulk count matrix.As illustrated in the results, this can be used for downstream analysis, such as differential isoform usage.
High-throughput nanopore single-cell sequencing has great potential for uncovering detailed isoform usage across cell types and samples.Scywalker unlocks this potential by providing a highly scalable pipeline capable of handling nanopore samples comprising more than 10 000 cells each and over 100 million reads per sample at volume while simultaneously obtaining excellent accuracy in cell identification and gene quantification.Furthermore, scywalker incorporates automated cell-type assignment, pseudobulk isoform count generation, and sample comparison in this single-command endto-end workflow.We also demonstrated that the results can be directly used to identify differential transcript usage, including the detection of novel transcripts.

Figure 1 .
Figure 1.Scywalker UMI counts per gene and cell compared to their respective short-read Cell Ranger results for the four human brain samples.Samplespecific Pearson correlation coefficients (R) are shown on the upper left corners of each panel.y-axis, scywalker UMI counts per gene and cell from longread sequencing data; x-axis, Cell Ranger UMI counts per gene and cell from short-read sequencing data.SRS, short-read sequencing; UMI, unique molecular identifier.

Figure 2 .Figure 3 .
Figure 2. UMAP plot generated by scywalker based on gene counts, showing cell-type assignments by ScType in different colors for the brain1 sample.