Deconvolution of the tumor-educated platelet transcriptome reveals activated platelet and inflammatory cell transcript signatures

23 Tumor-educated platelets (TEPs) are a potential method of liquid biopsy for the diagnosis and monitoring 24 of cancer. However, the mechanism underlying tumor education of platelets is not known, and transcripts 25 associated with TEPs are often not tumor-associated transcripts. We demonstrated that direct tumor 26 transfer of transcripts to circulating platelets is an unlikely source of the TEP signal. We used CDSeq, a 27 latent Dirichlet allocation algorithm, to deconvolute the TEP signal in blood samples from patients with 28 glioblastoma. We demonstrated that a substantial proportion of transcripts in the platelet transcriptome are 29 derived from non-platelet cells, and the use of this algorithm allows the removal of contaminant 30 transcripts. Furthermore, we used the results of this algorithm to demonstrate that TEPs represent a subset 31 of more activated platelets, which also contain transcripts normally associated with non-platelet 32 inflammatory cells, suggesting that these inflammatory cells, possibly in the tumor microenvironment, 33 transfer transcripts to platelets that are then found in circulation. Our analysis suggests a useful and 34 efficient method of processing TEP transcriptomic data to enable the isolation of a unique TEP signal 35 associated with specific tumors. 36


Introduction
Liquid biopsy has garnered increasing interest as a method of interrogating tumors before treatment, immediately after treatment to test for minimal residual disease, after treatment to assess for recurrence, and even before diagnosis as a screening method (1)(2)(3).Several liquid biopsy methods have been proposed and developed, including circulating tumor cells, circulating tumor DNA, and extracellular vesicles.These techniques may identify the presence of tumors, characterize their biological properties, and may be used to tailor therapy precisely to the molecular features of the cancer cells.Furthermore, liquid biopsy may offer advantages over direct tumor biopsy, beyond the obvious technical advantage that liquid biopsy is substantially less invasive.For example, direct biopsy of the tumor does not always reflect tumor heterogeneity if the biopsy procedure samples only part of the tumor, whereas liquid biopsy is more likely to account for tumor heterogeneity, potentially allowing for further tailoring of therapy (4).
For primary brain tumors, however, traditional liquid biopsy methods are not sufficiently effective (5).One report noted that less than 10% of patients with glioma had detectable circulating tumor DNA (6).Similarly, exosomes can aid in the detection of epidermal growth factor receptor variant III (EGFR-vIII), a molecular alteration found in about 30% of glioblastoma, but with specificity and sensitivity comparable to those of brain MRI (7).The inability to detect primary brain tumors using liquid biopsy presents an especially acute problem, because many primary brain tumors have high rates of recurrence (8).Glioblastoma, the most common and lethal primary brain tumor in adults, has a median survival of 21 months even with standardof-care treatment, including resection, adjuvant radiotherapy with concurrent and adjuvant temozolomide, and tumor-treating fields (9).Tumor recurrence is generally considered to be inevitable, but is difficult to detect correctly due to the phenomenon of "pseudoprogression" in which subacute effects of radiotherapy and chemotherapy can imitate the appearance of an enlarging, recurrent tumor on MRI (10,11).This highlights the need for accurate methods to detect the presence or recurrence of high-grade brain tumors through methods that may complement imaging, such as liquid biopsy.
A recent report by Sol et al. demonstrated the feasibility of using "tumor-educated platelets" (TEPs) as a liquid biopsy for glioblastoma (12).Multiple studies have demonstrated that the platelet transcriptome in patients with cancers of multiple origins differs from that of healthy controls (13)(14)(15).Several mechanisms have been proposed for "tumor education," including the suggestion that the tumor transfers mRNA transcripts directly to tumor cells (14,16), or that patients with cancer have preferential upregulation of reticulated platelets with differing transcriptomes (17), or that the tumor produces factors that alter platelet mRNA splicing and thus modulate gene expression (18).However, none of these mechanisms have been definitively demonstrated, limiting the ability to precisely define the TEP signal and develop its use as an assay with clinical utility in detecting initial tumor burden, minimal residual disease, or tumor recurrence.
Studies of TEPs in the clinical sphere have focused on using supervised machine learning algorithms to provide a clinically utilizable computational model (12,17).For example, Sol et al. identified 203 genes that constitute a TEP signal that distinguishes platelets in patients with glioblastoma from platelets in healthy controls.However, these genes were not established as glioma-related.Furthermore, many highranking genes in the signal, such as CA1, CD163, and S100A12 are typically associated with non-platelet circulating cells, further raising the question of what mechanism drives the expression of these transcripts in platelets.In this study, we used an interpretable, unsupervised, machine learning algorithm to elucidate the potential mechanism underlying tumor education of platelets.

Results
To test the hypothesis that platelets receive RNA transcripts directly from the tumor, we implanted tumors derived from the GS 8-11 glioma stem cell line into nude mice and drew platelet samples from these mice.
The platelet transcriptomes of these mice were compared to those of control mice using the BBSplit tool (19) to isolate transcripts mapping to the human genome (hg38) from those mapped to the mouse genome (mm10) (Figure 1a).An average of 0.35% of transcripts from the platelets of the tumor-implanted mice were mapped to the human genome, compared to 0.53% of transcripts from the platelets of the control mice (Figure 1b).Based on these results, it appears unlikely that there is direct transfer of transcripts from the tumor to circulating platelets.Furthermore, the identities of transcripts mapped to the human genome were predominantly similar between the tumor-implanted and control mice (Figure 1c), suggesting that these are not true human transcripts, but reflect incorrect mapping of these mouse transcripts to the human genome.
Additionally, differential expression testing using DESeq2 (20) showed that no transcript mapping to the human genome was significantly upregulated in the tumor-implanted mice, though several transcripts mapped to the mouse genome were significantly upregulated or downregulated in the tumor-implanted mice (Figure 1d).
Given that the direct transfer of RNA from the tumor to circulating platelets appears unlikely, we attempted to find alternative explanations for the observed changes in the platelet transcriptome associated with GBM.
Since many of the transcripts included in the TEP signature are not typically associated with the platelet transcriptome, we hypothesized that the platelet transcriptome sequencing data require further processing to remove contaminant transcripts that are not platelet-derived and to isolate the correct signature associated with the presence of tumor.We used the program CDSeq (21) to deconvolute the platelet transcriptome in an unbiased manner using an unsupervised machine learning algorithm based on latent Dirichlet allocation (see Methods).The algorithm takes the read counts from a group of RNA-seq experiments as input and deconvolutes them into read counts for individual cell types.We used published data of platelet transcriptomes of healthy controls and patients with GBM (12), with the entire dataset, containing both patients with GBM and healthy controls, used as input to the algorithm.Because the number of cell types must be provided as input but was not known a priori, we ran the algorithm for varying numbers of cell types from 2 to 60 (Supplementary Figure 1-2).Although the log-posterior of the output continued to increase with additional cell types, the number of cell types corresponding to non-platelets did not increase substantially after 10 cell types.The results of the deconvolution algorithm using 2-10 cell types are presented in Figure 2a.Annotation of cell types was performed by correlating the gene expression profile of each cell type with published annotated single-cell RNA-seq gene expression profiles (22) (Supplementary Dataset 1).Multiple related single cell types (e.g., CD4+ T cells, CD8+ T cells, and other T cells) were frequently highly correlated with a bulk cell type, in which case the most general cell type annotation is shown.We note that when dividing the read counts into 2 cell types, the smaller fraction appears to correspond to non-platelets, and this non-platelet cell type continues to be subdivided further by increasing the number of cell types.Importantly, 21.9% of the total reads in these platelet samples were derived from non-platelet cell types, based on the division of the transcriptome into platelet and non-platelet fractions.
We arbitrarily chose to examine the results using 8 cell types because with more cell types, there was little change in the non-platelet cell types, and further divisions only served to increase the number of platelet cell types (Supplementary Figure 1a).The relative abundances of these cell types in all the samples are shown in Figure 2b.Notably, because the amount of RNA in platelets is substantially lower than that in other cell types in peripheral blood, the percentage of cells that were non-platelets in these samples was lower than the percentage of reads that were not platelet-derived.Clustering of the cell types based on gene expression profiles (Figure 2c) also demonstrated that cell types annotated as platelets (A, B, C, and D) were more similar to one another than the other cell types and together formed a distinct clustering branch from the non-platelet cell types.Figure 2d shows a representation of the gene expression in each of the 8 cell types.The non-platelet cell types are seen clearly to be non-platelets, with the cell type annotated as "erythrocytes" (cell type E) having high levels of hemoglobin reads (HBB, HBA1, HBA2), the cell type annotated as "T cells" (cell type H) having an overrepresentation of T-cell receptor reads (T-cell receptor diversity domains and T-cell receptor alpha joining domains), and the cell type annotated as monocytes enriched in LYZ, CD74, and DDX5.Cell type G likely represents a subset of leukocytes because, upon examination of the diagram in Figure 2a, the transcripts in cell type G trace back to a common larger cell type that includes other white blood cells; additionally, this cell type contains a high proportion of ribosomal protein RNAs (e.g., RPL7 and RPS18), which are considerably more abundant in non-platelet circulating cells compared to platelets as seen in single-cell data sets (Supplementary Figure 3).We also found that the proportion of non-platelet transcripts was considerably higher in samples derived from specific institutions (Supplementary Figure 4), suggesting that the presence of non-platelet transcripts in the sample might be technique-dependent.For example, insufficient isolation of platelets by centrifugation, lysis of non-platelet cells, and/or release of their RNA during preparation may lead to contamination of platelet RNA.This may also partially account for the batch effects reported in tumor-educated platelet collections between hospitals (23).Of note, the protocol used for the platelet samples analyzed in this study did not apply leukodepletion during platelet purification (23), whereas other protocols do include this step (24,25), which may indicate that platelet samples in which leukodepletion is performed may have lower rates of these cell types.
We then examined how the four platelet types were represented in the platelets of patients with GBM compared to platelets in healthy controls.A representation of the different platelet types in healthy controls and patients with GBM is shown in Figure 3a.We note that the largest differences were seen in platelet types A and D, where patients with GBM had significantly higher levels of platelet type A and significantly lower levels of platelet type D. We selected genes that constituted more than 0.01% of reads and were upregulated or downregulated at least 10-fold in cell type A compared to cell type D (Figure 3b).We then identified Gene Ontology (GO) biological processes that were differentially enriched in upregulated genes compared to downregulated genes (26)(27)(28) as shown in Figure 3c.Given that many of the GO terms are associated with platelet activation ("platelet activation," "positive regulation of platelet activation," "wound healing," "hemostasis") or cytoskeletal element function which is important in platelet shape regulation after activation, we infer that cell type A represents activated platelets, while cell type D represents quiescent platelets.For the same reason, it is unsurprising that cell type D contains increased levels of RGS10 and RGS18, both of which encode proteins associated with inhibition of platelet activation.(29)Cell types B and C had intermediate levels of platelet activation-related genes and RGS10 and RGS18 (Supplementary Figure 5), suggesting that they represent platelets with intermediate levels of activation between platelet types A and D. We refer to the transcriptomes of these subtypes as Plt activated and Plt quiescent , respectively.Given the concern that higher levels of activated platelets might reflect increased platelet production and higher levels of reticulated platelets, we divided the dataset into samples with high levels of circular RNAs (greater than the median value) and low levels of circular RNAs (less than the median value), using predicted circRNA levels from the PTESFinder algorithm (30).Since circRNA levels are increased in older platelet samples, due to preferential decay of linear mRNAs over circRNAs (31), this should be a surrogate marker for overall platelet age.There was no significant difference between the fraction of circRNAs in samples from controls versus samples from patients with GBM (p = 0.07).Furthermore, our results were similar for the entire dataset as a whole compared to either half of the dataset (Supplementary Figure 6), suggesting that the higher levels of activated platelets are not strictly a function of increased or decreased platelet production.
To demonstrate that changes in the transcriptional profile of these subpopulations correlate with differences in the platelet phenotype, we performed single cell RNA-seq with mouse whole blood using CITE-seq (32), using peripheral blood from one healthy control mouse and one mouse with an implanted non-small cell lung cancer (NSCLC) tumor.After identifying the subset of data corresponding to platelets and clustering the platelets based on gene expression, three clusters of platelets were identified, which we name P1, P2, and P3 (Figure 4a).P3 appears to be a sub-cluster of P2 and is predominantly observed in platelets from the tumor-bearing mouse.Platelets in cluster P3 had especially high levels of CD41 expression (Figure 4bd), suggesting increased platelet activation (33)(34)(35).When examining the P1, P2, and P3 clusters at the transcriptomic level (Figure 4e), P2 and P3 together had an altered transcriptional profile compared to P1, with genes highly expressed in P2 and P3, including those seen at higher levels in Plt activated than in Plt quiescent , such as Myl6, Sh3bgrl3 and Actb.Cluster P1 contains higher levels of genes associated with megakaryocytes, including Rock2 (36), Daam1 (37), and Gata2 (38), as well as genes previously associated with "young" platelets such as H3f3b (39), suggesting that these represent young platelets that have more recently been released from megakaryocytes and are thus less likely to have been activated.P1 also contained higher total read counts (Figure 4f), which is consistent with this cluster representing younger platelets, since older platelets have undergone mRNA decay and cannot synthesize new mRNA since they are anucleate.P2 and P3 had higher levels of transcripts associated with platelet activation, including Itga2b, Sh3bgrl3 and Myl9; lower levels of Rgs18 which is associated with negative regulation of platelet activation; and lower levels of Nt5c3 and Tsc22d1 which were highly expressed in the Plt quiescent subtype (Figure 4g).
We then returned to analyze the platelet samples collected from control patients and patients with GBM, analyzing the cell type distribution of genes that were differentially upregulated in the platelet transcriptome of patients with GBM.We selected genes with greater than 1.25 log 2 -fold upregulation.Importantly, these genes do not include transcripts known to be upregulated in GBM cells.These were then clustered using a hierarchical clustering algorithm (see Methods); the clustering results are shown in Figure 5a.We then identified the distribution of these transcripts by deconvolution into 30 cell types (Figure 5b).Some of these transcripts are heavily represented in platelet cell types, whereas others are mostly represented in nonplatelet cell types, including red blood cells, monocytes, and T cells.Figure 5c shows the expression of these transcripts in each platelet cell type.GBM-associated transcripts that are mostly expressed in platelets with little expression in non-platelets are expressed in many platelet cell types, with an apparent preference for activated platelet cell types (those that are most similar to Plt activated ).This suggests that these genes are upregulated in the platelet transcriptome of patients with GBM, as these patients have increased levels of activated platelets, as noted previously.
When examining GBM-associated transcripts that are mostly not expressed in platelets, as shown in Figure 5b, those that are mostly expressed in monocytes or T cells do not appear in most platelet cell types, including most activated platelet cell types.Rather, they are preferentially expressed in one of the platelet cell types, corresponding to the most highly activated platelet type.This suggests that these genes are mostly expressed in non-platelets but are also found in a small subset of activated platelets, which may indicate that these transcripts originate in circulating non-platelet cells and are then transferred to a small fraction of platelets.Of note, genes that are upregulated approximately equally in platelets, monocytes, and T cells have especially high expression in this one platelet subtype, and then lower expression in other activated platelets, suggesting that these are genes expressed in activated platelets, but also in non-platelet cells, which may then transfer some of these transcripts to platelets.GBM-associated transcripts that are expressed in red blood cells appear to be expressed in all activated platelet cell types, similar to transcripts that are predominantly expressed in platelets.This likely represents the fact that red blood cells adhere directly to platelets to promote aggregation and degranulation (40), such that activated platelet cell types contain transcripts originating from red blood cells that are bound to platelets.Only one gene, WFDC1, is expressed predominantly in platelets and not in other cell types, but it also appears only in one activated platelet type where these other genes are expressed.Of note, this activated platelet type is specific to GBMassociated platelets and not other inflammatory conditions; analysis of the different platelet subtypes in samples from patients with a nonneoplastic inflammatory condition, multiple sclerosis, did not show elevation of the GBM-associated platelet subtype or elevation of the Plt activated cell type more generally (Supplementary Figure 7).This analysis suggests that the deconvolution method used here could be employed in a supervised manner to predict whether tumor-educated platelets from patients with GBM are present.Indeed, we were able to reanalyze this data using a supervised latent Dirichlet algorithm (41), splitting the dataset into training (80%) and validation (20%) subsets and using the algorithm to predict whether a sample represents a control patient or a patient with GBM using cell type composition.The algorithm successfully differentiates GBM samples from control samples (training: AUC 0.84, validation: AUC 0.83) (Supplementary Figure 8), using only cell type composition as a predictor variable and without direct use of read counts as predictors (see Methods).This suggests that cell type composition, including presence of activated platelets and platelets containing ingested mRNAs, could be directly used for sample prediction.
To address the concern that TEPs might not contain platelet-related reads due to the primary malignancy being protected by the blood-brain barrier, we repeated the analysis for a second malignancy, non-small cell lung cancer (NSCLC).Indeed, in NSCLC, we found a similar pattern of findings.We again found that a substantial fraction of reads in analyzed blood samples did not correspond to platelets but to other cell types, including erythrocytes, leukocytes and monocytes (Supplementary Figure 9).When analyzing the cell types generated when deconvoluting the samples into 7 distinct cell types, 4 of these types corresponded to platelets.Two of these cell types were significantly enriched in samples from patients with NSCLC, while a third cell type was significantly decreased in these patients (Supplementary Figure 10).The NSCLC-enriched fractions had similar profiles to the GBM-enriched fraction of platelets, corresponding to the activated subtype, with high levels of ACTB, ITGA2B, MYL6 and GP1BB; the NSCLC-depleted fraction had high levels of RGS18 similar to the GBM-depleted fraction.Similar to GBM, the genes highly expressed in platelets from patients with NSCLC include those predominantly expressed in platelets, which are mostly seen in activated platelet subtypes, and those predominantly expressed in non-platelets, which are confined to 1-2 platelet subtypes (Supplementary Figure 11).Our analysis of the NSCLC platelet samples also allowed for comparison to samples drawn from patients with metastatic NSCLC to the brain (Supplementary Figure 12).Of note, these samples had elevated levels of one of the 4 platelet subtypes, similar to primary NSCLC samples, corresponding to activated platelets.Additionally, when these samples from patients with brain metastases were deconvoluted into 30 subtypes, a subtype which was highly enriched in samples from patients with primary NSCLC, with high levels of IFITM3, IFI27, TPM2 and HBG2 reads, was similarly enriched in samples from patients with brain metastases.

Discussion
Tumor education of platelets has eluded mechanistic interpretation, hampering further development and use of this liquid biopsy method.Our analysis suggests that circulating platelets likely do not directly receive mRNA transcripts from the tumor.Instead, the TEP signal is found in a minority of platelets whose mRNA can be separated using computational means from a much larger sample of platelet mRNA.Our results suggest that a non-negligible fraction, approximately 20%, of the "platelet" transcriptome is in fact derived from non-platelet cells, serving to contaminate the transcriptome.If these transcripts are removed, the remaining transcripts fall on a spectrum from quiescent to activated platelets and the presence of GBM increases the rate of activated circulating platelets.
Activated platelets have increased levels of transcripts, including ITGA2B, MYH9, and ACTB, which are associated with the process of platelet activation, and lower levels of transcripts, including RGS18, which are associated with decreased platelet activation.Furthermore, a small subset of activated platelets contains mRNAs that constitute biomarkers that represent the presence of tumor.The latter mRNAs appear to be those that are also found primarily in non-platelet cells, including erythrocytes, leukocytes, and myelocytes.
Such non-platelet cells may interact with activated platelets in the tumor microenvironment, transferring inflammatory markers that can then be directly detected in the transcriptome of circulating platelets (42).This explanation is consistent with the finding that none of the TEP markers are transcripts known to be upregulated within the tumor itself.Furthermore, previous reports have suggested that the TEP signal reflects activated platelets (17), but transcripts representative of platelet activation, such as ITGA2B, have not been identified by supervised machine learning algorithms as part of the signature used to classify platelets as tumor-educated or control.Our analysis suggests that platelet activation is permissive of the TEP signal, but not sufficient, and that the TEP signal is contained in only a small fraction of activated platelets.We do not see a clear difference in the signature between localized disease and metastatic disease (in the case of NSCLC); this may reflect that our analytic method is not sensitive enough to distinguish such a difference, or that the TEP signal is only produced in a small fraction of platelets such that extent of disease could not be expected to increase the intensity of the signal.
Our analysis also suggests that that removal of non-platelet transcripts is important for the detection and interpretation of the TEP signal.Other reports have analyzed the presence of ribosomal protein transcripts or hemoglobin transcripts found in the platelet transcriptome associated with tumor or benign conditions (43)(44)(45).However, we suggest that most of these transcripts originate from non-platelet cells in the blood that are lysed during sample processing.On the other hand, the small subset of activated platelets with TEP signal-associated transcripts contains transcripts that were seen mostly in monocytes and lymphocytes.
That is, in addition to transcripts in the sample which indeed derived from non-platelets, platelet samples from patients with GBM also contain platelet-derived transcripts that are likely to have originated in nonplatelets.One possibility to explain this finding is that these transcripts may be transferred to platelets from monocytes in the tumor microenvironment, either actively or passively, and may represent a useful biomarker of tumor presence.WFDC1 is the only gene that is upregulated in GBM-associated platelets and is expressed predominantly in platelets but appears only in a small subset of platelets associated with the TEP signal.This suggests that WFDC1 transcripts may originate from other non-blood cells, perhaps in the tumor microenvironment, and are then transferred to this subset of platelets.Notably, endothelial expression of WFDC1 is found in the brain vasculature comprising the blood-brain barrier in mice, where it has been found to regulate inflammation and wound repair (46).At sites of blood-brain barrier disruption, such as gliomas, WFDC1 may be highly expressed among endothelial cells interacting with platelets.
Deconvolution of platelet transcriptomic data may serve as a sort of computational "filter" to isolate platelet mRNA, which can then be analyzed further for signals representative of tumor presence.One notable limitation of this technique is the substantial computational time required to employ this deconvolution algorithm, which, as a Gibbs Monte Carlo Markov chain algorithm, is not parallelizable.Other deconvolution algorithms that use simpler techniques, such as non-negative matrix factorization, are expected to be quicker, but may also be less accurate in removing non-platelet transcripts.Furthermore, a notable advantage of the CDSeq algorithm is that no reference cell types are required, which is important because the composition of different platelet cell types are not well defined.
We were unable to ascertain the definitive origin of the TEP signal transcripts, although, as noted, they are not likely to be derived directly from tumor cells, as they are not considered representative transcripts of the tumor and are more likely to be derived from other nontumor cells, perhaps from supporting cells in the tumor microenvironment.Bidirectional biomolecular transfer is well established between platelets and other cells, including endothelial cells and inflammatory cells (47).Thus, the increased activation of platelets in patients with GBM may benefit the tumor by allowing the transfer of RNAs to and from platelets.This may then be exploited for clinical benefit by detecting activated platelets in which mRNA has been transferred from cells in the tumor microenvironment.Because the tumor microenvironment differs between tumors, the biomolecules abundant in the microenvironment, as well as those most likely to be detected in circulating platelets, are expected to differ as well, as evidenced by reports showing different TEP signals for different cancers.Further studies using deconvoluted transcriptomic data from platelets in patients with various tumors may further elucidate the inflammatory markers unique to each tumor.
Overall, we find that the use of an unsupervised machine learning algorithm for deconvolution allows for further mechanistic insight into the nature of the tumor-educated platelet signal.Although complete insight into the origin of tumor-educated platelet transcripts cannot be provided by computational insights alone, further experimental work may be able to trace transcripts, possibly originating from inflammatory cells, transferred to circulating platelets.The signature produced will benefit from additional validation in a prospective study of a patient population with GBM.This would allow for analysis of other patient markers which may correlate with inflammatory transcripts seen in the TEP data, such as erythrocyte sedimentation rate, and other clinical variables such as smoking and medications which might alter platelet states.The analytic pipeline used in this study may allow for decontaminated transcriptomic data which may better elucidate the role of these variables in platelet composition.

Sex as a biological variable
Sex was not considered as a biological variable in analysis of human samples, which included both male and female patients.Mouse experiments were performed exclusively on female mice and it is unknown whether the findings are relevant for male mice.

Deconvolution with CDSeq
FASTQ files containing RNA-seq data were obtained from the GEO database under accession number GSE156902 (12).Raw FASTQ files were processed using Trimmomatic version 0.36 (48) for trimming and clipping sequence adapters.The reads were then aligned to the human reference hg38 genome using STAR (49), and the aligned reads were summarized using HTSeq 0.13.5 (50).The dataset for analysis of the GBM TEP profile included all normal samples and all GBM samples, excluding follow-up specimens.Levels of circRNA in each sample (discussed in Supplementary Figure 6) were predicted by using PTESFinder (30).Aligned reads from the same specimen stored in separate files were combined.Genes with fewer than 400 reads across all the samples were removed from the analysis.
The resulting dataset contained 437 samples, including 89 samples from patients with GBM, 348 healthy controls, and 20367 genes.Mitochondrial reads were removed from the dataset.This dataset was then provided as input to CDSeq (21), using the parameters alpha = 5 and beta = 0.5.Results of deconvolution with varying values of alpha and beta were substantively similar (Supplementary Figure 13).To make the process computationally tractable to run with different numbers of cell types, we used the CDSeq data dilution module with a dilution factor of 10.We then ran CDSeq on the dataset for each number of cell types in the range of 2 to 60 to determine the ideal number of cell types.Each run was performed for 1000

Markov chain Monte Carlo steps.
To analyze the relationship between cell types in each run, we tracked the cell types to which the reads of a given gene/sample combination were assigned.For example, suppose the number of reads of gene g in sample s that are assigned to cell type t among T total cell types (t = 1, 2, 3, …, T) is given by  , so that the total number of reads of gene g in sample s is ∑  .Also denote the number of reads of gene g in sample s that are assigned to cell type  among T + 1 cell types (t = 1, 2, 3, …, T, T + 1) by  .Now, we can estimate the number of reads of gene g in sample s assigned to cell type t among T cell types that are also assigned to cell type t' among T + 1 cell types, as We then estimated the number of total reads of cell type t among T cell types that were also assigned to cell type t' among T + 1 cell types as the sum of this expression over every gene-sample combination, or This quantity represents transcripts that are preferentially assigned to cell type t over other cell types and that also constitute a substantial fraction of cell type t.For each cell type, we computed the proportion of each sample that comprised the cell type and computed cell types preferentially found in GBM over normal samples using a t-test, using the Holm-Bonferroni method to correct for multiple comparisons.
Cell types were identified by comparing the transcriptome of each cell type with single-cell RNA-seq data from a set of healthy control patients included in a publicly available dataset of peripheral blood singlecell data (22).Samples from patients in this dataset with COVID-19, which was the subject of the study for which this data was obtained, were excluded.The distance between the distribution of gene expression associated with each bulk cell type and the distribution associated with the annotated single-cell types was computed using the Hellinger distance where N is the total number of genes, numbered  1, … ,  and  and  are the expression of gene  in the bulk data and single-cell data, respectively.The single-cell type closest to a given bulk cell type was selected as the correct annotation.In practice, several single-cell types were nearly equally close in distance to each non-platelet bulk cell type but were of the same hematopoietic lineage; therefore, in these cases, the cell type was annotated using the hematopoietic lineage (see Results).Hierarchical clustering was performed on the bulk cell types using the Hellinger distance to compare cell types, as described above.
The same single-cell data set was used to approximate the number of RNA molecules per cell for a given type by taking the mean number of reads among all cells of each type.For cell type G (unknown type with high ribosomal protein counts), the mean read count of all cells in the data set excluding erythrocytes and platelets was used.
We also examined transcripts that were differentially upregulated in the platelet transcriptomes of patients with GBM compared with healthy controls.Analysis was performed using DESeq2 (20), and transcripts with a greater than 1.25 log 2 -fold increase in expression, at a significance level of p < 0.05, were selected.
These genes were then clustered together using hierarchical clustering of the distribution of the transcripts of each gene in the deconvolution algorithm, using 2 to 60 cell types; using the notation above, the distance between two genes g and g' is given by That is, genes that tended to be allocated to the same cell type across deconvolution attempts for different numbers of cell types were more likely to be clustered together.For each gene, we analyzed the cell types to which the transcripts were assigned when performing deconvolution into 30 cell types.We also analyzed the share of each platelet cell type transcriptome that was constituted by the transcripts of each of these genes, as well as when performing deconvolution into 30 cell types.
The above algorithms were also used to analyze an RNA-seq dataset of patients with non-small cell lung cancer (NSCLC) using GEO accession number GSE89843 (17).Processing was performed as described above, and deconvolution and analysis were performed using a dataset of healthy controls from the aforementioned studies combined with NSCLC samples.
To run the LDA algorithm in a supervised fashion, we used the implementation of the supervised LDA algorithm in tomotopy 0.12 (51).We first used the standard unsupervised LDA algorithm with k = 2, with alpha = 20 and beta = 0.5, to deconvolute the data into two cell types corresponding to platelets and nonplatelets, running the algorithm for 2000 steps.We then used the platelet cell type from the resulting deconvolution as input for the supervised LDA (LDA) algorithm, using k = 30 cell types, alpha = 0.1 and beta = 0.01, with a binary classifier set with µ = 0 and ν 2 = 1, and setting the response variables to 0 for control samples and 1 for GBM samples.This was run for a total of 1000 steps and the inferred response value for each data sample in the training and validation sets are used for class prediction.
To compare patient follow up samples and samples from patients with multiple sclerosis (both included in GEO accession number GSE156902) to our data set of samples of healthy controls and initial samples from patients with GBM, we used a "fold-in" technique to directly compare these additional samples to the ones analyzed by the initial pipeline.We used the same processing pipeline as above to generate a dataset of aligned reads.We then downsampled these samples by a factor of 10 to match the dilution factor used by CDSeq for the original samples and to make the analysis computationally tractable.We then ran 1000 cycles of our own latent Dirichlet allocation algorithm including both the new samples and the already analyzed samples, but only changing the cell type assignments of the new sample reads while keeping the cell type assignments of the previously analyzed samples.Thus, the new samples are "folded into" the previous analysis so that cell type annotations are identical and the new samples can be analyzed within the reverse transcription using 10X Genomics and libraries were prepared as previously described (32).
Briefly, cDNA amplification was performed in the presence of an antibody oligo-specific primer to increase the yield of antibody-derived tags (ADTs).The amplified cDNA was separated by SPRI size selection into cDNA fractions containing mRNA-derived cDNA (>300bp) and ADT-derived cDNAs (<180bp).Sequencing libraries were generated from the mRNA and ADT cDNA fractions, which were quantified, pooled, and sequenced on an Illumina NovaSeq platform (Illumina).
CITE-seq data were processed using Seurat v.4 pipeline (55).Platelets were selected from the single-cell dataset by visual inspection of the UMAP plot of the total data set, selecting the cell cluster that contained high levels of platelet markers including Ppbp and Pf4, then removing cells with >2% ribosomal protein transcripts, >2% hemoglobin transcripts, >10% mitochondrial mRNA transcripts, or increased levels of Malat1.Platelet datasets from normal mice and mice with implanted tumors were integrated using the Seurat pipeline.The integrated data were then processed using the standard Seurat pipeline, with markers of Seurat clusters found using the FindConservedMarkers function.

Statistics
For analysis of rates of human transcripts in control and tumor-implanted mice, the mean was calculated and 95% confidence intervals calculated using a bootstrap.Gene set enrichment analysis was performed using topGO (28) and p values were calculated using Fisher's exact test.Differential expression testing p values were calculated using a Wald test, using the standard DESeq2 methods.Differences in expression levels and read counts in CITE-seq data analysis were analyzed using a Wilcoxon test.

Study approval
All animal procedures were reviewed and approved by the Institutional Animal Care and Use Committee.

Figure 1 .
Figure 1.Analysis of platelets from mouse implanted with a human brain tumor.(A) Schematic of experiment.Mice are implanted with cells from human GS 8-11 line, then mouse platelets are harvested from implanted mice and control mice, for downstream RNA-seq.Created by Biorender.(B) Fraction of reads in each sample mapped to the human genome hg38 for control mice and mice implanted with GS 8-11 tumors (each n = 4).The large colored dot represents the mean value, with lines extending from the mean representing 95% confidence intervals calculated using a bootstrap.(C) The top 20 genes mapped to the human genome for platelets from control mice and platelets from mice implanted with GS 8-11 tumor.(D) Differential expression of mouse-aligned and human-aligned genes using DESeq2, with significantly altered expression (p < 0.05, Wald test) highlighted.

Figure 2 .
Figure 2. Deconvolution of the platelet transcriptome using CDSeq.(A) Results of the deconvolution algorithm, using varying numbers of cell types (ranging from 2 to 10 cell types).Each column represents the same set of reads deconvoluted into a different number of cell types.Flow from one column to the next represents an estimate of the repartitioning of reads into a larger number of cell types.For the remaining figures, we use the deconvolution into 8 cell types.(B) The proportion of each cell type with regard to total number of transcript reads (left) and total number of cells (right).(C) Clustering of cell types based on similarity.Distance is computed by taking the Spearman correlation coefficient between two cell type gene expression profiles and converting it from a value ranging from 1 to -1, to a distance in [0,1].(D) Prominent genes found in each cell type with darker color corresponding to higher fraction of reads of the given gene in the given cell type.

Figure 3 .
Figure 3. Features of platelets found in controls and patients with GBM.(A) Boxplots showing fraction of cell types A, B, C and D in samples from healthy controls and samples from patients with GBM, after other cell types have been removed.Means were compared via a Wilcoxon test.(B) Plots of genes in cell types A and D, with x axis representing the estimated representation of the gene in the two cell types combined, and y axis representing the ratio of expression in cell type A to cell type D. Genes in the green and red squares are taken to be upregulated and downregulated, respectively.(C) Gene Ontology terms overrepresented or underrepresented in genes that are upregulated in cell type A compared to genes that are downregulated in cell type A.

Figure 4 .
Figure 4. Analysis of CITE-seq data from mouse platelets.(A) UMAP plot of platelets collected from normal mouse and mouse implanted with tumor, with clusters P1, P2 and P3 highlighted.(B) CD41 surface expression in platelets.(C-D) Expression of CD41 in platelets from normal and tumor-implanted mice, and in clusters P1, P2 and P3, with p-values from Wilcoxon test shown.(E) Gene expression heatmap for P1, P2, and P3 clusters, with genes highly expressed in P2/P3 vs P1 or vice versa shown.CD41 expression heatmap in P1, P2 and P3 clusters shown below gene expression heatmap.(F) Total read count per cell in clusters P1, P2, P3; p value using Wilcoxon test.(G) Dot plot showing expression of platelet activation related genes (Itga2b, Sh3bgrl3 and Myl9) and genes associated with decreased platelet activation (Nt5c3, Tsc22d1, Rgs18).

Figure 5 .
Figure 5. Genes upregulated in GBM-associated platelets.(A) Clustering dendrogram for genes upregulated in platelets of patients with GBM.(B) Distribution of these genes among cell types when deconvoluting into 30 cell types.Distribution of all other genes among cell types is shown on the right side.(C) Expression of these genes among platelet cell types.Platelet cell types are ordered vertically by finding the Hellinger distance of the platelet type gene expression from the Plt activated transcriptome, so that platelet types at the top are most similar to Plt activated .