Subcellular protein localisation of Trypanosoma brucei bloodstream form-upregulated proteins maps stage-specific adaptations

Background: Genome-wide subcellular protein localisation in Trypanosoma brucei, through our TrypTag project, has comprehensively dissected the molecular organisation of this important pathogen. Powerful as this resource is , T. brucei has multiple developmental forms and we previously only analysed the procyclic form. This is an insect life cycle stage, leaving the mammalian bloodstream form unanalysed. The expectation is that between life stages protein localisation would not change dramatically (completely unchanged or shifting to analogous stage-specific structures). However, this has not been specifically tested. Similarly, which organelles tend to contain proteins with stage-specific expression can be predicted from known stage specific adaptations but has not been comprehensively tested. Methods: We used endogenous tagging with mNG to determine the sub-cellular localisation of the majority of proteins encoded by transcripts significantly upregulated in the bloodstream form, and performed comparison to the existing localisation data in procyclic forms. Results: We have confirmed the localisation of known stage-specific proteins and identified the localisation of novel stage-specific proteins. This gave a map of which organelles tend to contain stage specific proteins: the mitochondrion for the procyclic form, and the endoplasmic reticulum, endocytic system and cell surface in the bloodstream form. Conclusions: This represents the first genome-wide map of life cycle stage-specific adaptation of organelle molecular machinery in T. brucei.


Introduction
Trypanosoma brucei is a unicellular eukaryotic parasite and, like any unicellular organism, adjusts its gene expression profile to adapt to different environments. As an obligate parasite, the environments it encounters are exclusively within the host and vector and gene expression profile changes give rise to the appropriate protein machinery to adapt the parasite to these niches. T. brucei has three main replicative life cycle stages: the procyclic form (PCF, fly midgut), the epimastigote form (EMF, fly salivary glands) and the bloodstream form (BSF, mammalian host bloodstream), although within these stages there is also additional specialisation 1,2 . The PCF and BSF are readily grown in culture.
The PCF and BSF have many well characterised differences, including the BSF VSG surface coat and associated expression machinery 3 , metabolic differences and associated remodelling of the mitochondrion 4 , morphology, and morphogenesis adaptations 5,6 , along with many more. However, genome-wide mapping of the global changes are broadly limited to gene expression level, most extensively determined at the mRNA level 7-11 which does not correlate fully with protein abundance 12 . Fewer studies consider later steps in protein production: translation (mRNA ribosome footprinting) 7,11 and protein abundance (quantitative proteomics) [13][14][15] . Despite the comparative ease of culturing PCFs and BSFs and the powerful reverse genetic tools available, a huge number of genes with evidence for BSF upregulation are not characterised.
Here, we aim to address this using subcellular protein localisations. We have demonstrated the power of this approach in PCFs with the TrypTag genome-wide protein localisation project 16 . This showed how informative localisation can be for holistic mapping of potential protein function, although naturally localisation does not determine specific molecular function. We also previously used high throughput tagging of BSFupregulated genes to identify ESB1, necessary for transcription of the expression site containing the VSG gene along with expression site associated genes 17 . However, our previous analysis of these BSF localisations was minimal, aiming only to identify expression site body components. Here, we present analysis of an extended version of this BSF localisation dataset as both evidence for how BSFs are adapted relative to PCFs and as a resource for the research community.

Cell culture
Bloodstream form Trypanosoma brucei brucei strain Lister 427 pJ1339 was grown in HMI-9 at 37°C with 5% CO 2 18 , maintained in log phase growth and at less than ~2×10 6 cells/ml by regular subculture. To enable CRISPR/Cas9 genome modifications, this cell line expresses T7 RNA polymerase, Tet repressor, Cas9 nuclease and puromycin drug selectable marker 17 and were maintained with periodic drug selection using 0.2 µg/ml Puromycin Dihydrochloride. Culture density was measured with a CASY model TT cell counter (Roche Diagnostics) with a 60 µm capillary and exclusion of particles with a pseudo diameter below 2.0 µm.

Electroporation and drug selection
For endogenous tagging of a protein, electroporation was used to transfect T. brucei with two linear DNA constructs; one from which a CRISPR sgRNA is transiently expressed and one carrying the fluorescent protein and drug selectable marker which has homology arms allowing homologous recombination into the target locus. Constructs for endogenous N or C terminal tagging constructs were generated using long primer PCR from a pPOTv7 mNeonGreen (mNG) / blasticidin deaminase template, and PCR was used to generate DNA encoding sgRNA with a T7 promoter, both as previously described 19,20 (for primer sequences see Underlying data 21 ).
For DNA encoding the drug selectable marker and fluorescent protein, 0.2 mM dNTPs, 30 ng pPOT plasmid,2 µM genespecific forward and reverse primer and 1 unit HiFi Polymerase (Roche) were mixed in 1× HiFi reaction buffer with MgCl 2 and 3% v/v DMSO, in 50 µl total volume. PCR cycling conditions were 5 min at 94°C followed by 40 cycles of 30 s at 94°C, 30 s at 65°C, 2 min 15 s at 72°C followed by a final elongation step for 7 min at 72°C on a SimpliAmp Thermal Cycler (ThermoFisher).
For DNA encoding sgRNAs, 0.2 mM dNTPs, 2 µM of sgRNA scaffold primer (aaaagcaccgactcggtgccactttttcaagttgataacggactagccttattttaacttgctatttctagctctaaaac) and gene-specific primer and 1 unit HiFi Polymerase were mixed in 1× HiFi reaction buffer with MgCl 2 , 50 µl total volume. PCR cycling conditions were 30 s at 98°C followed by 35 cycles of 10 s at 98°C, 30 s at 60°C, 15 s at 72°C on a SimpliAmp Thermal Cycler. 2 µl of each reaction were run on a 2% agarose gel to check for the presence of a product of the expected size. For gel images, please see the associated Zenodo deposition 21 . 5 µg of DNA from the PCRs was purified by phenol chloroform extraction, resuspended in 10 µl water, then mixed with approximately 3×10 7 cells resuspended in 100 µl of Roditi Tb-BSF buffer 22 . Transfection was carried out using program X-001 of the Amaxa Nucleofector IIb (Lonza) electroporator in 2 mm gap cuvettes. Following electroporation, cells were transferred to 10 ml pre-warmed HMI-9 for 6 h then 5.0 µg/ml Blasticidin S Hydrochloride added to select for cells with successful construct integration. Healthy resulting populations were maintained with periodic drug selection using 0.2 µg/ml Puromycin Dihydrochloride and 5.0 µg/ml Blasticidin S Hydrochloride.

Amendments from Version 1
This version includes three primary changes: Firstly, including specific numbers and percentages for qualitative statements in the test, secondly, including gene set sizes in more figures and, thirdly, including example images of localisations which are potentially suspect -specifically soluble fluorescent protein (mNG) and background autofluorescence localisations. An additional data table has been added to the underlying data (at Zenodo) containing the underlying raw data for the analyses in Figure 2 and Figure 3, and the Zenodo description significantly updated to aid readers in finding and accessing the raw data.
Any further responses from the reviewers can be found at the end of the article REVISED Selection of genes for tagging BSF tagging was carried out in the T. brucei Lister 427 cell line, and we considered genes for tagging if they had a syntenic ortholog in T. brucei TREU927. Genes were selected for tagging as described in the main text using TrypTag PCF protein localisation data available up to 12 th March 2018 and TriTrypDB version 36, with the following specific exclusion criteria to avoid tagging of large well-known gene families and genes encoding GPI-anchored proteins known to be refractory to N and C terminal tagging. VSG, the major BSF surface coat protein was excluded by removing known (named) VSG genes and pseudogenes. In the interest of unbiased analysis, we ensured surface coat proteins characteristic of other life cycle stages were also excluded: EP procyclins, also called procyclic acidic repetitive proteins (PARPs), and brucei alanine rich proteins (BARPs). Known (named) invariant surface glycoproteins (ISGs) were excluded, with the exception of tagging controls ISG65 and GPI-PLC, and VSG expression site associated genes and related genes (ESAGs and GRESAGs) were excluded. Finally ribosomal proteins, which we deemed unlikely to be of interest, were excluded.

Light microscopy
Cells were prepared for light microscopy by centrifugation to remove medium, followed by resuspension in FCS-free HMI-9 containing 1 µg/mL Hoechst 33342 before a second centrifugation and resuspension in a small volume (~20 µl) of FCS-free HMI-9. An equal volume of 0.04% (v/v) formaldehyde in FCS free HMI-9 was added to lightly fix the cells 17,23 . Images were captured on a DM5500 B (Leica) upright widefield epifluorescence microscope using a plan apo NA/1.4 63× phase contrast oil-immersion objective (Leica, 15506351) and a Neo v5.5 (Andor) sCMOS camera using MicroManager (version 1.4.18) 24 .

Statistics
Statistical significance of change in localisation annotation terms usage for the PCF and BSF upregulated gene sets was evaluated using the Chi squared test (using Excel version 2210, Microsoft), taking the annotation term usage in the genome-wide PCF set as the null hypothesis. Fold change in individual term usage was calculated as the ratio of term count in the PCF or BSF upregulated set to the term count in the genome-wide PCF set, eg. count of axoneme annotation terms in the BSF upregulated gene set divided by count of axoneme annotation terms genome-wide in PCFs. This is an approximation for BSFs, as we do not know the genome-wide term usage in BSFs. Error was estimated using the standard error of proportion (SEP) for each annotation term (using Excel). Fold change in term usage was normalised (and SEP scaled appropriately) to the total number of annotation terms in each set, such that no bias in usage between sets is unity.

Results and discussion
To enrich for proteins likely to have BSF-specific functions, we devised three gene tagging sets based on data available at the time ( Figure 1). Set 1) 289 genes with mRNAs upregulated in BSFs, based primarily on mRNAseq from 7 but manually incorporating some genes identified as strongly upregulated in 8-11 not in 7. Set 2) T. brucei-specific genes (defined as those which lack both an L. major Friedlin and T. cruzi Brener non-Esmareldo ortholog) not already included in Set 1, which met one of two criteria based on TrypTag PCF tagging data available at the time: Set 2a) the 30 genes that had failed to give a convincing signal above background by both N and C terminal tagging, and Set 2b) the 21 genes which had a nucleoplasm or nucleolar localisation. The former were selected to test whether lack of PCF signal correlated with BSF stage-specific expression, and the latter as candidates for T. brucei-specific BSF nuclear structure adaptation potentially associated with antigenic variation/variant surface glycoprotein expression.
We prioritised N terminal tagging because this preserves the 3' untranslated region (UTR), suspected to confer most gene regulation in trypanosomes 25 . However, when a protein had a predicted N terminal signal peptide C terminal tagging was instead necessary. If we failed to generate a drug resistant population, we repeated construct generation and transfection at least once. The final success rate generating cell lines (for full listing see Underlying data 21 ) was 72.9% ( Figure 3A), of which 76.6% had signal we manually classified as unlike background fluorescent signal ( Figure 3B) -i.e. a convincing subcellular localisation.
For the final analysis of these localisations, we re-analysed the gene sets based on the entire TrypTag PCF localisation dataset 16 and TriTrypDB version 59 26 (Figure 2. There were some changes; altered OrthoMCL sensitivity due to addition of new genomes ( Figure 2B), additional PCF tagging repeats providing a strong convincing localisation where only weak signal was previously observed ( Figure 2C), and changed PCF localisation annotation (e.g. from nucleoplasm to nuclear envelope, Figure 2D). We also defined a final criterion for upregulation in the BSF: transcripts significantly upregulated (p < 0.05, Figure 2. Post hoc analysis of the target gene sets for BSF tagging. Bar charts showing the proportion of genes in a gene set (x axis) which meet a particular criterion. Total number of genes in each set is shown above each bar. A. Proportion of genes at least 2.5fold upregulated mRNA and p < 0.05 (two-tailed T test) from 7, for each target gene set; BSF upregulated, T. brucei-specific with PCF weak signal by N and C terminal tagging and T. brucei-specific proteins which localise to the nucleus in PCFs, in comparison to all T. brucei genes. B. Proportion of genes with no L. major and no T. cruzi ortholog in each target gene set. C. Proportion of genes with N and C terminal tagging data in PCFs from the TrypTag project for which both termini had weak, i.e. no strong localisation to an identifiable organelle. D. Proportion of genes annotated as localising to the nucleus, nucleoplasm or nucleolus by either N or C terminal tagging in PCFs from the TrypTag project. In each graph, the number of genes for which data is available in each group is shown at the top of each column. i.e. no strong localisation to an identifiable organelle in the BSF. C. Proportion of each target gene set for which strong localisation to an identifiable organelle was observed for either N or C terminal tagging in PCFs, in comparison to all genes with PCF data. Data from the TrypTag project. D. The proportion of BSF cell lines for each target gene set with strong localisation to an identifiable organelle which gave a similar localisation to either N or C terminal tagging in PCFs. In each graph, the number of genes for which data is available in each group is shown at the top of each column. Student's T test) by mRNAseq in the BSF relative to the PCF (data from 7). However, overall, the gene sets well reflect their original purpose.
We observed convincing fluorescent signal in BSFs for many (164/289, 56.7%) tagged proteins in Set 1 (upregulated in BSFs at the mRNA level, Figure 3B). In this gene set, disproportionately many genes (39.1% vs. 18.4% genome-wide) were also T. brucei-specific ( Figure 2B), and disproportionately few (42.0% vs. 76.3% genome-wide) had no convincing abovebackground localisation observed in the PCF ( Figure 2C). We also observed a convincing fluorescent signal in BSFs for many (13/30, 43.3%) in Set 2a (T. brucei-specific genes with no detectable PCF signal, Figure 3B). Lack of fluorescent signal in the PCF tagging previously raised our suspicions that these genes may not be expressed in this life cycle stage, never expressed, or encode a non-functional, and therefore degraded, protein product. Similarly, failure to generate a PCF tagged cell line may indicate inaccurate sequence data for that locus or that the drug selectable marker cannot be expressed from that locus. This was an acute concern when the gene was T. brucei specific and therefore had no evidence from evolutionary conservation for being functional. Our BSF localisation provides evidence that many of these genes (67/161, 41.6%) encode an expressed and likely functional protein (on the basis that the proteins often targeted to a specific organelle), supporting proteomic analyses 15 . As would be expected, fluorescent signal in a tagged cell line therefore broadly correlates with mRNA abundance across life cycle stages and failure to observe a convincing localisation in PCFs is, as we previously proposed 16 , at least partially predictive of a stage specific protein expression.
As described above, with the exception of Set 2b, the set of T. brucei specific nuclear genes which were selected based on a specific PCF localisation, our BSF tagging was of proteins disproportionately more likely to have no detected signal from PCF tagging ( Figure 3C). However, when a PCF localisation was available it was likely to be similar to the BSF localisation we observed, overall ~85% were manually classified as similar ( Figure 3D). When dissimilar, the localisation observed in either the PCF or BSF was typically either a weak cytoplasmic signal or a cytoplasm, nuclear lumen and flagellar cytoplasm localisation (examples shown in Figure 4A). The former is simply background autofluorescence signal. The latter is the localisation we observed in PCFs for mNG when not fused to a protein. As we previously described for PCF tagging 27 , these can arise from frame shifts, likely originating from stochastic errors in synthesis of the primers for tagging. Alternatively, they may be poorly tolerated fusion proteins -truncated or partially degraded leading to expression of effectively mNG alone. Overall, we therefore conclude that the vast majority of proteins differ only in expression level and not localisation. One, however, featured a clear change; see below.
For Set 1, the set of BSF upregulated genes, whether or not a PCF localisation was visible the BSF localisation gave a much stronger signal -detectable as we used the same microscope, camera and image processing settings for PCFs and BSFs, making signal intensity in the images approximately quantitative. This includes proteins known or expected to be BSF-upregulated: pyruvate transporter 1, PT1 28 ; repressor of differentiation kinase 2, RDK2 29 ; flagellum adhesion protein 3, FLA3 30 ; and cytoskeleton associated protein CAP5.5V 31 ( Figure 4B). However, it also includes novel or uncharacterised proteins localising to a range of different organelles (examples in Figure 4C). We also noted one clear example where protein localisation differed between the PCF and BSF. Tb927.11.1230 and its syntenic ortholog Tb427tmp.47.0026 localised to the distal axoneme (occasionally with weak proximal signal) in PCFs and the entire axoneme in BSFs ( Figure 4D). PCF to BSF localisation differences have been previously observed, for example MCP6 and α-KDE1 32,33 , but most notably the Tb927.11.1230/ Tb427tmp.47.0026 localisation change is comparable to that of the flagellar protein FLAM8 (flagellar member 8) 34 .
We noted that BSF-upregulated proteins often localised to membranous structures -the pellicular or flagellar membrane, the endoplasmic reticulum or the endocytic system ( Figure 4B,C). We therefore tested for a bias in localisation annotation term usage relative to genome-wide usage in PCFs. Taking only the target genes for BSF tagging not selected based on a nuclear PCF localisation, i.e. excluding Set 2b, there was indeed a significant bias in term usage (p < 10 -30 , chi-squared test). Normalised fold-change in usage of annotation terms revealed a strongly disproportionately high usage of terms associated with the surface membrane and the endo/exocytic system (pellicular and flagellar membrane, ER and endocytic). There were also weaker biases in BSFs for 1) general (nucleus, nuclear lumen) rather than specific (nucleoplasm, nucleolus) nuclear localisation annotations, 2) fewer mitochondrion and kinetoplast annotations, 3) more glycosome terms, and 4) more flagellum tip and flagellar connector-like 5,35 annotation terms ( Figure 5A,B). The BSF cell Figure 5. Stage-specific organelle adaptation mapped using localisation term usage. A. Localisation annotation term usage, as the proportion of all annotation terms used localisations, comparing all PCF (N and C terminal tagging) localisation terms to all BSF localisations described here, excluding the target gene set 2b; T. brucei-specific nuclear localising proteins. All localisation annotations for N and/or C terminal tagging, whichever are available, so long as they did not have the 'weak' or '<10%' modifiers. B. The data in A, except plotted as the ratio of term usage in BSF upregulated vs. total PCF, normalised to number of annotation terms in the BSF set. Error bars represent standard error of proportion. Grey hatched bars indicate too few (<3) BSF upregulated protein localisations for accurate fold change calculation. C. Analogous analysis of PCF upregulated genes from TrypTag data: Localisation annotation term usage, as the proportion of all annotation terms used for non-weak localisation, comparing all PCF localisation terms with those for proteins encoded by genes significantly upregulated at the mRNA level in PCFs. D. The data in C, except plotted as the ratio of term usage in PCF upregulated vs. PCF total term usage, normalised to number of terms in the PCF set. Error bars represent standard error of proportion. Grey hatched bars indicate too few (<3) PCF upregulated protein localisations for accurate fold change calculation. surface therefore has the greatest adaptation between BSFs and PCFs, with this change plausibly supported and/or maintained by changes in the ER and endocytic system. The converse analysis, taking genes upregulated in the procyclic form (p < 0.05, Student's T test, by mRNAseq in the PCF relative to BSF, data from 7) and analysing localisation annotation term usage relative to genome-wide usage in PCFs also revealed a significant change (p < 10 -30 , chi-squared test) in term usage, reflecting adaptation in the PCF. We identified 1) disproportionately high usage of mitochondrion and kinetoplast terms, 2) high usage of flagellar tip and flagellar connector terms, and 3) few glycosome terms. This speaks to the known upregulation of oxidative phosphorylation (mitochondrial) relative to glycolysis (glycosomal) as the major ATP source in procyclic form and adaptation of the flagellum tip likely linked with new flagellum outgrowth 5 , but limited other changes ( Figure 5C,D).
In conclusion, we have mapped which organelles contain proteins upregulated in the T. brucei BSF and PCF life cycle stages (summarised in Figure 6), thus mapping where the molecular machinery responsible for their stage-specific adaptations likely act in the cell. This includes uncharacterised proteins with little or no bioinformatic insight into likely function. Lack of fluorescent signal by endogenous tagging in the PCF was often predictive of BSF expression, confirming the power of the TrypTag genome-wide protein localisation resource as a protein expression level resource. We also showed that it is likely that a large majority of T. brucei proteins, when expressed, have similar localisations in BSFs and PCFs -the dominant adaptive process therefore appears to be change in expression level rather than change in localisation. We suggest that this also likely applies to other life cycle stages and the different life cycle stages of other trypanosomatid parasites.

Developmentally Regulated Localization of the Mitochondrial Carrier
The Tryptag project that has generated images of mNeon-green tagged protein in procyclic form (PCF) Trypanosoma brucei has been immensely useful in determining protein sub-cellular localisation and by inference function. In this work the authors have extended the project to include the clinically relevant bloodstream form (BSF) for selected protein sets, with a focus on those that are BSF enriched at transcript level or did not generate distinct signal in the PCF, in addition to certain nuclear proteins. This is a laudable effort and I have no doubt that it has been performed to the highest technical standards and will be of great utility to the community. However, the overall presentation of the data is disappointing and format of the available underlying data is unsuitable.
The manuscript is rather short, which as a data-set paper would be fine, but the brevity extends to a lack of specific details in the results and discussion. There are several times that we are told the rather vague "many" without the precision of hard numbers, percentages or readily available tabulated data to back up these statements. Identifying the results that support the assertions made should be made clearer and not requires a deep dive into the underlying data.
I find the format of the data presented in Figure 2 & 3 hard to interpret visually and would rather the data were tabulated for clarity, with additional supplementary table identifying which gene have these characteristics i.e. Set 1 "BSF upregulated" of 289 gene ~75% appear to match the threshold (>2.5-fold, P > 0.05) but their identity is unclear.
In the discussion regarding the differences in localisation of ~15% of proteins between PCF and BSF there is a statement that a "weak cytoplasmic signal" is simply background and "cytoplasm, nuclear lumen and flagellar cytoplasm localisation" is observed in Pcf for unfused mNG. It would be useful to see examples images of this type (in comparison to "genuine" cytoplasmic signal) and to have greater clarity as to which proteins these caveats relate to. The localisation of unfused mNG in BSF should also be presented and discussed as this is a critical control.
The Bsf upregulated proteins with higher fluorescent signal in BSF than PCF "includes many novel or uncharacterised proteins localising to many different organelles". The identity and localisation of these proteins should be clearly given in a supplementary table.
Underlying data is opaque Whilst all the underlying data is technically available the data provided in the Zendo link is difficult to navigate or gain any sort of high-level view of the data without investing time and effort. Combined with the lack of detail in the manuscript itself, it is very difficult to see summary data (i.e. a list of proteins tagged and whether the localisation could be determined), let alone compare with previous PCF tagging. Simply browsing images is difficult to the point of being technically challenging and most likely beyond everyone except the most dedicated and data-savvy.
It is essential that the authors provide summary data is a user-friendly format that can be easily viewed and interpreted by the casual reader to ensure maximum dissemination and impact of the work. I would also urge the authors to integrate these results into the existing Tryptag resource so that they are searchable and can be readily viewed, and to work to integrate them into the TritypDB functional genomic database.

Minor points:
The introduction states "However, genome-wide mapping of the global changes are broadly limited to gene expression level, most extensively determined at the mRNA level7 -11 which does not correlate fully with protein abundance12. Few studies consider later steps in protein production: translation (mRNA ribosome footprinting)13 and protein abundance (quantitative proteomics)14. " This is not an accurate reflection of the literature, as there are three SILAC quantitative proteomic studies alone that compare Bsf and Pcf protein abundance (PMID: 23090971 1 and references 14 & 26).

1.
Reference 13 appear to be erroneous -it is certainly not an mRNA ribosome footprinting study.

2.
The language in the manuscript would benefit from minor changes to grammar to ensure clarity, particularly in the abstract and introduction, i.e: "Results: We have confirmed the localisation of known [stage-specific proteins] and identified the localisation of novel stage-specific proteins." "This showed how informative localisation can be for holistic mapping of potential protein function, although naturally [localisation] does not determine specific molecular function." "We also previously used high throughput tagging [of] BSF-upregulated genes to identify 3.

Republic
No one doubts the importance of genome-wide localization studies. We can clearly see their power in the impact of the TrypTag project 1 , which was finally published in Nature Microbiology, but whose resources we have long been able to use via the TrypTag or Tritrypdb platforms. Since the whole genome-scale analysis of subcellular localization was performed on the procyclic forms (PCF), one of the two easily cultured forms of T. brucei, it is obvious that future studies will also focus on the second form, the bloodstream stage (BSF).
The authors made a good selection of the initial sets of proteins of interest, namely i) genes that are upregulated (2.5 times, p< 0.05 according to the study by Jensen et al, & some others), ii) genes for which no localization in PCF was shown, and iii) genes for which nucleoplasmic or nucleolar localization was shown (actually, a rationale for this third group should be included, as it must not be obvious to many, including me, I just conclude that it is related to VSG).
Based on the localization data obtained, the authors identify organelles that are subject to stagespecific regulation. They also conclude that in most cases localization does not change when known for both life cycle stages, and that if it was not possible to localize the protein to PCF, it is most likely a BSF protein whose localization can be detected in these forms.
Overall, the manuscript is quite short, but one was more looking forward to the supplement to evaluate the study in general and for specific IDs. However, I cannot evaluate the supplemental data available from Zenodo. I downloaded everything, clicked on everything, but I did not see any complete tables or images. After opening the TrypTag website, I was sad to see that the data from this study is not yet available on this website.
I have two main criticisms, which I will address below: i) some vagueness of the text, 2) the lack of supplements and tables to support the figures shown.
Ad1) -the whole text is interspersed with statements like: 'we observed convincing fluorescent signal in BSF for many tagged proteins [link to Fig 3B, which is firstly in percentages, so you have to do the math, and secondly it is not even clear how many genes are 100%]'; '…for many in Set 2...', '…many of these genes encode...' '...it includes many novel or uncharacterized proteins...' One wonders how many is "many." Since the data sets were very discrete, I do not see why the authors cannot be more specific. The same is true for the cut-off for genes up-regulated in BSF, it is only mentioned once that the cut-off is 2.5 with a p-value < 0.05.
This certain ambiguity relates to Figure 2 and 3, which are a mystery to me. Example - Figure 2A, where the y-axis is in percent and labeled as BSF upregulated proteins -the first column -8171 is 100% (I guess), so I conclude that 10% are upregulated BSF -that is about 800 proteins, but the authors selected 289 genes..., the second column -289 is 100%, if so I do not know what this column is supposed to show, and the same is true for column 3 and 4.
It would be nice if the authors could redesign the diagrams to make them easier to understand (but it is possible that only I have this problem). Nonetheless, I would definitely recommend accompanying these graphs with supplemental Excel spreadsheets highlighting the gene IDs, including the gene description, the determined localization in BSF and PCF.
This will also help in determining for how many hypothetical proteins subcellular localizations were found. The authors mention this in the abstract, "we have confirmed the localization of known and identified the localization of novel stage-specific proteins" but practically, there is no table or graph (how many and where they are) for this result. Ad2) Lack of supplements and tables that would support the figures shown -this was mentioned above, but I would add that a table would also be useful for Figure 5. While it is nice to see the specific organelle adaptation, scientists will always want to look for their favorite gene IDs.
This is also related to the question of whether it is possible to make the data available on TrypTag; I understand that linking to Tritrypdb will take some time.

Comments on this article Version 1
Author Response 19 Apr 2023

Richard Wheeler
We thank the reviewers for their careful consideration of this work. Unfortunately, it is clear that a major theme from the reviewers' comments was concerns about data availability.
We are sorry that this us seen as an issue, especially as the main priority of this paper is to describe this open dataset. I assure the reviewers that all the microscopy and localisation annotation data is available at Zenodo, under record number: 7258722.
For example, the list of localisation annotations (and gene IDs tagged) is in the file called "localisations.tsv".
The current link for this is https://zenodo.org/record/7843865/files/localisations.tsv?download=1, but it is better to access it via the Zenodo website to ensure the latest version of the record is accessed. There was clearly frustration that images are not easily visible. Please understand that this is a non-trivial data management issue.
While it would be good to have all of the image data simply listed for download by gene ID, the total dataset is too large even for specialised data deposition websites like Zenodo. Hence the spread over multiple Zenodo entries.
To access the microscopy data you just need to look up the gene ID in "id_doi_index.tsv", go to the indicated Zenodo ID and download the microscopy data. The current link for this index https://zenodo.org/record/7843865/files/id_doi_index.tsv?download=1 This is the best solution we have been able to set up. Unfortunately scientific infrastructure is simply not well-suited to projects with this scale of microscopy data. Please note that the emphasis placed on the .zip file (which is only necessary to replicate our image management) by the Zenodo preview is out of our control (see https://github.com/zenodo/zenodo/issues/1477) We have updated the Zenodo description to try to help, however there is little more we can do than this. We are not keen to change the paper text to explicitly describe the Zenodo deposition file structure as the two are separately indexed endities would ultimately loose synchrony. We have decided that this data will not be made available on TrypTag.org, as the bloodstream form data is neither genome-wide nor carried out in the same strain: TrypTag.org was designed for genome-wide information and is a poor search/user interface for identifying data with sparse genome coverage. A dedicated bloodstream form website would be 95% empty.
Adding the bloodstream form data to the existing procyclic form/TrypTag.org database is complex as the bloodstream form tagging was carried out in T. brucei Lister 427 instead of TREU927. There is no good or simple way to handle this with a mixed database as gene IDs are not synchronised between these strains. TriTrypDB is aware of this dataset, and we hope will be availabile soon. There were several requests for data to be included as supplemental figures/tables. However, Wellcome Open Research articles cannot have supplementary information of that type: "Extended data should be deposited in an approved repository and listed as part of the data availability statement." (https://wellcomeopenresearch.org/for-authors/articleguidelines/preparing-a-research-article/#supp) To address the various requests for a tabular format of the classes/criteria used through the paper we have added a new file to the Zenodo respository, named "geneselection.tsv", which contains this information. Christine Clayton These comments almost entirely concerned data access, which I hope we have clarified sufficiently above. There are two specific additional notes: No coding ability is needed to access the bloodstream form tagging data. Just look up the gene ID in the id_doi_index.tsv, and then go to the appropriate Zenodo deposition and download the zip file. Including TrypTag/procyclic form localisation data in the bloodstream form Zenodo deposition is an unnecessary duplication. TrypTag localisation annotations are available in that project's master Zenodo deposition, also named "localisations.tsv". Note that The TREU927 gene IDs identified as orthologs for our anaysis are, however, provided for easy cross-reference. Alena Zikova Again, for queries related to data access, please see above.
Re. "candidates for T. brucei-specific BSF nuclear structure adaptation.": This, indeed, was motivated by the potential for nuclear architecture change associated with variant surface glycoprotein gene transcription. We have clarified this in the text.
Re. usage of "many": We have updated the text to include specific numbers and percentages at all such statements.
Re. Figure 2/3: We have updated the legend to indicate that the numbers at the top of each column indicate the total number of genes each category on the x axis, with the y axis indicating proportion of the proteins in that category matching the specified criterion. Note the totals vary depending on data availability. Eg. in Figure 3, the number is different for each successive panel as A is all tagging attempts, B is only successfully generated cell lines and D is only where both PCF and BSF tagging gave a cell line.
Regarding the number of upregulated genes in the "All" gene set in Figure 2A being much larger than the ~300 we analysed here. You are absolutely correct, but bear in mind that this is all genes, ie. including genome including 427-specific genes, VSGs, ISGs, (GR)ESAGs etc. which were excluded from our analysis as indicated in Figure 1.
Re. Figure 1: We have added the number of genes in each gene set to Figure 1. Note that these are the numbers indicated at the start of the results and for each set in Figure 3A.
We have also this information to the geneselection.tsv file Re. Extended data for Figure 5: In this work, we do not do any additional mRNAseq to quantify transcript abundance -the data source we use for generating Figure 5 is Jensen et al. 2014. This is