HTGQC and shinyHTGQC: an R package and shinyR application for quality controls of HTG EDGE-seq protocols

Extraction-free HTG EdgeSeq protocols are used to profile sets of genes and measure their expression. Thus, these protocols are frequently used to characterise tumours and their microenvironments. However, although positive and control genes are provided, little indication is given concerning the assessment of the technical success of each sample within the sequencing run. We developed HTGQC, an R package for the quality control of HTG EdgeSeq protocols. Additionally, shinyHTGQC is a shiny application for users without computing knowledge, providing an easy-to-use interface for data quality control and visualisation. Quality checks can be performed on the raw sequencing outputs, and samples are flagged as FAIL or ALERT based on the expression levels of the positive and negative control genes. Availability & Implementation The code is freely available at https://github.com/LodovicoTerzi/HTGQC (R package) and https://lodovico.shinyapps.io/shinyHTGQC/ (shiny application), including test datasets.


tumour and its microenvironment composition (Oncology Biomarker Panel and Precision
Immuno-Oncology Panel) [8,9]. Formalin-Fixed Paraffin-Embedded (FFPE) is a widely used method for preserving tissue specimens that allows the preservation and easy storage of biological samples for years. Therefore, the extraction-free HTG EdgeSeq panels are optimised for FFPE samples. Allowing the reliable quantification of RNA expressions using FFPE samples, these panels are of great interest to the biological world [10].
Although the wet lab protocols are highly standardised, little indication is given concerning the analysis of the resulting sequencing data. A fundamental step is the assessment of the success of the sequencing runs and the quality of individual samples [11].
In this regard, HTG EdgeSeq panels provide four positive controls (spike-ins) and four negative controls (non-human genes). Before proceeding with the data analysis, normalisation and quality controls are critical for such custom-made panels.
The amount of data created by high-throughput machines is constantly increasing; however, the tools allowing biologists to independently perform basic data cleaning, visualisation and analysis are still mostly inadequate.
Here, we introduce HTGQC and shinyHTGQC, the first tool, to our knowledge, for the quality control of HTG EdgeSeq protocols. Since no well-defined pipeline is currently advised or in use for the analysis of this type of data, our standardised quality control tool can be considered as a building block and starting point for downstream applications, such as differential expression and gene set enrichment analyses.

METHODS
We provide an easy-to-use tool, along with clean visualisations of the quality control, in the form of an R package (RRID:SCR_001905). Moreover, a Shiny application (RRID:SCR_001626) has been created to facilitate quality controls and data analysis for those unfamiliar with the R language [12] ( Figure 1). The R package HTGQC is composed of two functions. The first one, readHTG (), processes the output from the EdgeSeq machine (a .xlsx file) with no need for preprocessing or manual curation. The main function, qualityCheck (), performs the quality control of the samples using both positive and negative controls.

Positive controls
The percentage of reads allocated to the positive controls over the total library size is plotted. A table is produced listing the samples not passing the quality control, with a threshold of 40% set for the failures. Based on previous analyses, this percentage tends to be in the range of 0-5% for good-quality samples. Therefore, we added an ALERT flag when the percentage of reads allocated to positive controls exceeds 10%.  where

Negative controls
In the case of a complete failure of all the samples in the run, there is a possibility that all deviances will be within the ±2 * standard deviation range since all the samples have a high number of reads allocated to negative controls. We have therefore added a second filter in which, similarly to the positive controls, a sample fails the quality control if the percentage of reads allocated to the negative controls exceeds 10% of its library size. Finally, each patient's quality control is flagged ALERT when any of the two filters is ALERT, while it is flagged FAIL when any filter is FAIL.

Output
The results are two tables and two plots. The tables report the success and failure samples, while the plots represent the percentage of reads allocated to the negative controls and the deviance from the expected value for the negative controls. If the input data was not pre-processed, an additional option is available for downloading the cleaned data, ready to be used for additional analyses. The package has been trained and validated for the Oncology Biomarker and Precision Immuno-Oncology HTG panels. However, some options can be customised (i.e., the number of positive and control genes) to allow further applications.

Shiny application
The web-based shiny application shinyHTGQC has been implemented using the same concept for the quality controls above. This tool allows researchers with no knowledge of the R language to perform the same analyses.
shinyHTGQC only requires the user to upload the input data files and specify the control genes to be used. The application performs the quality check and provides the user with the possibility of visualizing and downloading the results in a user-friendly way.

Data normalisation and analysis
Since shinyHTGQC will be used by researchers with little or no knowledge of bioinformatics, we also added features for visualising the effect of the quality control analysis.
The user is asked to select a normalisation method between cpm and variance stabilizing transformation (vst). The default selection is cpm, as suggested by the authors.
A principal component analysis (PCA) is plotted in the 'Analysis' tab of the shiny application, along with a heatmap visualisation of the samples (column) and genes (rows).
In order to visualise the effect of removing patients flagged as FAIL/ALERT by the quality control, a further option allows the users to exclude these patients from the graphical representations.
The user can specify an annotation file to specify the different groupings of the samples. In this case, the PCA will be coloured and the heatmap annotated accordingly.

DATA AVAILABILITY
Test datasets can be found at https://github.com/LodovicoTerzi/HTGQC/tree/main/dataExample/. Both pre-formatted and unformatted test datasets can be found, along with a sample annotation dataset. Snapshots of the code are available in the GigaDB repository [13].