A transcriptomic dataset used to derive biomarkers of chemically induced histone deacetylase inhibition (HDACi) in human TK6 cells

Transcriptomic biomarkers facilitate mode of action analysis of toxicants by detecting specific patterns of gene expression perturbations. We identified an 81-gene transcriptomic biomarker of histone deacetylase inhibitors (HDACi) using whole transcriptome data sets of TK6 human lymphoblastoid cells generated by Templated Oligo-Sequencing (TempO-Seq) after 4 h of exposure to 20 reference compounds (10 HDACi and 10 non-HDACi) [1]. The biomarker, named TGx-HDACi, was derived using the nearest shrunken centroid (NSC) method and can distinguish HDACi from non-HDACi compounds based on the expression pattern across the 81 genes. The classification capability of TGx-HDACi was evaluated by NSC probability analysis of 11 external validation compounds (4 HDACi and 7 non-HDACi) with a probability cut-off of 90%. Thus far, TGx-HDACi has demonstrated 100% accuracy in classifying the reference and validation compounds as HDACi or non-HDACi. Of the 81 TGx-HDACi genes, 19 genes are part of the S1500+ gene panel containing 2753 genes, developed for toxicological assessments [2]. Herein, we assessed the classification performance of the biomarker with this reduced gene set to determine if TGx-HDACi can be applied to analyze S1500+ gene expression profiles. The 20 reference compounds and 11 validation compounds were correctly classified as HDACi or non-HDACi by the NSC probability analysis, principal component analysis, and hierarchical clustering based on the expression of the 19 genes, demonstrating 100% accuracy.


a b s t r a c t
Transcriptomic biomarkers facilitate mode of action analysis of toxicants by detecting specific patterns of gene expression perturbations. We identified an 81-gene transcriptomic biomarker of histone deacetylase inhibitors (HDACi) using whole transcriptome data sets of TK6 human lymphoblastoid cells generated by Templated Oligo-Sequencing (TempO-Seq) after 4 h of exposure to 20 reference compounds (10 HDACi and 10 non-HDACi) [1] . The biomarker, named TGx-HDACi, was derived using the nearest shrunken centroid (NSC) method and can distinguish HDACi from non-HDACi compounds based on the expression pattern across the 81 genes. The classification capability of TGx-HDACi was evaluated by NSC probability analysis of 11 external validation compounds (4 HDACi and 7 non-HDACi) with a probability cut-off of 90%. Thus far, TGx-HDACi has demonstrated 100% accuracy in classifying the reference and validation compounds as HDACi or non-HDACi. Of the 81 TGx-HDACi genes, 19 genes are part of the S1500 + gene panel containing 2753 genes, developed for toxicological assessments [2] . Herein, we assessed the classification performance of the biomarker with this reduced gene set to determine if TGx-HDACi can be applied to analyze S1500 + gene expression profiles. The 20 reference compounds and 11 validation compounds were correctly classified as HDACi or non-HDACi by the NSC probability analysis, principal component analysis, and hierarchical clustering based on the expression of the 19 genes, demonstrating 100% accuracy. Crown

Value of the Data
• These data demonstrate that TGx-HDACi, an 81-gene biomarker of HDACi, retains its ability to accurately classify chemicals when applied using a subset of 19 genes of the biomarker that are part of the S1500 + gene panel for toxicological analysis; this gene panel is commercially available for use on the TempO-Seq transcriptomics platform.
• The training set data and the 81-or 19-gene biomarker can be applied by researchers to efficiently analyze new and existing transcriptomic profiles of cultured human cells exposed to chemicals to identify potential HDACi agents and conditions. • The biomarker can be used to screen existing gene expression profiles of chemicals in databases to identify potential HDACi compounds and to analyze new gene expression data; the reduced gene set facilitates TGx-HDACi analysis of non-whole transcriptome gene expression profiles, which may not contain all 81 genes. • In addition, the complete TempO-Seq dataset from which TGx-HDACi was derived and validated contains whole transcriptome profiles of 31 compounds of diverse modes of action; these data could be leveraged to develop additional transcriptomic biomarkers to diversify the modes of action that can be detected in gene expression profiles of TK6 cells.

Data Description
Whole transcriptome expression profiles were generated by TempO-Seq in TK6 cells exposed to HDACi compounds and non-HDACi compounds of diverse mechanisms for 4 h (GSE164478). This dataset contains raw data files (FASTQ files) and the normalized transcriptomic profiles containing CPM values of the TGx-HDACi biomarker reference set (10 HDACi and 10 non-HDACi compounds) and the external validation set (4 HDACi and 7 non-HDACi). The list of compounds is presented Table 1    The nearest shrunken centroid (NSC) method was applied to the TGx-HDACi reference gene expression profiles to derive a panel of 81 genes whose expression pattern can distinguish between the two chemical classes; the shrunken centroids of the 81-gene biomarker are included in Supplementary Table 1. Table 2 presents the shrunken centroids of 19 genes that are part of both the TGx-HDACi biomarker and the S1500 + gene set. Figs. 1 and 2 show the principal component analysis (PCA) and hierarchical clustering, respectively, of the 19-gene expression profiles of the 20 reference compounds and the 11 validation compounds. These analyses show two distinct clusters formed by HDACi and non-HDACi classes.
A graphical depiction of the 19-gene expression profiles with the NSC probability analysis of all chemicals is presented in Fig. 3 . The heatmap shows the direction of change in each of the genes in response to the chemical treatment. Each chemical profile was assessed for its probability of membership in the HDACi class based on the expression across the 19 genes. Compounds with above 90% probability of belonging to the HDACi class were classified as such.

Chemical treatment and RNA extraction
TK6 cells were plated in 6-well plates at a density of 4 × 10 5 -5 × 10 5 cells/mL. The cells were exposed for 4 h to one concentration of each of the 20 TGx-HDACi reference compounds (10 HDACi and 10 non-HDACi) and the vehicle solvents, as well as 11 external validation compounds (4 HDACi and 7 non-HDACi), and the vehicle solvents (treatment solution concentration of 1% v/v in media). All chemicals and solvents and the concentrations used in the study are listed in Cho et al. [1] .
The concentration selection method is described in detail in Cho et al. [1] . Briefly, in a range finding experiment using qPCR and the 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide, or MTT, assay for cell viability, these concentrations induced the highest transcriptional responses after 4 h exposures, without reducing cell viability below 60% at the 24 h time point. After 4 h exposures to the selected concentrations, cells were harvested and total RNA was extracted using a RNeasy Mini Kit (Qiagen, Toronto, ON, Canada). The quality of each RNA sample was evaluated using a NanoDrop ND-100 spectrophotometer (Thermo Scientific, Burlington, ON, Canada) and an Agilent 2100 Bioanalyzer or an Agilent TapeStation (Agilent Technologies, Mississauga, ON, Canada). All RNA samples had A260/280 absorbance ratios of ≥ 2.0 and RNA integrity number (RIN) between 7.5 and 10.

Template Oligo-Sequencing library construction
Two 96-sample whole transcriptome TempO-Seq (BioSpyder, Carlsbad, CA) kits were used following the manufacturer's protocol to construct two sets of libraries: one set containing three replicates of each of the 20 reference compound treatments and the corresponding vehicle solvent treatments, and the other containing three replicates of each of the 11 validation compound treatments and the vehicle solvent treatments. Each set of libraries also contained two replicates of water only sample (negative control), human universal reference RNA, and human brain total RNA for quality control and assurance after sequencing. The libraries were sequenced on an Illumina NextSeq 500 using a 75 cycle flow cell.

TempO-Seq data processing
The BCL files were demultiplexed and converted to FASTQ files using bcl2fastq v. 2.20.0.42. Using pete. star. Script_v3.0 (BioSpyder), the reads were aligned and the feature counts specified in a Gene Transfer Format (GTF) file from the aligned reads were extracted. First, low signal ( < 0.3%) in the water only samples was confirmed. The two replicates of the human universal reference RNA and the human brain total RNA were compared within each batch and between the two batches of libraries to confirm that data generated from the two sequencing runs were comparable to each other. Both controls had a within-batch correlation of > 0.997. When compared between the two batches of libraries, correlation ranged from 0.713 to 0.794. Boxplots of total mapped reads for each sample and hierarchical clustering of all samples were observed to identify and remove outliers. A dissimilarity cut-off of > 0.2 was applied and three samples (one water vehicle control, one replicate of pracinostat, one replicate of tacedinaline) were removed from the study as a result.
The read counts in all samples were normalized as counts per million (CPM) using R 3.4.1 [3] . The CPM of each chemical treatment sample was normalized to the corresponding vehicle solvent sample. The three replicates of each chemical treatment were averaged to generate the whole transcriptome expression profile of each chemical to be used in HDACi biomarker derivation.

TGx-HDACi biomarker derivation and external validation
The nearest shrunken centroid (NSC) method was applied to the gene expression profiles of the 20 reference compounds to identify a group of genes that can distinguish HDACi from non-HDACi compounds within the reference set [4] . The NSC method was performed using the pamr function in the R statistical environment ( www.bioconductor.org ). The standard centroids of each of the HDACi and non-HDACi groups were computed by dividing the mean expression level of each gene by its within-class standard deviation.
The centroid shrinkage value was determined based on classification accuracy for the reference compounds in 10-fold cross-validation and the resulting number of genes, with a target range of 50-100 genes. With every increase in the shrinkage value, 10-fold cross validation was performed in which the reference compounds were divided into 10 groups and each of the groups was classified as HDACi/non-HDACi using the remaining nine groups as a classifier in the NSC probability analysis with a probability cut-off of 90%. A cross validation error of 95% was allowed to generate a gene panel in the target range for size. An 81-gene panel with 95% accuracy was identified using a shrinkage value of 3.449.
The 81-gene panel (named the TGx-HDACi biomarker) was used to classify an additional 11 compounds to assess its HDACi classification ability using NSC probability analysis (90% cut-off for the probability of HDACi class membership), principal component analysis (PCA), and hierarchical clustering. Validation compounds that clustered with the HDACi reference compounds were classified as HDACi in the PCA and hierarchical clustering. The PCA was performed using the prcomp in R ( www.r-project.org ) and hierarchical clustering was performed using the hclust function in R using average linkage and Euclidean distance.

Chemical classification using 19 genes of TGx-HDACi
Nineteen genes were identified that are part of both TGx-HDACi and the S1500 + gene panel (Mav et al. 2018). Using the HDACi class centroids of the 19 genes, the 20 reference compounds and the 11 validation compounds were classified as HDACi or non-HDACi in the NSC probability analysis, principal component analysis, and hierarchical clustering to assess its classification ability. Classification calls were made in the same manner as described earlier.

Ethics Statement
Not applicable.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.