RNA-sequencing (transcriptomic) data collected in liver and lung of male and female B6C3F1 mice exposed to various dose levels of 4-methylimidazole for 2, 5, or 28 days

The National Toxicology Program (NTP) reported that chronic exposure to varying dietary concentrations of 4-methylimidazole (4-MeI) increased lung tumors in female and male mice [1]. In this study, mice (male and female B6C3F1 mice) were either administered 4-MeI by oral gavage (0, 50, 100, 200, or 300 mg/kg/day) for 2 days or exposed for 5 and 28 days to 4-MeI in the diet (0, 150, 300, 1250, or 2500 ppm) and whole transcriptome (RNA-Sequencing) data from 4-MeI-exposed B6C3F1 mice to determine whether changes occurred in the target (lung) and nontarget (liver) tissues. This analysis was conducted to provide information with which to evaluate biological processes affected by exposure to 4-MeI, with a focus on identifying key events that could be used to propose a plausible mode of action (MoA) for mouse lung tumors [2].


Specifications
. Parameters for data collection FASTQ files were mapped to iGenomes UCSC mm10 reference, reads per sample counted with HTSeq. Body weight, food consumption, and clinical observations are reported in the study report appendices (Supplemental Data File 1).

Description of data collection
Male and female mice exposed to 4-MeI for 2, 5, and 28 days at four dose levels, plus vehicle-only controls. Eight animals per sex were exposed at each dose level and time point, with the six samples per condition (sex, dose, and time), yielding the best total RNA used for whole-transcriptome sequencing. Samples were sequenced by 75 bp paired-end reads, with four bar-coded pairs of reads per sample (Supplemental Data Files 2, 3, and 4

Value of the Data
• Gene expression data provide a powerful tool for identifying key molecular initiating and key events to inform a mode of action (MoA) for mouse lung tumors in mice exposed to high levels of 4-MeI [3,4] . • Gene expression data provide the ability to generate cellular mechanistic hypotheses relative to cellular biology preceding apical adverse endpoints [5] . • These data are useful in considering plausible MoAs for mouse lung tumors that occur in mice with chronic dietary exposure to 4-MeI.

Data Description
Tables 1 and 2 provide summary data of in-life animal observations, including food consumption and body-weight changes with exposure to 4-MeI in male mice and female mice, respectively. Supplemental Data File 1 provides the final report from the animal study, with complete descriptions of the experimental design, animal model, dose levels, and endpoints collected over the 28-day exposure period, as outlined in Tables 4 and 5 , below, in the Experimental Design, Materials, and Methods section. Table 3 provides the differential gene expression from feature count data, generated after normalization in DESeq2. Fold change was computed relative to time-specific, vehicle-only control animals. The data provided in this table are the number of differentially expressed genes Table 1 In-life animal observations: Mean feed consumption and body weight change in male mice exposed to 4-MeI (Supplemental Data File 1 provides complete raw data for both male and female mice).  Table 2 In-life animal observations: Mean feed consumption and body weight change in female mice exposed to 4-MeI (Supplemental Data File 1 provides raw data for both male and female mice). Abbreviations: SD = standard deviation, n/a = not applicable. 1 Feed consumption was calculated based on food consumed after 1 week/ body weights of mice divided by number of days exposed. 2 Body-weight gain determined from individual animal data based on difference between final mean body weight and initial mean body weight. * Statistically significant decrease compared to concurrent control (Dunnett's test p < 0.05).

Table 3
Differential gene expression from feature count data, generated after normalization in DESeq2. Fold change was computed relative to time-specific, vehicle-only control animals. Shown are the number of differentially expressed genes (DEGs) using an FDR-corrected p-value of < 0.05, an absolute value of fold change of > 1.2-fold (which had to be lowered from a standard FC of 1.5 due to limited signal), and both thresholds applied simultaneously. These data are derived from the RNA-Seq count tables in NCBI GEO accession GSE129622. While the liver has a larger proportion of genes in common between male and female mice than does the lung, there were still a large number of sex-specific DEGs, particularly in females, when a less stringent criterion of |FC| > 1.2 was used.
using a false discovery rate (FDR)-corrected p -value of < 0.05, an absolute value of fold change of > 1.2-fold (which had to be lowered from a standard fold change (FC) of 1.5 due to limited signal), and both thresholds applied simultaneously. Fig. 1 provides an example using Venn diagrams showing the distribution of differentially expressed genes (DEGs) between male and female mice exposed to the highest dose level of 4-MeI (300 mg/kg-d) for 2-days of oral gavage dosing. Both sexes exhibit maximal differential expression using any criteria selected. Genes shown in these Venn diagrams were identified as significant by both FDR < 0.05 and a |FC| > 1.2-fold. While the liver has a larger proportion of genes in common between male and female mice than does the lung, there were still a large number of sex-specific DEGs, particularly in females, when a less stringent criterion of |FC| > 1.2 was used. Fig. 2 provides an example of a map of the ontology enrichment for the lungs of mice exposed to 4-MeI where the genes are rank-ordered by fold change based on the selection of the top 500 genes up-regulated and top 500 genes down-regulated from the highest dose (2 days) or dietary exposure level (5 or 28 days). Fig. 3 provides an example of a map of the ontology enrichment for the liver of mice exposed to 4-MeI where the genes are rank-ordered by fold change based on the selection of top 500 upregulated and top 500 down-regulated genes from the dose level (2 days) or dietary exposure level (5 days).
Supplemental file 1 provides the final report from the study in which mice were exposed to 4-MeI by oral gavage for 2 days, and then via the diet for 5 or 28 days, prior to collection of liver and lungs for further processing. This report contains the in-life data from the animal study, such as the individual body weights and food consumption information, to summarize the data reported in Tables 1 and 2 . It also provide more details on the experimental animal study protocol.
Supplemental file 2 [2_Day2_DGE.xlsx], provides Day 2 liver and lung differential gene expression results (DESeq2) from male and female mice. Tables (four tabs in single Excel file) with gene identifiers include the Log 2 fold change, the standard error of the fold change, and the p-value and FDR-corrected p-value for each gene.
Supplemental file 3 [3_Day5_DGE.xlsx], provides Day 5 liver and lung differential gene expression results (DESeq2) from male and female mice. Tables (four tabs in single Excel file), with gene identifiers, include the Log 2 fold change, the standard error of the fold change, and the p-value and FDR-corrected p-value for each gene.
Supplemental file [4_Day28_DGE.xlsx] provides Day 28 liver and lung differential gene expression results (DESeq2) in male and female mice. Tables (four tabs in single Excel file), with Reactome ontology enrichment for lung using genes rank-ordered by fold change and selecting the top 500 upregulated and top 500 down-regulated genes from the highest dose level at each time point (2-, 5-, and 28 days). The red nodes (male mice) indicate categories enriched with down-regulated genes at 2 days, the green node (male mice) with down-regulated genes at 28 days, and the blue (female mice) with up-regulated genes at 5 days. The mustardcolor nodes (male mice) indicate categories simultaneously enriched for down-regulated genes at both 2 and 28 days. All colored nodes had a minimum of five elements and an enrichment FDR < 0.05 (Fisher's exact test). White nodes were not significantly enriched and are included for continuity of the ontology hierarchy. The shaded area highlights mitochondrial functional pathways and metabolism, including arachidonic acid metabolism. Fig. 3. Reactome ontology enrichment for liver using genes rank-ordered by fold change and selecting the top 500 upregulated and top 500 down-regulated genes from the maximum exposure concentration group. The red nodes (male mice) indicate categories enriched with up-regulated genes that occurred with administration of 4-MeI for 2 days, and the green node (female mice) with up-regulated genes that occurred after 5 days of dietary exposure. The mustardcolor node indicates categories simultaneously enriched for both males (2-day) and females (5-day). All colored nodes had a minimum of five elements and an enrichment FDR < 0.05 (Fisher's exact test). White nodes were not significantly enriched and are included for continuity of the ontology hierarchy. The shaded area highlights metabolism processes associated with amino acid metabolism, lipid metabolism, and glutathione metabolism. Temporal differences in the specific processes enriched in cell-cycle processes for each sex are indicated by the mustard-colored nodes downstream of "cell cycle" and "cell cycle, mitotic," indicating no overlap in enriched elements between the sexes for these different exposure times. Nodes enriched for males at 2 days (red nodes) were not enriched for female mice at 5 days (green nodes), and vice versa. gene identifiers, include Log 2 fold change, the standard error of the fold change, and the p-value and FDR-corrected p -value for each gene.

Animal husbandry
Male and female B6C3F1 mice (n = 120/sex, body weight: 16.3-27.6 g, and age: 9 weeks) (CRL International, Inc.) were acclimated for at least 14 days prior to study start; the study was conducted at Integrated Laboratory Services (ILS), Inc. All procedures were in compliance with the Animal Welfare Act Regulations, 9 CFR 1-4, and animals were handled and treated according to the Guide for the Care and Use of Laboratory Animals [6] .

Test substance
4-MeI was purchased from Sigma Aldrich (Lot Batch, MKBV5083V) and prepared in diet formulations (NTP 20 0 0 rodent diet, Zeigler, Gardners, PA) at CRL (Ashland, OH) and gavage dose formulations at ILS, Inc in sterile USP water at dose concentrations of 0 (water only as vehicle control), 5, 10, 20, and 30 mg/mL. Dose formulations were protected from the light and analyzed under conditions of use and found to be stable. Table 4 provides an outline of the study design. For oral gavage, a dose volume of 10 mL/kg was administered each day for 2 days. Dose formulations via feed were ad libitum for 5 and 28 days. Animals were evaluated twice daily and once on weekends for mortality/moribundity. Body weights were evaluated at the study start, weekly, and at termination. Food consumption (groups 6-15) was calculated from the start of exposure to the termination date for the 5-and 28-day exposure groups. Animals were euthanized approximately 6 h after the final dose for Table 4 The group number, number of animals per group, test substance and dose level, dose route, and day of study termination are identified. Six animals per group were analyzed only for changes in transcriptomics. (Note: animals 070 and 071 were found dead after dose administration on Day 0 and, thus, are absent from the data table  A total of six mice per condition (sex, dose, and time) ( Table 4 ) were used for the RNA-Seq experiment (the six animals with the best total RNA yield among the eight animals per exposure group). The raw data (FASTQ files in NCBI sequence read archive (SRA) as part of GEO accession GSE129622) consist of four pairs of paired-end read files per sample, while the mm10 feature count data in the NCBI GEO accession consist of total counts per genomic feature per sample (six tab-delimited text files, one for each sex and tissue at each of the three sample time points). The vehicle control (VC) was sterile water for the gavage study and untreated feed for the dietary study). 1440 1432 groups 1-5, and then on Day 5 or Day 28 for animals designated to the dietary study. All animals survived to scheduled termination except for two male mice in Group 5 (300 mg/kg-d, 2-day gavage) which were found dead following the first day of dosing. There were no clinical abnormalities associated with toxicity observed in any animals during the course of the study.

Study design
Of the eight animals sampled, the six with the highest yield and quality of recovered RNA were used for sequencing. The complete in-life report with clinical observational data is available as Supplemental Data File 1.
At the end of the study, the right lung was harvested for gene expression analysis. The lung was perfused with RNAlater, and a section of the left liver lobe was cubed and fully immersed in RNAlater ( ≥ 5 volumes) and stored at 2-8 °C for 1-30 days and then at -15 °C to -25 °C indefinitely. The left lung was perfused with 10% neutral buffered formalin (NBF), and the remaining liver lobe was immersed in NBF for 18-24 h and transferred to 70% histology-grade alcohol prior to paraffin embedding. RNA was extracted from liver and lung from each animal, cDNA libraries were prepared, transcriptomes were sequenced using next-generation sequencing, and FASTQ files were prepared prior to data analysis.

RNA sequencing
Sequencing was carried out using 1-to 2-μg total cellular RNA using Illumina standard procedures for their TrueSeq ® stranded mRNA HT kits. Sequencing was performed on an Illumina NextSeq 500, and binary base call (BCL) files were uploaded to Illumina BaseSpace for processing and FASTQ file generation. After preparation of the mRNA from eight animals per exposure group, the six samples with the highest yield were used for sequencing. Table 5 provides a summary of the samples of liver and lung collected from the male and female mice for the RNA-Seq experiment, including the day and 4-MeI concentrations and the number of biological replicates collected for analysis.

FASTQ file processing
The design of the sequencing experiment meant that each biological sample consisted of four barcoded sets of reads, with a pair of FASTQ files (forward and reverse reads) for each barcode set. To eliminate low quality reads from processing, each FASTQ file was processed by read trimming, where each sequence read was trimmed from both ends to eliminate all bases with a PHRED33 score (measure of quality of nucleobases generated by automated DNA sequencing) of less than 21 [7] . Also, any read of less than 65bp generated as a result of trimming was discarded. For the samples from 4-MeI exposed mice, this typically eliminated less than 0.5% of the total reads in any given FASTQ file, thus effectively only eliminating the few poor-quality reads in a sample.
After trimming, each pair of read files (FASTQ file) were mapped to the UCSC mm10 reference genome ( http://genome.ucsc.edu ) using the short-read mapping algorithm BOWTIE2 and indexed reference genomes provided publicly as a resource by Illumina's iGenome project [8] . Each mapped pair of FASTQ files produced a single SAM (sequence alignment map) format output file [9] . These were, in turn, sorted by genomic coordinates, and then merged into a single mapped read BAM (binary alignment map) file, to produce a single mapped read file for each biological sample.
Raw gene expression data were extracted from the BAM files by counting each pair of reads that maps to an annotated genomic feature in the reference mm10 using the Python tool HTSeq [10] . Once each biological sample was counted, the counts are merged into a single tab-delimited table for statistical processing. Both FASTQ files and the count tables as text files were deposited in the NCBI GEO expression database, accessible under accession GSE129622.

Differential gene expression analysis
The tabulated genomic feature count data was processed in DESeq2, a BioConductor package (ver. 3.4) in the open-source statistical language R (ver.3.3.2) [11] . DESeq2 uses a dispersion correction of the count data based on the negative binomial distribution and a maximum likelihood model to impute the prior data distribution for statistical testing. Empirical Bayesian statistics are applied to linear combinations of factors to test differential expression for multiple contrasts simultaneously. To avoid bias and unnecessary computation in the dispersion correction, the data set was pre-filtered to exclude any annotated genomic feature for which there were no counts in any biological sample. The final output of DESeq2 is a table of estimated Log 2 fold change, p-values for the defined contrasts tested, and Benjamini-Hochberg corrected false discovery p-values (FDR) [12] .
We determined the significance of differential expression using multiple thresholds, either singly or in combination. A statistical threshold of an FDR < 0.05 is a commonly used significance threshold in whole transcriptomic analysis. Additionally, some minimum magnitude of change in gene expression is typically applied as a selection criterion. In this study, with six replicates per dose and time, fold change thresholds of 1.2-fold, up-or down-regulated (|FC| > 1.2 fold), was applied, because the more commonly applied FC threshold of 1.5-fold was not sensitive enough. A fold change of 1.2-fold is equal to a Log 2 fold change of 0.263. Finally, the application of both a statistical threshold and the smaller FC criterion (FC = 1.2) (FDR < 0.05 & |FC| > 1.2 fold) permitted identification of a larger numbers of genes whose differential expression from controls was statistically significant, allowing a better opportunity to identify enriched pathways. The complete differential expression tables that provide the data described above are provided as Supplemental Data Files 2, 3, and 4 (Supplemental_2_Day2_DGE.xlsx, Supplemental_3_Day5_DGE.xlsx, and Supplemental_4_Day28_DGE.xlsx), for data from male and female mouse liver and lung following exposure to various concentrations of 4-MeI for 2 days, 5 days, and 28 days, respectively.
Reactome ontology enrichment was performed using an in-house software tool (GoFig-ureMaps) that performs a Fisher's exact test of over-representation of query genes relative to defined pathway elements. This software produces a graphical representation of the ontology enrichment in the context of the ontology hierarchy of cellular pathways, referred to as a bubblemap [ 13 , 14 ].

Ethics Statement
The Animal Study final report described in Supplemental Data File 1 was conducted at ILS. ILS has an Office of Laboratory Animal Welfare (OLAW) Assurance (A3490-01), and therefore, it uses the standards of the ILAR (2011) Guide for the Care and Use of Laboratory Animals [6] , the PHS (2015) Policy for Humane Care and Use of Laboratory Animals [15] , and USDA, (2020) Animal Welfare Act [16] .