mRNA expression data in breast cancers before and after consumption of walnut by women

This article contains supporting data for the research paper entitled: ‘Dietary walnut altered gene expressions related to tumor growth, survival, and metastasis in breast cancer patients: a pilot clinical trial’ [1] Hardman et al., 2019. Included are tables for all mapped genes and all unmapped loci identifications that were significantly changed in breast cancers by consumption of walnut for about 2 weeks. All gene networks that were identified by Ingenuity Pathway Analyses as modified are shown in table 3. Files containing the raw reads, along with a shell script describing the complete data analysis pipeline, were deposited to the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) and can be obtained via accession number GSE111073. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111073.


Data
The data in Table 1 present the known genes in which the log ratio of [(gene expressions in the breast tumor at surgery) divided by (the expression of that gene in the initial biopsy)] in the subjects who consumed walnut divided by the [(gene expressions in the breast tumor at surgery) divided by Specifications table

Subject area
Breast cancer More specific subject area Diet supplementation with walnut Type of data Tables of modified genes and gene pathways How data was acquired Next-gen sequencing of mRNA from breast cancer tumors Total RNA was extracted from biopsy or surgical specimens using RNeasy, Lipid Tissue kit (Qiagen.com). RNA sample quality was assessed on RNA Pico chips in using an Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA). One microgram of total RNA was used to construct RNA-Seq libraries using a TruSeq stranded total RNA library prep kit with RiboZero(HyMyR) ribosomal RNA reduction (Illumina Inc., San Diego, CA) according to the kit's instructions. Twenty RNA-Seq libraries were clustered on an Illumina cBot and sequenced on a HiSeq 1500 platform, in a 2 x 50 base paired end design yielding a minimum of 50 million reads per sample.
Reads were trimmed using Trimmomatic v 0.36 [2] to remove low-quality base calls and adapter sequences, and then aligned to the human reference genome GRCh38 using HISAT v2.1.0 [3]. Resulting bam files were sorted and indexed with SamTools v1.3.1 [3,4], and PCR and optical duplicate reads marked using Picard tools v2.6.0. The numbers of reads mapping to each gene for each sample were counted using the R/Bioconductor package GenomicAlignments, v1.12.2 [5] and the Ensembl gene database for GRCh38, build 84 [6]. Differential gene expression was computed using DESeq2 version 1.10.0 [7], with a statistical model comparing the ratio of expression between surgery and biopsy specimens for the walnut-consuming group to the ratio of expression between surgery and biopsy specimens for the control group. Data format Raw expression reads were analyzed by Ingenuity Pathway to identify changed gene expressions in the tumor by walnut consumption.

Experimental factors
Women with breast cancer in the walnut group consumed 2 ounces of walnuts per day for 2 e3 weeks, the control group did not consume walnut.

Experimental features
The data were obtained in a non-placebo, two-arm, clinical trial. Women with breast lumps large enough for research and pathology biopsies were recruited and randomized to walnut consuming or control groups. Immediately after biopsy collection, women in the walnut group began to consume two ounces of walnuts per day until follow-up surgery, the control group did not consume walnut. Pathology confirmed that lumps were breast cancer in all women who remained in the trial. At surgery, about two weeks after biopsy, additional specimens were taken from the breast cancers. Changes in gene expression in the surgical specimen compared to baseline biopsy were determined in each individual woman in walnut-consuming (n ¼ 5) and control (n ¼ 5) groups. RNA-Seq was performed. Resulting expression data were analyzed by Ingenuity pathway analyses. Data source location Huntington, West Virginia Data accessibility Files containing the raw reads, along with a shell script describing the complete data analysis pipeline, were deposited to the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) and can be obtained via accession number GSE111073. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc¼GSE111073 Related research article Hardman  Value of the data This data can be of value to those who: desire to further investigate synergism between diet and cancer therapy desire to understand why some cancers responded to the dietary intervention and some may not desire to discover beneficial combinations of dietary components and/or standard cancer therapies by understanding the genes that are influenced (the expression of that gene in the initial biopsy)] of that gene in the control subjects were significant. Thus, this data shows the genes in which mRNA expression was significantly changed by consumption of walnut compared to control and the direction of that change (increased or decreased). The data in Table 2 contains all the loci which did not map to a gene but were identified as significantly altered in the breast tumor by consumption of walnut using the same calculations as for the data in Table 1. The meaning of changes in these loci has not been identified.
IPA analyses use the results shown in Table 1 to organize the genes into functional networks. This is important to identify the net effect of multiple changes on gene expression. The data in Table 3 lists all 25 significantly modified gene networks that were identified by IPA analyses, the genes in those networks and the top diseases and functions associated with the genes. The network 'score', the negative log of the overall statistical significance of the network, is shown. A network score of 41 means that in 10 À41 experiments of a similar type one might to encounter this pattern of mRNA expression changes by chance. The data in this table indicates the effect of walnut consumption on gene networks in the existing breast cancer and could indicate other diseases or functions in which dietary walnut may have benefit. Focused research would be needed to ascertain this benefit.

Experimental design
Women were recruited for this clinical trial when they came to the clinic for their first diagnostic biopsy, before it was known if the lump was cancer or not. Potential subjects must have had lumps large enough for the necessary biopsies for diagnosis and 1 or 2 extra research biopsies. After signing informed consent, subjects were randomized into walnut-consuming or control groups. Subjects in the walnut group immediately began to consume 2 ounces of walnuts per day until surgery. If a subject was later found to not have cancer or if the cancer was to be treated with chemotherapy or radiation prior to surgery, she was no longer included in the trial. Thirty-eight women were initially recruited. Twentyfour of 38 subjects were disqualified because the lump was benign, or the subject was to receive chemo-or radiation therapy prior to surgery. An additional 4 subjects were disqualified because the extracted mRNA of at least one specimen did not pass quality control. Remaining were 10 subjects; five in the walnut consuming and five in the control group. mRNA was extracted from each individual specimen then genome wide mRNA was determined in each specimen via next-generation sequencing. Gene expression ratios were calculated for each gene as: (walnut surgery/walnut biopsy)/(control surgery/control biopsy) for further analyses [1].

IRB approval
The Marshall University Office of Research Integrity has an Institutional Review Board (IRB), which reviews and monitors all human subject research conducted at Marshall University, St. Mary's Medical Center, Cabell Huntington Hospital and the Edwards Cancer Center. The research protocol and participant informed consent were approved by the IRB (protocol number 339384e3). This study was not listed at ClinicalTrials.gov. Potential study participants were identified from records review by the Research Study Nurse prior to their appointment for a diagnostic biopsy. At the appointment time, the potential participant was interviewed by the study nurse, the study was explained and informed consent was obtained. The physician obtained one or two additional biopsies for research use when the biopsy was obtained for pathology studies.
Inclusion criteria: All subjects: 1) were female and with a breast mass that, according to standard of care, was to be biopsied for diagnosis and was large enough to obtain the needed biopsies for pathology and research; 2) understood and were willing to sign the informed consent form; 3) had an ECOG (Eastern Cooperative Oncology Group) performance status of 0 or 1; (0 -Fully active, able to carry on all pre-disease performance without restriction. 1 -Restricted in physically strenuous activity but ambulatory and able to carry out work of a light or sedentary nature, e.g., light housework, office work.)   (continued on next page) 4) were between 18 and 90 years of age; 5) were recruited as available without regard to race or ethnicity.
Exclusion criteria: Excluded persons were: 1) those who do not like or who were allergic to walnuts or other tree nuts; 2) those with any metabolic disease that could be affected by walnut consumption; 3) those with a life expectancy less than 6 months; 4) those who were pregnant (to prevent confounding due to pregnancy hormonal factors).

Clinical protocol
Subjects were consented at their first visit and were randomized into treated (consume walnut) or control (no added walnuts) groups. Routine clinical data were recorded (age, weight, height, family history, etc.). A five ml blood specimen in EDTA was collected for the research laboratory. After the initial biopsy, the subject was asked to continue to consume the usual diet and to not change consumption of any medication or supplements. If she was randomized to the walnut group, the subject was given 30, one ounce packages of walnuts, was asked to consume two packages (two ounces) of walnuts daily and to return remaining packages for counting. If needed, due to extended time for the clinical workup, the subject was given additional packages of walnuts to allow for continued consumption of two ounces of walnuts per day until surgery (about two to three weeks). Control group subjects were asked to not intentionally consume walnuts. At the conclusion of the study, each subject  was asked to identify whether any changes were made to the usual diet especially in the areas of fruits, vegetables, nuts or supplement consumption and fats used in cooking and whether walnuts were consumed (walnut group) or not (control group). At the time that ultrasound guided core needle biopsies of the breast mass were obtained for diagnosis, one or two extra cores were taken for research use. In the procedure room, immediately upon removal, the biopsies for research were placed in Qiagen All-Protect tissue reagent (Qiagen.com) to preserve RNA, DNA and protein for up to 7 days at room temperature. Biopsies were delivered to the research laboratory for initial processing within 2 hours.
If the pathology report indicated that the lump was not cancer, no further tissue was collected and the subject was no longer part of the study. If the biopsied tissue was breast cancer and surgery was scheduled without intervening radiation or chemotherapy, another specimen of tumor tissue and blood was collected at surgery. A small section of macroscopically viable tumor, away from the clean margin, was excised then immediately placed in Qiagen All-Protect tissue reagent, as before. Any patient who, according to the clinical care plan, was to receive either chemotherapy or radiation prior to surgery was no longer part of the study so as to not confound analyses.

Laboratory protocols
Total RNA was extracted using RNeasy, Lipid Tissue kit (Qiagen.com). This micro-kit is suitable for less than five mg of tissue and for extracting up to 45 mg of total mRNA. mRNA was checked for quantity and sent to the Marshall University Genomics Core Facility for further processing. The Genomics Core is a full service facility and provided RNA quality assessment, RNA-Seq analysis on each specimen, and DESeq2 expression profiling analyses of the data.

RNA sequencing: next-generation sequencing
RNA sample quality was assessed on RNA Pico chips in using an Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA). RNA samples had RNA Integrity Numbers (RIN) ranging from 2.6 to 9.4. One microgram of total RNA was used to construct RNA-Seq libraries using a TruSeq stranded total RNA library prep kit with RiboZero(HyMyR) ribosomal RNA reduction (Illumina Inc., San Diego, CA) according to the kit's instructions. RNA fragmentation times were modified based on RNA samples' RIN value to generate inserts of equal size across all libraries.
Twenty RNA-Seq libraries were clustered on an Illumina cBot and sequenced on a HiSeq 1500 platform, in a 2 x 50 base paired end design yielding a minimum of 50 million reads per sample. Five matched pairs of samples (initial biopsy and subsequent surgery) were collected from each of the walnut consuming and control groups.
Reads were trimmed using Trimmomatic v 0.36 [2] to remove low-quality base calls and adapter sequences, and then aligned to the human reference genome GRCh38 using HISAT v2.1.0 [3]. Resulting bam files were sorted and indexed with SamTools v1.3.1 [3,4], and PCR and optical duplicate reads marked using Picard tools v2.6.0. The numbers of reads mapping to each gene for each sample were counted using the R/Bioconductor package GenomicAlignments, v1.12.2 [5] and the Ensembl gene database for GRCh38, build 84 [6]. Differential gene expression was computed using DESeq2 version 1.10.0 [7], with a statistical model comparing the ratio of expression between surgery and biopsy specimens for the walnut-consuming group to the ratio of expression between surgery and biopsy specimens for the control group, as described in "statistical analyses" below.

Statistical analyses
Differences between groups (walnut or control) in fractions of individual fatty acids as determined by gas chromatography or in clinical parameters were determined by T-test with a significance level of p 0.05.
It was expected that there would be large interpatient heterogeneity thus the baseline mRNA expression of individual genes would be highly variable between patients. The analyses of biopsy and surgical specimens allowed each patient to serve as her own control. Gene expressions for the sample collected at initial biopsy and the sample collected at surgery were determined and the ratios of these expressions were calculated for each patient. Then the means of the ratio of expressions were compared between the walnut-consuming group and the control group. The comparison was performed using DESeq2, which models the read count per gene using a negative binomial distribution, and moderates the estimated expression changes to account for the dependence on overall read count [7]. Each patient was assigned a unique id within their treatment (walnut or control) group, and the statistical model (equation (1)): extraction þ treatment þ patient's group þ extraction:treatment (1) was passed to DESeq2, with extraction taking values "biopsy" or "surgery", and treatment taking values "walnut" or "control". Genes that were significant for the extraction:treatment interaction parameter at a Benjamini-Hochberg (BeH) controlled false discovery rate of 10% were considered to be differentially expressed. The corresponding moderated fold change computed by DESeq2 is an estimate of this parameter, and can thus be considered to be an estimate of the quantities: g w;s . g w;b g c;s . g c;b (2) In Equation (2), 'g' represents the expression level of gene 'g', the subscripts 'w' and 'c' represent samples in the walnut and control groups, respectively, and the subscripts 's' and 'b' represent samples from surgery and biopsy, respectively. Thus, the ratio of expression of gene 'g' from surgical specimens verses biopsy specimens of walnut patients was divided by the ratio of expression of gene 'g' from surgical specimens verses biopsy specimens of control patients. These analyses determined whether, across the group, there were significant and consistent changes in the mRNA expression of specific genes due to walnut consumption and provide the input for subsequent Ingenuity Pathway analyses.

Ingenuity Pathway analyses (IPA)
The complex data resulting from RNA seq expression profiling requires complex analyses. Data were analyzed by use of IPA [8]. Final downstream phenotypic effects are due to the balance of positive and negative influences on expression of genes in a pathway. The goal of the IPA Downstream Effects Analysis is to identify genes and the resulting functions that are expected to increase or decrease, given the observed gene expression changes in the experimental dataset [8]. Downstream Effects Analysis is based on expected causal effects between genes and functions; the expected causal effects are derived from the literature compiled in the Ingenuity ® Knowledge Base [8]. The analysis examines genes in the dataset that are known to affect functions, compares the genes' direction of change to expectations derived from the literature then issues a prediction for each function based on the direction of changes in the dataset [8]. IPA uses a z-score algorithm to make predictions which is designed to reduce the chance that random data will generate significant predictions [8]. A publication further describing Downstream Effects Analyses can be found at [9].
The p-values for networks were calculated using a Fisher exact test with BeH multiple testing corrections. The networks were generated through the use of IPA [8]. The network score is based on the hypergeometric distribution and is calculated with the right-tailed Fisher's Exact Test with BeH multiple testing corrections. For example, for a network with a p-value of 1 Â 10 À30 , the network's score ¼ [-log(Fisher's Exact test result)] ¼ 30. Thus, a score of 30 can be interpreted as meaning that if there were no associations between walnut consumption and the gene expression changes seen in the network, an overlap between the network and the differentially expressed gene set would only occur 1 in 10 30 times in similar experiments.
Files containing the raw reads, along with a shell script describing the complete data analysis pipeline, were deposited to the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) and can be obtained via accession number GSE111073. https://www. ncbi.nlm.nih.gov/geo/query/acc.cgi?acc¼GSE111073.