A Single Gene Expression Set Derived from Artiﬁcial Intelligence Predicted the Prognosis of Several Lymphoma Subtypes; and High Immunohistochemical Expression of TNFAIP8 Associated with Poor Prognosis in Di ﬀ use Large B-Cell Lymphoma

: Objective: We have recently identiﬁed using multilayer perceptron analysis (artiﬁcial intelligence) a set of 25 genes with prognostic relevance in di ﬀ use large B-cell lymphoma (DLBCL), but the importance of this set in other hematological neoplasia remains unknown. Methods and Results: We tested this set of genes (i.e., ALDOB , ARHGAP19 , ARMH3 , ATF6B , CACNA1B, DIP2A, EMC9, ENO3, GGA3, KIF23, LPXN, MESD, METTL21A, POLR3H, RAB7A, RPS23, SERPINB8, SFTPC, SNN, SPACA9, SWSAP1, SZRD1, TNFAIP8, WDCP and ZSCAN12 ) in a large series of gene expression comprised of 2029 cases, selected from available databases, that included chronic lymphocytic leukemia (CLL, n = 308), mantle cell lymphoma (MCL, n = 92), follicular lymphoma (FL, n = 180), DLBCL ( n = 741), multiple myeloma (MM, n = 559) and acute myeloid leukemia (AML, n = 149). Using a risk-score formula we could predict the overall survival of the patients: the hazard-ratio of high-risk versus low-risk groups for all the cases was 3.2 and per disease subtype of several hematological neoplasia using a single gene-set derived from neural network analysis. High expression of TNFAIP8 is associated with poor prognosis of the patients in DLBCL. of the IPI. In DLBCL (Series 5) of the cell-of-origin molecular subtype and Eastern Cooperative Oncology Group (ECOG) performance status; and DLBCL of Series 6 and 9 from the molecular subtype as well.

that high protein expression of TNFAIP8 also quantified using artificial intelligence associated with a poor prognosis of the patients.

Gene Expression Analysis
Gene expression analysis was performed as we previously described [7]. For survival analysis the gene expression data was transformed to a prognostic index (also known as risk-score) to generate the risk-groups. Calculation was performed by multiplying the gene expression values with the estimated beta coefficients from the fitted Cox proportional hazards model. Of note, variables with positive coefficients (the B values) are associated with increased hazard and decreased survival times, i.e., as the predictor increases the hazard of the event increases and the predicted survival duration decreases. Negative coefficients indicate decreased hazard and increased survival times. Exp(B) is the ratio of hazard rates that are one unit apart on the predictor. After ranking the samples by their prognostic index, the samples were split into low-risk vs. high-risk groups. The risk-group splitting was also optimized using an algorithm that uses the inner-group p-value in order to identify the best cutoff for survival (i.e., lower p value). Then, conventional survival analysis was performed [7,[18][19][20]. Within the 25 genes set, the most relevant genes are those that are consistently differently expressed between high-risk and low-risk groups although their weight and direction of the association is also taken into account.

Statistical Analysis
The analysis was performed using R version 3.6.3 (2020-02-29) (http://cran.r-project.org) as well as SPSS software (IBM SPSS Statistics 25, Armonk, NY, USA). The conventional criteria for overall survival was used. Survival analysis was performed with Kaplan-Meier and log rank tests, and Cox regression, method (enter), contrast (indicator). Hazard-ratios/risks (HR) were determined using Cox regression analysis (exp(B) values).

Validation of TNFAIP8 in an Independent Series of DLBCL
We validated one of the most relevant markers with immunohistochemistry using diagnostic biopsies in an independent series of 97 patients with Diffuse Large B-cell Lymphoma who were diagnosed at Tokai University Hospital from the years 2004 to 2011. This study was approved by the institutional review board (IRB 14R-080) and conducted in accordance with the Helsinki Declaration of 1975 as revised in 2008.

Conventional and Artificial Intelligence-Based Digital Image Analysis
Slides were visualized in an optical microscope (Olympus BX63, Olympus K.K., Tokyo, Japan) and later digitalized using a digital slide scanner (NanoZoomer S360, Hamamatsu Photonics, Hamamatsu City, Japan). For conventional analysis the image was evaluated as an ordinal variable: 0 (no staining), 1+ (low), 2+ (intermediate) and 3+ (high positive). Artificial intelligence-based digital image analysis was performed using Fiji software [(Fiji Is Just) ImageJ 2.0.0-rc-69/1.52p/Java 1.8.0_172 (64-bit)]. The artificial intelligence-based image analysis method quantified the marker based on the Waikato Environment for Knowledge Analysis [Weka, version 3.9.3, The University of Waikako Hamilton, New Zealand; with Java (TM) SE Runtime Environment, version 1.8.0_172-b11, Oracle Corporation, Redwood City, California, United States]. The raw immunohistochemical image, which corresponded with same area previously evaluated in the conventional analysis, was loaded into the analysis software and directly analyzed without type change. For the training input three types of pixels were selected: Class 1 (positive staining, DAB), Class 2 (negative areas) and Class 3 (absence of cellularity). The segmentation settings included as training features the Gaussian blur, hessian, membrane projections, Sobel filter and difference of Gaussians. The membrane thickness was set at value 1, membrane patch size at 19, minimum sigma at 1.0 and maximum sigma at 16.0. The training of the classifier included fast random forest. The classifying of the whole image used all available CPU threads. Finally, the segmentation of the whole image was performed, and each class area was inked in a different color and quantified using thresholds with Fiji software.

Survival Analysis According to the Risk-Score Based on the Set of 25 Genes
The set of 25 genes, previously identified in the MLP, were analyzed for prognosis in the series of 2029 lymphoma cases. The analysis consisted on multivariate Cox regression analysis and the calculation of the risk-score. The risk-score values in each series were used for finding the most significant cut-off for overall survival and log-rank test (i.e., "maximized risk-groups"). The hazard-risks were calculated and the different gene expression between the two risk-groups were also tested. The analysis was made in each database independently as the gene expression was performed in different experimental conditions (i.e., different array platforms). In each lymphoma subtype, a final survival plot was created by merging the different risk-score groups (Figure 1).
In the Table 2 it is detailed the overall survivals data of the different subtypes according to the risk-groups obtained from the gene expression data. It includes the number of cases per group, the 5-year OS and the p value of the log-rank test, the hazard-risk/ratio and its p value. The risk-score based on the 25 genes managed find two risk groups in all the hematological subtypes with p values long below to 0.001 (the p values ranged from 6.58 × 10 −8 to 3.92 × 10 −42 ). The hazard-risks in average were 3.2, with a range from 3.0 to 5.2, being the differences more important in MCL and MM subtypes.
Correlation with several clinicopathological characteristics was available in some of the series. In MCL (Database Series 3), the prognostic relevance of the risk-groups was kept independently of the cyclin dependent kinase inhibitor 2A (CDKN2A, also known as INK/ARF), ATM serine/threonine kinase (ATM) and tumor protein p53 (TP53) deletion status. In case of FL (Series 4), the risk-groups were independent of the IPI. In DLBCL (Series 5) of the cell-of-origin molecular subtype and Eastern Cooperative Oncology Group (ECOG) performance status; and DLBCL of Series 6 and 9 from the molecular subtype as well.

Gene Contribution to the Prognostic Model
The risk-groups are set up based on the risk-scores that are calculated with the beta values of the multivariate COX regression and the gene expression values (Figure 2). All the markers contribute to the final value of the risk-score as well as to the risk-group. Therefore, both beta and gene expression values are informative of which genes are the most relevant for the model and the direction of the association. In general, in our data, the most significant genes usually associated in the direction of high expression with poor prognosis but in some genes the association was inverted.
In the MCL (Series 3), the array had only 5 of the genes of the 25 gene series. But with only 5 genes, it was possible to define the two risk-groups: high gene expression of ARHGAP19, SZRD1 and KIF23 associated to the high-risk group while high ALDOB to the low-risk group. In case of FL (Series 4), high expression of SPACA9 and CACNA1B associated to the high-risk group while high ARHGAP19, SNN, TNFAIP8 and EMC9 was associated to the low-risk group. In DLBCL (Series 5), high expression of SWSAP1, WDCP, CACNA1B, TNFAIP8, POLR3H, ENO3, SERPINB8, SZRD1, KIF23, GGA3, METIL21A associated to the high-risk group and high ZSCAN12 and LPXN to the low-risk group. In case of MM (Series 10) and AML (Series 11) most of the genes were represented in the high or low-risk profile. Please refer to the tables (e.g., Table 3) and figures for detailed information of all the neoplastic subtypes and database series. Of note, high ENO3 associates to the high-risk group in several subtypes including CLL, DLBCL, MM and AML.  Table 3. Correlation between high gene expression of a marker with the risk-groups in the different subtypes: 1, high-risk group (red color); −1, low-risk group  Table 3. Correlation between high gene expression of a marker with the risk-groups in the different subtypes: 1, high-risk group (red color); −1, low-risk group (green); 0, no risk-group association (white); genes that are not present in the array (grey). (C) Radial plots of the most relevant subtypes. Black color lines correspond to the p values (1−p calculation) of each gene in the multivariate Cox regression analysis. Grey colors correspond to the p values (1−p) of the differential gene expression (DGE) between high-risk vs. low-risk groups (grey color). Values >0.95 correspond to p < 0.05. Note: risk-scores are calculated by multiplying the beta values with the gene expression. The p values are useful to identify the most relevant markers, but all markers contribute to the risk-score calculation. (D) Differential gene expression between high-risk and low-risk group in three representative subtypes (Subtype 1 of CLL, FL 4 and MM 10). HR, high-risk group; LR, low-risk group; NC, no correlation; "-", the gene is not available in the gene expression array of the series. The asterisk "*" indicates that the p value is in the limit of significance (0.1 < p value >0.05). The underlined genes are the most relevant. The percentage (%) indicates the percentage of the series with significant association with the high or low-risk groups. Of note, all the genes contributed to the risk score and the risk score groups generation (i.e., to the overall survival of the patients), but their weight and direction of the association was variable.

Immunohistochemical Expression of TNFAIP8 in DLBCL
The characteristics of the subjects of the validation set of 97 cases from Tokai University Hospital are present in the Table 4. The age ranged from 14 to 97 years old with a median of 68 years, and 54 were men (55.7%). According to the cell of origin classification by the Hans' classifier 33% were GCB and 67% non-GCB. Ninety-four percent of the cases had received RCHOP or RCHOP-like therapy. Patients that had an unfavorable prognosis associated with age >60 years, high LDH, high sIL2RA, ECOG performance status ≥2, clinical stage III-IV, extranodal sites >1, higher IPI score, a non-GCB molecular subtype, high RGS1 expression, positive BCL2 expression and positivity for Epstein-Barr virus (EBER+). The alive/dead ratio was 0.73. The expression of TNFAIP8 ranged from 3.2% to 87.9%, with a median of 38.1% and a mean of 41.2% ± 25.3 STD. The TNFAIP8 staining was also evaluated by the pathologist (J.C.) under the microscope using a conventional ordinal scale approach (0, 1+, 2+ and 3+). Both quantifications had a good correlation (Spearman's rho correlation 0.887, p = 1.0088 × 10 −33 ). The TNFAIP8 values from the A.I.-based quantification were ranked and the most adequate cutoff point for overall survival was calculated (≥19%). Patients with high TNFAIP8 expression had a 250% more risk of dying than the patients with low expression (hazard risk = 3.5, 95% CI 1.244-9.849). The 5-year overall survival of the patients, high vs. low TNFAIP8, was 53% (95% CI 63.8-41.2%) vs. 85% (95% CI 100.7-70.1%), respectively [log rank (Mantel-Cox) p = 0.011]. The 10-years overall survival of the patients, high vs. low TNFAIP8, was 42% (95%CI 55.5-28.1%) vs. 73% (95%CI 98.9-47.5%), respectively [Log Rank (Mantel-Cox) p = 0.011]. TNFAIP8 expression was also correlated with several clinicopathological characteristics. High TNFAIP8 correlated with age >60 years, high serum IL2RA, non-GCB phenotype and high infiltration of CD163+ M2-like tumor associated macrophages (CD163+TAMs). TNFAIP8 also moderately correlated with MYC (Spearman's correlation coefficient 0.389, p = 0.009) and Ki67 (proliferation index; Spearman's correlation coefficient 0.48, p = 0.001). High TNFAIP8 also associated (trend) a worse progression free survival (p = 0.052). Finally, a multivariate COX analysis between TNFAIP8 (high vs. low) and IPI (low+low/intermediate vs. high/intermediate+high) show that only TNFAIP8 kept the prognostic value (HR = 3.5, p = 0.040). All the information of this paragraph is present in the Tables 4 and 5, and the Figure 3.   (C) The expression of TNFAIP3 correlated with several clinicopathological characteristics. A moderate but significant correlation was also found with MYC and the proliferation index assessed with the Ki67 marker.

Discussion
Neural networks are a computer architecture, implementable in either hardware or software, modeled after biological neural networks. Like the biological system, the computerized neural networks that are known as perceptrons consist of neuron-like units. In our previous project, we focused on DLBCL and we used a multilayer perceptron approach with a hidden layer with 12 units that allowed us to identify the top 25 genes associated to the prognosis of the DLBCL patients [7,21]. Next, in this research, we aimed to know if that set of 25 genes also had predictive value in other hematological neoplasia. Therefore, we searched for available gene expression databases that we could easily use for the analysis. We selected 11 gene expression series comprising CLL, MCL, FL, DLBCL, MM and AML that in total included 2029 cases from Western countries. The follow-up of this series of 2029 cases ranged from 0 to 26 years and had a 1, 3 and 5-year OS of 84%, 72% and 62%, respectively. This survival is the standard of a general lymphoma series. Each hematological neoplasia had a different survival that was also in concordance with other series and as described in the WHO Classification of Tumors of Hematopoietic and Lymphoid Tissues. [1] Therefore, we expected that the results of this analysis would be applicable to patients of hematological neoplasia (mainly lymphoma subtypes) from Western countries.
A detailed description of the biological function of each of the 25 genes is present in our previous publication, [7] so refer to that publication for the thorough information. To the best of our knowledge, the prognostic relevance in lymphoma of these 25 markers have not been reported yet. In general, these markers are related to the following terms: ribosome, protein binding, RNA polymerase activity, transferase activity and enzyme binding. According to the KEGG pathways, the most relevant were ribosome, RNA polymerase, EBV infection, glycolysis and pyrimidine metabolism. [7] A functional network association analysis did not find significant straightforward association between these markers. [7] Therefore, these markers seem to be independent between them and do not belong to a common pathway.
All markers contribute to the risk-score (i.e., prognostic score) but they differ in their weight and in the direction of the association, which will be different in each series. Nevertheless, some markers had a conspicuous role in the prognosis of the patients in across several subtypes: ENO3, KIF23, EMC9, ARHGAP19 and TNFAIP8 were overexpressed in the high-risk or low-risk groups in several subtypes. ENO3 protein is the beta-enolase, with main function in striated muscle development and regeneration. This protein is involved in Step 4 of the sub pathway that synthesizes pyruvate from D-glyceraldehyde 3-phosphate. [22] Using bioinformatics analysis, ENO3 had been highlighted in the expression profile of hepatocellular carcinomas. [23] In our series, ENO3 had a significant role in CLL, DLBCL, MM and AML. KIF23 is a component of the centralspindlin complex that serves as a microtubule-dependent and Rho-mediated signaling required for the myosin contractile ring formation during the cell cycle cytokinesis. KIF23 is essential for cytokinesis in Rho-mediated signaling. [22] Therefore, KIF23 has a role in mitotic cytokinesis. In addition, according to Reactome, KIF23 is associated to the antigen processing and presentation of exogenous peptide antigen via MHC class II, [24] which is a function of B lymphocytes. In gastric cancer, KIF23 promotes cancer by stimulating cell proliferation. [25] KIF23 had a role in MCL, DLBCL, MM and AML. EMC9 is also known as ER Membrane Protein Complex Subunit 9 but not additional information about this protein is known. The relationship of EMC9 with lymphoid neoplasia is not reported as well. In our research, EMC9 had a role in FL, DLBCL and MM. ARHGAP19 is a signal transductor located in the nucleus that has GTPase activator activity. ARHGAP19 is predominantly expressed in hematopoietic cells and has an essential role in the division of T lymphocytes. Overexpression of ARHGAP19 in lymphocytes delays cell elongation and cytokinesis. [26] ARHGAP19 had a role in MCL, FL, DLBCL and MM. TNFAIP8 acts as a negative mediation of apoptosis and may play a role in tumor progression. Polymorphisms are related to the risk of non-Hodgkin's lymphoma [27] and it has been previously identified in DLBCL [28]. TNFAIP8 had a role in FL, DLBCL and MM. ATF6B is a transcriptional factor that acts in the unfolded protein response (UPR) pathway by activating UPR target genes induced during ER stress. [22] To date, no manuscript has related ATF6B with lymphoma.
TNFAIP8 was one of the most relevant markers that we had identified by A.I. By gene expression, TNFAIP8 associated to a poor prognosis of the patients. Among the different lymphoid neoplasia subtypes, TNFAIP8 was especially relevant in DLBCL. Therefore, we decided to validate the prognostic value of TNFAIP8 in an independent DLBCL series from Tokai University Hospital. This series of Tokai is a conventional series of DLBCL as shown in Table 4. Therefore, our results should be reproducible by other researchers. Ninety-seven patients with de novo DLBCL were selected and the diagnostic biopsies were analyzed for protein expression of TNFAIP8 using immunohistochemistry. The expression of TNFAIP8 was evaluated using a conventional approach as an ordinal variable (0, 1+, 2+ and 3+) and with a novel digital image quantification method based on A.I. segmentation. Of note, good correlation between both methods was found. High protein expression of TNFAIP8 correlated with a poor prognosis of the patients. Therefore, we have confirmed the prognostic value of TNFAIP8 in DLBCL and we have shown how A.I. can be applied in the histopathological evaluations. TNFAIP8 acts as a negative mediator of apoptosis and may play a role in tumor progression. TNFAIP8 suppresses the TNF-mediated apoptosis by inhibiting caspase-8 activity but not the processing of procaspase-8, subsequently resulting in inhibition of BID cleavage and caspase-3 activation [29]. Interestingly, we also found correlation of TNFAIP3 with other clinicopathological characteristics that usually associate to poor prognosis of DLBCL such as age > 60 years, high sILRA, non-GCB subtype and high CD163+macrophages. The reason of these associations will need further investigation.
In conclusion, we have confirmed that it is possible to predict the prognosis of the patients with hematological neoplasia (mainly B-cell non-Hodgkin lymphomas) with a risk-score formula based on the gene expression of 25 genes, which were previously identified using artificial intelligence approach in a series of DLBCL. In addition, we have found that high protein expression of TNFAIP8 is associated to poor prognosis in DLBCL patients.