Age Dependent Analysis of Colon Cancer Tumours Using Mathematical and Statistical Modelling

Colon cancer is the third most commonly diagnosed cancer and the second leading cause of cancer death in men and women combined in the United States. In this work, we performed mathematical and statistical modelling of Tumour sizes as a function of age for four different races. Mathematically, based on the behaviour of the data for each race, we partitioned ages of subjects into several intervals. The mathematical function that characterizes the size of the Tumour as a function of age was determined for each age interval. Statistically, using quantile regression, we designed models that are more robust at specific quantiles using Tumour size and age as dependent and predictor variables.


Introduction
Cancer is a sickness characterized through the unchecked division and survival of abnormal cells. When this form of abnormal boom occurs within the colon or rectum, it is known as colorectal cancer (CRC). Combination of colon and rectum is known as large intestine or large bowel. The colon has four sections, the ascending colon starts with the cecum (a pouch wherein undigested food is obtained from the small intestine) and extends upward at the right side of the stomach. The transverse colon is so known as because it travels the body from right to left side. The ascending and transverse colon is collectively referred to as the proximal colon. The descending colon descends at the left aspect. The sigmoid colon, which is known as for its "S" form, is the very last part of the colon and joins the rectum. The descending and sigmoid colon are collectively known as the distal colon (Jacobs et al., 2018).
Colon cancer is the maximum commonly identified cancers and the second main cause of cancer demise in men and women combined inside the United States (Bonsu, 2013). The risk of CRC increases with age. The median age at analysis for colon cancer is 68 in men and 72 in females; for rectal cancer, it is 63 years of age in males and females (Favoriti et al., 2016). As the result of rising CRC incidences in early age groups coincident with declining rates in older age people, the percentage of instances identified in people younger than age 50 accelerated from 6% in 1990 to 11% in 2013 (Few and Edge, 2008). Most of those instances (72%) occur in individuals who are in their 40s. CRC occurrence and mortality costs are highest in non-Hispanic African Americans (NHBs) and lowest in Asians/Pacific Islanders (APIs). During 2009-2013, CRC occurrence rates in African Americans had been about 20% higher than the ones in non-Hispanic Caucasians (NHWs) and 50% better than APIs. The disparity for mortality is two times that for incidence; CRC death quotes in African Americans are 40% higher than in NHWs and double the ones in APIs (Augustus and Ellis, 2018). Reasons for racial/ethnic disparities in CRC are complicated to study but there are noticeable differences in the treatment offered to patients based on their socioeconomic status. According to the US Census Bureau, 24% of African Americans lived in poverty in 2015, in comparison to 11% of Asians and 9% of NHWs (Proctor et al., 2016). In this paper, we modelled Tumour sizes as a function of age for four races by applying both mathematical and statistical techniques. Mathematical/Statistical models have demonstrated to be helpful in establishing definite and clear relationships between systems and stages in cancer (Anderson and Quaranta, 2008;Byrne, 2010). The competency of statistical/mathematical models lies in its capacity to proclaim already obscure or absurd phenomenal principles that are disregarded by a subjective way to deal with science (Altrock et al., 2015). Paterson et al. (2020) developed a stochastic model to estimate the order of procurement of driver mutations that lead to colorectal Tumours, probability of colorectal malignant Tumours. Univariate analysis performed by Pages.et.al (2005) on 959 specimens of resected colorectal cancer proved that the existence or nonexistence of histologic signs of early metastatic invasion exhibited significant differences in disease-free and overall survival rate (Pages et al., 2005). DePillis et al. (2013) developed a model to establish relationship between colon cancer growth and treatment using a system of nonlinear ordinary differential equations (ODE).
Bergin, applied quantile regression approach on rural and urban colorectal and breast cancer data. Subjects from rural areas with colorectal cancer had longer intervals at the 50th, 75th and 90th percentiles (Bergin et al., 2018). Researchers, policymakers, and clinicians all use quantile regression. Healthcare expenditures uses this statistical method to interpret observations from various parts of the outcome's distribution across different quantiles of the distribution (Olsen et al., 2012).
Many research works using mathematical and statistical modelling are published on various cancers. Very few research works are published in the area of age dependent analysis on colon cancer using mathematical and quantile regression. In the first part of this paper, we employ a line graph approach and fit a regression line using ordinary least squares approach to find a model for predicting average Tumour growth as a function of age. This is a preliminary analysis that can shed light for subsequent future research. In the later part, we employ quantile regression approach for modelling Tumour sizes as a function of age. This approach helps to conduct a granular research modelling for the data mainly at the extreme quantiles. Estimates of these regression models can be used to produce forecasts.

Data
The data for this research is obtained from SEER database registry (Ratnapradipa et al., 2017). This data source SEER (Surveillance Epidemiology and End Results Program), which is a unique, reliable and essential resource for investigating various cancers.
Colon cancer data from SEER database from 2004 until 2015 is used in this work (Howlader et al., 2016). We pre-processed the information of colon malignant Tumours to expel redundancies and missing data. The resulting data set had 30,251 records. We analyse the rate of change of Tumour growth separately for the four races, Caucasians (88.9%), African Americans (10.4%), American Indians (0.3%) and others (0.4%). 49.7% of the data are male subjects and 50.3% of the data are female subjects with colon cancer. The mean (standard error) of the age of the subjects and Tumour sizes for the entire data are 67.74 (0.08) and 48.14 (0.2) respectively. Detailed data description can be found elsewhere (Bhargavi et al., 2020).

Line Plots
The fundamental concept of visually exploring data is to provide the information in some visible shape, permitting the researcher to get insight into the records/data and then infer conclusions. Present descriptive data mining techniques have proven to be of high value in exploratory data evaluation and in addition, they have an excessive capacity for exploring huge databases. Visual representation of data is mainly beneficial when very little is known about the data and when there is no definite suitable technique available to explore the data. Visible representation of the data normally allows a faster exploration and provide results that are more accurate. Especially in instances in which software algorithms fail (Keim, 2002). A line graph is an exploratory data analysis technique, which presents a visual display of the relation between two continuous variables. Line graph is very effective mainly when working with time series data, distance data, etc. They are useful in identifying patterns like trends, and seasonal effects in the data. Connecting values with a line along with time provides true insight of data if (1) the intervals are same in size, (2) the intervals are in right order, and (3) values are recorded at all periods (Koenker and Bassett, 1978). As missing values represent a discontinuity in the information, we processed our initial data to get rid of any missing values. In this work, x-axis of a line graph represents age of the subjects and y-axis represents the average Tumor sizes. For each pair of Age-Tumor size values, a point is marked using the coordinate system. Line graph involves connecting all these points with a line, usually by the method of ordinary least squares (OLS) regression, which may be useful for prediction purposes. The line graph provides a clear display of trend in event such as growth of Tumor sizes along the age axis.

Quantile Regression
Quantile regression has better elasticity when analysing data. This method is helpful when there exist differences among the distribution of a dependent variable, such as distinct relationships at different parts of the dependent variable's distribution (Koenker and Kevin, 2001;Le Cook and Manning, 2013). For instance, quantile regression can be used to determine how many days a treatment should be administered which helps recover 90% of subjects (for 100 subjects with similar conditions), such that there is no interest in the mean.
Quantile regression helps for comprehending relationships between variables outside of the mean of the data, it is useful in understanding outcomes not normally distributed that lack linear relationships with independent variables (Le Cook et al., 2013;Hong et al., 2019). In quantile regression, we estimate the coefficients by minimizing the absolute deviations from the median. Compared to mean, the median is a resistant measure of central tendency that is not affected greatly by outliers. main classified groups of quantile regression approaches. They are (a) an inferential method used in quantile regression known as minimization of weighted absolute deviations. Conditional median and a range of other quantile functions are estimated by minimizing asymmetrically weighted absolute residuals and (b) the maximization of a Laplace likelihood.
Let us assume that yr and xr, (r=1, 2, 3, 4…, n), as the outcome of interest and as predictor variables respectively. The quantile regression model with τ th quantile for the response yr given xr is given by ( 2) where (•) is the check function defined by (u) = (τ−I(u<0))u and I(•) denotes the indicator function (Huang et al., 2017).

Methodology
In this paper, we investigate the socio-demographic affect towards the growth of cancer Tumours. Our response variable is Tumour size. The covariates include age, gender, and race of the cancer patients. Gender variable was not significant. Two-sample t-test showed no statistical significant difference between the Tumour sizes of males and females in the data. Further, we divided the data into four races; Caucasians (R1), African Americans (R2), Asian Indians (R3) and other races (R4). We present relevant statistical models for each of these races.
Line plots for each race were obtained using ordinary least squares (OLS) regression. Considering Age as the independent variable, x, and Tumour size as dependent variable, y, we obtained line plots. Using point process and visual understanding of the behaviour of the scatter plots, we further divided each sample of the data based on age intervals.
The average rate of change of the Tumour size, A(.), over the interval [a, b], is calculated using ∆ ∆ . The instantaneous rate, y'(.), is calculated by finding the first order derivative of the best-fit model at the median of the age interval.
Common regression models such as ordinary least squares and logistic regression often assume that the regression coefficients are constant across the population. Methods such as linear regression can estimate the mean of a dependent variable conditional on values from independent variables. Whereas quantile regression is a method of analysis that is used to estimate the conditional median and quantiles of dependent variables. The most common quartiles are Q1 (25th percentile), the median (50th percentile, and Q3 (75th percentile). Moreover, 25% of observations fall below Q1, 50% observations fall below the median, and 75% of observations fall below Q3. Quantile methods allow for researchers to place less focus on common regression slope assumptions, which in turn enables the researcher to describe variations in the distribution of a dependent variable (Le Cook, 2013). Unlike ordinary least squares regression, quantile regression does not make any assumptions on the distribution of dependent variables. In order to get percentile-based rate of growth of the Tumours, we have modelled equations for each race with the help of quantile regression. Figure 1 depicts the flowchart of this research work.

Line Plots for Each Race in the Data
For each race in consideration, using age as independent variable and average Tumour size as dependent variable, various mathematical equations including logarithmic, linear, quadratic, cubic and exponential were fitted separately. Model with a high coefficient of determination, R2, among the listed ones is chosen as the best-fit model to find out rate of change of Tumour with age. Initially, we plotted a scatter plot of average Tumour size as a function of age for all the races. The graphs for every race is not stationary throughout the age axis. Analysing the data of this form with a single model will not be adequate. We need a specific mathematical model that can handle data of this kind. Thus, we partitioned the age into groups using temporal time point process (Daley and Jones, 2003) while observing changes in the shape of the data and accounting for the quality of the model. Using this process, we arrived at certain age partitions for all the races.

Line Plot for Average Tumour Size as a Function of Age for Caucasians
In order to fit a best line plot, we generated a scatter plot for average Tumour size as a function of age for all Caucasians. When fitting the line plot for this data, we notice that the trend is not stationary. To address this, we partitioned the age into four age groups. The intervals we obtained are 17-35 years, 36-60 years, 61-90 years, and 91-106 years for fitting best-fit line plots. Figure 2 shows the scatter plot of average Tumour size as a function of age for all Caucasians.  Scatter plots for the identified age intervals of Caucasians are in the Figure 3. Once again, for each of these intervals, linear, exponential, quadratic, and cubic models among other models were fitted to the observed data and residual analysis performed. For the age interval of 17-35, a cubic model may seem appropriate for this group with an average residual as -0.0032. Repeating the same process for the rest of the age intervals for all the races and we identified a good-fit model for each such intervals, with a residual mean that is relatively small. Line plot equations for these age intervals of Caucasians are in the Table 1. Table 1 has the best-fit regression equations along with average rate of change and instantaneous rate of change of Tumour sizes for Caucasians. The average rate of change of Tumour size using, A(.) = [y(b)-y(a)]/(b-a) for each age interval [a, b] and instantaneous rate of change of Tumour size, y'(median) is calculated at the median of the age group. A(.) for the age group of 17-35 is positive and for the rest of the groups it is negative. Tumour size is increasing at an average of 1.3 mm per year in the younger subject group between 17 and 35 years.

Line Plot for Average Tumour Size as a Function of Age for African Americans
From the Figure 4, the age intervals we obtained for African Americans are 18-37 years, 38-82 years, and 83-101 years for fitting best-fit line plots.  Using the scatter plots for these age intervals as shown in Figure 5, line plots are fitted to understand the trend in these age groups. Table 2 has the best-fit regression equations along with average rate of change and instantaneous rate of change of Tumour sizes for African Americans. A(.) for the age groups of 18-37 and 83-101 is positive and the age group 38-82 value is negative. Tumour size for the younger subject group between 18 and 37 years and older groups of ages 83 and above is increasing at an average of 1.02 mm and 1.62 mm per year respectively.

Line Plot for Average Tumour Size as a Function of Age for Asian Indians
For Asian Indians data, using the scatter plot shown in the Figure 6  After identifying the best-fit line plots for these age intervals, we observe that A(.) for the age groups of 27-41 and 63-89 is positive and the age group 42-62 value is negative. Tumour size for the subject group between 27 and 41 years has an average increase of Tumour size as 2.14 mm per year. Among the four races considered in this study, this is relatively high rate of change of Tumour size.   Table 3 has the best-fit regression equations along with average rate of change and instantaneous rate of change of Tumour sizes for Asian Indians.

Line Plot for Average Tumour Size as a Function of Age for Other Races
Using the scatterplot of other races given in Figure 8, the age intervals are partitioned into 15-42 years, 43-74 years, and 75-87 years for fitting best-fit line plots.  Table 4 has the best-fit regression equations along with average rate of change and instantaneous rate of change of Tumour sizes for other races. A(.) for all age groups are negative, indicating that there is a decrease in the average rate of Tumour size per year across all ages for the other races data. The scatter plots used to identify the best-fit line plots for these age groups of other races are in the Figure 9.

Quantile Regression Approach
Homoscedasticity is one of the important assumptions for ordinary least squares (OLS) regression.
In OLS regression, we anticipate that the variance of the residuals must be constant throughout. If the residual variance is not constant and fanned out along, we have a case of heteroscedasticity. In our current data, we find that the distribution of the dependent variable is not normal. Figure 10 shows that as we move left to right, the spread appears to shrink at the very low and high predicted values for Tumour size and with more than a few outliers. We also conclude that there is a pattern in the variance of the residuals violating homoscedasticity assumption of variance for an OLS model. We further performed Breusch-Pagan test for heteroscedasticity. The result of this test reported a test statistic value of 162.208 with a p-value of zero indicating a failure of homoscedasticity. With this support, we proceed to perform quantile regression of the data with Tumour size as the dependent variable, age and race as independent variables.
As linear regression lacks in explaining the effect of a predictor variable like age on the lower and higher Tumour sizes, we consider quantile regression to provide better estimates of age even for lower and higher quantiles of age. We also notice that the data across all age intervals for all the races is either negatively skewed or positively skewed. Quantile regression can help us deal with such behaviour of the data. Quantile regression is more robust to outliers than ordinary least squares regression, and is semiparametric. Quantile regression is also robust for extreme data observations. Using quantile regression, we obtained models for conditional quantiles of Tumour sizes with age, Q τ (Tumour |age), τ (0, 1). Quantile regression models for the quantiles ranging from 0.05 until 0.97 are developed. The age parameter did not show any significant effect for the quantiles above 0.3 and below 0.9. This means that the conditional quantile at any level in that range cannot be estimated precisely for the Tumour size prediction as a function of age. Therefore, we focused on the lower quantile levels of 0.3 and below and higher quantiles above 0.9. Note that 0.3 quantile and below represents the age for early cancerous subjects while the 0.9 quantile and above corresponds to the older age cancerous subjects. Figure 11 documents the behaviour of the parameter estimates using quantile regression. For the independent variables, note that the extreme quantiles are the ones where the nonlinear effect is prominent. The left column plots of the Figure 11 are the intercept term of the model. It shows the predicted Tumour size for each quantile, if there is no increment in the age of the subject. The curve is monotone upwards.

Age in years
However, our main interest is in the graphs is the middle column, age. Age of the subjects clearly makes much more difference in the lower and upper quantiles. At the 0.05 quantile, a unit increase of the age relates to about 0.1 mm increase in the Tumour size. Similarly, at the upper quantiles, say at 0.92 quantile, Tumour size increases by 0.05 mm. It is interesting to notice that 0.92 quantile of age parameter is significant followed by 0.95 being insignificant and 0.97 quantile as significant.
The right most column of the Figure 11 are the parameter coefficients of the race term. We notice that race is not behaving different from OLS prediction.
Compared to all quantiles, it is interesting to observe that age and race are both significant only for 0.2 and 0.3 quantiles. This reveals that age and race factors played an important role in both quantiles. In addition, the effect of age is highly significant and with increase of one year in age is associates with 0.07 mm increase of Tumour size. At the higher quantiles (0.92, 0.95, 0.97), even though the race values were not statistically significant, the opposite signs of the coefficients at these quantiles indicate that the effects of the race variable are heterogeneous. The effect of age was not significant at the quantile 0.95, indicating that age may have heterogeneous effects on Tumour size. Table 5 reflects the modelling results of quantile regression for the lower and higher quantiles. We omit the quantile regression output for the intermediate quantiles, as it is not worth discussing.

Conclusions
In this paper, we performed age dependent analysis of colon cancer Tumours using mathematical and statistical modelling. We modelled line plots using ordinary least square regression and quantile regression equations for Tumour size as dependent variable with age and race as independent variables. Linear regression equations are developed to each of the four races, Caucasians, African Americans, Asian Indians and other races. We obtained different age intervals for each race and fitted ordinary least squares regression equations. Average rate of change and instantaneous rate of change of Tumour size for each age interval are calculated.  Further, in the work, we fitted quantile regression equations with age and race as independent variables and Tumour size as the dependent variable. Quantile regression is a semiparametric tool capable of modelling heterogeneity in the relationship between a dependent and one or more independent variables without making parametric assumptions on the conditional distributions. In this article, we modelled estimating Tumour size of colon cancer using age and race as covariates.
The quantile parameter estimates are nonlinear at the extreme quantiles. We modelled quantile regression equations for the extreme quantiles at 0.05, 0.1, 0.15, 0.2, 0.3, 0.92, 0.95 and 0.97. Race variable was significant only for lower quantiles 0.2 and 0.3. Age variable is significant at both the extremes. The estimates for the regression coefficients along with standard errors as well as the 95% confidence interval bounds are obtained. The prediction lines for quantile regression are generated and are presented in Figure 12. Even though this is just a preliminary study, we find clear indications that age plays a very important role in the growth of identified Tumours, mainly when the subjects are identified with colon cancer in their young age.

Conflict of Interest
The authors confirm that there is no potential conflict of interest to publish the paper in the journal.