Relationship between Clinicopathologic Variables in Breast Cancer Overall Survival Using Biogeography-Based Optimization Algorithm

Breast cancer is the most common cancer among women and is considered a major public health concern worldwide. Biogeography-based optimization (BBO) is a novel metaheuristic algorithm. This study analyzed the relationship between the clinicopathologic variables of breast cancer using Cox proportional hazard (PH) regression on the basis of the BBO algorithm. The dataset is prospectively maintained by the Division of Breast Surgery at Kaohsiung Medical University Hospital. A total of 1896 patients with breast cancer were included and tracked from 2005 to 2017. Fifteen general breast cancer clinicopathologic variables were collected. We used the BBO algorithm to select the clinicopathologic variables that could potentially contribute to predicting breast cancer prognosis. Subsequently, Cox PH regression analysis was used to demonstrate the association between overall survival and the selected clinicopathologic variables. C-statistics were used to test predictive accuracy and the concordance of various survival models. The BBO-selected clinicopathologic variables model obtained the highest C-statistic value (80%) for predicting the overall survival of patients with breast cancer. The selected clinicopathologic variables included tumor size (hazard ratio [HR] 2.372, p = 0.006), lymph node metastasis (HR 1.301, p = 0.038), lymphovascular invasion (HR 1.606, p = 0.096), perineural invasion (HR 1.546, p = 0.168), dermal invasion (HR 1.548, p = 0.028), total mastectomy (HR 1.633, p = 0.092), without hormone therapy (HR 2.178, p = 0.003), and without chemotherapy (HR 1.234, p = 0.491). This number was the minimum number of discriminators required for optimal discrimination in the breast cancer overall survival model with acceptable prediction ability. Therefore, on the basis of the clinicopathologic variables, the survival prediction model in this study could contribute to breast cancer follow-up and management.


Introduction
Breast cancer is the most common cancer among women worldwide, with an estimated 1.67 million newly diagnosed cases each year, ranking second for cancer incidence rate and fifth for cause of death from cancer [1]. Worldwide increase in breast cancer incidence represents a sizeable burden on public health services [2]. Consequently, the diagnosis, treatment, and prognosis of breast cancer has become a vital research concern [3]. Previous studies have reported various prognosis factors for susceptibility [4][5][6][7][8][9] and the overall survival of patients in breast cancer [3,[10][11][12].
The clinicopathologic factors for breast cancer prognosis and overall survival include both tumor burden and tumor molecular biological factors. Tumor burden factors are usually defined as tumor size, lymph node invasion, 2 BioMed Research International and lymph vascular and dermis invasion, and tumor molecular biological factors usually include tests for hormonal status-including estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2)-combined with fluorescence in situ hybridization and immunochemistry test results. The Nottingham prognostic index (NPI) was developed to predict prognosis outcomes for various tumor burden situations [13,14], and the breast cancer severity score was developed for the same purpose but depends mainly on both tumor burden and tumor molecular biological factors [15]. Age at diagnosis is also a factor that is associated with breast cancer prognosis outcome and overall survival [16]. The effect of age on breast cancer progression and mortality might be affected by other clinicopathologic factors [17,18]. Although various treatments for breast cancer can result in favorable prognosis and survival rates for breast cancer patients [19][20][21][22][23], the radiotherapy, chemotherapy, hormonal therapy, and targeted therapy are highly reliant on clinicopathologic factors [10].
Formerly, analysis of survival benefit or long-term followup of breast cancer was based mainly on the common statistical analysis strategy. Along with the improvements in data volume and complex disease causality, machine learning and optimization algorithms have emerged as novel strategies for analyzing breast cancer survival [24][25][26][27]. Both statistical and machine learning and optimization algorithm approaches provide theoretical and acceptable explanations for the association between clinicopathologic factors and breast cancer survival benefit. The combined and hybrid use of both statistical and machine learning approaches is a trend in modern biological research.
Biogeography-based optimization (BBO) is a metaheuristic algorithm that was proposed by Simon in 2008 to solve global optimization problems [28]. BBO is an evolutionary algorithm that was inspired by the migration of species between habitats. Specifically, it was inspired by biogeography, which describes (i) the speciation and migration of species between isolated habitats and (ii) the extinction of species [29]. In recent years, the BBO algorithm has been widely used in a myriad of fields, such as to solve the engineering optimization problem. Various BBOs have been proposed to enhance the BBO search ability in specific problems, including blended BBO [30], localized BBO [31], and ecogeography-based optimization [32]. The BBO algorithm is a powerful search technique because it contains both exploration and exploitation strategies based on migration [33].
To explore the clinicopathologic variables of the overall survival of patients with breast cancer, this study analyzed the relationship between the clinicopathologic variables of breast cancer by using a BBO algorithm. We sought to (i) assess the relationship between the clinicopathologic characteristics of overall survival among patients with breast cancer and (ii) to demonstrate the optimization of the overall survival prediction model on the basis of the clinicopathologic variables of breast cancer.

Data Source and Patients.
All data were collected from the single-center Taiwan Breast Cancer Consortium (TBCC) database, which is prospectively maintained by the Division of Breast Surgery at Kaohsiung Medical University Hospital, Taiwan. Patients who were diagnosed with ductal carcinoma in situ were excluded. In total, 1,896 patients with breast cancer were included and tracked from 2005 to 2017. The prognosis variables in this dataset included age at diagnosis, grade, tumor size (American Joint Committee on Cancer [AJCC] stage), estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), lymph node (AJCC stage), lymph vascular invasion (LVI) status, dermal invasion, perineural invasion, surgical method, radiotherapy, chemotherapy, and hormone therapy. The overall survival term of all participants with breast cancer was tracked from the date of first diagnosis until participant death or study conclusion. The proposed analysis procedure for TBCC dataset is summarized in Figure 1.

Biogeography-Based Optimization.
The BBO algorithm is a population-based optimization algorithm inspired by the natural biogeographical distribution of species [28], which simulates biogeographical species distribution in accordance with the insular migration of species. In biogeography, an area's quality is evaluated by considering suitability index variables (SIVs), including climate, temperature, and humidity. The habitat suitability index (HSI) represents insular quality. A high HSI value indicates that an area is a superior habitat, whereas a low HSI value indicates that an area is an unsuitable habitat. Higher HSI value areas are usually saturated, which means that species encounter difficultly in migrating to these areas and species currently living in these areas are likely to migrate to other areas. By contrast, low HSI value areas are likely to acquire many migrant species come to here. In the BBO algorithm, an unsatisfactory solution results in an area with a low HSI value, whereas a satisfactory solution results in an area with a high HSI value. According to the BBO mechanism, satisfactory solutions are likely to share their SIVs with other solutions and are unlikely to accept SIVs from other solutions. The BBO migration model depicts the migration of species in a habitat. I is the maximum immigration rate ( ), E is the maximum emigration rate ( ), max is the maximum number of species that an island can host, and S 0 is the number of species that causes the immigration rate to equal the emigration rate. The immigration rate increases as the solution quality decreases, whereas the emigration rate decreases with species quantity. This linear model describes the way for simulating species migration. The BBO algorithm comprises two primary stages: migration and mutation.

Migration.
In migration, the probabilistic model can be used to represent the concepts of emigration and immigration. Consider the probability that a habitat contains S species; changes from time t to time t + Δ , which can be formulated as   where and are, respectively, the immigration and emigration rates when S species are present in the habitat. Equation (1) holds because, to have S species at time (t + Δt), one of the following conditions must be true [28]: (1) S species were present at time t, and no immigration or emigration occurred between t and t + Δt. Suppose that time Δt is sufficiently small; the probability of more than one immigration or emigration occurring can be ignored. Subsequently, taking the limit of (1) as Δt → 0 provides the following equation: From the migration operation curves, where k number of species are present, the emigration and immigration rates can be formulated as (3) and (4), respectively: where I and E are the maximal immigration and emigration rates, respectively. The pseudo code of the migration operator is depicted in Algorithm 1.
Consider the special case E = I; (3) and (4) can be combined as

Mutation.
In the BBO algorithm, some events, such as the wind carrying seeds or flotsam, provide more favorable features that allow an island to generate superior solutions with statistically significant enhancements. Through species count probabilities P s , the mutation rate can be determined as where max is a user-defined maximal mutation rate that m can reach, and max = max( ). The mutation scheme tends to increase diversity among the population. Highly probable solutions tend to be more dominant in the population. Thus, the high HSI solutions likely mutate, which increases the probability that they will improve even more than they already have. The pseudo code of the mutation operator is depicted in Algorithm 2.
The BBO algorithm can be described using the following steps: (1) Initialize the BBO parameters, which include a problem-dependent method for mapping problem solutions to SIVs and habitats, the modification probability mod , the maximal species count max , the maximal migration rates E and I, the maximal mutation rate max , and elite number p. (2) Initialize the habitats, which depend on the population size and problem to determine what habitat is a potential solution.
(3) Calculate the HSI of each habitat, emigration rate , immigration rate , and species S.
(4) Identify the elite habitats depending on the HSI value.
(5) Migration operation: Modify each nonelite habitat by immigration and emigration rates. The probability that a habitat is modified is proportional to its immigration rate , and the probability that the source of the modification comes from a habitat is proportional to the emigration rate . After modification of each nonelite habitat using the migration operation, each HSI is recomputed.
(6) Update the species count probability within the habitat according to (2). The mutation operation is performed on each nonelite habitat, and compute each HSI value.
(7) Go to step (3) for the next iteration. The BBO operation is terminated if the criteria are satisfied.
The BBO parameters, habitats, and HSI value evaluations for identifying the relationship between the clinicopathologic variables of breast cancer are explained in Section 2.2.4.

Initializing the BBO Parameters and Habitats.
In this study, the BBO algorithm parameters were set as follows: habitat size = 50, number of generation = 100, maximum immigration and emigration rates for each island = 1, elite number = 2, and mutation probability = 0.04 [30].
The solution H comprises ℎ (i = 1 to population number) as in (7), each habitat h consists of (n = 1 to number of problem dimensions) as in (8), and each consists of a randomly generated 0 or 1 as in (9).

Evaluation of HSI Values.
We use the C-statistic (or "concordance" statistic) as the HSI values in the BBO algorithm; the C-statistic value is calculated according to linear prediction estimated through Cox proportional hazard (PH) BioMed Research International 5 regression. The C-statistic's goodness of fit can be measured for binary outcomes in a regression model. The value of the C-statistic indicates the probability that a patient who had experienced an event had a higher mortality risk than one who had not experienced the event. C-statistics were used to test the predictive accuracy of survival models. The value of the C-statistic equals the area under the receiver operating characteristic curve. A C-statistic value of 0.5 indicates that a model predicts the outcome by chance, 0.7-0.8 indicates acceptable discrimination, 0.8-0.9 indicates excellent discrimination, 0.9-0.99 indicates outstanding discrimination, and 1.0 is perfect prediction [34].
Harrell [35] defined the C-statistic as the ratio of all sample pairs for which the predictions and real outcomes are concordant. Assume a sample data of M items is given. Let 1 , 2 , 3 , . . . , indicate the survival times, and 1 , 2 , 3 , . . . , indicate the predicted probabilities of survival. Given that ̸ = , each concordant pair is assigned 1 point, and each discordant pair is assigned 0 point [36].
Subsequently, the C-statistic is calculated using the following equation: where is the number of all usable pairs, and U is a set of all usable pairs of participants (i, j).

Univariate Clinicopathologic
Variables. The 15 clinicopathologic variables included age, grade, tumor size, ER, PR, HER2, lymph node, LVI, dermal invasion, perineural invasion, surgery method, radiotherapy, chemotherapy, hormone therapy, and target therapy. Each clinicopathologic variable is dichotomized into low-and high-risk characteristics in Table 2, which lists the clinical characteristics for low-and high-risk breast cancer progression; these characteristics were identical to those used in previous studies. The univariate Cox regression analysis of the clinicopathologic variables in overall survival of patients with breast cancer is presented in In the stepwise selection Cox PH model with inclusion at p = 0.05, tumor sizes greater than 2 cm (adjusted HR = 3.017, p < 0.005), LVI (adjusted HR = 1.878, p = 0.016), dermal invasion (adjusted HR = 2.152, p = 0.048), and without hormone therapy (adjusted HR = 2.231, p = 0.002) were significantly correlated with a high risk of breast cancer progression. In the stepwise selection Cox PH model with inclusion at p = 0.1, older age (adjusted HR = 0.589, p = 0.036) was correlated with a high risk of mortality for patients with breast cancer, whereas grade III (adjusted HR = 1.830, p =   Stepwise selection met the p = 0.05 level for entry into the model; 2 Stepwise selection met the p = 0.1 level for entry into the model; 3 Stepwise selection met the p = 0.2 level for entry into the model. * p-value is estimated using Cox PH regression.
The C-statistic values were 76%, 78%, 78%,77%, and 80% in the fully adjusted Cox PH model, three stepwise selection Cox PH (with inclusion at p = 0.05, 0.1, and 0.2) models, and BBO-Cox PH model, respectively. The BBO-Cox PH model obtained the highest C-statistic value of all the models, which indicated higher concordance for predicting the overall survival of patients with breast cancer.
We present the survival curves of various clinicopathologic characteristic categories using Kaplan-Meier method. The difference between each clinicopathologic variable in the survival curves was estimated using the log-rank test. The BBO-Cox PH model included tumor size, lymph node, dermal invasion, surgery method, chemotherapy, and hormone therapy as breast cancer progression predictors, and the results revealed that all BBO-selected clinicopathologic variables were statistically significant in different categories within each individual clinicopathologic variable. The Kaplan-Meier curves are presented in Figure 2.

Discussion
Overfitting is a limitation that must be faced for conventional statistical approaches, such as regression models. The principal reason is that regression-based selection approaches provide analysis results that depend mainly on the distribution of a population in associated variables. BBO is similar to evolutionary algorithms, such as the genetic algorithm (GA) [37] and particle swarm optimization (PSO) [38]. GA, PSO, and BBO are all inspired by nature and perform information sharing between solutions (i.e., genes for GA, particles for PSO, and habitats for BBO) [39]. A GA solution will "die" at the end of each generation, and PSO and BBO solutions will continue to exist as the optimization process progresses. PSO solutions tend to be clustered together in similar groups, whereas GA and BBO solutions do not necessarily have any clustering tendencies. BBO has unique features, and a BBO solution shares its content (SIV) directly with other solutions. This feature leads to BBO having better performance than GA and PSO in optimization problems [39]. BBO has been successfully applied to several engineering problems, including global benchmark functions and economic load scheduling [33]. In this study, BBO was successfully applied to select clinicopathologic variables that resulted in superior performance compared with other Cox regression models (i.e., the fully adjusted Cox regression model and three stepwiseselected Cox regression models). BBO is a mechanism of the Cox regression model that improves the ability to explore C-statistics and variables. By using BBO, the migration operator shared more accurate information than with other solutions, and including randomization for the mutation operator at the mutation stage partly resolved the overfitting problem.
The BBO-selected model included tumor size, lymph node metastasis, lymphovascular invasion, perineural invasion, dermal invasion, surgery method, hormone therapy, and chemotherapy as predictors for the overall survival of patients with breast cancer. This optimization model obtained superior prediction results and concordance compared with the fully adjusted Cox PH model in the assessment of Cstatistics. The fact that the BBO algorithm considers the internal relationship between clinicopathologic variables may explain why it increases the discriminant ability of the final BBO model.
In addition, we reevaluated the individual HR for each selected variable in the BBO model through multivariate Cox PH regression analysis. The results revealed that not all of the selected variables provided a significant individual HR in the Cox PH model. Only tumor size, lymph node metastasis, dermal invasion, and hormone therapy had a significant correlation with the overall survival of patients with breast cancer. These factors had been previously reported as associated with the overall survival of patients with breast cancer.
Tumor size and lymph node are frequently used to predict the overall survival and prognosis of patients with breast cancer [40]. The NPI index is a benchmark for breast cancer prognoses that includes these two main factors [13,41]. The extended use of the NPI index for various breast cancer prognostic purposes remained contributed and widespread in recent studies such as including the hormone status (ER, PR, and Her2) [42][43][44]. Dermal invasion, also known as dermal lymphatic invasion, is generally associated with inflammatory breast cancer [45]. Breast cancer characterized by dermal invasion is generally correlated with inferior overall survival. Hormone therapy is an effective therapy for hormone-positive patients with breast cancer [46]. Therefore, the decision to commence hormone treatment mainly relies on the ER and PR status of patients with breast cancer. Although ER and PR were not included in the BBO-selected model, hormone therapy (an effective treatment options for patients with ER positive or PR positive breast cancer) could partially explain the beneficial effect of ER and PR positive case with hormone therapy in breast cancer overall survival. However, the retrospective nature of our study might have limited the analysis of various covariates that have been previously reported as correlated with overall breast cancer survival rates.

Conclusion
This study used the BBO algorithm to search for combinations of clinicopathologic variables that can facilitate prediction of the overall survival of patients with breast cancer. Compared with the fully adjusted multivariate Cox regression model and stepwise selection Cox regression model, the BBO-selected model had a higher C-statistic value for predicting overall survival. This study determined that the BBO algorithm could select only 8 variables from 15 clinicopathologic variables, which is the minimum number of discriminators required for optimal discriminant effectiveness when predicting the overall survival of patients with breast cancer accurately and with concordance. This model may have vital implications for the selection of clinicopathologic variables for breast cancer.

Data Availability
The data used to support the findings of this study have not been made available because of the institution regulation.

Conflicts of Interest
The authors declare that they have no competing interests or other financial motivations.