Actuarial Analysis of Survival among Breast Cancer Patients in Lithuania

Breast cancer is the most common cause of mortality due to cancer for women both in Lithuania and worldwide. Chances of survival after diagnosis differ significantly depending on the stage of disease at the time of diagnosis. Extended term periods are required to estimate survival of, e.g., 15–20 years. Moreover, since mortality of the average population changes with time, estimates of survival of cancer patients derived after a long period of observation can become outdated and can be no longer used to estimate survival of patients who were diagnosed later. Therefore, it can be useful to construct analytic functions that describe survival probabilities. Shorter periods of observation can be enough for such construction. We used the data collected by the Lithuanian Cancer Registry for our analysis. We estimated the chances of survival for up to 5 years after patients were diagnosed with breast cancer in Lithuania. Then we found analytic survival functions which best fit the observed data. At the end of this paper, we provided some examples for applications and directions for further research. We used mainly the Kaplan–Meier method for our study.


Introduction
Breast cancer is the most common cause of mortality due to cancer for women both in Lithuania and worldwide in the recent years. Death due to cancer is the second cause of women's mortality, the first being cardiovascular disease. Analysis of statistical data shows that deaths due to cancer among women of all ages in Lithuania during 2018 amounted to 18% of all deaths, and it was the second cause of mortality after deaths due to cardiovascular disease (63%). Though there were some deaths due to cancer or cardiovascular diseases among girls and women of young age (10-34), mortality rates due to both causes increased quite significantly with age. Both reasons, however, result in a slightly different mortality pattern. Approximately one out of three deaths among women aged 35-74 is due to some type of cancer. Cardiovascular disease is more threatening for women aged 60 and above when mortality rates due to this reason increase from 30% in the age group of 60-64 to almost 80% in the oldest age group (85+). Reduction of mortality due to cancer can result in a lower number of fatalities among women of active age when they are near or at the peak of their career, and/or when their children are young. Breast cancer is usually the most common cause of death among all deaths due to cancer among women. For example, in 2012, deaths due to breast cancer in Lithuania amounted to 24% in the youngest age group , 18% in the age group 55-74, and 11% among the oldest (aged 75+) [1].
A diagnosis of cancer is still understood by laypeople as a terminal life threatening illness. However, due to advances in medicine, survival after diagnosis can be quite high especially if the disease is found at the onset. Therefore, it is important to estimate survival rates among cancer patients, and to determine whether there are significant differences among patients diagnosed at different stages. Results obtained can then be used for many reasons. For example: Pyenson et al. [2] estimated how screening for lung cancer offered as part of health insurance benefits could help to save lives at a relatively low cost. Estimation of survival can help insurance companies design new health related products, as in some cases the benefit paid depends on the survival of the patient until some time in the future after diagnosis. Results of research can also be useful for public policy makers since the financial burden of cancer can be significant for households. Dieguez et al. [3] analyzed additional costs incurred by breast, lung, and colorectal cancer patients. After survival rates are known, it is easier to assess the financial impact of cancer awareness campaigns, and estimate the amounts of public programs needed to financially support patients and their families.
There is quite a large number of research on the estimation of survival after breast cancer diagnosis. The Kaplan-Meier estimate is one of the most popular tools for such analysis. Some other statistical methods are also used. Fisher et al. [4] analyzed survival among breast cancer patients based on treatment received. Giordano et al. [5] analyzed possible improvements in survival after diagnosis of breast cancer. In Narod et al. [6] it is established whether the mortality rate is influenced by age at the diagnosis, ethnicity and initial treatment received. Narod et al. [7] used the Kaplan-Meier technique, and time to death histograms to estimate mortality of women who died of breast cancer during the 20 year period after diagnosis. In Chen et al. [8] and Giannakeas and Narod [9], the selected groups of patients diagnosed with breast cancer are examined. Seung et al. [10] and Tyurimina et al. [11] drew attention to different scenarios of breast cancer development depending on the subtypes of the tumor. In Smith [12], the breast cancer surveillance guidelines are discussed. We observe that the Kaplan-Meier method is universal and can be applied not only to the study of breast cancer treatment but also to other cancers to compare the survival of different groups of patients after diagnosis, see for instance [13][14][15][16][17][18][19][20][21][22][23][24][25].
Long-term observation of patients is usually required to construct survival functions for even 10-15 years after onset. So, construction of reliable survival models takes a significant amount of time. Moreover, mortality of patients observed for such long periods may be influenced by a number of other factors, such as general changes in mortality in the country. Hence, even after estimating survival for longer periods, it will be unclear whether results may be used to predict mortality of patients diagnosed during recent years. Therefore, some specific statistical and/or actuarial tools are needed to estimate survival for longer periods, but will be based on data collected during, for example 5-10 years. In [26] linear regression is fitted to model breast cancer mortality rates in various regions throughout the world. We went further in this direction. Firstly, we analyzed mortality of breast cancer patients in Lithuania using two different approaches: ratio of deaths to exposure and the Kaplan-Meier estimator. Secondly, we fit the analytic functions to the obtained estimates. The derived analytic functions depend on the stage at onset and can be used to project survival during longer periods.
The rest of our paper is organized as follows. In Section 2, we present some mathematical preliminaries and notations which we use later. In Section 3, we describe data and methods used for our analysis. The main results of our analysis are presented in Section 4. The possible applications of our research are discussed in the concluding Section 5.

Some Notations and Mathematical Preliminaries
Let us consider a person who was just diagnosed with cancer. Her future lifetime is a nonnegative random variable which we denote by T. We assume that there exists a differentiable survival function S(t) = P(T > t).
We denote the probability to survive until time t + h for the individual being alive at time t by the following standard equality Alternatively, probability to die until time t + h being alive at time t is The instantaneous rate of mortality, or the force of mortality at time t is defined by equality This function µ t is also called the hazard function for the survival function S(t).
Sometimes it is useful to average behavior of the force of mortality in the interval (t, t + 1]. In such a case, the central death rate or central mortality rate m t is used, which is defined as a weighted average of the force of mortality: Finally, we define the measure of risk exposure, or central exposure to risk E t as the total number of years lived by persons under investigation in the time interval (t, t + 1]. If d t is the number of deaths during the period (t, t + 1], then the following equality holds for the central death rate

Data and Methodology
The data collected by the Lithuanian Cancer Registry [27] were used for our analysis. The Lithuanian Cancer Registry is a nationwide and population-based cancer registry which covers the whole territory of Lithuania. We analyzed only cases when the patient was diagnosed with cancer for the very first time, i.e., there was no evidence that the patient was diagnosed with any type of cancer before. We analyzed cases diagnosed during the period 1995-2012, and studied the individuals since the onset of disease until death or 31 August 2018, if earlier (end of the study). Cases lost in follow up were treated as right censored, e.g., survival time for such persons we considered to be at least as long as the day of their last observation. We adopted the same approach when treating all survivals until the end of the study period, namely, 31 August 2018. Survival time for those cases was known to be as long as the end of the study period. It is important to note that even patients diagnosed at the end of the diagnostic period (the end of 2012) had the chance to survive at least 5 years. We disregarded the cause of death after diagnosis. We assumed that all deaths after diagnosis were due to cancer, or at least the diagnosis had an influence on the probability of death. One might observe, however, that our results can be easily adapted to the situation when death after a specific period of time since diagnosis (1 year+) was no longer regarded as death due to cancer. On the other hand, some types of insurance (additional critical illness) pay at least part of the benefit conditional on the survival of the insured person for some time after diagnosis. Part of the benefits are no longer paid if the insured person died within a specific period regardless of the reason of death. More about critical illness insurance and types of benefits can be found in Chapter 3 in [28], for instance.
Initially we had a set of 22,437 cases. After initial inspection, we decided to remove 435 cases where the day of death coincided with the day of diagnosis, mainly because majority of such cases were situations when the death certificate indicated the cause of death as breast cancer (death certificate only cases). Such patients were diagnosed earlier before death, but it was impossible to track the survival time from the moment of diagnosis until death. So, the final set of data consisted of 22,002 records (N = 22,002). Since we believe that the stage of cancer determined at the time of diagnosis can significantly influence the survival time of the patient, we divided our data into groups according to the stage of cancer (see Table 1). Finally, we excluded 738 more cases from further analysis, since the stage of disease was unclear. Consequently, for the final analysis we selected 21,264 cases. The main goal of our research was to construct survival functions for patients diagnosed with breast cancer based on the stage of disease at onset. Long observation of patients is required to construct survival functions for even 10-15 years after onset. Mortality of patients observed for such long periods may be influenced by a number of other factors, such as general changes in mortality in the country. Hence, even after estimating survival for longer periods, it will be unclear whether results may be used to predict mortality of patients diagnosed during recent years. In our research, our goal was to construct an analytic survival function based on a shorter observation period. We limited estimation of survival functions from the raw data to five years after onset to allow patients diagnosed at the end of the diagnosis period (the end of 2012) survive and be observed until the end of the study period. Otherwise, the percentage of the censored data would increase and could distort survival probabilities simply because lives were censored because the study period had finished. Based on the observed data, we attempted to construct analytic survival functions. We used two different approaches for the parameter estimation.

Exposure to Risk and Central Mortality Rate
At first we estimated the central mortality rate m t at time t measured in months since onset using the traditional actuarial technique. For this purpose we calculated exposure to risk based on lives at each period since the onset of disease, and for each stage at inception separately. We use the so-called exact exposure to risk. Since the time of onset was used as the beginning of the observation period for each patient, we had no entries at the times other than the initial time t = 0. We assumed that in any time period (t, t + 1] contribution to exposure is 1 by all enders, while those who died or withdrew contributed less than 1. We calculated the exact contribution of each life from the beginning of the period until the time of death or withdrawal. In that case, the central mortality rate at time t has the standard expression: m t = d t E t , where: m t is the central mortality rate in the interval (t, t + 1], d t is the number of deaths during the time period (t, t + 1] and E t is the central exposure to risk. We only note that deaths occurring at the exact time t are attributed to the period (t − 1, t]. This is in line with the traditional actuarial approach.
Since the estimated values of the central mortality rate are influenced by quite significant random fluctuations, we graduated the calculated rates using the R software package and the Whittaker-Henderson graduation method. The Whittaker-Henderson smoothing has a couple of advantages. Firstly, it is non-parametric. Hence, it has no assumption about functional form of the central mortality rate. Secondly, it allows the balance between fidelity to observed data and smoothness of the fitted curve. Fidelity of the data is measured by sum SS of squares of deviations between observed values and fitted values: The fitted curve is smoother for smaller values of the squares of third differences, M 3 . Fitted values are then calculated by minimizing so-called balance function SS + λM 3 , where λ is the so-called smoothing parameter, and One can find more about the Whittaker-Henderson graduation in Chapter 11 of [29], for instance.

Kaplan-Meier Method
Alternatively, the Kaplan-Meier estimate can be used to estimate survival probabilities. The main advantages of this method are the following:

•
It is suitable for data sets with limited number of cases. Otherwise, using this method may become time consuming. For our data we used an Excel spreadsheet and built-in VBA programming language for this reason.

•
It is very suitable for medical trials when time since onset of disease is more important than just the age of the patient, so it is difficult to apply standard actuarial procedures for construction of life tables.

•
It is non-parametric, so no advance assumptions about analytic form of survival function are required, nor do parameters need to be estimated. Despite being nonparametric, it is still a statistical estimator, so standard error and confidence intervals may be calculated.

•
The estimate for death probability h q t is obtained. Interval h may be as short as one day, e.g., h = 1/365. Hence, the Kaplan-Meier method is suitable for estimation of death probabilities during quite a short period without making any assumptions about distribution of deaths within one year. Moreover, interval h may differ for different subintervals and is not determined a priori, but is based on data under investigation.
More information on the Kaplan-Meier method can be found in Chapter 8 of [29], for instance. According to the above procedure, we can obtain the estimator of the survival function by using the following form where d t i is the number of deaths that occurred at time t i , and l t − i is the number of patients under observation living immediately before time t i .
The approximate value of the standard error of the estimatorŜ(t) can be calculated using the Greenwood formula:

Exposure to Risk and Central Mortality Rate
First we calculated crude estimates of the central death rate. Since the crude estimates show very erratic behavior, we chose a smoothing parameter λ = 50. The results obtained are summarized in Figures 1 and 2.  For comparison and illustration purposes, we included the central mortality rates for Lithuanian females aged 57-62 in year 2017 from the Human Mortality Database (Available online: www.mortality.org (accessed on 20 April 2020).). The initial age of 57 was chosen because this is the average age of patients diagnosed with Stage 1. Since we measured time in months from onset, and the Human Mortality Database provides data for annual periods, we represented the population mortality rate by the almost flat line. As can be expected, the highest mortality rate was observed for Stage 4. Note, however, that the mortality rate for Stage 4 was significantly higher at the beginning, then decreased, and almost converged to the population average. This showed that chances of survival increased over time since diagnosis. A similar pattern (decreasing with time), though not so obvious, is also presented in the curve for Stage 3 (see Figure 2). It is interesting to note that mortality curves for Stages 1 and 2 behave quite extraordinarily. Firstly, they both are below the population average. Secondly, they do not seem to converge to the population average. This can happen for a couple of reasons such as: random fluctuations and the fact that the population average is represented by a mortality rate for females aged 57 at diagnosis. Since the number observed was not so large, it was probably not a good idea to use average (or even median) age for the comparative purposes.
Our estimated mortality rates behaved differently than expected. Usually we expect mortality rates to increase (or at least be constant) with age, except maybe with newborns. Our estimated rates are usually not monotonous, and do not seem to converge to the population average. This means that no one widely used mortality analytic law can be used for projections, and we should use another method for mortality estimation.

Kaplan-Meier Method
Since traditional actuarial techniques seem not to be the best tool for survival estimation, we chose to redo calculations with the Kaplan-Meier estimate. We observed each life from onset of disease until death or withdrawal, and constructed estimator of the survival function. Since the Kaplan-Meier technique is non-parametric, we estimated the mortality of four subsets (based on stage of disease at onset) by stratifying the data into four subgroups (see Table 1), and then we applied the Kaplan-Meier procedure four times.
Results of our analysis are summarized in Figure 3 and Table 2.   For simplicity, we assume that every month has 30 days. The analysis showed that there is still a chance to survive at least five years from the onset of disease even if diagnosed with Stage 4. However, as may be reasonably expected, chances of survival decreased quite significantly with the stage found at diagnosis. Those with Stage 1 have slightly more than 90% chance of survival for at least 5 years, while chances for those diagnosed with Stage 4 are only slightly less than 14%. The reader should also observe that even after diagnosis of Stage 1, the chances of survival are lower compared to the female population average, see Table 3), where population mortality tables were obtained from the Human Mortality Database (www.mortality.org). The analysis shows that the probability of 5 year survival for patients diagnosed with Stage 1 is 6.5 times higher than for patients diagnosed with Stage 4.

Construction of Analytic Survival Functions
We believe that estimates obtained for survival functions using the Kaplan-Meier method are in line with expectations and are consistent. Therefore, we used the obtained survival functions in Figure 3 to project survival into further periods. We used Excel and its built-in capabilities to fit analytic functions which best represent survival depending on the stage at onset. Our analysis shows that survival for Stages 1 and 2 are best fitted by linear function, 3rd degree polynomial was used for Stage 3 and logarithmic function was used for Stage 4. Exact parametric functions with corresponding mean squared errors are presented below in Table 4 while graphical representation can be seen in Figures 4-7.

Applications, Discussion and Conclusions
After the construction of estimates of survival functions and their projections, it became clear that steps taken to diagnose breast cancer during Stage 1 of disease are not only psychologically, but also financially sensible. The main measures employed are usually regular medical check-ups, ad hoc check-ups (e.g., for medical underwriting when taking out a life insurance policy) and Cancer Awareness campaigns. In Lithuania, currently there are two main breast Cancer Awareness programs: the privately financed and managed project Nedelsk (Do not delay) and the public program for women aged 50-69. The latter program is financed by the public health care system and covers mammography screening every two years. Both programs started within a two year period: Nedelsk in 2002, and the public program in 2004. It was interesting to see whether measures taken have had an effect. For this reason, we analyzed the numbers of diagnosed cases by year and distribution of disease by stage at onset. Results are summarized in Figures 8 and 9. Though there was no significant increase in the number of diagnosed cases, however, the percentage of the mostly threatening Stage 4 decreased from 17% during 1995, to 8% in 2012. At the same time, the percentage of Stage 1 increased from 8% during 1995, to 31% in 2012.  Figures suggest that it is probably worth investing in cancer awareness campaigns. Effectiveness of public cancer awareness campaigns was analyzed more deeply in [30], so interested reader is referred to this source for more information.
Using our projected survival functions and potential changes in stage at onset with and without the Cancer Awareness campaign, it is possible to evaluate live years saved or lost due to early or late cancer diagnosis. This may be further used for financial motivation of government spending for public health programs. This issue, however, is beyond the scope of our paper, but may be evaluated in the future.
We believe that further research in this area is worth the investment and should be carried out. One potential direction is to explore whether and-if yes-how quickly the mortality of patients converges to the population average. This may help insurance companies to construct selected mortality tables. We also noticed that the average age of patients increased with the stage at onset. It is interesting whether this is simply a coincidence or some other reason lies behind this fact, e.g., maybe the onset of disease began when patients were on average 57 years old, but simply were not diagnosed for some reason. This may help the government to decide at what age it is more effective to start diagnostic screening programs. Surely, analysis of mortality among breast cancer patients should be repeated after some years to see whether newly obtained results coincide with our results. Analysis done on a regular basis may help to detect whether changes in population mortality (longevity, or pandemic such as COVID-19) have an effect on mortality among breast cancer patients.
Author Contributions: Conceptualization, A.S. and J.Š.; methodology, A.S.; data collection, validation and formal analysis, A.P. and R.P.; writing-draft preparation, A.S.; visualization, A.S. and R.P.; writing-review and editing, A.P. and J.Š.; project administration, J.Š.; validation, A.S., R.P. and J.Š.; funding acquisition, J.Š. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The data was provided by Lithuanian Cancer Registry https://www. nvi.lt. The data is not publicly available.