The use of a multiple imputation method to investigate the trends in Histologic types of lung cancer in Songkhla province, Thailand, 1989–2013

The incidence of lung cancer in many parts of the country as shown in cancer registry statistics is not decreasing. The incidence of adenocarcinoma (ADCA) in Songkhla is now higher than that of squamous cell carcinoma (SCC) in both sexes. The percentage of the unknown histologic type of lung cancer in Songkhla is around 30 %. The objective of this study is to estimate trends in incidence of the two major histologic types of lung cancer: SCC and ADCA, in Songkhla province of Thailand from 1989 to 2013. Age-standardized incidence rates (ASR) were used to compare and described the trends in both major types of cancers. Multinomial logistic regression models were used to impute unknown histological cancer types using a multiple imputation (MI) method to account for the high percentage of unknown histology. The multinomial predictive model for major types of lung cancer in Songkhla consisted of sex, age, year of diagnosis, and place of residence. After MI, the number of cases with both SCC and ADCA in both sexes increased by one-third of the number of cases with originally known histology. The increasing trends were observed in ADCA in both sexes while SCC in males was stable and in females was decreasing. A rapid increase in the incidence of ADCA was found while the incidence of SCC in males showed no significant change and it was declining in females. These results warrant an investigation into risk factors other than cigarette smoking. The number of cases has limited use when the age structure of the population under study is changing. Year of diagnosis was one of the predictors in the MI model.


Background
Primary lung cancer (ICD-10 C33-34) has been the second leading cancer in males and the fourth or fifth in females in Thailand for many decades [1]. In the period of 2007-2009, the estimated age-standardized incidence rate (ASR) was 26.2 and 11.5 per 100,000 Thai male and female populations, respectively [2].
Cancer registries must exclude metastatic cancers to the lungs from being classified as primary lung cancers. It is possible that cancer registries misclassify metastatic lung cancers as primary lung cancers, especially in cancer registries in which the percentage of morphologically verified cases (%MV) is low, and the percentage of cases reported by death certificate only (%DCO) is high [3].
In Thailand, the number of lung cancer cases reported from cancer registries published in the series of 'Cancer in Thailand' had low %MV, and many had high % DCO [2]. This meant that there was a high proportion of unknown histologies among lung cancer cases in some registries. Among cases with known histology, the proportion of adenocarcinoma (ADCA) varied from 40 to 70 % of all lung cancer cases, and that of squamous cell carcinoma (SCC) from 20 to 30 % in males, while ADCA varied from 60 to 80 % and SCC from 6 to 20 % in females. According to this figure, the average percentage of ADCA in Thailand is higher than in western countries while the proportion of SCC is much lower. However, a direct comparison is inappropriate when the %MV is low and/or the %DCO is high as occurs in some registries in Thailand. The ratio of the proportion of SCC/ADCA in both sexes was less than 1 in all SEA countries [4].
Songkhla is a province in the southern region of Thailand occupying an area of 7,392 square km on the eastern side of the Malay Peninsula. Lung cancer has been the leading cancer in males since the establishment of the Songkhla cancer registry in 1989 [1,5,6]. The incidence of lung cancer in men and women increased from 16.3 and 4.3 in 1989-1991 [7] to 24.4 and 8.1 in 2007-2009 [2], respectively. Increases have also been observed in other regions of Thailand.
Non-small cell carcinoma (NSCC) of the lung is a collective term for carcinoma, not otherwise specified, which includes SCC, ADCA, and other rare primary carcinomas, of the lung. SCC and ADCA are usually reported separately while classification of other types of NSCC tends to be reported inconsistently depending on the cancer registry [8,9]. When the term NSCC was first introduced, there was a tendency for pathologists to overuse this diagnosis when special molecular studies were not available or completed such as in cytology and tiny biopsy specimens. Recently, a new classification of lung cancer was proposed, and some new terms were introduced [10,11]. However, SCC and ADCA are still the main types.
A limitation of a cancer registry in estimating an incidence rate ratio is the high rate of missing histological diagnoses. The missing data on specific histological diagnosis usually happens in deep organs such as lung, liver, and brain. It occurs in situations that the patients present in late stage and the aim of treatment by physicians is just for palliation, thus, complete investigation for specific histologic type is not necessary. This situation is still true nowadays in low and middle income countries (LMIC) such as Thailand. So, the missing data on the histology is correlated with the stage and clinical performance of the patients but not directly with the histology of the disease itself. Data are said to be missing at random (MAR) if the probability that a value is missing may depend on observed values in the data but not additionally on the missing value itself [12,13].
Usually the histological type of cancers in an organ is related to different risk factors, disease course and treatment, thus, the need of the incidence and trends in the cancer of specific histologic type that are closer to the truth than that obtained by the ignorance of the unknown histology is necessary for disease control and medical care planning. MI is a means for representing uncertainty in missing data, by producing a distribution of plausible values for a missing variable in a record, given the values of that record's non-missing covariates [14]. The process of running a multiple imputation is explained in the Methods section below. The method has been used by cancer registries in Australia to estimate the incidence, mortality, and survival in Aboriginal people [14]. MI produces a distribution of plausible values for a missing variable in a record, given the values of that record's non-missing covariates. Thus, the MI technique can be used to estimate the plausible proportion of histological type distribution within a cancer registry so that the distribution and age-standardized incidence rates of the histologic type of lung cancer can be calculated with a high confidence of accuracy and are useful for health policy planning. However, to achieve a narrow probability interval of the estimate, the percentage of missing cases must not be too high.
The objective of this study is to estimate gender specific trends in incidence of the two major histologic types of lung cancer: SCC and ADCA, in the Songkhla province from 1989 to 2013 using the multiple imputation method.

Lung cancer cases
We recruited all primary lung cancer cases diagnosed between January 1, 1989 and December 31, 2013. Morphologic diagnoses were grouped into 6 International Code for Diseases -Oncology (ICD-O) categories according to the International Rules for Multiple Primary Cancer [15]: SCC, ADCA, large cell carcinoma (8012/3-8014/3), non-small cell carcinoma, not otherwise specified (NSCC-NOS, 8046/3), other known histology, and unknown histology (8000/3-8011/3). The retrieval of the data was approved by the Ethics Committee of the Faculty of Medicine, Prince of Songkla University.

Population denominators
Population denominators to calculate incidence rates were estimated from the three population censuses conducted by the National Statistical Office in 1990, 2000, and 2010 [16][17][18]. Intercensus populations for the years in between were estimated using a log-linear function between two consecutive censuses. The populations beyond 2010 to 2030 were estimated and reported by the Office of the National Economic and Social Development Board [19].

Statistical analysis
Descriptive statistics for variables in the cancer registry were presented as frequency counts and percentages. Geometric means and standard deviations (sd) were presented for variables having an asymmetric distribution. All analyses were conducted with R [20].

Multiple imputation method
Multivariate Imputation by Chained Equations (mice) package [21] in R was used to perform the imputation of histological types of lung cancers. We used a polytomous regression imputation method since the histological types are unordered categorical data. The function imputes categorical response variables by the Bayesian polytomous regression model. Since there were cases with true unknown histology, we had to distribute those cases into one of the four categories; SCC, ADCA, large cell carcinoma, other known histology and NSCC. The fact that NSCC is a vague terminology in lung cancer pathology that can fall into one of the 4 histologic types mentioned before, the second step of MI was needed to specify which histologic type in a patient was. So, there were two steps of imputation performed in this study. In the first step, cases with unknown histology were replaced with one of the known histological categories, including NSCC, according to the probability distribution of the groups among those who had known histology obtained by the chained equation method plus a degree of random error. A multinomial (polytomous) logistic regression model was used for generating the distribution according to the predictive ability of basic variables present in the registry database such as year and age at diagnosis, religion, and district of residence. A polytomous regression model is given by: where α j is a constant and β' j is a vector of regression coefficients of X explanatory variables, for j = 1, 2,…, J -1.
[22] The method described by White et al. [23] was applied to avoid bias due to perfect prediction. In the second step, cases diagnosed with NSCC in the original data, as well as those imputed from the first step, were replaced with SCC, ADCA, large cell carcinoma or other specified histology using a similar approach as the one used in the first step. After the second imputation step, large cell carcinoma and all other specified histological types, including small cell carcinoma and sarcoma, were grouped as 'other'. Thus, only three categories: SCC, ADCA, and other were obtained.
This two-step multiple imputation method was repeated 200 times to get estimates of the 95 % Bayesian probability interval (PI) or credibility interval obtained from the quantiles of the posterior distribution for the three histologic categories.

Computation of age-standardized incidence rates
Since comparison of the proportion of SCC and ADCA over a long period can be biased by the change in the age structure of the population, we used the agestandardized incidence rates (ASR) for the two groups to illustrate only the effect of time on the imputed number cases and the trends in incidence rate and to ignore the change in the age structure of the population. The rates were calculated for 24 calendar years from 1989 to 2013. ASRs standardized to the world population modified by Doll [24] in 1966 from that proposed by Segi [25] in 1960 were estimated for each particular year.

Age period cohort model
Age-period-cohort (APC) regression models were used to investigate the effect of age, calendar year and birthcohort on the incidence of cervical cancer. We used the classical method which fits a log-linear model with a Poisson distribution to the observed data to estimate age, period and cohort effects in a multiplicative APC model as follows: where the expected log-incidence rates R(a,p) is assumed to be equal to a linear combination of effects that adjust for age a, period p and birth-cohort c, where c = p-a. To address the non-identifiability problem of the APC models, two-effects models (age-period and age-cohort) were first chosen and the remaining effect (cohort or period) was then identified to the respective model's residuals using natural splines to reduce random variation [26]. These are referred to as the AP-C and AC-P models. The analysis of APC models was performed with the Epi package [27] for R statistical software version 3.2.2 [28].

Results
There were 2,734 male and 1,110 female cases with lung cancer diagnosed during 1989-2013. The mean (sd) age at diagnosis was 64.4 (12.4) and 63.8 (14.4) years in males and females, respectively. The distribution of major histological groups of lung cancer cases in both sexes is shown in Table 1. The percentage of SCC was higher in males than females (21.4 vs. 8.4 %) while the percentage of ADCA was higher in females than males (52.5 vs. 33.2 %). The difference in histologic type distribution, sex, district, and period at diagnosis were different between males and females. Table 2 shows the trends of SCC and ADCA in both sexes. The distribution of SCC and ADCA varied by year of diagnosis. The NSCC appeared by the end of the period during 1994-1998, this histologic diagnosis contributed approximately 5 % of the diagnosis in both sexes. The percentage of the unknown category was around 30 % in both sexes without significant change in the trend according to sex and period while there was a declining trend in other known histologic types throughout all periods. Table 3 shows the predictive multinomial logistic regression model of cases with known histology. Age, sex, and year of diagnosis were the three strongest predictors for ADCA. District of resident was a weak predictor for ADCA. Thus, age, sex, year of diagnosis, and district of residence were included in the multinomial logistic regression model for multiple imputation of lung cancer patients with unknown histology (true unknown histology plus NSCC). Thus, the predictory polytomous model for multiple imputation of unknown histology should include sex and others described above as all of the variables explain the distribution of the known histology cases and should, if the missing at random (MAR) assumption [12] was the case. The actual (non-imputed) numbers and percentages of cases with squamous cell carcinoma (SCC), adenocarcinoma (ADCA), and other specified cancer are shown in the first two columns of the Table 4. After running twostep 200 rounds of MI with the polytomous regression model as described in the Method section above for prediction of the histological group in Table 3, the distribution of the estimated numbers and percentages of the three major cancer groups stratified by sex among patients with unknown histology are shown in columns 3 and 4 of the Table 4. Now the distribution of the numbers and percentages in columns 3 and 4 had probability intervals obtained by the 200 rounds of imputation. The table shows the similarity of histologic type distribution of the number of cases with known histology and those with unknown histology, as the results of the MI, with Chi-square p-values of 0.235 and 0.968 for males and females, respectively. This results suggested a little change in the distribution of histologic type from the original distribution after MI. However, it did significantly affect the trends in the number of cases and the incidence rates in terms of absolute values as shown in Figs. 1 and 2. The predicting model in Table 3 showed that the probability of the missing data falling in to SCC and ADCA categories varied with time as time was used as one of the explanatory variables used to run the MI. Figure 1 also shows the increase in the number of known cases of SCC and ADCA over the period 1990 to 2013. To overcome the problem of the change in population age structure that affects the number of cases in different years, the ASR (world) by calendar years from 1990 to 2013 stratified by sex is shown in Fig. 2. With ageperiod-cohort model, the overall drift (slope) of ADCA in males was 5.9 % per year (95 %CI: 5.1, 6.7 %), that of SCC in males was 0.6 % per year (95 %CI: −0.3, 1.5 %), that of ADCA in females was 6.1 % per year 95 %CI: 5.1, 06, 7.2 %), and SCC in females was −2.1 % (95%CI: −4.4, 0.2 %). Before multiple imputation, the two types of lung cancer in males had almost the same incidence rates and an abrupt change was seen after 2003 where the rates of ADCA rapidly increased while that of SCC slightly declined. After multiple imputation, the incidence of ADCA rose more rapidly than the rate obtained from the cases with known histology. After multiple imputation, the overall trend of SCC was somewhat stable over the whole period. In females, the incidence rates of ADCA behaved in the same way as in males but to a lesser extent, while the rates of SCC declined in cases with known histology.
After imputation, the ratio of SCC/ADCA for the last 5-year period between 2009-2013 using the estimated number of cases was 0.41 and 0.08, for males and females, respectively, which is similar to that obtained by using the estimated ASRs. Figure 1 shows a rapid increase in the number of cases of ADCA after 2004 in both sexes while the rate of increase in SCC in males is much slower than that of ADCA and no change at all in females. When adjusted for age, the increase in ASR of ADCA in both sexes exists but the slope is less than that of the number of cases. There was no obvious change in incidence of SCC among males and a slight decrease of the disease among females.

Discussion
The term 'non-small cell carcinoma' (NSCC) is used when the biopsy tissue is small and pathologists cannot decide whether it is SCC, ADCA, large cell carcinoma, or other specific histologic diagnosis. The term causes difficulty for cancer registries when classifying lung cancer by histologic type. When the percentage of NSCC increases in a cancer registry, the incidence rate and ratio of SCC/ADCA are somewhat unpredictable. Table 2 shows no change in the percentage of cancers in the true 'unknown' category while 'other' seems to decrease when NSCC was reported in the last three 5year periods. Pathologists in Songkhla had been using the term NSCC by the end of the period during 1994-1998, and this histologic diagnosis contributed approximately 5 % of the diagnosis in both sexes. When fineneedle biopsy cytology was introduced in late 1980s, pathologists found it difficult to diagnosis lung cancer based on a smear of individual cells rather than a large piece of tumor tissue. Thus, misclassification is unavoidable when a larger biopsy specimen is not obtained, especially among late stage cancer cases. However, this study shows that it was the true 'unknown' category, and not the NSCC, that contributed to the uncertainty of the SCC/ADCA ratio, since the proportion of NSCC was much lower than that of 'unknown' category.
Although there was rapid increase in the number of cases of SCC and ADCA both in the actual cases and imputed cases in males and a rapid increase in ADCA in females (Fig. 1), adjustment for population age structure throughout the period, as shown in Fig. 2, suggests a different phenomenon. We clearly show a need to adjust for the population age structure when comparing the change in incidence of cancer over a long period. Age standardization is essential in this case since the predictor model for multiple imputation was proven to be dependent on time. A biased impression on the trends over time is unavoidable when only the number of cases is investigated.
The rather stable ASR of SCC in men and slightly decreasing rate in women and the increasing rates of ADCA in both men and women in Songkhla from 1989 to 2013 are different from trends observed in different subpopulations in the US. The ASR of SCC in the US has been consistently declining while that of ADCA has been stable over the last two decades and have been increasing in the last decade [29][30][31]. The overall increasing trends in incidence of ADCA and the decreasing trends in SCC have been observed in other western and Asian countries [4,[32][33][34][35]. In some Middle Eastern countries, the SCC/ADCA ratio has been decreasing but it still has been high in some areas [36]. As mentioned earlier, the change in diagnosis procedures and classification of lung cancer has contributed partly to the change in the SCC/ADCA ratio [10,11] and the unsynchronized adoption in various countries around the world may further complicate the story.
A common issue with multiple imputation methods is accuracy in the imputed results. It has been suggested that the data must be missing at random Columns are arranged for patients with known histology (real data), and the estimated numbers and percentages by multiple imputation for patients with unknown histology (imputed) a Unknown histology includes NSCC and true unknown histology. The distribution of the number of cases with known and unknown histology cases is not statistically significant with chi-square p-values of 0.235 and 0.968 for males and females, respectively. Those with known histology, the true number and the percentage with its 95 % CI (calculated from the standard error) are present. After 200 rounds of multiple imputation, the mean (estimated) number and the percentage of different cancer types among those with unknown histology, and 95 % PI are presented (MAR). One study demonstrated that the linear regression method works well when the percentage of missing values is between 10 and 60 % [37]. However, the study was not dealing with categorical data. A simulation study on multiple imputation using a multinomial logistic regression model showed unbiased results when the values were simulated to be in any of the three missing patterns; missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR) gave unbiased results up to the missing data was reaching 80 % [38]. MAR is difficult to be proved and this simulation paper showed that the NMAR data set is acceptable for the imputation of the categorical data type.
In this study, there are no obvious reasons for bias among cases with missing histology. The chance of death among all histologic types is theoretically nondifferential. Cases diagnosed by death certificate only (DCO) are not likely to incur bias as the SCC, ADCA, and other histologies have an equally poor prognosis. Clinicians may decide to perform cytology and/or biopsy based on patients' performance status and compliance, however, there is no direct association between performance status and histologic type of lung cancer. The preference of pathologists to diagnose a particular histologic type over another is an important source of bias. The prevailing use of immunohistological staining in pathological investigation may raise the possibility that ADCA might have been diagnosed more often in recent years than in the past. However, a study in Japan showed no differential misclassification in retrospective review of slides in the past against recently diagnosed cases [39].

Conclusions
With the use of multiple imputation methods, we observed a rapid increase in incidence of ADCA, which outnumbered that of SCC in both sexes over the period of 1989-2013 in Songkhla. The incidence of SCC in males showed no significant change throughout the period while it was declining in females. We also showed that it is necessary to use age standardized rates when studying trends over a long period, especially when the predictive model for MI is dependent on the period under study.