Multivariate analysis of cancer mortalities for selected sites in 24 countries.

In order to analyze the pattern in the geographical distribution of cancer death in 24 countries of the world, correlation coefficients were calculated between pairs of mortality rates of different cancer sites, using the data for 13 sites in males and 14 sites in females over 18 years from 1950 to 1967. Then factor analysis by means of varimax method was performed on 13 x 13 correlation matrix for males, 14 x 14 correlation matrix for females and 27 x 27 correlation matrix for males and females combined. As a result of factor analysis, three factors are extracted, which are commonly recognized in both males and females. The first factor has high positive factor loadings on pancreas, prostate (for males), skin, and intestine cancers, and negative loadings on stomach and liver cancers. The second factor has high positive factor loading on rectum, intestine, and lung cancers, and the third factor on larynx, oral, and esophagus cancers. Factor analysis based on 27 x 27 correlation matrix revealed that the third factor of both sexes are heterogeneous with regards to the distributions of the factor score. In order that we may find some clues to develop an etiological hypothesis for each site of cancer, we obtained the correlation coefficient between the scores of the extracted factors and the variables on food and environmental agent, and performed stepwise regression methods as well. One of the most striking results we obtained was that excessive drinking of alcohol and the lack of appropriate intake of fruit are suspected as etiological promoters in the pathogenesis of oral, esophagus, and larynx cancers in males.


Introduction
So far, the spatial distribution of cancer mortality and morbidity rates has been studied by many authors, and its characteristic pattern as to the various sites of cancer has stimulated much concern for the biostatisticians in various countries. Burkit (1,2) stated that the variation in the geographical pattern ofa disease may be related to its cause, and in view of this assertion many authors have studied the geographic distribution of cancer.
Segi (3)(4)(5) has published a number of elaborate works on cancer mortality of various sites in some countries including Japan. Dunham  In order to examine Burkit's assumption that the similar geographic distribution of two different cancers may suggest the existence of a common etiological cause between them, it is necessary to calculate correlation coefficients between pairs of the various cancer sites with various sites. As it has been generally recognized that any of the specific factors do not contribute independently but interdependently to the etiology in chronic diseases including cancer, some methods of multivariate analysis may be used to analyze the complicated structural interdependence among the many correlated variables. Among the available methods of multivariate analysis, factor analysis is quite useful as a first step to uncover the unknown etiology which may be hidden behind the data.
We have already completed work (9) which October 1979 analyzed the interrelationship between the mortality rate of any cancer sites in terms of the geographic and age distributions in Japanese cases and extracted four factors, which are commonly recognized in both sexes. In this paper, we attempt to analyze the characteristic pattern of geographic distribution of mortality rates of various sites ofcancer in the world, by using the series of data collected by Segi and Kurihara (4).

Method Data
The data used in the analysis were obtained from age-adjusted death rates for malignant neoplasms for selected sites in 24 countries from 1950 to 1967 (4). The list of 24 countries is shown in Table 1. The number of sites of cancer selected for analysis is 13 for males and 14 for females, as shown in Table 2. Note that age-adjusted death rates for these cancer sites were calculated by using standard population of 46 countries around 1950. In addition to the data of cancer mortality, we used the data from FAO Production Year Book (10), in which calories per capita per day with respect to various kinds of food such as milk, meat, or oil and fats are described. We also used the data concerning environmental variables such as population density, rainfall, and so on. The list of these variables is shown in Table 3. The missing data were replaced by the mean mortality rates calculated from available data.

Prcedures of Analysis
Correlation coefficients on the geographical distribution. Since the age-adjusted mortality rates of 24 countries for various cancer sites are chronologically given as the mean values of the two consecutive calendar years; (1) 1950-51, (2) 1952-53, (3) 1954-55, (4) 1956-57, (5) 1958-59, (6) 1960-61, (7) 1962-63, (8) 1964-65, (9) 1966-67, these nine periods are classified into the following three categories: (1) 1950-55, (2) 1956-61, and (3) 1962-67, and the mean values for the three chronological categories are computed. Using these values, we calculated correlation coefficients between each pair of the three periods for each site and sex. As shown in Tables 5 and 6 below, fairly high correlation coefficients were found for most of the cancer sites. We thus calculated the means of the age-adjusted mortality rates of these three periods for each country. We call them simply mortality rate hereafter instead of the mean age-adjusted mortality rates over 18 years. Next, we calculated correlation coefficients between the male and female mortality rates for each cancer site excluding prostate, breast and uterus cancers. Finally, we calculated correla-tion coefficients between pairs of site-specific mortality rates of 13 sites for males and 14 sites for females. Hence, we have a 13 x 13 correlation matrix for males and a 14 x 14 correlation matrix for females.
Factor Analysis. Factor analysis (11) was performed for the 13 x 13 correlation matrix (for males), the 14 x 14 correlation matrix (for females) and the 27 x 27 correlation matrix (for males and females combined) respectively. The procedure of factor analysis adopted in this study was as follows: First, communality for each variable was estimated by squared multiple correlation (SMC) by Guttman (12); principal factor loadings are obtained by Thomson's refactoring method, and factor loading matrix is rotated by Kaiser's Varimax method (13) in order to obtain a simple structure solution. Next, we estimate a factor score matrix F by F = ZR-1A, where A is a factor loading matrix. Finally, we calculate correlation coefficients between each factor score and 24 variables shown in Table 3.  Abbreviation for analysis   140-148  Oral cavity and pharynx  Oral  male  female  150  Esophagus  Esophagus  male  female  151  Stomach  Stomach  male  female  152  Intestine (small and large)  Intestine  male  female  154  Rectum  Rectum  male  female  155  Liver, biliary passage  Liver  male  female  157  Pancreas  Pancreas  male  female  161  Larynx  Larynx  male  female  162 Lung  Stepwise Regression Method. In order to examine the causal relationship between cancer mortality and food or environmental variables, the stepwise forward regression method (14) was performed, using each of the cancer mortality rates of various sites as a criterion, and 16 variables were selected out of the 24 variables in Table 3 as a set of explanatory variables.

Site-Specific Mortality Rates in Each Period
The site-specific mortality rates of cancer in each of three periods are shown in Table 4 and in its last column sex ratio of cancer mortality is given. In the first period, the mortality rate of stomach cancer ranks highest for both sexes, but it decreases as the calendar year proceeds, and the mortalities of lung cancer increase instead. Furthermore, the mortality rate of breast cancer increases and that of uterus cancer decreases as the year proceeds. As for the sex ratio, larynx cancer ranks highest, followed by lung, oral, and esophagus cancers.

Correlation Coefficients with Regard to the Geographical Distributions of Cancer Mortality
We have shown the correlation coefficient of the mortality rates for each pairof three periods: (1) from 1950 to 1955, (2) from 1956 to 1961, and (3) from 1962 to 1967 in Table 5 for males and in Table 6 for females. It is seen from these tables that most of the sites have consistently high correlation with few exceptions. Such consistently high correlations indicate that the chronological variation in the geographical distributions of various cancer sites is stable throughout the whole period from 1950 to 1967. Of all the geographical distributions concerning the mortalities of cancer sites, that of liver cancer is the most stable throughout these periods in both sexes, followed by intestine cancer in both sexes and breast cancer in females, while those of pancreas cancer in both sexes and that of oral cancer in males are relatively unstable. Next, we show in Table 7 the correlation coefficients between the male and female October 1979 (3) Stomach (1) (2) (3) (3) Liver (1) (2) (3) (3) Thyroid (1) (2) We calculated the correlation coefficients between any pair of the sites with regard to the geographical distribution. The 13 x 13 correlation ma-  Table 4.   Tables 8 and 9, respectively. By convention of inferential statistics, correlation coefficients with the value of 0.394 or more were considered statistically significant at the 5% level, based upon the assumption that the mortality rates computed were random samples from the population characterized by normal distribution. The number of pairs with statistically significant correlation at the 5% level was 28 (32.1%) in males and 31 (34.1%) in females. The highest correlation coefficient was found between rectum and intestine cancers in males (r = 0.754) and between breast and intestine cancers in females (r = 0.817), while the highest negative correlation coefficient was observed between stomach and intestine cancers in both sexes (for males r = -0.729, and for females, r = -0.733).

Factor Analysis
First, we performed a series of steps required for factor analysis procedure on the 13 x 13 and 14 x 14 correlation matrices. Results are shown in Tables 8  and 9, respectively. We extracted four factors by this procedure for each sex. The contribution of these factors totaled 65.5% and 65.3% of the total in males and females, respectively.
The factor loading matrices of males and females are shown in Tables 10 and 12 respectively. The scores of the corresponding four factors with respect to 24 countries are obtained and shown in Table 11 for males and Table 13 for females. Note that we exchanged the order of the first and second factors in females so that they correspond to those in males in terms of the order in which the factorsfj are written. First Factor. In males, high factor loadings were found for pancreas, prostate, intestine, and leukemia, (positive in sign) and liver (negative in sign), Table 8. Correlation coefficients between the mortalities of any pair of 13 sites of male cancer with regards to geographical distribution.
(1) Oral 1.000 2) Esophagus 0.  Table 9. Correlation coefficients between the mortalities of any pair of 14 sites of female cancer with regards to geographical distribution. (1) (10) (11) (12) (13) (14) (1) Oral producing the high degree of contribution ratio (27.9Wo). On the other hand, in females, high loadings were found only for skin, breast, stomach (negative), and uterus (negative) cancers and leukemia, resulting in a relatively smaller degree of the contribution ratio as compared to the male case. Hence, the first factor of males does not correspond to that of females in the strict sense but is sure to reflect some common factors. With regard to the scores of the first factor in males shown along with the vertical line in Figure 1, U.S. white is plotted highest, followed by U.S. nonwhite and Australia; Japan is plotted lowest, while among females Norway is plotted highest, and Japan and Chile lowest.  Second Factor. In both sexes, rectum, intestine and lung cancers have commonly high loadings on this factor, and breast cancer has also high loadings in females. It is presumed, therefore, that the second factor functions in males as well as in females in a similar way in that considerably high positive loading are found for intestine, rectum and lung cancers. It should be noted, however, that some discrepancies are found in the locations of factor scores shown in the vertical lines of Figure 5 and 7. In males, factor scores of Denmark are highest, followed by England and Wales, and Scotland, while in females Scotland is highest, followed by Canada and Denmark. Note that Japan is plotted lowest on this factor. Thlird Factor. Oral, esophagus, and larynx cancers have high factor loadings on this factor in both sexes. On the contrary, distributions of factor scores on both sexes are dissimilar, since France and Switzerland, where high positive factor scores are found in males, have negative factor scores in females.    Fourth Factor. High loadings were found in esophagus and stomach cancers in males, while relatively high loadings were found in thyroid and lung cancers, showing quite dissimilar patterns with regard to sex. In view of its relatively small contribution ratios (7.9o in males and 8.3% in females) the fourth factor does not seem to bear any substantial meanings behind it.
In order to substantiate the meaning of the four factors, we calculated the correlation coefficients between these factor scores and the scores on the 24 variables shown in Table 3. The results for males are shown in Table 14 and those for females in Table 15.
The first factor correlates highly with intakes of sugar, meat, spices, and animal products in the positive direction and with population density in the negative direction in case of males, while it correlates highly and positively milk, sugar, meat, and energy intake and negatively with population density in the case of females. It is noted for both sexes that the first factor correlates positively with animal products and negatively with vegetable products. This tendency is more obvious in males rather than in females. The second factor has, on the whole, a similar pattern of correlations as the first factor but oil fats and animal oil have higher correlations to this factor than the first factor. It is interesting to note that the second factor has a strong negative correlation with population density in both sexes, suggesting that cancers of intestine, rectum, and lung are liable to occur more frequently in areas where population is not so dense. For the third factor, large discrepancies are found between males and females as they are seen between the distributions of factor scores. In males, the third factor has strong correlations with intake of alcohol, vegetables, and vegetable products, while it has no significant correlations with these variables especially in females. For the fourth factor, we may only note that rainfall and vegetable have high negative correlations with this factor only in females. Next, we present in Table 16 the results of factor analysis based on the mortality rates of 27 sites of males and females. From Table 16 we see that the first and second factors correspond to those of Table 10 and Table 12 in terms of the sites with high factor loadings. It is interesting to note that the third factor which is shown in Table 16 corresponds to the third factor of males shown in Table 10 and the fourth factor corresponds to the third factor in Table 12. This result implies that the geographical distributions of oral, esophagus, and larynx cancers are different with respect to each sex, thus forming different clusters. Scrutinizing the result of Table 16, we note that the correspondence mentioned above will not be so clear in the strict sense, since pancreas cancer in both sexes have high positive factor loadings, and larynx cancer of females have relatively lower factor loadings as compared with that of Table  10.    yields a multiple correlation coefficient of 0.765. The squared multiple correlation coefficient (SMC) and the squared multiple correlation coefficient (SMCR) adjusted with regard to the degree of freedom are also shown in Tables 17 and 18. The number of statistically significant SMC at the 1% level (SMC _ 0.522) are 11 and 9 in males and females, respectively. The largest multiple correlation coefficients were observed in prostate cancer in   males (r = 0.897) and breast cancer in females (r = 0.843). Cancers of the intestine, rectum, liver, pancreas, and thyroid also have larger multiple correlations, as compared to the other sites ofcancer in both sexes. Next, we shall describe some interesting points for each site of cancer concerning the result of stepwise regression procedure as shown in Table 17 and 18. First, oral, esophagus, and larynx cancers in males, which are found to form a cluster with regard to their geographical distribution, have high positive regression coefficients on alcohol and negative coefficients on fruit, but in case offemales only fruit has a negative regression coefficient for each of the three October 1979 cases. Secondly, for the regression coefficients of stomach cancer, meat has large negative values and rainfall has positive values in both sexes, but they are opposite in sign for intestine cancer although the magnitude of correlation coefficient between rainfall and intestine cancer is negligible. Thirdly, it is noteworthy that signs of the regression coefficients of sugar and population density are opposite with respect to liver and pancreas cancers in both sexes. Finally, it is interesting to note that fruit has large value of regression coefficients on thyroid cancer in both sexes and that spices have high value with respect to prostate cancer. The trend of site-specific mortality rates of cancer is shown in Table 4. In most cases of the sites of cancer, the magnitude of mortalities is relatively stable during these three periods. However, as we have already mentioned, the mortality rate of lung cancer has been increasing and that of stomach cancer has been decreasing in both sexes over the period from 1950 through 1966. From Table 4, we can easily calculate the sum of mortality rates of 13 male cancers in three periods, which amount to 114.73, 131.83, and 126.6 per 100,000 population and those of 14 female cancers to 92.87, 89.50, and 86.37. Thus a consistent decreasing trend is observed in female cancers from the first to third periods, while the male cancers have a peak in the second period. As to the sex ratios shown in the third column of Table 4, we note that high values are found in oral, esophagus, larynx, and lung cancers, in which the cigarette smoking is suspected as one of the etiological factors. Next, we shall consider the result of the geographical distributions in the different periods (Tables 5 and 6). Consistently high correlation coefficients were obtained for most of the cancer sites. The result would surely tend to support Burkitt's assumption that environmental factors implicated through the geographical distribution might function as the predominant factors in cancer etiology together with the hypothesis that race differences might be related to cancer etiology in some way. As for the interpretation of the magnitude of correlation coefficients, we should keep in mind to recognize the fact that correlation coefficients are easily biased by the way sampling is done from the population. It is well known that correlations between two cancer sites obtained from the countries over the world sometimes differ drastically from those obtained from the samples of Japanese prefectures. For example, we note from Tables 8 and 9 that a high negative correlation is observed between the international distribution of stonmach cancer and intestine cancer (r = -0.729 for males, r = -0.733 for females).
However, according to the result based on the Japanese data (9), they are positively correlated in both sexes (r = 0.436 for males, r = 0.425 for females). In view of this disagreement we should take some care about nature of the sample used for calculating the correlation coefficient when we interprete the magnitude of correlation coefficients and the factor loadings for each variable.

Factor Analysis
Now, we shall give some comments on the results of factor analysis, which are shown in Tables 10  through 16 and Fig. 1 through 8. Since the results of factor analysis indicate that the various cancer sites forned clusters with respect to their geographical distributions over the periods examined, we may postulate the existence of some common causes to develop cancer in respective sites. It should be noted, however, that similarities in the geographical distributions of two diseases may sometimes reflect only statistical bias in sampling. Hence, it is better to develop an etiological hypothesis about alternative diseases within a cluster only when we have some established hypothesis about the etiology ofa certain disease in the cluster. In view of this, we shall consider what clues each factor provides to develop an etiological hypothesis, using the result of factor analysis as well as that of stepwise regression method. It is fairly obvious that the correlations between the mortality rates of some cancers (prostate, pancreas, skin, and intestine) and the first factor are very strong in U.S. males and very low in Japan. One ofthese reasons may be attributable to the difference observed in the amount of intake of sugar, meat, milk, and animal products which have high factor loadings on this factor. This fact conforms to the finding reported by Lea (15), who also analyzed interrelationships between the mortality rates of some cancer sites and food intake. Hence, in view of the findings, we might say that the first factor may be related to the Western style food habits. Unfortunately, we cannot give any satisfactory explanations for the close relationship between foods habits and the high mortality rate in skin and prostate cancers, and leukemia which have high factor loadings on the first factor of males. In connection between prostate cancer and leukemia, we only point out the fact that Berg et al. (16) investigated second primary carcinomas in index patients with leukemia and noted an increasing frequency of carcinomas of the prostate.
We have already stated that in both sexes the second factor has high loadings in intestine and rectum cancers as well as lung cancer, but it should be noted here, that stomach cancer of males has a moderately large negative loading, and that factor scores of U.S.A. white and nonwhite are negative, despite the fact that this factor has positive correlations with intakes of meat, animal products and oil and fats. Scrutinizing the geographic distribution of the scores of this factor, we note that high loadings were found in relatively northern parts of the world such as Denmark, North Ireland, and England, thus yielding high negative correlation with the mean atmospheric temperature. Considering the fact that excessive intakes of oil and fats especially animal oil as well as total energy measured in terms of total caloric intake have high correlations with the second factor, a possibility of association between rectum or intestine cancer and the excessive intakes of energy or oil and fats may well be suggested. This hypothesis conforms to the fact that in Denmark, Scotland, and Belgium which indicate high factor scores, intake of oil and fats is more than in the U.S., where high factor score is computed in the first factor. Hence, rectum and intestine cancers are suspected to have a strong association with a high intake of fats.
From the result of factor analysis based on 27 cancer sites of males and females, it is seen in Table  16 that the third factors in males (Table 10) and females (Table 12) are completely heterogenous with respect to the geographical distribution, although esophagus, oral, and larynx cancers are clustered differently in each sex. In males, the correlations shown in Table 14 conform to the well-known hypothesis that excessive intake of alcohol is a main etiological cause of these cancer sites. We also note that alcohol is abundant in France, and oral and esophagus cancers occur more frequently in France.
Next, we shall give some comments on the results of stepwise regression procedure shown in Tables 17 and 18. The result that alcohol was selected first in oral, esophagus, and larynx cancers may well reflect the validity of the hypothesis stated above. Besides alcohol, it is interesting to note that intake of fruit is negatively correlated with the mortalities due to oral, esophagus, and larynx cancers in both sexes, since it is generally believed that excessive drinking and insufficient fruit intake may sometimes cause vitamin deficiency, which is suspected to be a possible etiologic factor for these cancer sites. Meat and rainfall have regression coefficients opposite in sign in case of stomach and intestine cancers, that is, meat has a positive regression coefficient on intestine cancer but a negative regression coefficient on stomach cancer. On the other hand rainfall is in the situation opposite to meat. It is well-known that Japan and Chile where high incidence of stomach cancer is observed, have much rainfall throughout the year, and people in these countries do not eat a great deal of meat at a time. Hence, from our result, it seems valid to postulate that excessive intake of meat is an etiological factor of intestine cancer, as it has been suggested by many researchers. It should be noted, however, that it is inappropriate to conclude that excessive intake of meat would function as inhibitive factor for stomach cancer, since the statistically significant correlation does not always imply a casual relationship between the variables, and high correlation may sometimes be caused by the existence of a third variable. Note that such correlation is termed as spurious correlation, and existence of such a correlation may sometimes lead to incorrect interpretation of the result. For example, let us consider the case of sugar and pancreas cancer in males shown in the bottom row of Table 17. Presumably, it might be better to explain the high correlation between sugar and pancreas cancer by the spurious correlation in terms of a hypothetical variable, say, the degree of industrialization, since sugar is abundant in the industrialized countries and the mortality ofpancreas is also high in such countries as well. But we cannot deny the possibility of a casual relationship between sugar and pancreas cancer at all, since in view of clinical epidemiology, an adequate supply of sugar may sometimes work as a remedy for patients suffering from diseases of pancreas. Finally, we should mention that etiological factors such as excessive alcohol drinking and meat intake which this study revealed, are not considered as the initiator but as the promoter in the pathogenesis of cancer.

Methods of Analysis Used in the Study
Finally, we shall illustrate some points on the methodology of the analysis used in this study. First of all, we should point out that the data used in the analysis are not entirely appropriate in sampling for reasons that most of 24 countries available in our analysis are distributed in the European countries, and sites of cancer such as bladder, kidney, and ovary are not included in the analysis. In view of this fact, we cannot assume that 24 countries used for the analysis are random samples from the population of the world; hence we can hardly say that our results reflect the global tendency in the mutual relationship of the cancers at large. Of course, we cannot deny the possibility that the additional data from the other countries may change the whole implications of the obtained factors in this study, but it is not likely that the obtained results are different from the present ones even ifwe add data from many other countries.
From the methodological point of view, it is very difficult to estimate the change in the factor structure which would result from addition of new variables as well as new data. We should admit that development of such powerful methodology would make it possible to elucidate the meaning of the results obtained in our study. In addition, we expect that statistical data on cancer deaths in all the countries in the world, including many developing countries for which data are hard to obtain, would become available for the biostatisticians who are concerned with interrelationship between cancer mortality and various environmental factors.
As for the statistical data, we should mention the reliability of the data obtained from the death certificate, which may sometimes affect the accuracy of the results. It is well known that the reliability of the death certificate data may vary in each geographic area. Studies carried out in the U.S. (17,18) on the accuracy of the death certificate revealed that the most misleading errors in the certificate of the cause ofcancer death lies in the geographical differences in the assignment of cause of death to each of the malignant tumors. Such errors are liable to happen especially in the sites such as prostate or bile duct, which physicians find difficult to diagnose. With regards to prostate cancer, Maruchi (19) speculated that the reported number of cases of prostate cancer according to death certificates in Japan is estimated to be less than two thirds of the real number by comparing the reported number in the Japanese Archives of Autopsy Findings with that from the death certificate. In view of the evidence, some correction is needed in order to estimate the true figures of deaths from cancer, in which the accuracy of diagnosis is not satisfactory for various reasons.
Finally, we should discuss the usefulness of factor analysis. Some biostatisticians or epidemiologists may raise questions whether factor analysis by itself provides an etiological factor for some diseases. Factor analysis or principal component analysis are said to be methods which reduce the information contained in each variable without losing useful information and which provide clues useful for finding etiological factors. When there are many correlated variables which are supposed to be related to the etiology of a certain disease, univariate statistical analysis such as testing of the differences of mean values carried out separately for each variable may lead to misleading results, since it ignores the correlations among the variables.
Hence, we might say that usefulness ofapplication of factor analysis to epidemiological data lies in finding some hints which may well lead to detecting etiological factors of some diseases hidden behind the data, which could not be detected without application offactor analysis. As an alternative method of finding some clusters of various sites of cancer, Burbank (20) recommends cluster analysis, since its application would produce the successive hierarchical structure of various cancer sites. But such hierarchical structure does not necessarily correspond to pathogenesis of cancer in the human body. Furthermore, we must mention that clusters resulting from an application of cluster analysis technique are usu-ally interdependent, and results of cluster analysis never produce information related to the mutual relationship among the extracted clusters. Hence, we would like to emphasize that factor analysis has an advantage over cluster analysis in the usefulness of the clusters.
This does not imply that we totally deny the usefulness of cluster analysis. If we apply both factor analysis and cluster analysis to the same data, it would be a powerful method to develop an etiological hypothesis, provided that we could obtain all the data needed for the analysis of our target. With these points in mind, we would like to develop our study further in order to pursue the unknown etiology of cancer, using the various methods of biostatistics.