Geographical distribution of liver cancer incidence based on big data analysis

The liver cancer has become the sixth most common cancer and the second leading cause of cancer deaths worldwide. The incidence of liver cancer has characteristics of the geographical distribution, which vary from region to region. Based on the data from Cancer Incidence in Five Continents, this study analyzed the spatial distribution of liver cancer globally and explored the effect of meteorological factors of surface temperature and wind field on incidence of liver cancer, to find the geographical distribution characteristics of liver cancer and the impact of geographical environmental factors on the incidence of liver cancer, which will be of great significance to the epidemiological study of liver cancer and the prevention of liver cancer. The results show that the humid and hot oceanic climate may be conducive to the development of liver cancer. The environmental factors play an important role in the occurrence of liver cancer.


Introduction
Cancer is the leading cause of death around the world and is a major public health problem, the global cancer deaths per year are projected to more than 11 million in 2030 [1], and liver cancer is one of the most common cancer [2,3,4], which is reported as the sixth most common cancer and the second leading cause of cancer deaths worldwide [5]. Liver cancer is mainly divided into two types: hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC). The HCC comprise about 85% of cases. There are various studies about the main risk factors for HCC, such as hepatitis B virus (HBV) or hepatitis C virus (HCV), aflatoxin-contaminated foodstuffs, heavy alcohol intake, obesity, smoking, and type 2 diabetes [6], but few studies on liver cancer impact factors from the perspective of geographical characteristics.
The main risk factors for HCC are many, and the factor of liver cancer vary from region to region, so the incidence of liver cancer must has some characteristics of the geographical distribution. According to the Global cancer statistics (GLOBOCAN) estimates, about 0.84 million new liver cancer  [7]. It shows the incidence of liver cancer has obvious characteristics of the geographical distribution. This paper analyses the relationship between geographic position and the incidence of liver cancer, and explores the influence of geographical environment on the prevalence of liver cancer. We aim to provide a scientific basis for the epidemiological research and prevention of liver cancer.

Methods
In this study, the method of regression analysis is utilized to study the liner relationship between the incidence of liver cancer and the geographical factor (latitude and longitude), and to apply the trendsurface analysis to show the geographical distribution characteristics of liver cancer incidence, with the data of geographical location information and liver cancer incidences.

Data
In order to analyses the relationship between geographic position and the incidence of liver cancer, we collect the age-standardized incidence rates (ASRs) of liver cancer from the Cancer Incidence in Five Continents (CI5) [8], which is a long collaboration between the International Agency for Research on Cancer and the International Association of Cancer Registries, to provide a unique source of cancer incidence data from high-quality population-based cancer registries (regional and national) around the world. The ASRs of liver cancer data used for this study is from 378 registries covering in 68 countries for the registration period 2003-2007 in Volume X of CI5, and it includes ASRs of male and female respectively. The latitude and longitude information about these registries are obtained from Google Earth. The Figure 1 and Figure 2 below shows the distribution of liver cancer incidence rates for male and female respectively at all registries worldwide respectively.  The Sea surface temperature (SST) data collect from Met Office Hadley Centre, which is a UK's foremost climate change research centre. The spatial resolution of SST data is 1°×1° and the spatial range is 89.5S°～89.5°N, 0.5°E～359.5°E.
The Sea surface wind field data comes from Physical Oceanography Distributed Active Archive Center (PO.DAAC). The spatial resolution of the data is 0.25°×0.25°, the temporal resolution is 6h, the spatial range is 78.375°S~78.375°N, 179.875°W~179.875°E, and the time range is from July 1987 to the present [9]. The data can be downloaded from http://rda.ucar.edu/datasets/ds744.4/data/.

Correlation analysis
The Pearson correlation coefficient is used to measure the linear association between liver cancer incidence and latitude or longitude, the formula is as follows Where r is the Pearson sample correlation coefficient between X (latitude) and Y (longitude); n is the number of observations Xi is the value of X (for ith observation); Yi is the value of Y (for ith observation). The value of the correlation coefficient varies between +1 and -1.
The P-value is calculated to determine the representativeness of the sample correlation coefficient to the overall degree of correlation, and we here set the significance level α as 0.01.

Trend surface analysis
Trend-surface analysis (TSA) is a form of polynomial regression used in geology, ecology and geography to model the overall distribution of properties throughout space. It can fit the liver cancer global trend of distribution as a surface, which is expressed by polynomial function found through the method of least squares.
The F-value and R-squared are used to test the significance and goodness of fit for the surface. Rsquared is equal to the sum of squares due to regression divided by the total sum of squares. The F-value is calculated using the formula as follow where SSR is the sum of squares for regression, SSE is the sum of squares for error; n is the number of observations, p is the number of regression parameters.
Setting the significance level as 0.01. If the F-value is greater than the critical value of the Fdistribution , this would mean that the distribution pattern of properties can be summarised with this surface function.

The distribution of liver cancer registries
Based on the geographic coordinates of the liver cancer registries, we made a spatial distribution map of these registries, and also showed the level of incidence of each registry, which was implemented by Python. It can be found in Figure 1, 2 that in North America and Europe, registries are very densely established, but many other regions around the world do have fewer establishments, especially in Africa and South America.

Correlation analysis
Using IBM SPSS statistical software to perform linear regression analysis on the incidence of liver cancer and latitude and longitude, the correlation analysis results are shown in Table 1. For latitude, the Pearson correlation coefficients of male and female are 0.11 and -0.68, respectively, and their P-values are 0.825 and 0.191, which are both greater than 0.01. This demonstrates that the linear correlation between incidence and latitude is weak to both male and female. By contrast, the linear correlation between incidence and longitude is extremely strong, which their correlation coefficients and P-values of male and female are 0.436, 0.434, 0.000, 0.000 respectively.

Trend surface analysis
According to the longitude, latitude and incidence of liver cancer each registry, a polynomial equation of degree five in binary is established for trend surface analysis. The polynomial equation regression coefficient value of trend surface for male and female, as listed in Table 2. For the test of fit goodness of trend surface, the 2 value of male and female are 0.569 and 0.541 respectively. Their fit goodness are both acceptable. For the test of significance, F-values of male (23.439) and female (20.89) are greater than 4.662 of 0.01 , indicating that their trend surface equations are both significant. These two polynomials passed the test, so the trend surfaces can be used to analyze the spatial distribution and change trend of liver cancer incidence. Figure 3 and Figure 4 show the 3D Trend surface of male and female liver cancer incidence respectively.    Figure 4. 3D Trend surface of spatial distribution and change trend of female liver cancer incidence. Figure 3 shows that the global geographic distribution of male liver cancer incidence is mainly in North America, Asia, Europe, and there is a trend toward South America and southern Africa. Among them, Asia is a region with a high incidence of liver cancer in men, especially in the eastern and southeastern coastal areas. It can be seen from Figure 4 that the global incidence of liver cancer in women is lower than that in men. Similar to the spatial distribution of male liver cancer, it is also mainly concentrated in the northern hemisphere. The high incidence area is still in Asia. Although the trend surface distribution of male and female diseases shows a high incidence in the northwest region, it is located in the Pacific Ocean and has no human settlements, so it is an outlier and the analysis of this region can be ignored.

Discussion
Although there is no obvious linear relationship between the latitude and the incidence of liver cancer in the linear relationship analysis, it does show significant regional differences in the trend surface. This may be because that these registries are still insufficient to represent the actual distribution characteristics of liver cancer, while the trend surface predicts the incidence of liver cancer globally, so it can more truly reflect the changing trend of liver cancer incidence. It can be found from Figure 3 and 4 that the areas with high incidence of liver cancer are roughly in the range of 100°E~140°E, 10°S~40°N, which mainly covers the east of Asia and Southeast Asia, but the total of incidence in these areas accounts for almost 80% of the global prevalence of liver cancer incidence. In reaction to this phenomenon, we think whether the prevalence of liver cancer may be related to the characteristics of geographical environment. As shown in Figure 3 and 4, the liver cancer incidence of male and female both gradually increase from inland to coastal area, and are mainly concentrated in the coastal area of latitudes within 40°N of the northern hemisphere. Considering the environment in this area is greatly affected by the ocean, such as sea surface temperature and wind field, therefore, we plotted sea surface temperature map ( Figure 5) and wind field map ( Figure 6) in the global waters for analyzing the relationship between liver cancer incidence and meteorological factors in this area. By comparing and analyzing the spatial distribution characteristics of liver cancer cases (Figure 3 and 4) and sea surface temperature ( Figure 5), overall, the sea surface temperature to the north of 40°N is significantly lower than that of low and middle latitudes. This means the temperature decreases with increasing latitude, therefore, we initially deduce that the hot and humid oceanic climate is conducive to the occurrence and development of liver cancer. For wind field (Figure 6), there is no obvious relationship between wind field and liver cancer incidence found. Due to limited data on the incidence of liver cancer, the correlation between the spatial distribution of liver cancer and temperature distribution obtained in this study is still in the conjecture stage. In the future, it is necessary to collect liver cancer cases in various seasons around the world and systematically analyze the relationship between spatial distribution of liver cancer and the spatial distribution of temperature and humidity in each season to provide scientific basis for the prevention and treatment of liver cancer.

Conclusions
In this paper, we studied the relationship between geographic position and the incidence of liver cancer, and explored the influence of geographical environment on the prevalence of liver cancer. Our research results show that the humid and hot oceanic climate may be conducive to the development of liver cancer, which may provide a scientific basis for the epidemiological research and prevention of liver cancer. However, the global liver cancer data collected this time is relatively limited. In the future, it is necessary to collect more global liver cancer data in various seasons, and systematically analyze the correlation between the spatial distribution of liver cancer and the spatial distribution of temperature and humidity in each season.