Predictive modeling of indicators of the tourism sector of the world economy

The article studies the statistical indicators of the tourism sector of the world economy. For the study, predictive modeling technology is used, which includes methods of correlation and regression data analysis. A modern R programming language is used as a modeling tool, which has wide functionality. The authors analyze the main indicators of tourism, as well as predictors of scientific, technical and innovative development. As a result of the analysis, a close relationship between indicators is proved, as well as a linear relationship is identified.


Introduction
Tourism is one of the main sectors of the world economy, which contributes to the economic growth of countries, an increase in the number of additional jobs, the formation of a positive image. This industry also contributes a significant share of cash to the formation of countries' GDP.
In order to understand how effectively a particular industry is functioning, statistical information about its state and the use of modern methods of data analysis are needed, which will help to identify trends in the development of the industry, determine dependencies and determine the predicted values of indicators. One of these methods is predictive modeling.
To ensure comparability of international indicators for the development of the tourism industry, UNWTO has developed International Recommendations for Tourism Statistics. These recommendations distinguish 2 main groups of statistical data for accounting for inbound tourism: accounting for inbound flows and accounting for expenses.

Materials and methods
To analyze the data, we will use the following indicators of international tourism for 2018, designating these predictors with the corresponding variables for the convenience of calculations [1, 2]: The number of arrivals -the number of tourists who visited the country within a period not exceeding 12 months (thousand people) -А1.
Number of departures -the number of people who left the country of permanent residence for any other country for any reason (thousand people) -А2.
International Tourism Expenses -Visitor spending in other countries, including payments to foreign carriers for international transport (million US dollars) -А3.
Travel expenses for passenger transport -travel expenses for passenger transport are the expenses of tourists traveling abroad from all over the world in other countries, for services provided by non-resident carriers in the process of international transportation (million US dollars) -А4.
Travel and shopping expenses -goods and services purchased by or on behalf of a tourist for his personal use or use as gifts, goods (million US dollars) -А5.
International tourism revenues are the expenses of foreign inbound visitors, including payments by national carriers for international transport. This income includes all prepayments for goods and services purchased in the destination country. They can also include income from tourists visiting the country for one day, unless they deserve a separate classification. (million US dollars) -А6.
Total contribution to GDP -the share of travel or employment spending across the economy in published national income accounts or in labor market statistics (million US dollars) -А7.
Direct contribution to employment is employment in the hospitality industry, travel agencies, transportation, as well as catering and leisure when providing services directly to tourists (million US dollars) -А8.
Domestic international tourist spending -domestic international tourist spending (million US dollars) -А9.
Residents' expenses within the country (million US dollars) -А10. Government spending -the share of travel or employment spending across the economy in published national income accounts or in labor market statistics (million US dollars) -А11.
Business travel expenses (million US dollars) -А12. For modeling, 20 countries were taken, which are leaders in the "Number of arrivals" indicator, presented in Figure 1. In addition, to determine the presence of a correlation and relationship between indicators of tourism and scientific and technological development, we add the following predictors, designating them with the corresponding variables [4,5]

Correlation modeling
Correlation analysis is the analysis of the relationship of quantitative data. Indicators for this relationship are in the range from -1 to 1. The closer this indicator is to 1, the stronger the relationship between the predictors. The simulation was carried out using the programming language R. In R, there is a huge number of correlation tests that allow you to calculate both partial correlation, and polychoric coefficients, and multi-row. For the analysis, the function "corr.test ()" was used, which allows you to calculate 3 correlation coefficients: Spearman's rank correlation coefficient, which shows the relationship between the ranked variables, Kendall's correlation coefficient, a nonparametric rank correlation indicator, and Pearson's correlation coefficient, showing the degree of linear relationship between quantitative variables. We will use the latter to determine the presence of a relationship between the predictors of the world economy tourism sector under study. Function syntax: corr.test (x, use =, method =), where "x" is the data table; "use" is a parameter that allows you to initiate gaps in the data and makes it possible to either ignore them, or give the user an error; "method" -specifies the type of coefficient used (Pearson, Spearman or Kendall). Figure 2 shows the result of using the "corr.test ()" function between variables A1-A12 and B1, as well as the result of checking the statistical significance of the calculated correlation coefficients. Let us note the most highly correlated values between the predictors. We note right away that the relationship between all the variables is direct. In a very close relationship (the correlation coefficient exceeds 0.9 and is statistically significant) there are tourism indicators such as «Expenditure on international tourism» and «Expenditure on travel and shopping» (0.99), «Expenditure on international tourism» and «Total contribution to GDP» (0.93), «Expenditure on international tourism» and «Expenditure by residents within the country» (0.94), «Expenditure on travel and shopping» and «Total contribution to GDP» (0.9), «Travel and Shopping Expenses» and «Domestic Residents Expenses» (0.9), «Travel and Shopping Expenses» and «Direct Contribution to Employment» (0.92), «Business Travel Expenses» and «Total contribution to GDP» (0.95).
In addition, we can highlight the relationship between the predictors with a correlation coefficient higher than 0.8: «Expenditure on international tourism» and «Travel expenses on passenger transport» (0.85), «Number of departures» and «Expenses on travel and shopping» ( 0.87), «International Tourism Expenditure» and «Direct Contribution to Employment» (0.88), «International Tourism Income» and «Domestic International Tourist Expenditure» (0.88), «Government Expenditure» and «Total Contribution to GDP» (0.84), «Business Travel Expenditure» and «Domestic International Tourist Expenditure» (0.89).
In a separate group, it is worth highlighting a direct and very close relationship between predictors and indicators of scientific and technological progress, the correlation coefficients of which are presented in Table 1. It should be noted that the indicators of scientific and technological progress and innovations have very high and statistically significant correlation coefficients with almost all the main indicators of the tourism sector of the world economy.

Regression modeling
Regression analysis is a statistical method based on identifying the mathematical relationship between indicators. In addition to the usual methods of regression analysis, R has many functions for the selection of regression models. Regression types used in R: simple linear, polynomial, multiple, multivariate, logistic, Poisson, Cox proportional hazards, time series, nonlinear, nonparametric, robust. The calculations also take into account the functions of detecting multicollinearity, heteroscedasticity and other problems arising in regression modeling.
The lm() function was used to calculate the regression parameters. Its syntax is as follows: lm (formula, data), where formula -describes the mathematical form of the model that needs to be calculated; data -a table with data containing values for creating a model. During the analysis, linear and multiple regression models were built. The results of regression modeling (containing only those regression equations in which the coefficients were statistically significant) are presented in Table 2. It presents the main relationships between the predictors. The indicator "Total contribution to GDP" was considered as an endogenous variable. Thus, the following conclusions can be drawn. Among the constructed models, the highest coefficient of determination (Multiple R-squared, equal to 0.99) has a linear regression equation describing the relationship between the predictors "Total contribution to GDP" and "Internal expenditures on research and development." It can be argued that with an increase in research and development costs by an average of $ 1 million. the total contribution to GDP will increase by $ 13.195 million. In addition, with an average increase in international tourism spending by $ 1 million. GDP will grow by $ 6.188 million. The results of regression modeling show that the indicators of the tourism sector of the world economy directly affect the increase in GDP of the studied countries.

Cluster analysis
Cluster analysis is one of the most popular and modern methods of data analysis. This tool is also actively used in the tourism sector. The task of cluster analysis is to break down the population (objects), in our case, 20 leading countries in the field of tourism by the number of tourists visiting the country, into groups that would be similar in a certain way. Clustering algorithms for the implemented methods are divided into 2 groups: hierarchical and nonhierarchical.
With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods are used. The work analyzed data for 20 countries and 11 indicators.
To analyze the data, we will use a non-hierarchical k-means squared error algorithm. The algorithm is the simplest to implement, but it assumes that the number of clusters is determined in advance. This problem can be overcome by performing a hierarchical analysis with a random sample of observations. Applying the k-means algorithm in the R programming language, we get the dendrogram shown in Figure 3. The dendrogram shows the presence of 5 clusters.

Conclusion
After the analysis, it should be noted that due to the huge data set and the dynamic development of world tourism, it will also be relevant to consider the complex patterns of the influence of indicators, identifying not only linear, but also nonlinear dependencies between them using modern means and data processing techniques. Also highlight and analyze the main indicators of scientific and technological progress, which most strongly affect the indicators of tourism development.