Applying Big Data Analytics to Monitor Tourist Flow for the Scenic Area Operation Management

1School of Economics and Management, Beijing Jiaotong University, 3 ShangyuanCun, Haidian District, Beijing 100044, China 2School of Traffic and Transportation, Beijing Jiaotong University, 3 ShangyuanCun, Haidian District, Beijing 100044, China 3School of Economics and Management, Beijing Information Science and Technology University, 12 Qinghe Xiaoying East Road, Haidian District, Beijing 100192, China


Introduction
With the rapid development of China's economy, the large and medium-sized cities have entered the Leisure Era.The quality of leisure has become an important evaluation criterion of the living quality of urban residents [1] and an essential part of public life.Witnessing the continuous growth of tourist flow in the scenic spots, it is vital to understand the accurate and real-time travel behaviour information of different types of tourists [2][3][4][5][6][7][8][9][10].In recent years, the statistical monitoring methods for the tourist flow mainly focus on video surveillance, entrance gates, and other means [11].The development of network and Internet technology also has enabled technical means of monitoring based on the WiFi provided by scenic areas and mobile phone application terminals.These flow monitoring methods require cameras, gateways, WiFi base stations, and other equipment to support them.It is difficult to install and implement in all scenic areas.Most mobile applications (such as WeChat) are oriented towards young and middle-aged users and cannot monitor tourists of all types.
The conventional estimation of macroscopic travel demands mostly relies on the empirical judgment on historic data by transportation practitioners.It is costly and the resolution is limited.Therefore, it is practical to develop new and better tools to automate the identification of tourist flows from large-scale mobility data sets.Some research is conducted with smart card fare data [12,13] and mobile phone data [14].
The quantity of mobile phone users in China is constantly increasing.According to data released by China's Ministry of Industry and Information Technology (MIIT), as of January 2017, the number of mobile phone users in China reached 1.32 billion, and the penetration rate of mobile phones reached 96.2%.The number of mobile phone users in Beijing reached 3.869 million, and the penetration rate of mobile phones reached as high as 178.3%.Nearly everyone who travels carries their cell phone.Based on the continuous development of positioning technology in mobile communication, the tourist flow analysis with Call Detail Record (CDR) data can provide real-time valid data for scenic flow control, tourist diversion, traffic dispersion, safety management, and so on.And these can support to improve the operation management of the scenic area.
Through the comprehensive analysis of scenic spots and tourists' travel behaviour, researchers have explored tourist flow and tourist behaviours in the corresponding scenic spots.Ferrari [15], through the processing and analysis of the Call Detail Record (CDR) data of mobile phones, analysed the temporal and spatial regulations of personal behaviour and the nature and law of events occurring in the city to provide a theoretical basis for the management of events [15].Ahas and Silm studied the temporal and spatial distribution of Estonian tourists, offering big data support for tourism planning [16][17][18][19].Based on the CDR data of mobile phones, Dong [20,21] analysed the spatial and temporal situation of population movement within the Sixth Ring Road area in Beijing at both the regional and road network levels.Etison [22], utilizing the CDR data, established a resident traffic flow monitoring and management system for real-time data monitoring and statistics for mobile phone users in specific areas.
The existing research results have no specific target population for the analysis of mobile phone users and do not cover the research of users' individual behaviour.Therefore, this paper intends to analyse the main tourist attractions in Beijing by employing handset CDR data, focusing on the Forbidden City as the main research object.The author aims to make full use of the advantages of big data to effectively analyse the residence time, the spatial and temporal distribution of tourists, and the behaviour of tourists.The experiment results show that this big data analysis technology can become technical support for the management of urban tourist areas and effectively improve the accuracy of the current tourist flow monitoring.

Location Mode of the Cellular Network.
Each zone covered by the mobile communication network is generally assumed as a regular hexagon, where each mobile user can act as a mobile station (MS), as shown in Figure 1.Regardless of the user's state, whether mobile or stationary, as long as the user calls or accesses the Internet, the user of the mobile phone will exchange data with the nearest mobile phone base station (BS).Therefore, the data containing the base station ID is recorded.Depending on the location of the base station, the current location of the user can be estimated.The specific positioning is shown in Figure 1.
When the user moves from Cell-1 to Cell-7, a total of three location areas (LACs) and seven cells are passed through in sequential order from Cell-1, Cell-2, Cell-3, Cell-4, Cell-5, and Cell-6 to Cell-7.If the users switch on and off, interact with the network, or receive the data regarding user steps, the user's location information is updated since LAC 1, LAC 2, and LAC 3 belong to different receiving areas.Taking Cell-2 and Cell-3 of LAC 1 and Cell-4 of LAC 2 as examples, if there is no interaction with the network in the process of their moving, such as calling, texting, or surfing the Internet, the user's location will not be updated, since Cell-2 and Cell-3 are from the same location area (LAC 1).In other words, when two cells belong to the same location area, the two cells have the same location area code.Although the cell area is changed, the location area remains intact, and thus the change in location will not be recorded in this situation.Since Cell-3 and Cell-4 belong to different location areas, when the user moves from Cell-3 to Cell-4, which is equivalent to the user moving from LAC 1 to LAC 2, the centre will record the user's location changes regardless of whether the user has data interaction with the network during the movement.
This cell phone positioning method is called the Cell-ID positioning method, which is utilized as a web-based cell phone positioning technology without installing additional equipment, acquiring maintenance costs and upgrading the existing mobile network.Although this method cannot obtain the location information accurately, the macroscopic tourist flow analysis used in this paper has already met its accuracy requirement.

Base Station Data of the Cellular Network.
In general, each base station (BS) has its own fixed location area (LAC).When a cell phone is registered in the communication network, the network will page the LAC location of the cell phone and obtain the corresponding Cell-ID.In this paper, the LAC is intercepted from the location area identification code (LAI) to identify the location area in the GSM network.The MSC area (hereafter called the MSC) is composed of all the location areas under its control.In addition to administering the subordinate location area, the MSC is also responsible for the reception of facsimile data.The geographical location of the base station (BS) mainly consists of information such as its administrative region, latitude, and longitude.The network attributes of the base station (BS) mainly cover information such as the BSC number used to control the communication, the running status, the antenna height, and the base station type.
The Cell-ID and LAC of the selected base station will be taken as the basic characteristic attributes of the base station.The recorded information includes the user's mobile phone ID, the user's phone status, the location area (LAC) of the user, the Cell-ID, the type of event triggered, and the recording position.The record table structure and data samples are shown in Table 1.Mobile communication networks cover a large amount of the population activity area, especially in urban areas.As shown in Figure 2, more than 70% of the base stations in Beijing are located in the urban area within the Sixth Ring Road area, with a total of 52,759 base stations.Meanwhile, according to statistics from the Ministry of Industry and Information Technology, the number of phone users in Beijing reached 20.62 million in 2015.With the mobile phone base station functioning as a fixed traffic detector and the user mobile phone as a mobile detector, the urban travel holographic information acquisition system is practicable.Therefore, the CDR data can be used in the analysis of extremely complicated travel behaviour.

User Data of the Cellular Network.
As with the base station attributes and types, when a user moves, updating the user location data acquired by the GSM network also must be defined by the user attributes.Each location update will generate a user location datum correspondingly.The GSM network encodes the user information not only to monitor the change of its location but also to confirm that users are properly connected to each other when using the network.Hence, it is necessary to correctly address this issue through encoding.Each user has a cell phone number during the use procedure.To protect the privacy of the user, the system will desensitize the user code hiding the user information, in the process of network storage.In this paper, based on the existing data, the author selected IMSI, Cell-ID, LAC, CTIME, and TIMECHAR as the location-updating attributes of the users under the condition of ensuring data quality.The Cell-ID and LAC meaning is consistent with the base station attribute property.The final selected mobile network user locations are updated in the format shown in Table 2.
The user data of the mobile communication adopted in this paper are recorded for one year.The data are collected every two seconds, with a data size of 52,759 pieces.Since there are 86,400 seconds in a day, this collection will amount to 450 million pieces per day with an average daily data size of 40G.To address large amounts of data quickly, preprocessing the data, cleaning the noise, and converting the format are necessary steps.

Processing Platform.
The CDR data are a kind of typically big data.Taking Beijing as an example, the CDR data for one day exceed 400 million pieces, which requires a large amount of calculation for processing.However, the processing capacity and I/O performance of a single machine cannot support such a large data calculation.At the same time, traditional relational databases, such as oracle, can build clusters, but when the amount of data reaches a certain limit, the query processing speed will become very slow, and the performance of the machine is very high.Thus, the Spark big data platform is considered to handle the CDR data in this paper.
The concept of Resilient Distributed Dataset (RDD) is adopted in the Spark framework.Considering that Map Reduce cannot complete effective data sharing at all stages of the parallel computing, RDD in the Spark framework makes up for this defect.Using this efficient data sharing and  Map Reduce-like operating interfaces, various proprietary types of calculations can be effectively expressed in the Spark framework, and similar performance can be achieved.
According to the popular classification of the application area, big data processing can be divided into complex batch data processing, interactive data query based on historical data, and streaming data processing based on real-time data streaming.Because of the abundant expression capability of RDD, the unified large data processing platform capable of simultaneously dealing with the above three situations is derived on the basis of the Spark core.The goal of the Spark ecosystem is to integrate batch processing, interactive processing and streaming processing into the same software stack.In this paper, the Spark SQL interface is used, which provides a distributed SQL engine with a query speed 10 ∼ 100 times higher than hive.

Tourist Flow Statistics and Characteristic Analysis Method
As shown in Figure 3, the overall flow chart depicting the scenic spot flow calculation and tourist travel characteristics analysis, the preprocessing of mobile phone base station data, and the CDR data are completed first to obtain the lists of all the base stations of each surveyed scenic spot and active phone users.Next, through matching the CDR data of the base stations in the scenic spots, the flow of each scenic spot will be obtained, and the tourist flow characteristics will be analysed.Based on the origin and the destination of tourists in the scenic spots, the tourist OD spatial distribution and the travel characteristics of tourists in the relative scenic spots can be obtained.3.This paper analyses the main scenic spots in Beijing, including the Forbidden City, the Summer Palace, and the Olympic Forest Park.Working with data from more than 40,000 base stations in Beijing, Python language is adopted to write scripts on the ArcGIS platform, handling 20 scenic spots in a batch.A buffer zone of 100 metres is designed, and then the base stations are screened out in the scenic area by matching with the location of the scenic area, as shown in Figure 4.The Cell-ID and LAC attributes of the base station are selected to store in the lists of base stations in the scenic spots.

Processing of the CDR Data.
The IMSI number, the only code in the CDR data for identifying the phone user, belongs to the STRING type, which is inconvenient for subsequent operations.Therefore, the IMSI number will be transformed into the LONG type through the HASH code, and its uniqueness and nonnegativity will be verified to ensure that one IMSI number can still determine one phone user.At the same time, due to the ping-pong switching phenomenon in the CDR data, "silent users" and ping-pong switching data have been filtered out by setting a threshold of frequency (PCF), leaving 67% of the data as the source for traffic information collection [9].After completion of the above two steps, 40G of data per day can be reduced to approximately 16G, which significantly reduces the running time of the subsequent processing.The new CDR data table is shown in Table 4.

Tourist Flow Algorithm
Based on the CDR Data.In this paper, the tourist flow statistics of the scenic spots are divided into five parts, namely, the total flow, influx, outflow, stagnant flow, and net increment and variation of the scenic spot  during a certain period of time.The calculation method is as follows.
The specified study period is [a, b], and the time interval from time a to time b is t hours; at the same time, a time period of [c, a] is set whose time interval is also t hour(s).Let the phone users of the first time period be collection B, in which the total number of users is recorded as    , and let the phone users of the second time period be collection C, in which the total number of users is recorded as C.
(1) The total flow of the scenic spot during a period of time is the total tourist number of the scenic spot in the time period (5) The net increment of the scenic spot during a certain period of time is the net increase in tourist number at the scenic spot during the time period [a, b], calculated by the total influx minus the total outflow over this time period; this item can also be understood as the increased tourist number (which can be negative) of the time period The calculation process is shown in Figure 5.

Tourist OD Analysis.
The origin and destination analysis of tourists will reflect the important characteristics of tourist travel.Therefore, this paper considers using O (origin) and D (destination) of the trip to analyse the origin and destination of tourists and travel characteristics.The existing traffic zone division has been referred to the conduct tourist OD analysis.

Traffic Zone.
This paper adopts the traffic zone division in the literature [20].According to the processed mobile phone base station data and the CDR data (see Table 4 for the data format), the mobile phone base station is first defined in  terms of traffic semantics as the residential area, work area, and road traffic.Then, it matches the geographic information system and divides these mobile phone base stations into "traffic zones" based on the mobility characteristics of the population directly related to traffic. Figure 6 depicts the division results for traffic zones in the urban area within the sixth ring in Beijing according to the above method.According to the above method, each divided traffic zone contains a large number of base stations.First, the base station trajectory of each user needs to be extracted by a specific algorithm, through which the continuous location switching of the user at base stations can be obtained.The data in this step provide the database for acquiring the dynamic urban traffic OD, because the data are obtained by using the traffic zone as a unit and converting the user's base station location information into the traffic zone location information to obtain the user's zone switching trajectory.

Tourist OD Analysis Algorithm Based on the CDR Data.
This paper primarily examines the OD flow of tourists in key scenic spots.Since February is a low season for tourism, the operating hours of most scenic spots are from 8:30 am to 16:30 pm.This paper takes the Summer Palace as an example to study the method for the tourist OD analysis.
The algorithm procedure is as follows: (1) Select Tourists Visiting the Scenic Spot.The scenic spot opens at 8:30 am every day.Although it will take approximately 2 to 3 hours to visit the entire area, data should first be screened and should meet the following two conditions: (1) select users who remain in the scenic spot without going to other areas from 8:30 am to 11 am; (2) exclude those who remain in the scenic spot from 8:30 am to 4:30 pm (workers).The results provide user IDs for the tourists in the scenic spot.
(2) Obtain Tourists' OD Information.For tourists whose IDs have been selected, information on which traffic zone they originate from and where they finally arrive should be determined.This information is obtained by reverse querying, that is, using a user's ID number to identify where she or he passed from 5 am to 8 am and from 11 am to 2 pm.Therefore, the user's movement route from 5 am to 2 pm is determined.
(3) Convert the Trajectory Data into an OD Matrix.The trajectories obtained for each user may contain a significant amount of location data because the user is constantly moving.However, researchers only need the origin and destination of the users.To improve the fault tolerance, the result position data calculated by the median of the position data in this period multiplied by 0.7 plus the average of the position data in this period multiplied by 0.3.Traffic zones are determined based on the location range of the zones.The number of people in each traffic zone is then counted, and the OD matrix is finally produced.

Experiment and Result Analysis
4.1.Tourist Flow Statistics.According to the above algorithm, we collect the statistics of tourist flow in 20 scenic spots and select 2 typical scenic spots including the Olympic Forest Park, Badaling Great Wall, obtaining the flow chart of the scenic spots as shown in Figure 7.
From the flow chart in Figure 7, we observe that the trend in the daily changes over time is similar in the same scenic spot.In general, most tourists begin their visit at 9:00∼10:00 am and leave by 17:00∼18:00 pm.At the same time, this trend     is not the same for traditional scenic areas and general parks; traditional scenic spots such as the Great Wall are closed after 5:30 pm.As a result, there are far more tourists remaining in the afternoon than the tourists who come in, and there are no visitors after 18:00 pm.For a general leisure park such as the Olympic Forest Park, another entering peak occurs after work from 17:00 pm to 18:00 pm, and a departure peak appears from 21:00 pm to 22:00 pm, consistent with daily behaviour.
With regard to another aspect, the number of tourists on the weekends and holidays is much higher than that on working days as shown in Figure 8. Examining the number of tourists in the scenic spots, through time dimensions we find that the number of tourists on weekends far exceeds the number of tourists on working days in the traditional scenic spots.The tourist number in the summer holidays is apparently higher than other seasons.And the tourist flow reached the peaks in holidays such as the Spring Festival, the Qingming Festival, the Dragon Boat Festival and the National day.The reason is that the resident population in Beijing has more time to go out a tour on weekends and holidays.
Comparing the total number of tourists in different scenic spots through spatial dimensions, we obtain the tourist flow order in Beijing shown in Figure 9.Some scenic spots are the hot spots for the tourists.

OD Analysis of the Scenic Spot.
To vividly describe the origin and destination of tourists, taking the Summer Palace as an example, we conduct an analysis of tourists (total 1,733 tourists) who visited the palace from 8 am to 11 am on February 8, 2015.Combined with the tourist OD analysis algorithm above, the tourist origin and destination information of the Summer Palace scenic area are obtained together with the GIS platform.Figure 10 To better distinguish the different tourist flows, we utilize the thickness and colour of arrows and lines.As the flow rate increases, the width of the arrows and lines increases gradually, and the colour changes from green to red.From the spatial distribution map of the tourists, we can determine whether the origin of tourists (tourist attractions in the Summer Palace area) or the whereabouts of tourists (Summer Palace area tourists) show "wave" spread, layer by layer, in line with regular traffic rules.Although tourists generally follow the law of "the distance increases and the amount decreases" after one tour, due to the particularity of the scenic spots, tourists choose to visit not only nearby scenic spots such as Summer Palace and Tsinghua, Peking University, but also farther scenic spots such as the Forbidden City and the Beijing Zoo.
To fully and comprehensively analyse the origin and destination of tourists in the Summer Palace, the data not only should be described qualitatively and visually from the space level but also should be described accurately and quantitatively from the statistics level to clarify the relationship between travel and distance.
First, the latitude and longitude of the centre point in the Summer Palace are extracted directly from the GIS platform to calculate the distance between the origin point (the ascent point) of each tourist and the centre of the scenic spot, that is, the geographical distance (spherical distance) between two points on the map.Then, we calculate the number of the above distances that are the same.Third, the arrival and departure probabilities of the tourists in the Summer Palace are calculated.Finally, we fit the distance distribution of the tourists by taking the travel distance as the abscissa and the scenic attraction quantity (occurrence quantity) probability as the ordinate.It is found that the composite exponential function is the best fit.Among the calculations, the fitting formula of the distance distribution of tourist origin is shown as follows: () = 0.5065 *  (−0.4146) + 0.1078 *  (−0.05757) () = 0.7927 *  (−0.9213) + 0.06561 *  (−0.1281) In the formula, d is the distance between the scenic spot and the visitor's location; P O (d), in the case of d, is the attraction probability of the scenic spots; P D (d), in the case of d, is the occurrence probability of the scenic spots.
From the formula and Figure 11, it is observed that the origin and destination of tourists are mainly distributed in the close range.That is to say, the majority of tourists still prefer to visit the closer scenic spots.

Conclusions
This paper presents a method based on the CDR data to analyse the tourist flow of scenic spots, including the collection and processing of the CDR data, tourist flow, travel OD, and other statistical analysis, which is all the helpful information for the operation management of the scenic spots.The conclusion is as follows: (1) Through an analysis with the CDR data in the scenic spots in Beijing, the results show that the method can effectively analyse the tourist flows and other behaviour information, which can provide big data support to alleviate   the traffic pressure of tourism lines, to alleviate the future traffic construction of tourism lines, and to promote the scenic area's operation management.
(2) Travel OD analysis of the scenic spot can give the spatial origin and destination distribution of tourists, which can be used to help the manager of scenic spot to attract the tourists from different districts in the city.
(3) The analysis shows that the big data analysis method based on the CDR data of mobile phones can provide realtime information about tourist behaviours in a timely and effective manner.This information can be applied in scenic areas and can provide real-time big data support for "smart tourism".
In the future, researchers can further improve the positioning accuracy and the calculation accuracy for the scenic spots, increase the ability to analyse the individual behaviour of tourists, enhance the application of the statistical analysis of tourist flow in scenic spots to support the operation management, and conduct in-depth research such as traffic monitoring, the tourist source analysis, and the comfort degree evaluation of tourists.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Figure 1 :
Figure 1: Location mode of mobile communication network.

Figure 2 :
Figure 2: Mobile phone base station distribution in Beijing.

3. 1 . 1 .
Preprocessing of Mobile Phone Base Station Data.Because the base station data are continuously improved under the operation of a communication company, there are many basic data without value.The Spark SQL has processed 52,759 base station raw data into 43,022 valid pieces of data and extracted Cell-ID, LAC, longitude and latitude, base station type, coverage area type, and base station location as new attribute values.The specific details are shown in Table

Figure 4 :
Figure 4: Mobile phone base stations in the scenic area.(a) Spatial distribution of typical scenic spots in Beijing.(b) Generating a list of mobile phone base stations in the scenic area.
[a, b], with the collection expressed as  =    .(2) The stagnant flow of the scenic spot during a certain period of time is the number of tourists in the scenic spot both within the time period [c, a] and within the next time period [a, b], with the collection expressed as U  = |∩| = |   ∩    |. (3) The influx of the scenic spot during a certain period of time is the increased tourist number at the scenic spot during a certain period of time, that is, the number of tourists who were not in the scenic spot within the time period [c, a] but who appeared within the time period [a, b], with the collection expressed as I  =  − U  =    − |   ∩    |. (4) The outflow of the scenic spot during a certain period of time is the number of tourists who appeared within the time period [c, a] but did not appear within [a, b], with the collection expressed as O  =  − U  =    − |   ∩    |.
[a, b] compared with the time period [c, a].The collection is expressed as R  = I  − O  =  −  =    −    .(6) The variation of the scenic spot during a certain period of time is the net increase in tourist number during the time period [0, b], that is, the user number calculated by the net increase in tourist number within the time period [a, b] plus the total outflow within [0, a].The collection is expressed as

Figure 5 :
Figure 5: Tourist flow algorithm based on the CDR data.

Figure 6 :
Figure 6: Traffic zone division of the area within the sixth ring in Beijing.

Figure 7 :Figure 8 :
Figure 7: Weekly tourist flow chart of the scenic spots.(a) Olympic Forest Park.(b) Badaling Great Wall.
(a) shows the number of tourists visiting the Summer Palace, whereas Figure 11(b) depicts the number of tourists who move from the Summer Palace to other places.

Figure 9 :Figure 10 :
Figure 9: Analysis of scenic spot flow.(a) A traffic flow chart of the scenic spots in one week.(b) Scenic spot flow chart of daily average tourists.

Figure 11 :
Figure 11: Distance distribution of visitors in the Summer Palace.(a) Origin distance distribution of tourists.(b) Destination distance distribution of tourists.

Table 1 :
Samples of the mobile phone base station.

Table 2 :
Example of CDR.

Table 4 :
Example of processed CDR data.