Hotel booking demand datasets

This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted. Due to the scarcity of real business data for scientific and educational purposes, these datasets can have an important role for research and education in revenue management, machine learning, or data mining, as well as in other fields.


Data format
Mixed (raw and preprocessed) Experimental factors Some of the variables were engineered from other variables from different database tables. The data point time for each observation was defined as the day prior to each booking's arrival Experimental features Data was extracted via TSQL queries executed directly in the hotels' PMS databases and R was employed to perform data analysis Data source location Both hotels are located in Portugal: H1 at the resort region of Algarve and H2 at the city of Lisbon Data accessibility Data is supplied with the paper

Value of the data
Descriptive analytics can be employed to further understand patterns, trends, and anomalies in data; Used to perform research in different problems like: bookings cancellation prediction, customer segmentation, customer satiation, seasonality, among others; Researchers can use the datasets to benchmark bookings' prediction cancellation models against results already known (e.g. [1]); Machine learning researchers can use the datasets for benchmarking the performance of different algorithms for solving the same type of problem (classification, segmentation, or other); Educators can use the datasets for machine learning classification or segmentation problems; Educators can use the datasets to obtain either statistics or data mining training.

Data
In tourism and travel related industries, most of the research on Revenue Management demand forecasting and prediction problems employ data from the aviation industry, in the format known as the Passenger Name Record (PNR). This is a format developed by the aviation industry [2]. However, the remaining tourism and travel industries like hospitality, cruising, theme parks, etc., have different requirements and particularities that cannot be fully explored without industry's specific data. Hence, two hotel datasets with demand data are shared to help in overcoming this limitation.
The datasets now made available were collected aiming at the development of prediction models to classify a hotel booking's likelihood to be canceled. Nevertheless, due to the characteristics of the variables included in these datasets, their use goes beyond this cancellation prediction problem.
One of the most important properties in data for prediction models is not to promote leakage of future information [3]. In order to prevent this from happening, the timestamp of the target variable must occur after the input variables' timestamp. Thus, instead of directly extracting variables from the bookings database table, when available, the variables' values were extracted from the bookings change log, with a timestamp relative to the day prior to arrival date (for all the bookings created before their arrival date).  No Depositno deposit was made; In case no payments were found the value is "No Deposit". If the payment was equal or exceeded the total cost of stay, the value is set as "Non Refund".
Non Refunda deposit was made in the value of the total stay cost; Otherwise the value is set as "Refundable" Refundablea deposit was made with a value under the total cost of stay. Not all variables in these datasets come from the bookings or change log database tables. Some come from other tables, and some are engineered from different variables from different tables. A diagram presenting the PMS database tables from where variables were extracted is presented in Fig. 1. A detailed description of each variable is offered in the following section.

Experimental design, materials and methods
Data was obtained directly from the hotels' PMS databases' servers by executing a TSQL query on SQL Server Studio Manager, the integrated environment tool for managing Microsoft SQL databases [4]. This query first collected the value or ID (in the case of foreign keys) of each variable in the BO table. The BL table was then checked for any alteration with respect to the day prior to the arrival. If an alteration was found, the value used was the one present in the BL table. For all the variables holding values in related tables (like meals, distribution channels, nationalities or market segments), their related values were retrieved. A detailed description of the extracted variables, their origin, and the engineering procedures employed in its creation is shown in Table 1.
The PMS assured no missing data exists in its database tables. However, in some categorical variables like Agent or Company, "NULL" is presented as one of the categories. This should not be considered a missing value, but rather as "not applicable". For example, if a booking "Agent" is defined as "NULL" it means that the booking did not came from a travel agent.
Summary statistics for both hotels datasets are presented in Tables 2-7. These statistics were obtained using the 'skimr' R package [7].
A word of caution is due for those not so familiar with hotel operations. In hotel industry it is quite common for customers to change their booking's attributes, like the number of persons, staying duration, or room type preferences, either at the time of their check-in or during their stay. It is also common for hotels not to know the correct nationality of the customer until the moment of check-in. Therefore, even though the capture of data took considered a timespan prior to arrival date, it is understandable that the distribution of some variables differ between non canceled and canceled bookings. Consequently, the use of these datasets may require this difference in distribution to be taken into account. This difference can be seen in the table plots of Fig. 2 and Fig. 3. Table plots are a powerful visualization method and were produced with the tabplot R package [8] that allow for the exploration and analysis of large multivariate datasets. In table plots each column represents a variable and each row a bin with a pre-defined number of observations. In these two figures, each bin contains 100 observations. The bars in each variable show the mean value for numeric variables or the frequency of each level for categorical variables. Analyzing these figures it is possible to verify that, for both of the hotels, the distribution of variables like Adults, Children, StaysInWeekendNights, StaysInWeekNights, Meal, Country and AssignedRoomType is clearly different between non-canceled and canceled bookings.