An Application of Reinforced Learning-Based Dynamic Pricing for Improvement of Ridesharing Platform Service in Seoul

: As ridesharing services (including taxi) are often run by private companies, proﬁtability is the top priority in operation. This leads to an increase in the driver’s refusal to take passengers to areas with low demand where they will have di ﬃ culties ﬁnding subsequent passengers, causing problems such as an extended waiting time when hailing a vehicle for passengers bound for these regions. The study used Seoul’s taxi data to ﬁnd appropriate surge rates of ridesharing services between 10:00 p.m. and 4:00 a.m. by region using a reinforcement learning algorithm to resolve this problem during the worst time period. In reinforcement learning, the outcome of centrality analysis was applied as a weight a ﬀ ecting drivers’ destination choice probability. Furthermore, the reward function used in the learning was adjusted according to whether the passenger waiting time value was applied or not. The proﬁt was used for reward value. By using a negative reward for the passenger waiting time, the study was able to identify a more appropriate surge level. Across the region, the surge averaged a value of 1.6. To be more speciﬁc, those located on the outskirts of the city and in residential areas showed a higher surge, while central areas had a lower surge. Due to this di ﬀ erent surge, a driver’s refusal to take passengers can be lessened and the passenger waiting time can be shortened. The supply of ridesharing services in low-demand regions can be increased by as much as 7.5%, allowing regional equity problems related to ridesharing services in Seoul to be reduced to a greater extent. indirect refusal of drivers to take passengers to unpreferred areas. With additional real-time ridesharing user data, the Deep Q-Network(DQN) technique can be adopted to conduct a smaller-scale spatial analysis of ridesharing services. with more knowledge on fare sensitivity by user group, the dynamic pricing approach proposed in this study can signiﬁcantly contribute to resolving the spatial equity problem in mobility services in the future.


Introduction
Since the 1960s, with the beginning of industrialization, Korea experienced a rapid urbanization [1][2][3]. As such, the transportation infrastructure has also developed steadily, greatly improving the mobility of citizens. However, in metropolitan areas, such as the capital city and its surrounding regions, the expansion of urban areas and overpopulation caused increased traffic and consequently other social problems, including traffic congestion and air pollution. To minimize such social costs, various policies to facilitate public transportation and taxis have been actively promoted [4]. Despite the efforts, the current services are short in providing an equal level of mobility to all residents, requiring enhancement. In particular, in Seoul Metropolitan City, despite its excellent public transportation infrastructure, there is a constant inconvenience of users with a need for enhancing problems caused by congestion during rush hours and a shortage of ridesharing (including taxi) supply late at night on the outskirts of the city and in hilly areas [5].
Against this backdrop and with the recent introduction of the mobility as a service(MaaS) concept, efforts have been made to improve individuals' mobility and accessibility through the integrated

Studies Analyzing Spatially Marginalized Areas in Terms of Transportation Services
In their study, Lee et al. [11] identified the mobility of different marginalized groups using smart card data and evaluated the mobility of groups highly reliant on public transportation. Building upon this, the study also categorized regions into different types according to their need for mobility improvement and concluded that mainly the outskirt areas of a city require improvements. Ha et al. [12] located areas marginalized from public transportation services in Seoul, using real travel characteristics gathered using Google and T map navigation application programming interfaces (APIs), and they analyzed areas with high priority in service enhancement. Data, such as travel characteristics and socioeconomic indices, were analyzed, identifying Gangbuk-gu, Seongbuk-gu, Seodaemun-gu, Jungnang-gu, and the southeast zone as areas requiring improvements in public transportation. Han [13] deduced spatially marginalized areas by assessing user mobility and the service level of providers. Moreover, the study reviewed the potential equity issue caused among different groups and suggested ways to improve it. According to the study, overall public transportation showed a satisfactory level from the users' perspective; however, there was a large deviation in terms of the Electronics 2020, 9,1818 3 of 14 supply level by region. Notably, on the outskirts of the city, including Nowon-gu and Gwanak-gu, the gap between supply and demand appeared to be wide and, thus, most urgently requiring an improvement in supply.
From an accessibility perspective, Lee et al. [14] developed various indices for the connectivity, directness, and diversity of public transportation using transportation card data and assessed each transportation zone. This study confirmed that, with a higher connectivity of public transportation, it had more routes and better directness. On the contrary, in zones located on the outskirts of the city, marginalization from public transportation was vivid. Kim et al. [15] and Yoon et al. [16] studied regional marginalization and inequity by considering not just spatial accessibility, but also social classes. Kim et al. overlaid and compared socioeconomic characteristics and city zones using location data of public transportation in Daegu. According to their analysis, the low accessibility and environmental inequity of socially disadvantaged people (the aged, recipients of national basic livelihood benefits, etc.) in suburban areas were confirmed. Yoon et al. calculated the inequity index of socially marginalized people on the basis of a Gini-style index and the methods of accessibility measure developed by Curie. As for public transportation accessibility, the regional gap was bigger for the subway than the bus. When this was overlaid on top of the data for the socially disadvantaged group, inequity was confirmed to be greater for the subway than the bus.
There were studies assessing equity depending on regional differences in infrastructure. Kim et al. [17] analyzed disadvantaged regions by overlaying service areas of public transportation and confirmed that suburb areas were mainly in a disadvantageous position. Furthermore, when considering socioeconomic characteristics, the study concluded that there was a gap in public transportation infrastructure among regions. Lee et al. [18] used Seoul Metropolitan Household Travel Survey data to measure regional equity among different income brackets depending on the levels of transportation infrastructure. The study showed that lower spatial equity led to longer total traveling time. Bin et al. [19] carried out a spatial cluster analysis at the administrative unit (Eup, Myeon, and Dong) level with transportation infrastructure indicators and travel behavior to assess equity in Gyeonggi-do province (excluding Seoul Metropolitan City and Incheon City). The results clearly showed gaps between areas closer to Seoul and those on the outskirts of the city. In particular, equity at the infrastructure level in northern Gyeonggi-do province was low.

Dynamic Pricing Studies on Ridesharing Service
Before reviewing previous literature, dynamic pricing can be defined as a strategy in which prices change flexibly for the same product or service depending on the market situation [20][21][22]. This strategy is mainly employed with respect to e-commerce, flight tickets, and hotel booking and demand management. This results in the optimization of selling products and services in an environment where the price can be easily adjusted. With regard to dynamic pricing for ridesharing services, there were studies on profitability improvement and the determination of an appropriate price through a pricing strategy [23][24][25][26][27][28][29][30], studies analyzing the elements of a pricing strategy affecting customers or drivers [31][32][33], and a study based on reinforcement learning [34].
First, Banerjee et al. [23] validated the performance of dynamic pricing by suggesting a ridesharing model considering two aspects: the stochastic dynamics of the market and the strategic decisions of the drivers, passengers, and platform. According to the analysis, depending on supply and demand conditions, flexible pricing resulted in increased total utility. Zeng et al. [24] researched the dynamic pricing strategy in accordance with potential users considering the destination of taxies. Markov Decision Process (MDP) was established by considering the cost of pick-ups at the destination. The total utility was enhanced compared to the fixed cost case. Hall et al. [25] studied the economic utility of surge pricing by analyzing Uber data.
When prices rise in line with the surge price algorithm, the delta affected the improving profitability of drivers, supply, and efficiency. Moreover, if surge pricing was not employed during Electronics 2020, 9, 1818 4 of 14 peak hours, passenger waiting time was extended as drivers did not pick up passengers and passenger utility dropped.
In an analysis of the effects of dynamic pricing on drivers through Uber cases, Chen et al. identified that the surge price has a negative impact on passengers and a positive impact on drivers. Furthermore, the study found unfairness along the regional border of surge pricing. Chen et al. [32] studied changes in the number of Uber drivers depending on changes in surge price. The analysis confirmed that there was a higher rate of vehicle operation, as well as changes in operating hours, during the time period in which higher profit was expected from the drivers' perspective. Kooti et al. [33] analyzed the impact of dynamic pricing and income on the behavior of drivers, and they found that drivers operating vehicles during peak hours earned a higher income compared to nonpeak-hour drivers.
Wu et al. [34] simulated the application of dynamic pricing to ridesharing services. They compared four pricing methodologies, namely, (1) statistic pricing, (2) proportional pricing, (3) batch updates, and (4) reinforcement learning with the goal of profit maximization in a single Origin-Destination(OD) scenario. The simulation found that pricing based on reinforcement learning increased the total profit and individual profit of drivers the most.

Summary of the Review
Previous studies on mobility-disadvantaged regions mainly focused on evaluating areas marginalized from public transportation from a user's perspective utilizing big data, such as smart card data, household travel surveys, and Geographic Information System Data Base (GIS DB: location data for bus and subway, thematic transportation map, etc.). These studies identified that public transportation services in these regions were not adequately provided. On the other hand, they paid little attention to the late-night mobility inconvenience in regions after public transportation services had ended.
As for dynamic pricing in ridesharing services, studies were carried out mainly using Uber cases. As Uber discloses operation data, several studies related to dynamic pricing were conducted. In particular, the surge-based pricing analysis of Uber showed that it helped increase driver supply and operational profit. On the contrary, if a single-fare system was applied without flexibility, overall utility decreased due to a lower rate of matching and a longer waiting time when suppliers decided not to take passengers. However, there were no studies measuring regional differences in dynamic pricing; thus, it is necessary to determine ways of addressing drivers' refusal to take passengers when taking regional differences into account.
Accordingly, a study evaluating the dynamic pricing solution is required to improve the quality of mobility services in disadvantaged regions in Seoul.

Analysis Methodology
To simulate dynamic pricing using reinforcement learning, this study first conducted a centrality (in-bound and out-bound) analysis with a district(gu)-level OD matrix to develop a regional indicator of degree centrality to be used in reinforcement learning analysis. Then, the indicators were applied to a reinforcement learning simulation.

Degree Centrality Analysis Methodology
Degree centrality is an indicator showing how many nodes are connected to a certain node. In social network theory, degree centrality is defined as the number of nodes linked directly to any given node. This study measured the out-degree and in-degree of each region (node) from a centrality Electronics 2020, 9, 1818 5 of 14 perspective depending on the level of node connectivity, using them as indicators to be applied in reinforcement learning. Centrality can be calculated as follows: where C D is the standardized degree centrality of node i, g j = 1 x ij is the degree centrality of node j, and g is the number of nodes. If there is no direction in the network, the above equation simply shows the degree of nodes. On the other hand, if directions are present, the equation distinguishes out-degree centrality (C outD ) and in-degree centrality (C inD ). Here, out-degree centrality is defined as the level of connections going out from a certain node to other nodes, and in-degree centrality is defined as the level of connections coming in from other nodes to a certain node. In this study, in-degree centrality referred to the number of vehicles coming into a certain zone (Gu), thus indicating a region as a frequently selected destination by passengers (usually residential areas during late-night periods). On the contrary, out-degree centrality referred to the number of vehicles traveling out of a certain zone, thus indicating a region as a preferred destination by drivers (central and subcentral regions). Therefore, this study identified indicators by region considering both out-degree and in-degree centrality.

Reinforcement Learning Methodology
Reinforcement learning is a learning method through which an agent chooses an action to take in an environment to maximize reward. This affects not only the immediate reward due to the action of the agent, but also the long-term reward. The main characteristics of reinforcement learning are a trial-and-error search and delayed reward ( Figure 1). This study used the Q-learning algorithm for analysis. where is the standardized degree centrality of node , ∑ is the degree centrality of node j, and is the number of nodes. If there is no direction in the network, the above equation simply shows the degree of nodes. On the other hand, if directions are present, the equation distinguishes out-degree centrality ( ) and in-degree centrality ( ). Here, out-degree centrality is defined as the level of connections going out from a certain node to other nodes, and in-degree centrality is defined as the level of connections coming in from other nodes to a certain node. In this study, indegree centrality referred to the number of vehicles coming into a certain zone (Gu), thus indicating a region as a frequently selected destination by passengers (usually residential areas during late-night periods). On the contrary, out-degree centrality referred to the number of vehicles traveling out of a certain zone, thus indicating a region as a preferred destination by drivers (central and subcentral regions). Therefore, this study identified indicators by region considering both out-degree and indegree centrality.

Reinforcement Learning Methodology
Reinforcement learning is a learning method through which an agent chooses an action to take in an environment to maximize reward. This affects not only the immediate reward due to the action of the agent, but also the long-term reward. The main characteristics of reinforcement learning are a trial-and-error search and delayed reward ( Figure 1). This study used the Q-learning algorithm for analysis. In the Q-learning algorithm [35,36], Q initially has an arbitrary fixed value. When the agent selects an action ( ) in accordance with a learning step ( ), an immediate reward ( ) is observed before entering a new state ( ), updating the Q value. The key characteristic of this algorithm is the value iteration update, using the weighted average of the old value and new information. The equation of Q-learning can be expressed as follows: where is the learning rate which updates the current Q value with immediate reward and/or future expected value, usually between 0 and 1. When the value is 0, the current Q value is continuously used without any update according to learning. On the contrary, if the value is 1, the previous Q value is ignored and updated automatically. is a discount factor which is a variable explaining the difference between immediate and future rewards, also between 0 and 1. When the value is 0, the agent takes a myopic action as by making an update with immediate reward. When the value is 1, the agent leans toward a future reward, underestimating the current reward. In this study, on the basis of existing studies, we experimentally set the learning rate to 0.1 and discount factor to 0.9. Table Configuration To carry out the reinforcement learning simulation using the Q-learning algorithm, a Q-table needed to be defined through state and action. The Q-table is a matrix that compiles rewards for all states and actions, and it is updated after each learning step. Here, 600 OD cells were used in an OD matrix of 25 zones in Seoul excluding intrazonal travel. The state refers to the price that can In the Q-learning algorithm [35,36], Q initially has an arbitrary fixed value. When the agent selects an action (a) in accordance with a learning step (t), an immediate reward (r) is observed before entering a new state (s), updating the Q value. The key characteristic of this algorithm is the value iteration update, using the weighted average of the old value and new information. The equation of Q-learning can be expressed as follows:

Environment Setting for Reinforcement Learning Simulation
where α is the learning rate which updates the current Q value with immediate reward and/or future expected value, usually between 0 and 1. When the value is 0, the current Q value is continuously used without any update according to learning. On the contrary, if the value is 1, the previous Q value is ignored and updated automatically. γ is a discount factor which is a variable explaining the difference between immediate and future rewards, also between 0 and 1. When the value is 0, the agent takes a myopic action as by making an update with immediate reward. When the value is 1, the agent leans toward a future reward, underestimating the current reward. In this study, on the basis of existing studies, we experimentally set the learning rate to 0.1 and discount factor to 0.9.

Q-Table Configuration
To carry out the reinforcement learning simulation using the Q-learning algorithm, a Q-table needed to be defined through state and action. The Q-table is a matrix that compiles rewards for all states and actions, and it is updated after each learning step. Here, 600 OD cells were used in an OD matrix of 25 zones in Seoul excluding intrazonal travel. The state refers to the price that can potentially be generated in each OD cell depending on time slots, which can be expressed as 13 (price range) × 6 h (time round, Tr) for each OD cell. In total, there were 13 actions, including a standard price and surge prices from 0.6-fold to 3.0-fold the standard price with an interval of 0.2. The composition of the district OD matrix is shown in Figure 2, with a separate layer constructed for each cell's travel time and passage cost (P base ) to be used for analysis.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 14 range) × 6 h (time round, Tr) for each OD cell. In total, there were 13 actions, including a standard price and surge prices from 0.6-fold to 3.0-fold the standard price with an interval of 0.2. The composition of the district OD matrix is shown in Figure 2, with a separate layer constructed for each cell's travel time and passage cost ( ) to be used for analysis.

Action Model
The simulation included passengers and drivers for each cell in the OD matrix and determined whether or not a match was feasible on the basis of the choice probability of passengers and drivers. Applying the method used by Wu et al. [34], passengers selected a price according to the surge according to a normal distribution. On the other hand, drivers applied choice probability according to a normal distribution with the weight of the centrality analysis outcome in accordance with their destination. Centrality analysis was applied to taxi operation data on the basis of 150 M links provided for Seoul Metropolitan City. The analysis was conducted using data from April to September of 2017 which were processed to generate OD data for each district (gu).
For the choice probability model of drivers and passengers, a surge of 1 (base fare of taxi) was applied by consulting the result of National Bureau of Economic Research(NBER, 2016) study [37]. In addition, the probability followed a cumulative normal distribution using the following equation: The choice probability models of passengers (Equation (4)) and drivers (Equation (5)) according to the surge are expressed below.
where is the weight modifying the choice probability of passengers when surge = 1, is the

Action Model
The simulation included passengers and drivers for each cell in the OD matrix and determined whether or not a match was feasible on the basis of the choice probability of passengers and drivers. Applying the method used by Wu et al. [34], passengers selected a price according to the surge according to a normal distribution. On the other hand, drivers applied choice probability according to a normal distribution with the weight of the centrality analysis outcome in accordance with their destination. Centrality analysis was applied to taxi operation data on the basis of 150 M links provided for Seoul Metropolitan City. The analysis was conducted using data from April to September of 2017 which were processed to generate OD data for each district (gu).
For the choice probability model of drivers and passengers, a surge of 1 (base fare of taxi) was applied by consulting the result of National Bureau of Economic Research(NBER, 2016) study [37]. In addition, the probability followed a cumulative normal distribution using the following equation: Electronics 2020, 9, 1818 7 of 14 The choice probability models of passengers (Equation (4)) and drivers (Equation (5)) according to the surge are expressed below.
where W 1 is the weight modifying the choice probability of passengers when surge = 1, W 2 is the weight modifying the choice probability of drivers when surge = 1, T r C outD N j is the out-degree centrality value of a specific time round (T r ), T r C inD N j is the in-degree centrality value of a specific time round (T r ), and W 3 is the weight of the equity index allowing differentiation by region.

Reward Function
The criteria for reward were different depending on matching. Once a driver was matched with a passenger, the reward was the base fare (P f are ) for the travel between origin and destination multiplied by the surcharge coefficient (S a ). If a driver was not matched with a passenger, a negative reward was applied in line with the waiting time value (W t ). The waiting time was set to 8 min considering the fact that the average waiting time during late-night periods when a surcharge is applied is 8.1 min according to a study in Seoul Metropolitan City [38]. The time value (V t ) was calculated on the basis of the taxi fare for 1 min in the OD matrix. While it increased profitability through a surcharge, it was also designed to provide a negative reward when unmatched by multiplying the time value for 8 min by the surcharge. The equation for the reward can be expressed as shown in Equation (6). Learning was conducted separately for when a negative value was applied to waiting time and for when it was not. Through a comparison, the study identified the appropriate reward function.

Q-Learning Algorithm
A total of 20,000 Q-learning reiterations were conducted by taking the index value for each zone into account in the late-night period ( Figure 3). In one episode, the goal was to identify the optimal surge for each time slot during 6 h of operation for each OD matrix. The Q-table included states (current state and price) for each time slot and action value by applying the concept of time. The travel time and price required to calculate the reward value (operating profit) for each state referred to a separate OD table. The weight values applied to the preference of drivers were the centrality indices for each time slot and zone. The reward function was created separately depending on the application of the time value. surge for each time slot during 6 h of operation for each OD matrix. The Q-table included states (current state and price) for each time slot and action value by applying the concept of time. The travel time and price required to calculate the reward value (operating profit) for each state referred to a separate OD table. The weight values applied to the preference of drivers were the centrality indices for each time slot and zone. The reward function was created separately depending on the application of the time value.

The Analysis Result of Areas Expected to Be Marginalized in Ridesharing Services
In order to identify the indices to be applied to reinforcement learning, the taxi operation data in Seoul were categorized into different zones according to 25 districts (gu), and the OD matrix was formed on the basis of the volume of pick-ups and drop-offs using district codes. Consequently, the study analyzed centrality and tried to identify a standardized index for each zone to be applied to reinforcement learning. The origin (the number of pick-ups) was compiled by matching administrative districts through the GIS spatial join function, while the destination (the number of drop-offs) was compiled by using the district(gu) code information for each vehicle departure. Any data including areas surrounding Seoul were excluded.

The Analysis Result of Areas Expected to Be Marginalized in Ridesharing Services
In order to identify the indices to be applied to reinforcement learning, the taxi operation data in Seoul were categorized into different zones according to 25 districts (gu), and the OD matrix was formed on the basis of the volume of pick-ups and drop-offs using district codes. Consequently, the study analyzed centrality and tried to identify a standardized index for each zone to be applied to reinforcement learning. The origin (the number of pick-ups) was compiled by matching administrative districts through the GIS spatial join function, while the destination (the number of drop-offs) was compiled by using the district(gu) code information for each vehicle departure. Any data including areas surrounding Seoul were excluded.
Using the OD matrix, degree centrality analysis for each zone was conducted, and the average in-degree and out-degree centrality was calculated for weekdays and weekends from 10:00 p.m. to 4:00 a.m. (Figures 4 and 5). According to the analysis, in-degree centrality displayed similar patterns for both weekends and weekdays, whereas the regional deviation was smaller than seen for out-degree centrality. Out-degree centrality was high in the Central Business Districts(CBDs) on weekdays and similarly high on weekends.
Electronics 2020, 9, x FOR PEER REVIEW 8 of 14 Using the OD matrix, degree centrality analysis for each zone was conducted, and the average in-degree and out-degree centrality was calculated for weekdays and weekends from 10:00 p.m. to 4:00 a.m. (Figures 4 and 5). According to the analysis, in-degree centrality displayed similar patterns for both weekends and weekdays, whereas the regional deviation was smaller than seen for outdegree centrality. Out-degree centrality was high in the Central Business Districts(CBDs) on weekdays and similarly high on weekends.     Areas located on the outskirts of the city had a lower degree centrality. In particular, outskirt areas showed lower out-degree centrality on weekdays. Residential areas on the outskirts of Seoul showed higher in-degree centrality. On the other hand, central and subcentral regions of the city showed higher out-degree centrality ( Figure 6). The patterns were similar to those found in previous studies showing the concentration of taxi pick-ups during late-night hours near CBDs [39]. There were large differences in out-degree centrality by zone before and after midnight when most public transportation services were suspended, with the difference growing as the night went on. Areas located on the outskirts of the city had a lower degree centrality. In particular, outskirt areas showed lower out-degree centrality on weekdays. Residential areas on the outskirts of Seoul showed higher in-degree centrality. On the other hand, central and subcentral regions of the city showed higher out-degree centrality ( Figure 6). The patterns were similar to those found in previous studies showing the concentration of taxi pick-ups during late-night hours near CBDs [39]. There were large differences in out-degree centrality by zone before and after midnight when most public transportation services were suspended, with the difference growing as the night went on.

Results of Reinforcement Learning Simulation
The algorithm presented in Figure 3 was used in the learning of 600 OD cells. When the reward function was defined by a simple matching rate, there was a limitation in comparison among regions as the fare and travel time for each OD were different. Therefore, the simulation implemented a reward value considering travel time and cost. Moreover, analysis was conducted depending on the application of a negative reward value in the unmatched condition. The matching rate for each learning step when a waiting time with a negative reward was applied (alt1) or not (alt2) is depicted in Figure 7.

Results of Reinforcement Learning Simulation
The algorithm presented in Figure 3 was used in the learning of 600 OD cells. When the reward function was defined by a simple matching rate, there was a limitation in comparison among regions as the fare and travel time for each OD were different. Therefore, the simulation implemented a reward value considering travel time and cost. Moreover, analysis was conducted depending on the application of a negative reward value in the unmatched condition. The matching rate for each learning step when a waiting time with a negative reward was applied (alt1) or not (alt2) is depicted in Figure 7. function was defined by a simple matching rate, there was a limitation in comparison among regions as the fare and travel time for each OD were different. Therefore, the simulation implemented a reward value considering travel time and cost. Moreover, analysis was conducted depending on the application of a negative reward value in the unmatched condition. The matching rate for each learning step when a waiting time with a negative reward was applied (alt1) or not (alt2) is depicted in Figure 7. When the negative reward was considered, the matching rate increased before converging as learning was reiterated. When it was not considered, the matching rate decreased before converging. According to this result, it can be concluded that the application of a negative reward was necessary when the reward function used the travel cost but not the matching rate. However, the current negative reward for the waiting time used a fixed value from a previous study; thus, for real-world applications, the expected waiting time according to a driver's choice probability should be calculated for each OD.
Using the deduced average surge, when a fare was calculated for each OD, the distribution could be expressed as shown in Figure 8, where the x-axis is the travel distance between origin and destination, and the y-axis is the price. Then, price levels were compared in terms of the base fare, base fare with late-night surcharge (additional 20% to base fare), late-night surcharge + ride-hail fee (KRW 3000, the highest fee charged by existing platform companies), and surge price (the outcome of learning for each OD). For travel distances shorter than 10 km, the differences among fares were When the negative reward was considered, the matching rate increased before converging as learning was reiterated. When it was not considered, the matching rate decreased before converging. According to this result, it can be concluded that the application of a negative reward was necessary when the reward function used the travel cost but not the matching rate. However, the current negative reward for the waiting time used a fixed value from a previous study; thus, for real-world applications, the expected waiting time according to a driver's choice probability should be calculated for each OD.
Using the deduced average surge, when a fare was calculated for each OD, the distribution could be expressed as shown in Figure 8, where the x-axis is the travel distance between origin and destination, and the y-axis is the price. Then, price levels were compared in terms of the base fare, base fare with late-night surcharge (additional 20% to base fare), late-night surcharge + ride-hail fee (KRW 3000, the highest fee charged by existing platform companies), and surge price (the outcome of learning for each OD). For travel distances shorter than 10 km, the differences among fares were not severe. However, the gap grew wider when the distance increased (see Figure 8a). This is because the late-night surcharge + ride-hail fee (yellow dotted line) was a flat fare regardless of traveling distance, while the surge fare (blue dotted line) was determined using a certain ratio, increasing the absolute price with traveling distance. Table 1 displays the three average prices for different distances, showing the price gap widening as the distance increased.
Electronics 2020, 9, x FOR PEER REVIEW 10 of 14 not severe. However, the gap grew wider when the distance increased (see Figure 8a). This is because the late-night surcharge + ride-hail fee (yellow dotted line) was a flat fare regardless of traveling distance, while the surge fare (blue dotted line) was determined using a certain ratio, increasing the absolute price with traveling distance. Table 1 displays the three average prices for different distances, showing the price gap widening as the distance increased.

Changes in Centrality after Surge-Driven Supply Increase
Centrality analysis was carried out after recalculating the traffic volume in accordance with the surge previously identified for each OD matrix and time slot. The traffic volume was recalculated as follows: where i, j are the origin and destination, Nvol i, j is the recalculated traffic volume, Ovol i,j is the previous traffic volume, S i,j is the optimal surge for travel (optimal surge for different OD as identified from reinforcement learning), and i,j Ovol i,j is the sum of traffic volume. In centrality analysis, the indicator evaluating the volume of vehicles traveling to a certain zone was the in-degree centrality. When in-degree centrality increases, it can be said that supply is increasing toward that region. According to the level of improved spatial equity by region, in-degree centrality decreased in the central and subcentral regions of the city, such as Gangnam-gu and Jongno-gu, while the in-degree centrality increased in residential areas on the outskirts of the city, such as Songpa-gu, Gangseo-gu, Dongjak-gu, and Nowon-gu ( Figure 9). When this was overlaid on top of the previous hotspot analysis, it was found that the in-degree centrality was drastically improved in districts (gu) located on the outskirts of Seoul Metropolitan City where the number of drop-offs or vacant vehicles was greater compared to the number of pick-ups ( Figure 10b). As a result, it can be expected that vehicle supply would be enhanced by as much as 7.5% when applying a higher surge to the region predicted to have a lower choice probability from the perspective of drivers. At the same time, equity in service supply would be improved by reducing the waiting time of passengers in marginalized regions with low taxi demand. Gangseo-gu, Dongjak-gu, and Nowon-gu ( Figure 9). When this was overlaid on top of the previous hotspot analysis, it was found that the in-degree centrality was drastically improved in districts (gu) located on the outskirts of Seoul Metropolitan City where the number of drop-offs or vacant vehicles was greater compared to the number of pick-ups ( Figure 10b). As a result, it can be expected that vehicle supply would be enhanced by as much as 7.5% when applying a higher surge to the region predicted to have a lower choice probability from the perspective of drivers. At the same time, equity in service supply would be improved by reducing the waiting time of passengers in marginalized regions with low taxi demand.

Conclusions
As ridesharing (including taxi) services are often run by private companies, profitability is the top priority in operation. This leads to an increase in drivers' refusal to take passengers to areas with low demand where they have difficulties finding subsequent passengers, causing problems such as extended waiting time when hailing a vehicle for passengers bound for these regions. This problem differs depending on time and region. In late-night hours and in suburban regions, the imbalance between supply and demand is especially widened, worsening the problem. In order to address this problem, solutions were proposed in this study through a dynamic pricing strategy using reinforcement learning algorithms.
This study used Seoul city's taxi data to find appropriate fare surge rates for ridesharing services between 10:00 p.m. and 4:00 a.m. In reinforcement learning, the outcome of centrality analysis was applied as the weight affecting drivers' destination choice probability. Moreover, the reward function used during learning was adjusted according to whether or not a passenger waiting time value was

Conclusions
As ridesharing (including taxi) services are often run by private companies, profitability is the top priority in operation. This leads to an increase in drivers' refusal to take passengers to areas with low demand where they have difficulties finding subsequent passengers, causing problems such as extended waiting time when hailing a vehicle for passengers bound for these regions. This problem differs depending on time and region. In late-night hours and in suburban regions, the imbalance between supply and demand is especially widened, worsening the problem. In order to address this problem, solutions were proposed in this study through a dynamic pricing strategy using reinforcement learning algorithms.
This study used Seoul city's taxi data to find appropriate fare surge rates for ridesharing services between 10:00 p.m. and 4:00 a.m. In reinforcement learning, the outcome of centrality analysis was applied as the weight affecting drivers' destination choice probability. Moreover, the reward function used during learning was adjusted according to whether or not a passenger waiting time value was applied. Profit was used as the reward value. By applying a negative reward for the passenger's waiting time, a more appropriate surge fare level could be identified. Across the region, the average surge level amounted to 1.6. Regions located on the outskirts of the city in predominantly residential regions such as Gangdong-gu, Dongjak-gu, Eunpyeong-gu, and Gangseo-gu showed a higher surge. On the contrary, central areas, such as Gangnam-gu, Jongno-gu, and Jung-gu, had a lower surge. The findings showed that the supply of ridesharing services in low-demand regions could be increased by as much as 7.5% using surge fares, thereby reducing regional equity problems related to ridesharing services in Seoul to a great extent.
This study conducted a reinforcement learning-based dynamic pricing simulation to respond to the regional equity problem of ridesharing (including taxi) services in Seoul. A novel approach was presented using dynamic pricing as a way to mitigate the spatial equity problem by affecting ridesharing supply, unlike most previous dynamic pricing studies which simply targeted higher profitability. Notably, it was shown that a surge rate change in fares could reduce the indirect refusal of drivers to take passengers to unpreferred areas. With additional real-time ridesharing user data, the Deep Q-Network(DQN) technique can be adopted to conduct a smaller-scale spatial analysis of ridesharing services. Furthermore, with more knowledge on fare sensitivity by user group, the dynamic pricing approach proposed in this study can significantly contribute to resolving the spatial equity problem in mobility services in the future.