Mining missing train logs from Smart Card data
Introduction
Buses and trains are dictated by operation plans, including schedule timetables and headways, on which each passenger relies to plan his or her route. Yet road congestions, increased dwell times, accidents, and other factors can all distort a planned operation and lead to degraded reliability of public transportation services. As a result, we often find ourselves stranded on a station or inside a packed car.
Transit service reliability is defined as, according to KFH Group (2013), “how often service is provided when promised.” and is measured by on-time performance, headway adherence, missed trips, and distance traveled between mechanical breakdowns. Hence the availability of a log, i.e. the set of records indicating the actual arrival and departure time of each bus or train at each station, is critical for reliability analysis.
Introduction of automatic passenger counters (APC) and automatic vehicle locators (AVL) for buses, and the control system for trains including automatic train operation (ATO) have made available precise logs of public transport. Hammerle et al., 2005, El-Geneidy et al., 2011, for instance, employed logs obtained from APC and AVL in evaluating on-time performance and headway adherence of buses in Chicago and Minneapolis, respectively. Logs, however, are not always available or, if so, complete. New York City transit lacked any system to acquire logs of their vehicles in 2000 (Doyle, 2000). Zurich, though having installed AVLs on all of their buses, had only 10% of them equipped with APCs by 2009 (Orth et al., 2012). Some of Seoul’s early metro stations kept no logs; others archived them incompletely or on unstable form, such as magnetic tapes, that later deteriorated beyond repair.
Fig. 1 is the metro network of Seoul metropolitan area as of November 2011. Black region includes 245 stations that maintain complete logs. The logs of remaining 248 stations are missing partially or entirely. Only 6 of 15 lines maintain complete logs for each train they operate.
This paper aims at recovering missing logs from the Smart Card Automated Fare Collection System, or Smart Card, data. In Seoul, after 9 years of practice, Smart Card has become the sole method of payment for the city’s metro as of 2009. Due to its massive data and comprehensive categories, many studies have been conducted to fully realize the potential it holds (Pelletier et al., 2011); O–D matrix estimation (Cui, 2006, Trépanier et al., 2007, Munizaga and Palma, 2012), route choice estimation (Kusakabe et al., 2010, Hong et al., 2015), transit behavior analysis (Lathia and Capra, 2011, Ma et al., 2013, Kim et al., 2015), and demand-driven timetabling (Niu and Zhou, 2013, Sun et al., 2014) are few of many examples.
Smart Card data includes the quadruples, (Departure station O, Entry time at gate, Arrival station D, Exit time at gate), in addition to vehicle ID, fare, distance traveled and seniority/disability status of the card holder. In this paper, we only use the set of quadruples, which is the minimum data needed to reliably reconstruct logs as well as the maximum we can expect to obtain from any metro network employing a Smart Card system.
Our procedure reconstructs each train as a sequence of the earliest exit times, called S-epochs, among its alighting passengers at each station. Thus we need the exit times of alighting passengers assorted to their trains at every station. However, the exit times from Smart Card data are recorded tag times of passengers in their order of arrivals at a gate often shared by trains from different directions – in and outbound – and/or even from different lines at a transfer station. Hence, in the first step, we extract from Smart Card data a considerable set of passengers whose routing choices are easily traced from their origin and destination pairs. Such passengers, known as reference passengers, give us, with certainty, the directions of their trains and the lines to which they belong, if not the exact trains themselves.
Based on the exit times of reference passengers, the procedure then computes a set of tentative S-epochs based on a detection measure whose validity relies on an extreme-value characteristic of the platform-to-gate movement of alighting passengers; the moments of exit at gates are closely grouped for passengers from identical trains. The tentative S-epochs are then finalized to be a true one, or rejected, based on their consistencies with bounds and/or interpolation from prescribed S-epochs of adjacent trains and stations. The arrival times of trains can then be recovered by shifting the finalized S-epochs backward in time by the platform-to-gate time of each station.
Tested on the instances from three entire metro lines from Seoul metropolitan area with varying degrees of missing logs, the mining procedure computed the arrival times of 95% of trains within the error of 24 s for local-only lines and within 45 s for lines operating local and express trains even when the logs were missing for an entire line. In the reliability analysis currently practiced by the city of Seoul, a train is defined to be delayed if it is late by more than 10 min. Therefore, our level of error, less than 8% of the threshold at maximum, seems permissible for the recovered log to replace the real one.
This paper is organized as follows. Section 2 describes the mining procedure, S-epochs, reference passengers, extreme-valued nature of the platform-to-gate times of metro passengers that enables the detection measure of S-epochs, and consistency checks that finalize the tentative S-epochs along with their applications on real instance. Section 3 describes how the procedure can be extended to mining trains under special operation strategies including short-turning and skip-stop. Section 4 evaluates the performance, especially the accuracy of the procedure. Finally, Section 5 provides some concluding remarks and possible further research.
Section snippets
Mining procedure
Our objective is to recover missing arrival times of a train in a given metro line using the quadruples – (Departure station O, Entry time at gate, Arrival station D, Exit time at gate) – of metro passengers obtained from Smart Card data. The proposed mining procedure first recovers each missing train X as the sequence of S-epochs , the earliest exit time of a passenger at a gate of stations N at which X stops, which is then shifted backward in time into arrival times. The main steps of the
When logs are missing for an entire line
In this case, we need substitutes for the initial prescribed S-epochs on which we applied the consistency checks in Section 2.4. We choose the station with the largest set of tentative S-epochs computed in Section 2.3. A tie is broken by selecting the station with the most alighting reference passengers. Then, regarding the tentative S-epochs of the chosen station as true S-epochs, we apply the mining procedure in Fig. 2 from Section 2.4.
We examine each train from the returned log, recovered as
Test sets
The procedure was tested on three entire lines for total of 12 days: Line 7 on November 21, 2011 with varying degrees of missing logs (Set 1, 2, and 3), Line 8 on 5 weekdays from November 21 to November 25, 2011 (Set 4) and on weekends from November 20 and November 26, 2011 (Set 5), and Line 9 on May 11, 2015 for its local (Set 6) and express (Set 7) trains.
We chose the three lines mainly for their complete logs which made available exact arrival times of trains, S-epochs, and the minimum
Concluding remarks and further research
We have proposed a mining procedure that recovers the arrival times of trains from the exit times of metro passengers typically available from Smart Card. The procedure was tested on 12 daily sets of trains, with varying degrees of missing logs, from 3 entire lines from Seoul metropolitan area. Even when the logs were missing for an entire line, the procedure recovered the arrival time of 95% of trains within the error of 24 s for lines operating a single type of train and 45 s for those
Acknowledgments
This research was supported in part by Basic Science Research Program (2014R1A2A1A11049663) of the National Research Foundation of Korea (NRF). Authors give their thanks to Eric Hong for his meticulous, and somewhat commanding, revision.
References (18)
- et al.
Does crowding affect the path choice of metro passengers?
Transp. Res. Part A: Policy Pract.
(2015) - et al.
Mining smart card data for transit riders travel patterns
Transp. Res. Part C: Emerg. Technol.
(2013) - et al.
Estimation of a disaggregate multimodal public transport Origin–Destination matrix from passive smartcard data from Santiago, Chile
Transp. Res. Part C: Emerg. Technol.
(2012) - et al.
Optimizing urban rail timetable under time-dependent demand and oversaturated conditions
Transp. Res. Part C: Emerg. Technol.
(2013) - et al.
Smart card data use in public transit: a literature review
Transp. Res. Part C
(2011) - et al.
Demand-driven timetable design for metro services
Transp. Res. Part C: Emerg. Technol.
(2014) - Cui, A., 2006. Bus Passenger Origin-destination Matrix Estimation using Automated Data Collection Systems. MS Thesis....
- Doyle, M. 2000. Timing is Everything – A Field Study of Subway Service Reliability. New York City Transit Riders...
- et al.
Records in athletics through extreme-value theory
J. Am. Stat. Assoc.
(2008)
Cited by (0)
- 1
Current address: Software Solution Laboratory, Samsung Advanced Institute of Technology, Republic of Korea.
- 2
Current address: Policy-Technology Convergence Research Division, Korea Railroad Research Institute, Republic of Korea.