A time-use activity-pattern recognition model for activity-based travel demand modeling

This study develops a new comprehensive pattern recognition modeling framework that leverages activity data to derive clusters of homogeneous daily activity patterns, for use in activity-based travel demand modeling. The pattern recognition model is applied to time use data from the large Halifax STAR household travel diary survey. Several machine learning techniques not previously employed in travel behavior analysis are used within the pattern recognition modeling framework. Pattern complexity of activity sequences in the dataset was recognized using the FCM algorithm, and resulted in identification of twelve unique clusters of homogeneous daily activity patterns. We then analysed inter-dependencies in each identified cluster and characterized the cluster memberships through their socio-demographic attributes using the CART classifier. Based on the socio-demographic characteristics of individuals we were able to correctly identify which cluster individuals belonged to, and also predict various information related to their activities, such as start time, duration, travel distance, and travel mode, for use in activity-based travel demand modeling. To execute the pattern recognition model, the 24-h activity patterns are split into 288 three dimensional 5 min intervals. Each interval includes information on activity types, duration, start time, location, and travel mode if applicable. Results from aggregated statistical evaluation and Kolmogorov–Smirnov tests indicate that there is heterogeneous diversity among identified clusters in terms of temporal distribution, and substantial differences in a variety of socio-demographic variables. The homogeneous clusters identified in this study may be used to more accurately predict the scheduling behavior of specific population groups in activity-based modeling, and hence to improve prediction of the times and locations of their travel demands. Finally, the results of this study are expected to be implemented within the activity-based travel demand model, Scheduler for Activities, Locations, and Travel (SALT).


Introduction
In recent years, disaggregate travel demand models have begun to be employed for travel demand forecasting purposes.These models improve upon the traditional four stage modeling method, since they are able to more accurately capture the effects of elements that influence travel behavior and time allocation, such as socio-demographic attributes.Latterly, the activity-based modeling approach, along with other disaggregate travel demand modeling methods such as trip-based modeling, has become more popular and commonly used in both the academic and practitioner sectors.To date, numerous activitybased models have been developed, such as Recker et al. (1986a, b), and Garling et al. (1989), Arentze and Timmermans (2000), Bowman and Ben-Akiva (2001), Vovsha et al. (2002), Miller and Roorda (2003), and Bhat et al. (2004).Activity-based models emphasize that travel is a derived demand, originating from the need of an individual to participate in activities.Activity-based models work on constructing the 24-h activity schedule and the associated travel linked with activities performed by individuals.Most activity-based models comprise the following universal modules: activity generator and scheduler, tour and trip time of day, tour and trip mode choice, tour and trip destination choice, and network assignment.
For many years, researchers and practitioner have employed different approaches for development of activity-based models.Rasouli and Timmermans reviewed recent work on activity-based models of travel demand (Rasouli and Timmermans 2014), and argued that these models can be classified into three main concept categories: (1) constraint-based models, (2), discrete choice models, and, (3) computational process models.The constraint-based models consider possible travel patterns with respect to a set of space-time constraints.Lenntorp (1977), Kwan (1997), and Button (1985) are some examples of activity-based models developed through the constraint-based modeling approach.The second category, discrete choice models (also known as econometric models), consider activity pattern consequences from utility maximizing decisions.Bowman and Ben-Akiva (2001), Vovsha et al. (2002), andCEMDEP (2004) are some examples of activity-based models developed through the econometric approach.Finally, the computational process models (also known as rule-based models) simulate and model activity patterns through computational processes.Garling et al. (1989), Arentze and Timmermans (2000), and Miller and Roorda (2003) are some examples of activity-based models developed through the rule-based modeling approach.
Some researchers have incorporated both econometric and rule-based modeling approaches in the activity-based modeling framework.The reason was to increase the model computational efficiency and degree of accuracy of outputs.In particular, Auld and Mohammadian (2009) employed a rule-based technique to resolve conflicts in the activity scheduling phase in the ADAPTS econometric activity-based model.Another example is the TASHA model, where Miller and Roorda (2003) for tour formation of their rule-based model by utilizing an econometric modeling approach.In very recent research, borrowing from the computer science field, researchers have used machine learning techniques to develop different components of activity-based models.However, there have been very limited applications of such techniques in activity-based modeling.For instance, the K-means clustering technique has been used in a pattern-recognition modeling framework (Jiang et al. 2012;Allahviranloo et al. 2016) and support vector machine (SVM) has been used in a daily activity sequence recognition process (Allahviranloo and Recker 2013).
In this study we develop a new solution method for the activity generation module in activity-based travel demand models.We build on progress in activity generation modules by developing a new comprehensive pattern recognition modeling framework which leverages activity data to derive clusters of homogeneous daily activity patterns.Each cluster produces vital information such as activity type, start time, end time, duration probability distribution, and sequential arrangement of activities.Our prime contention is that generating more accurate activity patterns is a significant step in decreasing uncertainty in forecasting the individual's activity engagement decisions and moving current activity-based models closer to replication of reality.Three-dimensional 5-min intervals are used as the basic analysis unit in this study.Several machine learning techniques not previously employed in travel behavior analysis (fuzzy c-means (FCM) clustering algorithm and the CART classifier) are employed in the pattern recognition framework.This study contributes by providing additional insights to the linkage between activity generation and activity scheduling modules in the overall activity-based travel demand modeling framework.Furthermore, the proposed modeling framework in this study may be applied to any applications that contain a group of linked sequences, such as day-to-day variations in transit ridership or station demand at the individual level.Finally, the results of this study are expected to be incorporated within the activity-based travel demand model, Scheduler for Activities, Locations, and Travel (SALT) for Halifax Regional Municipality (HRM), Nova Scotia, which is currently under development.
The remainder of the paper is structured as follows: first, we provide a review of relevant past research concerning activity generation modules in activity-based modeling framework.Secondly we discuss the data used and data transformation necessary for pattern recognition, followed by presentation of the pattern recognition methods and discussion of model results.The paper concludes by providing a summary of contributions and future research directions.

Literature review
Activity generation modules can play an important role in every activity-based modeling framework.Prediction accuracy of individual travel behavior depends on actual information drawn from activity generation modules.Thus, producing more accurate and homogeneous information from this module will result in increasing prediction accuracy in activity-based travel demand modeling.Since 1970, when Hagerstrand developed the preliminary activity-based model (Hagerstrand 1970), researchers have used several different approaches for the activity generation modules of activity-based models, such as empirical data analysis, decision trees, and hazard functions (Arentze and Timmermans 2000;Auld and Mohammadian 2009;Recker et al. 1986a, b;Miller and Roorda 2003).
The activity generation module in the ALBATROSS (A Learning-BAsed TRansportation Oriented Simulation System) model consists of an integrated decision-making heuristic with learning mechanisms (Arentze and Timmermans 2000).Initially, activities are classed into two sets: fixed and flexible activities.The model then produces the fixed activities and relocates them along with their temporal and spatial information to the scheduler module of the algorithm.The flexible activities are added to schedules based on their order of priority and choice of time-of-day.A hazard-based formulation is employed in the activity generation module in the agent-based dynamic activity planning and travel scheduling (ADAPTS) model (Auld and Mohammadian 2009).Activities with similar features are recognized and essential information such as activity start times and durations are generated.The activity generation module in the simulation of travel/activity responses to complex household interactive logistic decisions (STARCHILD) model comprises three successive phases (Recker et al. 1986a, b).At the beginning, the algorithm generates all the feasible travel activity patterns.Next, all possible activity-travel patterns are found and grouped.Lastly, representative patterns in each group are identified using the logit choice model.A series of empirical data analyses are employed in the activity generation module of the travel and activity scheduler for household agents (TASHA) model (Miller and Roorda 2003).Populations are groups with similar distributions of start time, activity type, and frequency, based on different sets of explanatory variables such as age, gender, income, and occupation.
One limitation of using explanatory variables for grouping the population is that using different sets of explanatory variables results in different groupings and generates different probability distributions.Therefore, it is challenging and time consuming to find which set of explanatory variables have the best fit for grouping the population in the dataset.As activity generation modules have a direct effect on prediction accuracy, it is important that populations are grouped with the most similar characteristics.In recent years, machine learning techniques have brought new insight into the modeling process of different components of activity-based modeling.For instance, Jiang et al. (2012) and Allahviranloo et al. (2016) employed the K-means clustering technique for activity pattern recognition.Another example is the application of the SVM technique in activity scheduling process (Allahviranloo and Recker 2013).Machine learning is a computationally fast and straightforward reproducible technique that without being deliberately programmed is able to naturally learn to recognize complex patterns, and make an intelligent resolution based on the trained data (Bishop 2007;Kubat 2015).While machine learning is well known in the computer science field, nevertheless there have been only limited applications of the technique in activity-based travel demand modeling, and mainly in the activity generation process.For instance, machine learning techniques can be used to solve a sequence alignment problem (Joh et al. 2002).In the following section some of these research efforts are overviewed.
Through aggregation of statistical learning methods and data mining, Jiang et al. (2012) proposed a new modeling framework for clustering daily patterns of individual activities.Numerous machine learning techniques such as the K-means clustering algorithm and the principal component analysis (PCA) were employed to explore the inherent daily activity structure, and populations were clustered based on the similarity of their activities.The modeling framework was implemented for both weekday and weekend data.Their research findings enhance the traditional population divisions into workers, students, and nonworkers.In their proposed modeling framework, individuals are clustered based on their activity similarity rather than by explanatory variables such as age or occupation.Liu et al. (2015) employed profile hidden Markov models (HMM) to augment the sequence alignment method (SAM).Their argument is that the SAM alone cannot capture infrequent activities and their related travel episodes.Consequently, they added a supplementary phase to the SAM by converting multiple alignments into a position-specific counting system to capture the probability of all infrequent and frequent activities in the data.A two-stage clustering technique to infer activity time windows was developed by Allahviranloo et al. (2016).Activity pattern recognition is accomplished using aggregation of K-means clustering and affinity propagation methods, in order to capture both frequent and infrequent activities.They extended their work to discover differences between activity patterns by employing the SAM and agenda dissimilarity distance measurement methods.They found that the scheduler executed better when it used the clustered data compare to un-clustered data.Their proposed method clustered populations into eight clusters.In another study, Li and Lee (2017) utilized probabilistic context-free grammars in the modeling and learning of daily activity patterns.Saneinejad and Roorda (2009) measured similarities between routine weekly activity sequences by utilizing the multiple sequence alignment methods.
This study addresses the above-mentioned limitations of activity generation modules by using explanatory variables for grouping the population and by recognizing infrequent activities in the overall activity-based modeling framework.In this study, we tackle the problem from a new standpoint, through development of a new comprehensive patternrecognition modeling framework that leverages activity data to derive clusters of homogeneous daily activity patterns.Each particular cluster produces essential information such as activity type, start time, end time, duration probability distribution, and sequential arrangement of activities.Application of this new framework to activity-based modeling not only reveals the strength of machine learning to identify homogeneous clusters, but also yields additional insights into the linkage between two critical activity-based model modules namely activity generation and activity scheduling.

Data
In this section, an overall description of the space-time activity research (STAR) survey data and data processing steps is presented.This study uses time-diary and GPS geocoordinate data, from the STAR survey accomplished in Halifax, Canada.The STAR survey was a combined household activity survey and travel survey, and the world's first large-scale employment of global positioning system (GPS) technology for tracking and verification of out-of-home activities.A brief description follows, and full descriptions of the survey design and the socio-demographic characteristics of respondents can be found in (TURP 2008;Millward and Spinney 2011).
The Halifax STAR project collected survey data from 1971 randomly selected households in Halifax Regional Municipality (HRM) between April 2007 and May 2008.The survey collected fully geo-referenced 2-day (i.e.48-h) time diary data from a randomly selected primary respondent aged 15 years or older within each household.Primary respondents carried a GPS data logger (Hewlett Packard iPAQ hw6955) for a 48-h reporting period, maintained a daily ''activity log'' during that period, and completed a computer-assisted telephone interview (CATI) time-diary survey the day after the 2-day reporting period had ended.The respondents' descriptions of their out-of-home activities were prompted and validated by the GPS data.
The original STAR dataset included 188 activity sub categories defined under ten main activity categories.The activity codes utilized were those employed in Statistics Canada's time-use surveys, and relate to the prime purpose of the activity.In addition, entertainment activities were defined by passive attendance, whereas both sports and hobbies were identified by active participation in the activity.

Data processing
The data processing consisted of three steps.The first step was to identify and eliminate data for non-working days from the STAR survey data.The second step was to clean the database of any missing values to ensure validity, uniformity, and consistency.The resulting data set comprised 2778 working person-days (1389 individuals, 2 days each).Note that the current research did not consider the possible temporal correlation of activity sequences between two continuous days, and treated all person-days as independent samples.As mentioned earlier, a 5 min interval was used as the basic time unit in this study.Therefore, we rounded all time values up/down in the way so that they were evenly divisible by five.Lastly, the final step was to re-categorize the original 188 activity categories.To align with the transportation planning literature and urban studies (Ben-Akiva and Bowman 1998;Bhat et al. 2004), and based on similarities between some of the primary activities, the original activities were aggregated into nine activity categories, as shown in Table 1.Travel episodes are categorized as a separate time-use category in this study.This feature allows the model to be updated with new data on congested travel times.The new travel times may be measured by operating the activity-based travel demand model in aggregation with a congestion index.In addition, for the purpose of this study, we categorized in-home activities into three major categories.

Data transformation
Prior to implementing pattern recognition techniques, it was essential to transform the activity survey data.The data transformation process is illustrated in Fig. 1.The 24 h were split into 288 5 min intervals, and each interval has 3 dimensions.The first dimension comprises temporal information on activities: each of the 288 cells was coded with one of the nine major categories as defined in Table 1.The second dimension comprises sociodemographic characteristics related to activities, and the third dimension contains spatial information associated with activities.In the current study, we have utilized only the temporal dimension in the modeling framework.However, the other dimensions can also be used for cluster analysis.For example, the spatial information dimension can be used to identify if there are differences in clusters in terms of daily spatial dynamics (e.g.homework distance).Consequently, the preliminary transformed matrix dimensions would be 288 9 2778 9 2778.The ultimate step in data transformation process was to transform activity survey data to the binary format.Each of the nine major activity categories were transformed to a ''1'' or ''0'' binary code such that if the individual participated in the activity in the particular time interval a code of ''1'' was recorded, and otherwise ''0''.

Individual and aggregated daily activity dissimilarities
Figure 2 shows the aggregated daily temporal pattern of 2778 person-days (1389 individuals, 2 days each) with their activities categorized in nine groups at the aggregated level.The 288 5-min intervals started at 4:00 am and finished at 3:55 am on the next day.The temporal distribution of in-home and out-of-home activities in Fig. 2 show very interesting and informative information about household daily activity pattern.In particular, we see that household chores have morning and early evening peaks (breakfast and supper), and household leisure is high in the later evening.Workplace and household chores dominate during the daytime, and are inversely related.
Figure 3 visualizes sparsity patterns of person-day activities.The spots in Fig. 3 show the transformed data using 5 min intervals.For each activity type, the darker area indicates that the individual participated in the activity (code ''1''), and conversely the brighter area indicates that the individual did not participate in the activity (code ''0).The horizontal axis denotes time of day starting from 4:00 a.m. and ending at 3:59 a.m.

Methods
The pattern recognition modeling framework in this study comprises four modules, as follows.First, we employed a subtractive clustering algorithm for initializing the total cluster number and cluster centroids.Previous studies (Pena et al. 1999;Erisoglu et al. 2011) suggested initializing these two values before implementing any clustering algorithm such as K-means or C-means, in order to increase the performance of the main clustering algorithm.We used Dunn's index to measure cluster validity.Next, individuals with similar activity patterns were identified and clustered using the FCM clustering algorithm.Using the multiple sequence alignment method (M-SAM), the sets of representative patterns were achieved.We incorporated the progressive method to calculate the number of steps needed to align multiple sequences.Finally, the CART classifier algorithm was used to explore inter-dependencies among the attributes in each identified cluster, and to relate the membership of cluster individuals to their socio-demographic characteristics.Broadly, we group the activity sequences into different clusters using FCM and analyse the relationship between demographic features of persons and identified clusters using CART, with an assumption that activity sequence clusters are dependent on demographic features.Although the SAM is commonly used in activity sequence analysis, to the best of our knowledge the FCM clustering has not been explored in activity pattern or travel behavior studies (Hafezi et al. 2017).Further study will include extending the current modeling framework to produce detailed information on activities, such as start time, duration, activity type, travel distance, and location, that are crucial for the scheduling stage of activity-based travel demand modeling.

Initialization of cluster number and cluster centroids
The first step in the proposed pattern recognition modeling framework is to initialize both cluster number and cluster centroids.For this purpose, a dynamic subtractive clustering algorithm is implemented.The algorithm searches for cluster centers based on the density of neighboring data points.Overall, the subtractive clustering algorithm consists of five phases.In this study, we present only a concise overview of the algorithm, and interested readers are referred to (Ngo and Pham 2012;Shieh 2014;John Lu 2010) for more explanation.The transformed temporal information on activities achieved in ''Data transformation'' section is used as input of the subtractive clustering algorithm.The parameter z is defined as the sample size of person-days in the dataset, 2778.For each individual in the population i 2 1; 2; 3; . ..; z f g there are 2592 (nine activity categories 9 288 time-intervals) data points P ¼ p 1 ; p 2 ; p 3 ; . ..; p z f g 2 0; 1 f g z such that each p i has 2592 dimensions.Each data point represents a transformed person-day activity pattern.The subtractive clustering algorithm begins by initializing the accept ratio ð oÞ, reject ratio ðoÞ, cluster radius (u r ) and squash factor ð#Þ parameters.These parameters have important effects on finding cluster centers and total cluster number in the database.The (u r ) is defined as a positive value demonstrating a neighborhood radius.A larger value of u r results in finding fewer cluster numbers whereas a smaller value of (u r ) can results in model overfitting.The suggested value for # is 1:25 # 1:0 and for (u r ) is 0:15 u r 0:30 (Ngo and Pham 2012).The next step in the subtractive clustering algorithm is to calculate density for all data points.
Using the Euclidean distance method, the distance between two data points is computed.In other words, the distance indicates the extent of differences between two person-day activity sequences.The algorithm continues by searching among computed densities for all data points, and the data point ðp Ã Þ with highest density T Ã ð Þ is designated as the initial cluster center.Next, the algorithm recalculates the density of all data points using the difference between the highest selected density in the last step and the new computed density.
If T [ oT ref then p Ã is nominated as a new cluster center.Otherwise, E min is computed as the shortest distance between p Ã and all previously found cluster centers.The process of finding a new cluster center is continued if If not, then T p Ã ð Þ ¼ 0 and w Ã is designated with the following maximum density.The algorithm is terminated when T Ã \oT ref .Considering the set of cluster centers, the membership degree of data points in each cluster is computed as follows: The cluster number and cluster centroids identified through the subtractive clustering algorithm are used as inputs for the fuzzy C-means (FCM) algorithm in the next step of the pattern recognition modeling framework.The FCM algorithm determines the final memberships in each cluster through a fuzzy process.

Identification of individuals with homogeneous activity patterns
The second step in the proposed pattern recognition modeling framework is to identify individuals with homogeneous activity patterns and group them into clusters.For this purpose, the fuzzy C-means (FCM) unsupervised machine learning algorithm is employed.In the FCM each data point that represent a person-day activity has the likelihood to belong to several clusters.This aspect of the algorithm boosts the cluster quality by selecting the best fitted data points.The FCM algorithm uses an iterative process in which the degree of membership for each data point in the cluster is computed at each iteration, and subsequently this information is utilized in updating the cluster membership and cluster centroids in the following iterations.The FCM algorithm is terminated when a termination condition is met.The FCM algorithm employs the following steps: Initialize membership R 0 ð Þ ¼ k ih ½ for data point p i (person-day activity) of cluster g Ã by randomly choosing membership of all clusters.At the t-th phase, calculate the fuzzy centroid F t ð Þ ¼ f l ½ for l ¼ 1; . ..; g, where g is the number of clusters obtained from the previous step.
where h il is the degree of membership of data point p i in the l Ã cluster, s is the fuzzy parameter and z is the number of data points (2778 person-days).The fuzzy membership h i is updated as follows: Minimize the following objective function: The updating algorithm is terminated when The parameter u is specified as the minimum threshold in the algorithm.The final membership of cluster f i is obtained as follow: For each data point p i , assign p i to cluster f i if fuzzy membership h i of N s ð Þ is greater than threshold value b.Activity sequences belonging to members in each identified cluster are used as input for the sequence alignment method (SAM) in the next step of the pattern recognition modeling framework in this study.The SAM algorithm measures distance between activity sequences based on the number of stages needed to align two sequences of activities.

Identification of sets of representative activity patterns
The third step in the proposed pattern recognition modeling framework is to identify the sets of representative activity patterns.For this purpose, the multiple sequence alignment (MSA) method is employed.The sequence alignment method is commonly used in the biological sciences to compare strings of chromosomes.One of the main challenges of the sequence alignment problem is to compute the required number of stages in order to align two strings (Chenna et al. 2003).This problem can be solved using various methods such as heuristic methods, approximation algorithms, probabilistic methods, and global optimization.In this study, a new heuristic method is used for solving the sequence alignment problem.This method, named the progressive alignment technique, is composed of three phases.At the beginning, for all existing pairs of sequences in each cluster, pairwise distance scores are computed.Next, a guide tree based on the calculated similarity sequences is produced and similar sequences are assigned close together in the guide tree.Finally, the sequences are aligned according to training collated by the guide tree.Supposing that cluster g has l membership: g l 2 g 1 ; g 2 ; g 3 ; . ..; g l f g , the objective is to calculate the optimal alignment for every member of cluster g.The optimal alignment is accomplished through the distance score matrix.The distance score between two members of cluster g is computed as follows: where r is the number of corresponding strings g i and g j in a bit string, s is the number of corresponding lengths of d gi and d gj in a bit string.
The score for two matched strings is equivalent to þ1, the penalty for two mismatched strings is Àq and the gap penalty is Àr.The guide tree is constructed according to the distance score matrix.Finally, the representative pattern will be achieved through execution of several alignments, including insertion, deletion, and substitution of the entire cluster membership.The edit-distance between two strings g i and g j with lengths of d e i and d e j is computed as follows: The T i,j is an alignment with maximum score.In order to understand the relationship between demographic features of persons and identified clusters, we then analysed each cluster using a CART classifier is used as input in the CART classifier.

Investigation of inter-dependencies among the attributes
The last step in the proposed pattern recognition modeling framework is to discover the inter-dependencies among the socio-demographic attributes of persons in each identified cluster, with an assumption that cluster membership is dependent on demographic features.In doing so, our approach is comparable to recent work by Jiang et al. (2012).For this purpose, the CART classifier is performed, to construct the best-fitting decision tree that contains the highest amount of information.Consistent with other decision tree algorithms such as C4.5, CHAID, and ID3, in the CART algorithm the impurity measure is a decision maker for seeking leaf nodes (Tan et al. 2006).The CART algorithm utilizes the Gini index to measure the impurity.The Gini index is calculated for every predictor variable at each node, and the variable that has the minimum value is chosen.In addition, the CART algorithm employs cross-validation as a complementary measurement to choose the optimal decision tree.The Gini index is calculated as follows: where n is the number of activity categories (nine activity categories as defined in Table 1), f i is the relative frequency of activity j in the cluster g.

Discussion of results
We applied the proposed pattern recognition modeling framework to data associated with 2778 person-day (1389 individuals, 2 days each) drawn from the 2008 space-time activity research (STAR) travel survey (TURP 2008) in Halifax, Nova Scotia, Canada.The FCM clustering method bundled individual activity patterns into twelve discrete clusters.The Dunn's index showed 12 to be the best number of clusters.The temporal pattern of individual activities for the twelve identified clusters is shown in Fig. 4, and Table 2 presents an analysis of clustered data.In the following section, a discussion for each of the twelve clusters and their socio-demographic attributes is presented.
Cluster #1: extended work-day workers, comprised a group of workers who engaged in work activity for a longer duration, starting from 8:00 a.m. to around 8:00 p.m.A large portion of workers in this cluster were middle-aged females aged between 36 and 55 years old (67.0%), while 76.0% of them had education levels higher than high school.Furthermore, 73.0% of people in this cluster were full-time workers, and they commonly had middle income (60.0%).The major percentage of the workers in this cluster (55.0%) indicated they had no flexibility in their work schedule.
Cluster #2: non-worker, midday activities, consisted of a group of people who participated in organizational/hobbies or entertainment activities mostly in the midday, starting from 10:00 a.m. to around 5:00 p.m.A large proportion of people in this cluster were female (53.0%) and also aged older than 55 years (66.0%).The majority of people in this cluster were educated and belonged to the middle or low income level.A minor proportion of the people in this cluster had work at home (10.0%) while 52.0% of them indicated that they had some flexibility in a work schedule.
Cluster #3: 8-4 workers, was a group of workers who engaged in work activity in a consistent manner, starting from 8:00 a.m. to around 4:00 p.m.The major proportion of workers in this cluster consisted of middle-aged males with education level higher than high school.A major proportion of workers in this cluster were full time workers (93.0%), and they typically had middle income level.Additionally, the workers in this cluster engaged in entertainment activities typically in the evening, starting around 6:00 p.m. for a duration around 2 h.
Cluster #4: non-worker, evening activity, involved a group of people who participated in organizational/hobbies or entertainment activities mostly in the evening, starting from 6:00 p.m. to around 10:00 p.m. Similar to cluster 2, most people in this cluster were older females with education level higher than high school.A large proportion of people in this cluster (48.0%) belonged to the low income partition.Furthermore, a minor group of them had work at home (15.0%), while 54.0% of them indicated that they had some flexibility in a work schedule.
Cluster #5: stay-at-homes, comprised a group of people who mostly spent their time at home.The greater number of people in this cluster belonged to the low-income partition.Similar to clusters 2, 4, 8 and 9, a large proportion of people in this cluster consisted of oldaged females.Furthermore, a minor proportion of people in this cluster (4.64%) went out of home in the day for recreational activities.Compared to other activities, sports and shopping activities after in-home activity was most typical.In addition, cluster 5 had the largest membership (15.08%) in comparison with other identified clusters in this study.Cluster #6: shorter work-day workers, involved a group of workers having work duration typically less than 5 h in a day, and who finished their work in the early afternoon before 2:00 p.m.A large proportion of workers in this cluster were middle-aged females between 36 and 55 years old (71.0%).Furthermore, a large proportion of workers in this cluster (85.0%) had education level higher than high school.In total, 56.0% of them indicated that they had some flexibility in their work schedule.The workers in this cluster participated in more recreation activities than those in other identified worker clusters.
Cluster #7: 7-3 workers, comprised a group of workers who started work in the early morning around 7:00 a.m. and finished their work in the early afternoon around 3:00 p.m.A large proportion of workers in this cluster were middle-aged males between 36 and 55 years old (47.0%), and the majority had middle-income (64.0%).Furthermore, the majority of workers in this cluster were full time workers (93.0%), while 63.0% of them indicated that they had no flexibility in a work schedule.A minor portion of workers in this cluster (7.0%) have more than one job.It is interesting to note that workers in this cluster typically start and end at the workplace ahead of peak traffic periods both in the morning and afternoon.
Cluster #8: non-worker, morning shopping, involved a group of people who did shopping activities mostly in the morning, starting from 9:30 a.m. to around 12:00 p.m. Similar to clusters 2 and 4, a major proportion of people in this cluster consisted of females and also were aged older than 55 years.Moreover, similar to cluster 4, a large proportion of people in this cluster were educated and belonged to the low income level (53.0%).Furthermore, a minor group of them had work at home (11.0%).In total, 52.0% of them specified that they had some flexibility in a work schedule.
Cluster #9: non-worker, afternoon shopping, consisted of people who did shopping activities mostly in the afternoon.Consistent with other non-worker clusters, a large proportion of people in this cluster were female (59.0%) and also aged older than 55 years.Furthermore, a large proportion of people in this cluster (48.0%) belonged to the low income partition, with education level higher than high school.Compared to other identified non-worker clusters, only a small portion of them had work at home (9.0%), while 51.0% of them indicated that they had no flexibility in a work schedule.
Cluster #10: evening workers, was a group of workers who mostly started to work in the evening around 4:00 p.m. and finished their work around midnight.In contrast to cluster 7, a large proportion of workers in this cluster were females (51.0%).Similar to cluster 3, 6 and 7, a large proportion of people in this cluster were middle-aged with education level higher than high school.Moreover, 47.0% of people in this cluster had an irregular working schedule, and they typically had middle or low income level.
Cluster #11: 9-5 workers, involved a group of workers who unlike workers in cluster 7, typically tend to travel to and from work during peak traffic periods both in the morning and afternoon.The majority of workers in this cluster consisted of middle-aged females between 36 and 55 years old (53.0%) with middle-income (59.0%) level.Similar to cluster 6, a large proportion of workers in this cluster had education level higher than high school.Interestingly, around 60.0% of them indicated that they had some flexibility in a work schedule.
Cluster #12: students, involved a group of students who engaged in school activity in a consistent manner.A large proportion of students in this cluster were young adults aged between 15 and 35 years old (60.0%).The majority of students in this cluster (78.0%) belonged to the low income partition.Furthermore, students typically engaged in recreation activities after school time, starting from 4:00 p.m. to around 11:30 p.m.
We performed an examination of start time and duration probability distributions of different activities.The purpose of this analysis is to explore the cluster aspects from a temporal point of view.The probability distributions of start time and duration for work activity are shown in Figs. 5 and 6, respectively.It should be noted that we do not show the details of probability distributions of all activities for the sake of brevity.Results of the cluster distribution analysis reveal that start time and duration distributions of work activity in most of the clusters are clearly dissimilar.Also, these results reveal that the number of clusters significantly influences cluster features.
We employed the Kolmogorov-Smirnov (KS) test on the activity start time distributions between pairs of clusters, to test for significant differences.Result of KS tests for twelve different clusters and all activity categories are shown in Table 3.The KS test is built as a statistical hypothesis test.The null hypothesis (H 0 ) is that the two samples were drawn from the same population.Values of 1 in Table 3 indicate rejection of H 0 at the p ¼ 0:05 level.As can be seen, in most of the tests the null hypothesis is rejected and start time distributions may be regarded as significantly different between the two clusters.
Figure 7 depicts a set of representative activity patterns that correspond to the centroids of each cluster.It should be noted that members within specific clusters can have activities that due to their lower share or short duration compared to other activities are absent from the representative patterns.Our results show that, by using the FCM clustering algorithm, each activity type is embodied in at least one of the representative patterns, which make it comparable with other clustering algorithm such as k-means (Jiang et al. 2012;Allahviranloo et al. 2016).
Figure 8 depicts the inter-dependencies among the attributes in each twelve identified clusters.Circles in Fig. 8  algorithm utilized the Gini index for leaf splitting and it fitted a tree with 16 leaves and 15 branches in total.Individuals are classified in the first root of the fitted tree based on their attendance at school or not.From branch 3, branches grow based on non-worker or works in-home versus out-of-home worker.Individuals are then classified as members of clusters 2, 5, 6, 8 or 9 based on income, age, and education level.Note that these clusters have a major non-mandatory activity (i.e., entertainment, sport, shopping) in their daily activities.
In contrast, clusters 1, 3, 7, 10 and 11 have a main work activity in the pattern, and are based on income, age, education, and gender criteria.The CART algorithm found specific clusters for particular leaf nodes based on the high probability that an individual belongs to it.However, it should be noted that, in each particular leaf node, there is a probability that an individual might belong to any of the other clusters.For instance, cluster 4 does not appear in any of the leaf nodes, as can be seen in Fig. 8. Accordingly, we calculated the probability distributions of each cluster at each leaf node, and these are shown in Table 4.This allows us to infer cluster membership of individuals based on random number generation and cumulative probability functions.This CART classifier feature is important for use in future forecasting phases in activitybased travel demand modeling.As an example, assume a non-worker or works in-home individual in the test set with the following socio-demographic characteristics: 32 years old and average income of 65 K.According to Fig. 8, the individual is allocated to leaf 11 in the decision tree.Subsequently, with respect to the probability distribution calculated in Table 4, the individual has a 52% likelihood to belong to cluster 9.However, as mentioned before, there is a 48% chance that the individual might belong to the other clusters.
Shopping and services 1 1 Suppose that the random generated number is 0.8.According to Table 4 and the cumulative probability functions, this value falls within the range of [0.72, 1.00] and indicates that the individual will be assigned to cluster 11.The proposed pattern recognition modeling framework in this study comprises four modules.The first module, subtractive clustering algorithm, initialized the cluster centroids and total number of clusters.The initial number of clusters found in this phase was validated through utilizing Dunn's index.The subtractive clustering algorithm is an alternative method to deal with problems including high resolution.The algorithm considers data points as potential sources which resulted in finding cluster centroids with higher accuracy.In the next module, person-days in the dataset were bundled into dissimilar clusters based on comparable routine activity sequences using the novel and efficient fuzzy C-means (FCM) clustering algorithm.When compared to other potential clustering algorithms such as K-means, FCM yields better convergence of the local minima of the squared error principle.This is directly associated to the choice of cluster centroids and to cluster membership.In the 3rd module twelve representative activity patterns were recognized, corresponding to cluster centroids.The progressive alignment method yields more accurate results by improving SAM through iterative profile-alignment of tree portions to maximize sum of pairs score.Finally, in the last module inter-dependencies among the attributes of each identified cluster were investigated using the CART algorithm.Compared to other potential classifiers such as C4.5, CART classifier improves decision tree performance by adding additional cross-validation step.The heterogeneous diversity among clusters in terms of their distributions of activity type, start time, activity duration, and end time, were confirmed by use of the non-parametric KS test.
In this study, we demonstrated how cluster analysis can be used to detect differences in the socio-demographic characteristics of population groups with different daily activity patterns.Our results show that individuals belonging to the non-worker or student clusters have different income, age, gender, and education level.Individuals with a stay-at-home pattern seem to be identified primarily by age, gender, and income level, while workers are found to have statistically dissimilar education, income, and flexible schedules.Lastly, and not surprisingly, students are found to have remarkably dissimilar marital status, age, and education level.
Numerous detailed information on activities, such as start time, activity duration, activity type, location, and travel distance can be extracted from each identified cluster.Such precise information is crucial for the scheduling step of activity-based travel demand modeling.The proposed method enriches the traditional methods such as using sociodemographic variables for classifying the population, and provides clusters based on the powerful computerized pattern recognition technique.For instance, discovering activity patterns over longer time periods, such as weekly, monthly, and seasonally, can be accomplished in a short period of time using the proposed algorithm in this study.Another advantage of the developed new method is that unlike previous approaches, the algorithm has the ability to recognize people who typically tend to avoid travel in peak traffic periods.Our model particularly recognized cluster #7 and cluster #11 of workers who commonly travel to and from work before and after peak traffic periods both in morning and afternoon peaks, respectively.
Compared to previous studies that used complex methods to capture frequent and infrequent activities in a dataset (Liu et al. 2015), the proposed FCM clustering algorithm in this study is more straightforward and easy to implement in practical activity-based travel demand models.Furthermore, the cluster memberships selection in the FCM is comparable to the k-means clustering algorithm proposed by Jiang et al. (2012) and Allahviranloo et al. (2016).In the FCM each data point has the likelihood of belonging to several clusters, and this results in producing more homogeneous activity patterns in each cluster.For instance, we identified two different workers in cluster #7 and cluster #11 that, regardless of their similarity in activity sequences, are distinguished by start time and end time at the workplace.The application of this study is not restricted only to the transportation area: the presented new modeling framework can be harmonized for any applications that contain a set of connected sequences, such as recognition of functionally significant regions, or day-to-day variations in transit ridership and station demand at the individual level.
To build on this study and further demonstrate the potential of our proposed method, we are proposing several avenues of research.Firstly, it is possible to explore seasonal activity patterns by taking advantage of the wealth of data in the large-scale Halifax household travel diary survey (STAR).Secondly, and in line with growing worldwide interest in employing GPS locational data, we aim to investigate additional linkages between the STAR GPS data and travel diary data, and incorporating them in the proposed modeling framework in this study.Thirdly, we intend to extend our work to study the interaction between individual and household activity patterns, using data from the STAR survey.The latest generation of activity-based travel demand models includes interactions between household members, as these have a significant impact on others' travel.In addition, a conceivable further step of this work is to establish a hybrid framework employing discrete choice models in combination with the output from pattern recognition to recognize likelihoods of activity participation and predict activity patterns of individuals with greater accuracy.
In summary, the modeling framework presented in this study provides a straightforward and easy-to-implement tool for urban and transport modelers to understand time-use activity patterns for different kinds of individuals.The results of this study are expected to be implemented within the activity-based travel demand model, Scheduler for Activities, Locations, and Travel (SALT) for Halifax, Nova Scotia.
Fig. 1 Database schema transformation

Fig. 4
Fig. 4 Temporal pattern of person-day activities for twelve identified clusters of-home activities in each cluster are designated as bold Transportation (2019) 46:1369-1394 1383

Fig. 6
Fig. 6 Probability distribution of workplace activity duration in clusters

Table 1
Proposed cluster-based codification for activity episodes

Table 2
Analysis of clustered data: share of different socio-demographic variables, membership analysis and representative patterns designate leaves and triangles designates branches.The CART

Table 3
Kolmogorov-Smirnov test on activity start time distribution* ConclusionsDue to lack of full data on all individuals of the population, transport modelers are not able to predict or model the travel behavior of all individuals in the territory.Consequently, the best policy is to predict or model travel behavior for a representative set of model individuals, who represent homogeneous cohorts.Accordingly, aggregation is both inescapable and essential in travel demand modeling.The significant original contribution of this study is to develop a new inclusive pattern recognition modeling framework that leverages activity data to derive clusters of homogeneous daily activity patterns for use in activity-