Probabilistic Modelling for Unsupervised Analysis of Human Behaviour in Smart Cities

The growth of urban areas in recent years has motivated a large amount of new sensor applications in smart cities. At the centre of many new applications stands the goal of gaining insights into human activity. Scalable monitoring of urban environments can facilitate better informed city planning, efficient security, regular transport and commerce. A large part of monitoring capabilities have already been deployed; however, most rely on expensive motion imagery and privacy invading video cameras. It is possible to use a low-cost sensor alternative, which enables deep understanding of population behaviour such as the Global Positioning System (GPS) data. However, the automated analysis of such low dimensional sensor data, requires new flexible and structured techniques that can describe the generative distribution and time dynamics of the observation data, while accounting for external contextual influences such as time of day or the difference between weekend/weekday trends. In this paper, we propose a novel time series analysis technique that allows for multiple different transition matrices depending on the data’s contextual realisations all following shared adaptive observational models that govern the global distribution of the data given a latent sequence. The proposed approach, which we name Adaptive Input Hidden Markov model (AI-HMM) is tested on two datasets from different sensor types: GPS trajectories of taxis and derived vehicle counts in populated areas. We demonstrate that our model can group different categories of behavioural trends and identify time specific anomalies.


Introduction
The rising population in modern cities introduces many challenges in urban city planning including problems associated with improving the inhabitants' quality of life and security. The utilisation of sensors and communication technology gives rise to the concept of Smart Cities and enables the adequate study of specific focus problems in urban development. Among the many examples are applications on structural health monitoring [1], where material conditions of civil infrastructure are monitored and automatically analysed to ensure population safety; smart waste management [2] to improve service provisioning by the utilisation of smart garbage bins and Internet of Things (IoT) technology; smart lighting solutions [3] for cost and greenhouse emission reduction by the installation of sensors and weather analytics on city streets; and, in the manufacturing industry [4], combining IOT with machine learning. With the growing interest and importance of sensor application in smart cities, there has also been a rise in the interest in human movement behavioural characteristics from sensors to identify movement trends and improve the understanding of traffic. Examples from recent literature on applications of human behavioural understanding include De Marsico et al. [5] user gait identification for automatic activation of controlled entry spaces, and Paul et al. [6], Torre-Bastida et al. [7] big data and social IOT was utilised for real time definition of human behaviours. For traffic analysis, Bhatti et al. [8] introduced an IOT accident detection and notification system that relies on multiple, commonly available smart-phone sensors.
The analysis of car traffic can aid the understanding of population dynamics in smart cities. This, on the other hand, leads to better capabilities for congestion control and accident detection Ahmad et al. [9]. For example, Kanungo et al. [10] studied traffic light management using video surveillance analysis, whereas Ozkurt and Camci [11] use surveillance footage to estimate density patterns. This focus on video surveillance data has prompted research on camera placement for greater area coverage [12] to reduce the high cost accumulation of such sensors. However, the high costs and privacy concerns related to CCTV data are a key driver of the search of alternative sensor resources: such as density counts, noise levels or automated geolocation data (such as GPS).
Despite the prior work on behaviour understanding from GPS [13][14][15], a lot of the GPS data analysis remains challenging due to noise, irregular sampling, heterogeneity and frequent interruption of data collection as environmental factors cause missing data [16]. To address this Ellam et al. [17], a recently proposed parameter estimation framework for spatial interaction models was demonstrated to simulate accurately the flow of customers. In scenarios where large amounts of individual-level data is available, we can also try to approach the problem reversely and first search for recurring patterns from large databases of movement in cities. Frequently sampled monitoring data such as the GPS data can be highly multimodal. To account for this, Witayangkurn et al. [18] proposed the use of a Hidden Markov Model (HMM) to detect anomalies from large scale GPS data. Nonetheless, this approach does not consider the time self-dependence of the observation sequence and requires that the model complexity be predefined which may bias the unsupervised approach to the analysis. The geospatial relationship between the journeys of individuals is often dependent on various contextual factors such as differences between weekend/weekday behaviour, peak/off-peak behaviour and individual idiosyncrasies.
To address some of these challenges, we design an exploratory nonparametric probabilistic modelling approach suitable for this type of data. We deploy a flexible HMM with augmented latent space to account for various types of contextual information. This contextual information can include difference between peak/off-peak time driving, weekday/weekend differences and person specific manoeuvrability decisions. To achieve this, we propose an intuitive Adaptive Input Hidden Markov Model (AI-HMM). Building on the theory of the Vector Autoregressive Hidden Markov Model (VAR-HMM) [19] to allow for autoregressive self-dependence between observations, we introduce a new discrete and independent semi-Markov variable which acts as a parent to the state indicator variable in the graphical model to represent environmental factors that influence movement behaviour. For example, in our case study in Section 5 on GPS data, we assume that the growing complexity of human trips, as more trajectories are observed, can be well captured with Bayesian nonparametric structure and that environmental factors such as time of day (e.g., peak or off-peak) may also influence the modality of the behaviour and therefore should be considered in the design of the generative model structure. We show that we are able to learn common patterns occurring across different contextual realisations, in addition to their specific generative distributions. To summarise, the contributions of this paper are as follows.

•
We develop an extension of the ubiquitous Hidden Markov model, which can robustly segment different behaviours in monitoring applications while accounting for present contextual factors.

•
We have derived a principled framework for inference in the developed AI-HMM approach.

•
We have shown the particularly suitability of the nonparametric switching autoregressive model for analysis of GPS and traffic count data.
This paper will continue as follows. Section 2 discusses related studies, Section 3 summarises the background of the used techniques, Section 4 explains the details of the proposed framework, Sections 5 and 6 summarise the algorithm experiments and its results and Section 7 concludes the report.

Related Work
Human behaviour profiling and automated analysis has great potential for both security and retail commercial applications in smart city designs. However, labelled sensor data of such nature is usually scarce and unsupervised clustering or direct time series methods must be used [20][21][22]. One commonly used pipeline is the online segmentation of crowd data from video feeds. For example, Mehran et al. [14] proposed a novel Social Force model for identifying abnormal behaviours in crowds. The approach assumes that located moving particles on sequences of frames are individuals and uses them to estimate their interaction forces throughout time and estimate the "normal" behaviour of the crowd. Then, a bag-of-words classifier is trained to separate normal and abnormal crowd behaviours. Mahadevan et al. [23] also proposed a supervised anomaly detection method based on crowd videos. However, [23] evaluated their methods on a significantly smaller database of videos. Rodriguez et al. [24] developed a partially Bayesian approach to the same problem which allows for robust analysis of crowd videos completely unseen during training. While civic value can be demonstrated through the analysis of videos at crowded locations, such data provides only localised snapshots of the movement within the city. It is available only for a small set of chosen locations and requires a large amount of labelled location specific training data. Video feeds have also been used extensively for activity recognition applications [20,25,26]; however, such techniques are likely far from being applicable on larger spatial scale across a city.
Continuous location data can be used to study a more diverse geographic perspective on personal mobility and daily activity patterns. Jiang et al. [21] investigated how to cluster daily patterns of activity in the city using both GPS data from individuals as well as additional survey data concerning their true activities. The study used the Chicago Regional Household Travel Inventory (CRHTI) dataset [27], which is one of the few data sources with fine labelled information in addition to GPS data. GPS trajectory data was not explicitly modelled but only used to identify different location areas for the participants. Jiang et al. [21] first constructed a calender type array indicating which of nine activities (work, home, school, etc.) occurred. Principal Component Analysis (PCA) was then applied followed by K-means clustering to infer patterns of daily activity across different regions of the city. The inferred patterns were used to reveal socio-demographic information about the region. The results of the study are informative on human behaviour; however, if GPS information is included in the analysis, the complexity of the data may affect the effectiveness of the proposed framework.
Shih et al. [22] modelled different mobility patterns based on GPS data from the Geolife dataset [28] to classify trips of people diagnosed with Alzheimer's disease. The basis for this is the common assumption that a symptom of Alzheimer's patients is spatio-temporal confusion. The abnormal trajectories were predicted based on thresholding the similarity to the training set. Although the results showed up to 97% accuracy rate in detecting abnormal mobility patterns, there was no ground truth data, and explicitly, no real information on the Alzheimer state of each subject. Instead, external sequences were added to the data of some individuals to represent Alzheimer's patients. This can add clear biases to the analysis and may indicate that the accuracy of the technique on more representative data will not mimic that of the presented study. More recently, Yao et al. [15] also proposed a novel behaviour recognition framework for GPS data, demonstrated on trajectories of ships. They used a Long-Short Term Memory (LSTM) autoencoder to infer the latent structure in features which were extracted from sliding windows of the trajectory time series. By passing the dataset through a trained network, the inferred embeddings were extracted and stored for each input time series. These embeddings were then clustered with K-means to identify different behavioural patterns in ship data. However, trajectories of ship movement contain significantly smaller variations across patterns and a constrained space of behaviours which can be explored. The LSTM autoencoder is likely to require multiple different behaviour sequences in order to group the trajectories in the latent space. Also, when focused on segmenting anomalous events, K-means is known to lead to misleading clustering [29].

Vector Autoregressive Model (VAR)
The Vector Autoregressive (VAR) model [30] is a widely used linear time series model for dynamic multidimensional data. An order p VAR assumes each observation in a sequence can be modelled as a linear combination of p previous points in the sequence and a stochastic term. The VAR model is also a parametric model of the spectral density. Unlike Fourier-based methods which require a windowing strategy, VARs are defined in the time domain and we can use p to control the number of spikes we assume in the power spectrum of the input. Let Y = (y 1 , y 2 , . . . , y T ) be a stationary time series where y t ∈ R D = (y (1) t , . . . , y (D) t ) T , a VAR(p) model will have the form of where Θ i is a (D × D) matrix of weights, µ µ µ is the mean of the sequence Y and t is a (D × 1) vector of white noise with zero mean and covariance matrix Σ. Using the matrix notation Y p+1:T = (y p+1 , y p+2 , . . . , y T )we can then write the VAR in the following form, being a vector of ones with the same length as Y p+1:T and E = ( p+1 , . . . , T ). Using this formulation, the parameters of a VAR model may be estimated in few main ways depending on the adopted formalism: via maximum likelihood estimation, using the Yule-Walker equations and via Bayesian inference. Inference in Bayesian VAR models is practically identical to statistical inference in Bayesian linear regression. Assuming, for practical convenience, a conjugate prior over the VAR parameters, the joint prior over the parameters Θ Θ Θ and the covariance matrix Σ is a coupled Matrix-Normal Inverse-Wishart (MN IW) distribution, implied by the Gaussian likelihood of the outputs.
where n 0 is the number of degrees of freedom; S is the scale matrix; M, Σ and S 0 are the mean, covariance of the rows and covariance of the columns of parameter Θ Θ Θ, respectively; and I is the identity matrix. The posterior of the VAR coefficients is then To compute posterior of the VAR noise covariance, it is more efficient to use the marginal likelihood, instead giving

Autoregressive Infinite Hidden Markov Model (AR-iHMM)
The infinite Hidden Markov Model (iHMM) also known as the Hierarchical Dirichlet Process HMM (HDP-HMM) is a Bayesian nonparametric approach to the HMM first proposed by Beal et al. [31] then formalised by Teh et al. [32]. The nonparametric formalisation tackled the limitations of the traditional HMM of needing to pre-specify the structure of the model allowing it to automatically infer in adaptive manner the complexity depending on the amount of data observed. Building on the iHMM, Fox [33] included linear AR dependence between the observation sequence to form the AR-iHMM (or also known as HDP-AR-HMM). Figure  β β β are used to sample the transition distribution π π π k , which the state indicators z are sampled from. The observations y t are generated from functions with parameters Θ Θ Θ z t which are in turn sampled from base distribution H. The AR-iHMM differs such that there are autoregressive dependence between observations. Let z = (z 1 , . . . , z T ) denote the state indicator variables and π π π k be the state specific transition distribution for state k where z t ∼ Categorical π π π z t−1 . The observations y t are conditionally independent given z t andȲ t = ( The complete data likelihood can then be written as using K + to denote the inferred number of represented states. The iHMM defines a set of random probability measures G k for each state representing the dynamic observations Y sampled from a Dirichlet Process (DP) with a global probability measure G 0 and concentration parameter α. The base probability measure itself is DP distributed with concentration parameter γ and a base probability measure H G 0 |γ, H ∼ DP(γ, H), where measures G k are conditionally independent given G 0 . Intuitively, Equation (7) means that random distributions G k vary around G 0 with variability governed by γ, whereas G 0 varies around the base distribution H with variability α. Including explicit hyperparameter controlling the self-transitions probability in the AR-iHMM [31,33] we can summarize the full AR-iHMM as follows, where β β β = (β k ) ∞ k=1 is both a mixing prior for the top level DP and a base measure for the lower level DP, GEM (which stands for Griffiths, Engen and McCloskey [34]) represents the stick breaking construction and (·) ∞ k=1 represents an infinite set.

Model Specification
AR-iHMMs are powerful models that are capable of modelling complex dynamic data effectively. However, as a generative model, they assume that the state indicator z t is only dependent on the state indicator at the previous time step z t−1 . The model also assumes that the full time series share the same upper layer mixing prior β β β and therefore the same set of transition distributions Π = (π π π T 1 , π π π T 2 , . . . , π π π T K ) T . In some real-world problems, there are independent and discrete features τ τ τ = (τ 1 , τ 2 , . . . , τ T ) that affect the model's state indicators and transition distributions. The semi-Markov variable τ τ τ can represent certain events or periods for the dataset where data generated during any fixed realisation of τ t = v can share the same mixing prior β β β and same set of transition distributions Π, but have different distributions for different realisations v. For example, in the case of human movement analysis, different periods of the day τ t ∈ {1, 2}, such as off-peak and peak time, respectively, can have different behavioural traits with separate parameters {Π 1 , Π 2 } and {β β β 1 , β β β 2 }. Although the desired separation can be achieved by modelling the data generated for every realisation separately with different models, it will likely result in assigning exclusive states and state emission parameters Θ Θ Θ k for each model. This will raise the problem of identifying a suitable method to measure state overlap across different models, specifically in cases like human movement behaviour where any small change in the states parameters may be important.
To avoid this problem, we propose a novel Adaptive Input infinite Hidden Markov Model (AI-iHMM) where an additional DP layer is added to the models hierarchy allowing for the generation of a different base measure G (v) 0 for each unique value of τ τ τ while still sharing the same upper level base distribution H for the sampling of the model parameters Θ Θ Θ k . Figure 2 depicts the PGM of the proposed model with the respective joint probability distribution and formalisation.
Q|ψ, H ∼ DP(ψ, H), where H is the base probability measure for the upper layer DP with concentration parameter ψ, Q is the master probability measure for the middle layer DP associated with the possible values of τ τ τ with concentration parameter γ, G  Figure 2. PGM of the AI-HMM structure. Observation y t being dependent on the indicator variable z t and p previous observations, whereas z t is dependent on z t−1 in addition to the semi-Markov feature τ t . Parameters β β β are used to sample the transition distribution π π π k which the state indicators z are sampled from. The observations y t are generated from functions with parameters Θ Θ Θ z t which are in turn sampled from base distribution H.
The possible values of τ τ τ can be inferred adaptively as will be described in Section 4.2 by placing an appropriate prior on them (for example the DP would be a conjugate choice) and estimate them using modified MCMC inference. This is useful when it is known that there is an underlying variable affecting z and the values for each time instance are unknown. For more intuitive analysis and better understanding of the effect the contextual category has on the transition dynamics of the time series data, τ τ τ can also be set as a fixed categorical input to the model. This allows for more precise inference of the transition dynamics associated with each context v and for their custom selection based on the problem of interest. Depending on the complexity of the data, the estimation of input τ τ τ in addition to the estimation of the remaining model parameters may be result in numerous local minima if no prior knowledge is known about the model. Therefore, it would be advantageous to train with partial knowledge of τ τ τ or pre-train with a fixed input when inferring it adaptively. In the general case of a completely adaptive model, using the stick breaking construction, the model may be written as where λ λ λ is the upper level mixing parameter and the middle level base measure, η is the prior parameter of the distribution of τ τ τ and v j,i is the posterior probability of τ t = v when z t = i and z t−1 = j.

Inference
Similarly to the AR-iHMM structure, the parameters of the AI-HMM can be inferred by blocked Gibbs sampling. A truncation level L is set to denote the maximum number of states expected expected for the time series modelling so that k ∈ {1, 2, . . . , L}. Given previously set transition distributions {Π } and assuming a normal distribution likelihood on the autoregressive observation sequence; the first step is to block sample the input contextual vector τ τ τ. Given its Markov blanket, the posterior of τ t is Here, p(z t |τ t , z t−1 ) is simply the transition matrix elements {π , as τ t is assumed to be categorically distributed with a Dirichlet prior and p(z t−1 ) is the global prior of z t−1 . By design, changes in the value of τ t should be sparse and infrequent since they represent broad categories of transition dynamics and cover long contextual periods relative to the time series.
For the case of z t = z t−1 = k, it can be possible that { (v) k,k } is similar for multiple values of v due to the closeness in value between probabilities {π π π (v) k,k }. A change in the value of τ t directly implies that the state for z t has changed as well, due to the hierarchy of the model. This means that effectively τ t is only sampled on when z t = z t−1 , which naturally enforces state persistence. To sample The next set of parameters to sample are z. This is done using a variation of the forward-backward algorithm where the (1 × L) backward message m t,t−1 is calculated by The forward message f(y t ) = ( f 1 (y t ), . . . , f L (y t )) for observation y t is where the first element of m t+1,t is equal to 1 and N y t ; {Θ Θ Θ kȲt , Σ k } L k=1 = (N (y t ; Θ Θ Θ 1Ȳt , Σ 1 ), . . . , N (y t ; Θ Θ Θ LȲt , Σ L )). The indicator variable z t can then be sampled by To sample the mixing parameters β β β v , the following auxiliary variables must be sampled all of shape (L × L): M v representing the count of transitions occurring due to sampling directly from the base probability;M v representing the counts of new transitions unobserved previously before in the sequence z and N (v) representing the counts of transitioning from state j to state k. To estimate M v , j,k = 0 and for n = 1, . . . , n then increment n and if x = 1 increment m where ω (v) j denotes of the override counts when the base probability distribution was used to draw self transitions for state j . This can in turn be used to estimatem β β β v may now be sampled by .,1 , . . . ,m m m (v) .,L , wherem m m (v) .,j denotes the j th column ofM v . The mixing parameter may then be used to sample π π π where the hyperparameter κ is added to the parameters for state k to enforce self transitions. The final step in the iterative optimisation procedure is to sample the state specific vector autoregressive parameters Θ Θ Θ k and Σ k . When using a MN IW prior as explained in Section 3.1, the parameters may be updated given the observations Y and state indicator variables z. For more details on the derivation of the update, please see Fox [33].
where Y (k) andȲ (k) are the observations and lag matrix of lagged observations, respectively, that are classified to state k. S where c (v) k is the count of the number of observations in state k through when τ t = v. The AI-HMM is a generative model that can represent the observation data statistically with the optimised model parameters. Therefore, given a pre-trained model and initial starting points, it may be possible to sample new observations that resemble the dynamics of the training data. The parameters can even be altered manually to represent different types of behaviours and trends. This allows for generation of new custom datasets that can be used for research across different fields.

Modelling With Fixed Inputs
To demonstrate the effectiveness of the AI-HMM, we tested the model on two different datasets that represent human and traffic behaviour in smart city environments. The first was the Dodgers loop dataset [35] containing loop sensor data counting the number of vehicles that pass through the Glendale on-ramp for the 101 North Freeway in Los Angeles in 5 min (288 reading per day) intervals over a period of 25 weeks. In addition to the presence of weekday morning and afternoon traffic-peak trends in the dataset, a baseball stadium in the vicinity of the ramp allows for the observation of the post-game traffic rise. This allows us to study the performance of the model at identifying commonly repeating daily trends that may change depending on external contextual information (weekend/weekday trends) as well as studying the outlier detection capabilities with the spike in density counts due to sparsely occurring games at the nearby stadium. The dataset is simple and uniformly sampled with easily identifiable trends that can be detected by simple data visualisation. Results can be seen in Section 5.1. The second dataset was the T-drive dataset [13,36] which contains GPS trajectory data from more than 10,000 taxis collected over a period of 1 week within the city of Beijing, China. Analysing this dataset is more challenging as GPS data can be very noisy, sparse, irregularly sampled, heterogeneous and its data collection can be frequently interrupted due to environmental factors causing missing data (rarely missing at random). This type of data is complex and allows us to explore the effect of multiple contextual realisations on the observation data such as time of day and user identity. It also relates well traffic analysis, an important topic in smart city research. The AI-HMM identifies interesting patterns and results as seen in Section 5.2.

Dodgers Loop Sensor Dataset
The AI-HMM was used to model the complete dodgers loop sensor dataset. Figure 3 shows the count scatter for an average game day on a weekday and on a weekend. As is can be seen from the sharp peaks on the plots, games on weekdays often occur in the evening, whereas games on the weekends can occur at a similar time in the evening or, as depicted in the figure, they can occur in the afternoon.  For this experiment, the observation instances y t = (c, h) where c is the vehicle count for the 5 min interval at time t and h is the hour of day when the observation took place. As weekday and weekend traffic dynamics commonly differ in density flow, the input feature τ t ∈ {1, 2} where τ t = 1 denotes weekdays and τ t = 2 is assigned for weekends. This choice is supported by the assumption that traffic dynamics generally repeat on a daily basis for weekends and weekdays separately, enforcing the belief of having different transition behaviours for each in the generative model. The VAR order was set to p = 288 being the number of observations in a day. Figure 4 shows an example of the results for five days in one week. Looking at the weekday plot, it can be seen that the morning and afternoon peak periods were clustered into states 1 (blue) and 4 (green), respectively. The sharp peaks in the evening (~23:00) on some weekdays corresponding to traffic rises due to the baseball games were clustered into state 6 (brown). On the weekend, the evening game on Saturday was also clustered into state 6; however, the afternoon (~16:30) game on Sunday was clustered into state 7 (pink); a state which is not encountered on weekdays when τ t = 1. This demonstrates the dynamics of the AI-HMM. States are shared across the structures relating to the different values of τ t , however, there may be differences when a set of observations following a certain behaviour trends occur exclusively for realisation i. To compare the results, we use Principal Component Analysis (PCA) on the density counts of vehicles per day. Figure 5 shows the first 4 PCs plotted against time. It can be seen that PC 2 seems to capture the dynamics of the morning peak hours, whereas PC 4 captures the dynamics of the lighter weekend midday traffic. PC 3 Captures the trend of the evening game which commonly occurs on weekdays and Saturdays. These results inform us which orthogonal direction in the multidimensional data distribution we can expect to see these respective dynamics. However, this is specific to this data only, any additional day vectors added may alter the expected results. Moreover, using this technique alone, it is not possible to identify which days contain the identified trends and which days do not. Within the first four PCs, there was no PC that appears to identify the evening peaks nor the afternoon baseball games. All these shortcomings are solved with the AI-HMM approach demonstrating the effectiveness of the technique.

Data Preprocessing
GPS data are often in the form of a collection of geolocation data points in the form of X = (x 1 , x 2 , . . . , x T ) where x t = (time, λ, l) ∈ R 3 , λ is the longitude coordinate and l is the latitude coordinate. The altitude is also included in some data sources; however, it is neglected for the purpose of this paper since it is not always available. Given that the altitude is recorded in the dataset, the feature be added for analysis in the proposed technique with no changes to the structure. Extreme outliers relating to errors in the sensors measurement and very short trajectories (a trajectory consisting of only five data points for example) may be neglected since they do not generally hold much information about the journey.
Using only the trajectory time series X, the feature sequence Y is calculated to describe the trajectory path where Y = (y 1 , y 2 , . . . , y T ), y t = (λ, l, h, ∆v) ∈ R 4 , h is the hour of day and ∆v is the average velocity given the distance travelled between two points ∆d. The distance is calculated with the Haversine Equation [37] where h(φ) = sin 2 φ 2 and r = 6371km is the radius of the Earth assumed to be constant on all locations on its surface. ∆v is then calculated by ∆d t ∆time t . From the T-drive dataset, the recorder journeys of 10 taxis were selected randomly and modelled with the proposed AI-HMM. GPS data is often recorded as a sequence of geolocation observations X = (x 1 , x 2 , . . . , x T ), where x t ∈ R 3 = (time, λ, l), λ is the longitude coordinate and l is the latitude coordinate. The feature sequence Y is calculated to describe the trajectory path where Y = (y 1 , y 2 , . . . , y T ), y t ∈ R 4 = (λ, l, h, ∆v), h is the hour of day and ∆v is the average velocity given the distance travelled between two points as calculated by the Haversine Equation [37].

AI-HMM Modelling with Predefined Input τ
Beijing is a busy city that is known for congested streets throughout most hours of the day [38]. However, the trajectories of trips in peak and off-peak times are likely to have very different latent dynamics. For this part of the case study, we assume that τ t ∈ {1, 2} can represent peak and off-peak traffic times respectively, and they are preset prior to modelling based on the time of day the trajectory point was recorded in. Due to the smoothness of the sampled GPS trajectories, we opted for a low VAR order of p = 2. The hyperparameters γ, α and κ can be set experimentally by Bayesian model selection or by placing an updatable prior over them as seen in Fox et al. [33]. After running the experiment until convergence, the model identified six states corresponding to different movement behaviour as presented in Figure 6.
AI-HMM GRU Figure 6. Results obtained from analysing 10 taxi trajectories using the novel AI-HMM framework (left) and the GRU technique (right). Each data point corresponds to a GPS observation where its colour represents the state to which it was assigned. The plots depict the longitude λ against latitude l reconstructions of the movements of the taxi locations on a map.
For comparison, a t-distributed Stochastic Neighbour Embedding (tSNE)-based technique [39] was applied following the methodology of Singleton [40]. However, due to the complexity of the data, the technique failed at identifying meaningful clusters; experimentally converging to the sample data average. We also compare our framework with a deep learning Gated Recurrent Unit (GRU) autoencoder embedding model with K-means clustering, similar to that proposed by Yao et al. [15]. To enable direct comparison, the number of clusters for the K-means algorithm was set to 6, following the number of states that were automatically identified with the nonparametric AI-HMM framework. Figure 6 shows the results obtained from the AI-HMM and the GRU frameworks while Table 1 shows the median velocity v median interquartile range (IQR) velocity v IQR , and the most prominent time ranges h prominent for each of the states using both techniques. The plots depict a reconstructed scatter map of the data where each point represents an observation with its corresponding value and time-step, whereas the colours represent which state the points belong to according to the legend. States 2 (orange) and 5 (purple) are mostly focused in the middle of the city with median velocities of 18 km/h and 19km/h, respectively. Their most prominent hours of day indicate that they represent the morning and afternoon peak periods when the roads are highly congested. As night approaches and traffic subsides, they are replaced by state 4 (red), which is also most prominent in the center of the city with a median velocity of 23 km/h. State 1 (blue) is focussed on the highway network where velocities are expectedly higher with a median velocity of 26 km/h. The observations belonging to state 3 (green) with a median velocity of 0 km/h are mostly distributed near Beijing capital international airport, where taxis often spend prolonged periods of time in stationary motion awaiting customers. As previously mentioned, GPS data can be very noisy with sporadic measurement errors causing possible location jumps. This noise was not filtered prior to analysis; however, the model was capable of identifying the noisy observations and separating from the rest of the data into state 6 (purple). The GRU analysis was also successful at identifying the noisy observations and filtering them into states 3 (green) and 6 (brown); however, the remaining states have very similar median velocities. Inspection of state velocity histogram distributions indicates that the clustering values have been affected by long periods of stationarity (influenced by taxis at the airport). The distribution of the data points in Figure 6 also shows that there is minimal location based clustering since most of the states overlap between all regions of the city (e.g., city centre and highways). Therefore, the GRU was effective at identifying the most separated clusters; however, it identified more specific and less diverged behaviour patterns. The AI-HMM framework was capable of identifying meaningful states that represent behaviour trends combining location information, velocity and hour of day patterns, yielding more informative and representative results.

Adaptive Input
In cases where the contextual input τ is known or can be assigned to the data, the AI-HMM demonstrates highly informative unsupervised analysis results outperforming other commonly used time series sensor analysis techniques in the literature. However, it is often the case that τ is partially missing, and must therefore be inferred from the data adaptively. To demonstrate the performance of adaptively inferring the input vector τ given a pre-trained model, another experiment was conducted on the T-drive dataset where τ was used to indicate the taxis identity. The full data of two different taxis was extracted and separated into 75% training set and 25% test set. For the training data points, the input was set to τ t ∈ {1, 2} representing taxi 1 and taxi 2, respectively, and they were fit with the AI-HMM model to learn all remaining parameters and variables Θ Θ Θ k , µ µ µ k , Σ k , Π v , β β β v and z train . The aim for the test data was to infer the values of z test as well as τ τ τ test using Equation (13) assuming they are unknown.
As would be expected, the transition distributions of the two different taxis given the training data showed some overlap in behaviour time dynamics with the addition of some differences. This firstly proves the hypothesis that sensor data belonging to different contextual factors (i.e., person identity in this example) follow different time dynamics in an HMM structure. Correctly identifying these transition distributions allows for more accurate model fitting; the framework will refer to a more specific transitions matrix representing the state transitions under the specified contextual representation during optimisation, generating more informative clustering results. For example, if transitions into state 4 representing late night, off-peak travel is absent from taxi 2's behaviour dynamics, indicating that this taxi avoids late night fairs and prefers to work at different times of day, whereas taxi 1 may prefer to work in such less congested times.
The presence of differences in the transition dynamics allow for accurate estimation of τ given the trained parameters on unseen data as reflected by the results obtained from the test data run. Figure 7 shows the AI-HMMs capability of estimating the correct identity of the taxi using adaptive τ inference with an accuracy of 93%. The state transition behaviour of an entity can help discover its identity.

Conclusions
This paper demonstrates a novel Bayesian nonparametric model for the identification of human behavioural trends from smart city sensors. We address the unsupervised problem of analysing complex data sequences representing human movement by designing a generative PGM that permits for the adaptive contextual switching of transition dynamics. Results obtained from applying the model to complex real-world data demonstrated effectiveness in identifying different dynamic behavioural trends and noise filtering capabilities. We have motivated the value of the presented approach via piratical problems in monitoring and compared the results to examples of alternative tools such as RNNs. The presented framework allows for easy augmentation of existing systems with an adaptive factor which can estimate the effect of discrete contextual information. By adaptively estimating the input value given a pre-trained model, it is possible to accurately identify the contextual realisation which the data belongs to, such as predicting the identity of the individual that generated the data.
As of now, the model is only limited to low dimensional time-series data. However, future work will include broadening the model to scenarios where the observation model can be high dimensional and where multiple dependent contextual variables can be included.