Clustering and automatic labelling within time series of categorical observations—with an application to marine log messages

System logs or log files containing textual messages with associated time stamps are generated by many technologies and systems. The clustering technique proposed in this paper provides a tool to discover and identify patterns or macrolevel events in this data. The motivating application is logs generated by frequency converters in the propulsion system on a ship, while the general setting is fault identification and classification in complex industrial systems. The paper introduces an offline approach for dividing a time series of log messages into a series of discrete segments of random lengths. These segments are clustered into a limited set of states. A state is assumed to correspond to a specific operation or condition of the system, and can be a fault mode or a normal operation. Each of the states can be associated with a specific, limited set of messages, where messages appear in a random or semi‐structured order within the segments. These structures are in general not defined a priori. We propose a Bayesian hierarchical model where the states are characterised both by the temporal frequency and the type of messages within each segment. An algorithm for inference based on reversible jump MCMC is proposed. The performance of the method is assessed by both simulations and operational data.

is logs generated by frequency converters in the propulsion system on a ship, while the general setting is fault identification and classification in complex industrial systems.
The paper introduces an offline approach for dividing a time series of log messages into a series of discrete segments of random lengths. These segments are clustered into a limited set of states. A state is assumed to correspond to a specific operation or condition of the system, and can be a fault mode or a normal operation. Each of the states can be associated with a specific, limited set of messages, where messages appear in a random or semi-structured order within the segments. These structures are in general not defined a priori. We propose a Bayesian hierarchical model where the states are characterised both by the temporal frequency and the type of messages within each segment. An algorithm for inference based on reversible jump MCMC is proposed. The performance of the method is assessed by both simulations and operational data.

| INTRODUCTION
Log files contain messages that are automatically generated by a machine or computer system. The messages are in many cases compact and coded in order to save storage space, such that expert knowledge or a suitable dictionary is needed to understand the contents. The log files are often used for troubleshooting and fault identification by technical experts. Given the trend towards autonomous systems and remote operations in the marine industry, there is a need to automate system monitoring and fault identification based on this kind of textual data. The literature survey given by Li et al. (2017) describes similar data types and problem statements within the field of computer system management. One common target shared between these applications is to detect and identify structures in the data that are not recognised or marked by the machine or system generating the logs. Identification and classification of faults are of particular interest.

| Marine variable speed drives
A variable speed drive (VSD) is used to control the torque and speed of the motors on ships with electric propulsion motors, and is an integral part of the propulsion system. In the considered systems, the operation of the VSD is controlled and monitored by a drive control unit (DCU), which outputs the log files.
The messages generated by the DCU reflect changes in the operational mode of the variable speed drive in addition to triggered safety functions like alarms and emergency shutdowns. In this context, a change of the operational mode may for instance be that the drive is powered on or off. The variable speed drive is a subsystem onboard the ship and interacts not only with several other systems, but also with crew members who control the vessel and occasionally perform maintenance and repairs. The state of the drive system is hence influenced by a range of systems and functions, but the output in the log files reflect the state of the drive only.
The intention of the presented work is to be able to automatically infer the operations of the variable speed drive and to cluster the stream of messages into segments that represent modes of operation or states, which can be identified and appropriately labelled by technical subject experts. Labelling of different classes of faults is a main focus, and will be used to automate fault diagnostics in future industrial systems. The derived series of consecutive states or modes of operation can be considered as new time series.

| Methodological approach
In this paper, we introduce an approach for learning macrostructures in event history data. This is performed by dividing the time interval into segments through a clustering process. These segments are characterised by two distinct properties that both are probabilistically modelled. The first is the temporal frequency of the messages occurring within a cluster of a given state, defined by the number of messages per unit of time. The main rationale is that a high number of messages are generated within a relatively short time when a fault occurs. The second property is related to the types of messages that occur within the segment, which also will be described through a probabilistic model.

K E Y W O R D S
clustering, categorical time series, random observation time, multiple change points, reversible jump MCMC Each observation from the log file consists of a timestamp and a message (from a finite number of possible categories), and the log file can hence be considered as a marked point process in time, with a data structure similar to those used in event history analysis (Andersen et al., 2012). See Figure 1 for an example.
Our modelling approach is to introduce a latent stochastic process describing the macro-events. This latent process is modelled as an additional point process which will be assumed locally independent of the observation process (Didelez, 2008). The combined model of the latent and observed processes can be seen as a continuous time state space model (with the observation process also corresponding to a continuous-time process). The latent process is assumed to be piecewise deterministic with discontinuities or jumps (Jacobsen, 2006). Between jumps, segments are characterised by marks which describe the segment behaviour. Our model approach is similar to the joint temporal dynamics model for discrete events of Bhattacharjya et al. (2020) but differ in that we only observe one of the processes.
Our approach to learning structures from the observed messages is to divide the time series into segments of similar structure, where similarity here is both with respect to the frequency and to the type of messages. Division into segments includes change-point detection (Chen & Gupta, 2011) but extends such frameworks to allow segments to be allocated to a smaller set of clusters. The approach can also be seen as a problem of clustering time series (Liao, 2005) of categorical observations (Pamminger & Frühwirth-Schnatter, 2010) but differs from most common settings in that clustering is performed within time-series. Since the data we consider is sparse, a model-based approach to clustering will be considered (Fraley & Raftery, 2002;Frühwirth-Schnatter, 2011). Arnesen et al. (2016) considered a Bayesian approach to segmentation of sequences of categorical observations and has several similarities to our approach. However, only discrete time observations were considered by Arnesen et al. (2016) and a fixed set of segment types was assumed to be visited in a prespecified order. In our approach, observations occur as a point-process in time with no restrictions on the order at which segment types are visited.
Although there are similarities both in data type and motivation with the computer system management application described by Li et al. (2017), there are some important differences that have motivated our approach to the marine setting. The number of messages is in general smaller within the considered data, making harder modelling assumptions necessary. Furthermore, in the considered F I G U R E 1 A timeline plot of messages. Each point represent a message (many overlapping points due to high frequency periods  Time   inu_runcmd  inu_stopcmd  exu_stopcmd  exu_runcmd  reset_fault  inu_offcmd  exu_offcmd  exu_fdbopen  mcb_opencmd  mcb_fdbopen  inu_oncmd  mcb_closecmd  mcbfdbclosed  exu_oncmd  exufdbclosed  cw1_pumpchanged_a  cw1_wtrdiffprlow_w_a  cw1_wtrdiffprlow_w_r  cw1_pumpchanged_r  cw1_auxsupprotsw_a  cw1_auxsupprotsw_r di_earthflt1mcirc_a di_earthflt2mcirc_a di_earthflt1mcirc_r di_earthflt2mcirc_r int0_di/dt_supervi_a int1_di/dt_supervi_a int0_di/dt_supervi_r int1_di/dt_supervi_r int1_1stfl:gusp_2 ppcs_communication case, the information in each message is limited. Several messages need to be grouped together in order to infer the macro-event, or state, that occurs at a given time.
The rest of the paper is organised as follows: In Section 2, we describe the data at hand. In Section 3, the model is presented while Section 4 describes our Bayesian inference approach. In Section 5, the method is evaluated on simulated data while Section 6 consider a data set from marine vessels. The paper is concluded in Section 7. Details about the algorithms and the experiments are given in the appendices.

| MESSAGES AND THEIR STRUCTURE
The output messages in the considered application are static, in the sense that they are generated from a finite set of messages and contain no numeric or dynamic information. The messages are hence treated as categorical variables, and the observations are a sequence of two-dimensional objects {(T i , E i ), i = 1, 2, …}. Here T i ∈ ℝ + is a timestamp denoting the time each message was generated, while E i ∈ {1, …, m e } is a categorical variable specifying the recorded message. We assume observations are available in a time window [t 0 , t max ] with n denoting the number of messages within this interval. The set of possible messages is finite, and the size of the set is in the order of 100-200. The messages in general reflect changes in the internal state of the equipment rather than periods of stationary operation. For instance, a message will be generated when the drive is powered on, while nothing is output if the drive is operated for days or weeks without interruption, until the drive is powered off. An example of a sequence of messages is given in Table 1. A timeline plot of longer series of messages is given in Figure 1. We see both from the table and the figure that the frequency of messages can vary considerably in different time periods and also that there is a wide range of messages being generated within this time frame. Furthermore, in the timeline plot we also see that while some messages appear frequently in most periods, others appear within some specific periods only.

| Notation
We use upper case for stochastic variables and lower case for fixed quantities. We will however use lower case Greek letters for parameters even though these will be treated as stochastic variables in a Bayesian framework. We will use p(·) to denote a generic continuous density with the arguments defining the variable(s) in question. We similarly use Pr(·) for discrete variables. In cases of combinations of continuous and discrete variables, we use p(·). We denote densities/distribution functions by, for example Dirichlet(λ;α) where the term before the semicolon denotes the stochastic variable and the term(s) after denote parameter(s). We use standard parameterisation except for the Gamma distribution where Gamma(·;κ,ϕ) denotes the Gamma density with shape parameter κ and mean parameter ϕ.
We assume that time is divided into different segments, see Figure 2, and use index j to denote segment number. Each segment is characterised by the duration D j , the number of messages within the segment, N j , as well as a two-dimensional state vector (V j , Z j ) where V j ∈ {0, …, n v } describes the activity state while Z j ∈ {0, …, n z } describes the frequency state. The activity state here models macro-events or operational modes, such that one activity state can be related to a particular class of failure or a given operation of the equipment. The role of the frequency state is regarded as instrumental; our inferential objective focuses only on the structure of the activity state in terms of messages and its value at specific time points.
As there are long periods in the data set where no messages are output from the system, we define one auxiliary zero state (V = 0) to correspond to an inactive state in which case also Z = 0. For all other states, V j ≥ 1 and Z j ≥ 1. The inactivity we refer to does not exclusively imply that the system itself is inactive or shut down, but rather the general lack of new incoming messages. We also define B j = t 0 + ∑ j j � = 1 D j � to be the breakpoint between segments j and j + 1. Note that there is a one-to-one correspondence between {B j } and {D j } and we will use either representations when convenient.

| Model
A model-based approach is considered, as the number of messages within the observation interval is limited. We assume a continuous time state space formulation where the segments follow a Markovian structure: Here S, the number of segments within the observed time window, is defined by the number of breakpoints occurring within the given time window. Note that the last segment will be censored in that its end point will be some time after the last observation. The time dependence between segments is described through the (latent) segment states {V j }. The extra discrete state variable Z j is introduced in order to capture high variability both in the length of the segments and in the number of log messages within the segments. A graphical presentation of the hierarchical structure of the model is given in Figure 3. Each of the conditional distributions, involving several parameters, will be described below.

| State dynamics
We assume a simple dynamic structure where the main dependence structure is that two following segments cannot both be in the zero (inactive) state. This will be modelled through dependence on V j−1 : (1) Graphical representation of the model within a segment. Each segment is characterised by two properties, the activity state (type of messages) and the frequency of messages. The first is captured by the model through the specification of the state V j , while the second is specified through the state Z j . The upper row lists the global parameters involved in the model which will be given prior distributions in our Bayesian framework. Note that V j also depend on V j−1 which is not included in the graph 720 | GRAMUGLIA et AL.
More sophisticated Markov structures could easily be included into the model, but would require longer time series in order to be able to identify the larger set of parameters. Note that two or more segments can belong to the same (active) state V, see for example the right column of Table 1. This behaviour fits with the natural dynamics of the system where the same operation can be executed repeatedly, still maintaining its distinctiveness.

| Segment features
We assume the following density for the duration of a segment: The duration of zero-state segments is assumed to follow the shape of a two-component mixture distribution as there is more variability in the duration of these segments.

| Observations
The number of messages within a segment (only relevant if V j > 0) is then assumed to follow a shifted negative binomial (NB) distribution for active states with segment-dependent parameters: The shift x gives the requirement of at least x + 1 messages within an active segment while the negative binomial distribution accounts for overdispersion related to a Poisson distribution.
Conditioned on the duration of the segment and the number of messages, we assume the timestamps to be an ordered set of independent identically distributed (iid) variables from the uniform distribution over [B j−1 ;B j ): The messages E j,1 , …, E j,N j observed within segment j are assumed to be iid. random variables with is a state-dependent probability vector. Also in this case a very simple model is assumed, ignoring the order of the occurrence of messages within segments. Li et al. (2017) defines this as a parallel episode. Segments will mostly contain a limited number of messages, making more parameter rich models (e.g serial episodes) difficult to identify.

| Prior distributions
A fully Bayesian approach is employed for inference. Whenever suitable, we specify conjugate prior distributions on the parameters for computational simplicity. We will define θ to be the full set of parameters, that is The probability vectors for the state components as well as the state-dependent probabilities for the messages are assumed to follow Dirichlet distributions; where a is a vector of a ′ s with appropriate length. Note that for p 0 this reduces to the Beta(a, a) distribution. This choice implies no prior knowledge and each possible combination (Z j , V j ) is a priori equally likely. Similarly, each state has an a priori structure assigning equal weight to all possible messages. In our experiments we have used a = 1.
Concerning ( z , z ) for the distribution on the number of messages within segments, and ( z , z ) for the lengths of the segments, we utilise that with a reparameterisation to ( z , z ), ( z , z ) with z = z ∕ z and z = z ∕ z , a conjugate prior for z and z will be an inverse Gamma distribution. In particular we assume p( z , z ) = p( z )p( z ) and p( z , z ) = p( z )p( z ) with

| Identifiability and relabelling
A well-known obstacle in Bayesian cluster models is the label switching problem (Jasra et al., 2005). If both the likelihood function and the prior are invariant under permutations of the labels (in our case the states), the posterior distribution will also be invariant to permutations. The marginal distribution for the parameters of each cluster component will then be identical. Several approaches have been proposed to solve the problem, see for example Jasra et al. (2005) and Rodriguez and Walker (2014). In all experiments, we compare four of the most used relabelling algorithms: Stephens's method (Stephens, 2000), pivotal reordering algorithm (Marin et al., 2005), ECR iterarive 1 and ECR iterative 2 (Papastamoulis & Iliopoulos, 2010;Rodriguez & Walker, 2014). All methods are implemented in the label.switching package in R (Papastamoulis, 2016). Our inferential purpose focuses on the state composition in terms of messages and the value of the V state at specific time points = ( 1 , 2 , …) called hotspots (see Section 4.3 for details), namely that we want to infer p( 1 |), …, p( n v |) and p(V( ) |). The role of the Z state is considered instrumental. On this premise we will carry out the relabelling solely on the above mentioned variables.
Full information about the processes is given through the posterior distribution based on the observed timestamps and messages {(t i , e i ), i = 1, …, n}: where S is the number of segments within the observed period, D S is the set of durations and similarly Z S and V S are the set of frequency and activity states, respectively. Furthermore,  = {t 1 , …, t n , e 1 , …, e n } is the set of observations. We will use the notation S = (D S , V S , Z S ) in the following both to simplify notation and to provide a generic presentation of the algorithm.
The posterior distribution is analytically intractable and sampling methods such as Markov chain Monte Carlo (MCMC, Gilks et al., 1995) have to be applied. A further complication in this case is that the dimension of the unknowns depends on the number of segments, which is a random variable. We apply a reversible jump MCMC (Green, 1995) for dealing with this transdimensional problem. Moves between different dimensions are obtained by moves, additions and removals of breakpoints.

| Standard reversible jump MCMC
For notational simplicity, we will suppress the conditioning on θ in the following. We will here consider a setting where S = ( 1,S , 2,S ) and simulation from p( 2,S |S, 1,S , , ) is possible. We assume there exists an auxiliary variable u S→S ⋆ such that for a one-to-one mapping G. One step of a reversible jump MCMC algorithm is described in Algorithm 1. The acceptance ratio is derived by and J S ⋆ →S ( S ⋆ , u S ⋆ →S ) is the Jacobian accounting for the transdimensional move. Equation (9) will be the most convenient form for calculation of the acceptance ratio while Equation (10) is easier for interpretation. Note in particular that if 1,S is empty, the algorithm reduces to the idealised one (an MCMC algorithm that only consider the marginal posterior distribution of S). In general, we would like 1,S to be low-dimensional compared to 2,S in order to obtain an algorithm close to the idealised one. For the specific application, we will define 1,S = D S while 2,S = (Z S , V S ). Note that due to the way 2,S ⋆ is generated, the Jacobian will only involve the transition 2,S → 2,S ⋆.
As stated by e.g Karagiannis and Andrieu (2013), practical efficient implementation of transdimensional sampling algorithms is challenging. Preliminary experiments with the ordinary RJMCMC algorithm demonstrated, however, that changes in (V s , ) were problematic, while moving around in the (S, D S , Z s ) as well as the rest of the hyperparameters worked reasonably well. We have therefore considered an addition of annealing/tempering steps that allowed for large changes in the (V s , ) space while keeping the remaining parameters/variables constant, see supporting materials A.3 for details.

| Storage and extraction of results
The reversible jump MCMC algorithm gives simulations {S q , D q , Z q , V q , q } for q = 1, …, Q where Q is the number of iterations of the algorithm. Summarising global variables not subjected to the change of dimension, such as S or the elements of θ, is straightforward based on the given simulated values. Summarising the segment-dependent variables is more difficult since these are continuoustime processes.
In principle, we would like to present estimates of quantities such as Pr(V(τ) = v|Data) for all time points τ, but for the sake of simplicity these quantities will only be given for a selected number of time points, which we call hotspots. Simple discretisation of the observation interval is problematic since messages are recorded at very variable frequencies. Our strategy is therefore to define some datadriven hotspot locations corresponding to time points that with high probability belong to intervals containing no breakpoints (so that these time points mostly have a common segment membership across the simulations). State probabilities are reported at these hotspot locations. The details are given in the supporting materials B.

| EXPERIMENTS ON SIMULATED DATA
In this section, we demonstrate the performance of the methodology through simulation experiments. In order to make the simulation scenarios realistic, data were simulated from a model with parameters estimated from the data set considered in Section 6. For these experiments, evaluation was focused on reconstruction both segment placement and cluster relation. Convergence diagnostics will be discussed within the operational data analysis in the next section.
Two different scenarios were simulated, in which different amounts of data were available to the estimation process. In the first case, we considered a time window of approximately 18 weeks (3000 h). We generated 10 data sets with sizes ranging from 3350 to 7970 observations, with numbers of segments varying between 83 and 240. For the second scenario, the time window was enlarged to 36 weeks (≈6000 h) resulting in 10 larger data sets (5800-11,870 observations) and consequently more segments (173-378). Both scenarios shared the same set of parameters (see Table 2); the frequency state Z is assumed to have two modalities (in addition to the inactive one) while for V the number of states is 8, each composed by 110 types of possible messages.
The probability of observing an inactive state after an active state was assumed to be p 0 = 0.93 while the probability vector for the activity states is obtained by smoothing the original vector from the analysis of the recorded operational data, v =̂ The v 's parameters are shown in Table 6.
On each data set, the performance of the method was evaluated in terms of (i) the positions of the estimated breakpoints, (ii) the estimated composition of the activity states and (iii) the estimated state allocations along the timeline. The processes involved are in continuous time. In order to obtain workable evaluation measures, some time discretisation is needed. To evaluate the prediction of breakpoints, we consider all intervals between uniquely observed timestamps, [t h−1 ,t h ), h = 1, …, H (some messages are recorded with equal timestamps). Within each such interval, the maximum number of breakpoints that can occur is two (since we do not allow two following segments to both be inactive). Defining I true h ∈ {0, 1, 2} to be the true number of breakpoints in interval [t h−1 ,t h ) and I h its corresponding random variable within the model, we define the loss This loss function represents the average probability of misplacement. To evaluate how well the model predicts the activity states along time we define where, as before, t 1 , …, t n are the observed time stamps. Since {V(t)} is obtained through a clustering procedure, a relabelling using the true state descriptions as pivots (see Section 3.4) is performed before F V is calculated.
Finally, for assessing the model performance in terms of the composition of the activity states, we used a symmetric divergence measure based on the Shannon entropy (Lin, 1991) between two discrete distributions {p x }, {q x } defined on a common finite space : where = (0.202, 0.130, 0.101, 0.129, 0.079, 0.145, 0.073, 0.138).
Pr(I h ≠ I true h |Data).

| Results
The values of F B and F V are shown in Table 3. In all cases, the loss F B did not exceed 0.006 with little difference between the two scenarios. This indicates that both the number of breakpoints and their positions are inferred with high accuracy. Concerning F V , Table 3 shows that for both scenarios the model is able to infer the correct state to a large extent. For scenario 1, the average misplacement error across the 10 simulated data sets is 10% while for the data-richer scenario the error decreases to 2.5%. In order to investigate the misplacement, confusion matrices for the prediction of I true h are provided in Table 4 for the cases with the lowest (data set 4) and highest (data set 5) values of F B within scenario 2. The adjusted Rand indexes (Hubert & Arabie, 1985) are also included. These are measures between zero and one with one corresponding to full agreement between clusters. The estimated number of breakpoints in interval [t h−1 ,t h ), Î h , has been chosen by selecting the value with the highest probability. For data set 4, the model has correctly estimated 278 out of 310 breakpoints (the numbers corresponding to I true h or Î h equal to two are doubled). In data set 5, the model performed well in finding the intervals hosting two breakpoints; however, the number of intervals with only one breakpoint was overestimated.
In Table 5, we report how the model estimates the composition of the states by comparing the true parameter vector true 1 , …, true 8 with the obtained estimates; the weighted average takes into account the true number of segments within each state. In all data sets, the ShE measures are close to 0 indicating that the model has been able to faithfully infer the states. Also in this case, we see an improvement in scenario 2 compared to scenario 1, indicating that a reasonable amount of data is needed in order to estimate these distributions properly.

| ANALYSIS OF DATA FROM AN OPERATING SHIP
We considered a data set consisting of 7569 messages recorded from a ship over a period of four and a half years (the actual dates have been removed to anonymise the data). The number of unique message types observed is m e = 110; after some preliminary experiments, we assumed a model with n z = 2 active frequency states and n v = 8 activity states. The main results for the analysis can be split into two parts; the structural results, focusing on describing the global parameters and the behavioural results, displaying the evolution of the system over time. Before analysing the results, we investigated the convergence of the employed MCMC.
T A B L E 5 ShE( true v ,̂ v ) for the simulation experiments. Here ̂ v is the posterior mean. Sc is scenario, D is data set, A D is average over the 10 data sets, A V is average over states, WA V is the weighted average over states with weights proportional to the (true) number of segments within states

| MCMC diagnostic
In order to assess the convergence of the MCMC algorithm 4, parallel simulations were performed, each of length 100,000, starting from different initial points derived by allocating a different number of breakpoints in different locations along the time axis. Here, one iteration contains one tempering step, five updates of hyperparameters and 100 updates of segment-specific features. The first 70,000 iterations were considered as burn in, and hence discarded. Moreover, a thinning schedule was applied by storing only every fourth iteration. The final tempering schedule was assessed after several trials, the goal being to achieve label switching multiple times at the highest temperature (Celeux et al., 2000). A total number of r = 30 increasing temperatures were chosen with the flattest target distribution raised to the power of 1/30. Relying on the four parallel simulations, the Gelman-Rubin statistic (Gelman & Rubin, 1992) was calculated for S, the number of segments. The choice for the analysis of this variable is motivated by the fact that while it provides information about the overall status of the MCMC it is not affected by label switching. The estimated Gelman-Rubin statistics then ended up at about 1.1, see also Figure 4. For other variables/ parameters, similar statistics are more difficult to estimate due to the relabelling aspects. However, we have inspected the contents of the states, which were very stable across runs, see supporting materials C. Figure 5 shows the score of the first principal component for each of the original posterior means of the v vectors (raw MCMC output) and after each relabelling method has been employed (ECR iterative 1, ECR iterative 2, Stephens and PRA) for one of the four simulations. The original MCMC output presents several switching points, proving that the designed algorithm has good mixing properties. ECR1, ECR2 and Stephens are successful in finding the right permutation pattern while the PRA method seems to fail. The results reported below are based on the output provided by one single chain; the chosen relabelling algorithm is the Stephens method.

| Structural results
The posterior estimates of the probability vectors 1 , …, n v describe the composition for the activity states. Table 6 reports the 10 most probable messages and their probabilities of occurrence for each state. Some states are composed by a few dominant messages while others reflect more complex structures. The three most common states, V = 1, 2 and 3, identify routine states, and are indeed characterised by standard messages. State V = 1, completely defined by only 4 messages, inu_runcmd, inu_stopcmd, exu_ runcmd and exu_stopcmd, reflects the necessary steps performed related to start/stop operations of the engines. States V = 2 and V = 3 correspond to the states where the system is initialised and terminated, respectively. Note that these two states are mainly defined by five messages, describing opposite actions. More interesting insights are gathered from states 4 to 8. They tend to be observed in situations where the system is going away from the normal settings. Based on expert domain input, state 4 is identified as the set of operations recorded during maintenance and inspection phases. States 5 and 6 indicate a malfunction of the cooling system of the ship. State 7 concerns a fault situation related to the power system. Finally, state 8 represents an abrupt shut down of the system due to a specific failure. In general, each state concentrates its mass around a few messages, demonstrating the ability of the model in recognising the different recurrent operative phases experienced by the system.

| Behavioural results
The composition of the states provides fundamental information about the structure of the system. However, considering the inferred states as a time series of known states and observing the dynamics over time provides a framework for monitoring the operations of the system, including faults and unwanted incidents. By applying the strategy exposed in section 3570 active segments were identified in the data set; the whole sequence of segments is shown on the top plot of Figure 6. We notice a well-defined pattern involving states 1, 2 and 3; such pattern identifies normal operative conditions. In the centre plot, we observe a snippet of about 8 weeks including 55 segments containing a total of 1143 observations of 87 different types. Among the normal 1-2-3 patterns we can observe unwanted incidents identified by the light green (V = 7 ) circles. Note that some of these apparent faults are generated by maintenance activities. By zooming further, the bottom plot shows an unexpected shutdown situation: after turning off the system (state 3) and on again (state 2) a sequence of faults is observed (4-5-6-7) after which the system is shut down. The clusterisation process provides therefore a simplified framework for performing failure diagnostic; also it makes easier to investigate the implications of the failure on the system by monitoring the states observed immediately after the failure.

| SUMMARY AND DISCUSSION
In this paper, we have proposed a method for clustering observations occurring as a marked point process in time. The clustering process creates segments along the timeline that are assumed to belong uniquely to one of a set of n v states with differing structural contents. In addition, an inactive state (periods with no observations) is included. Although the number of states is fixed a priori, the composition of the states is unknown and estimated from the data. The objective of the method is to cluster the observations in subsequences while simultaneously learning the structure of the states. We have taken a Bayesian approach and developed an algorithm based on reversible jump MCMC for performing inference in an offline mode. Tempering ideas have been incorporated into the algorithm for speed of convergence.
In the current model, a simple Markovian structure has been considered, partly since the amount of data has been limited and partly because the clustering approach is mainly motivated by an explorative/learning setting. A natural extension of this work will be to transfer the algorithm to a dynamic online setting in which more detailed description of the dynamics is built in to the model. A dynamic setting will also provide the possibility for prediction of states, with the added possibility for prediction of faults/failures. A more sophisticated model for the content of states can also be considered. In this work, we have only considered the occurrences of messages, not their order within each segment. The ordering certainly provides information, but utilising this information will require more/longer data sets than available at the current time. Combining data from similar but not identical types of equipment is certainly of interest, but is complicated since different units can provide messages with different content. Combining such data in an automatic way will require the use of some semantic similarity measures.
A simulation study has been conducted for evaluating the performance of the method, showing that the model is able to accurately segment the observations and predict the different states with high precision. The exact positions of the breakpoints between segments are however in some cases difficult to infer.
Data from an operational ship has also been analysed. This analysis demonstrates the potential of the method for enabling better understanding of the series of messages, including automated recognition of different complex fault scenarios.