Features spaces and a learning system for structural-temporal data, and their application on a use case of real-time communication network validation data

The service quality and system dependability of real-time communication networks strongly depends on the analysis of monitored data, to identify concrete problems and their causes. Many of these can be described by either their structural or temporal properties, or a combination of both. As current research is short of approaches sufficiently addressing both properties simultaneously, we propose a new feature space specifically suited for this task, which we analyze for its theoretical properties and its practical relevance. We evaluate its classification performance when used on real-world data sets of structural-temporal mobile communication data, and compare it to the performance achieved of feature representations used in related work. For this purpose we propose a system which allows the automatic detection and prediction of classes of pre-defined sequence behavior, greatly reducing costs caused by the otherwise required manual analysis. With our proposed feature spaces this system achieves a precision of more than 93% at recall values of 100%, with an up to 6.7% higher effective recall than otherwise similarly performing alternatives, notably outperforming alternative deep learning, kernel learning and ensemble learning approaches of related work. Furthermore the supported system calibration allows separating reliable from unreliable predictions more effectively, which is highly relevant for any practical application.


Introduction
Sequences of structural and temporal data combine properties of complex symbolic sequences and multi-variate time series, in that a single sequence s of length r has the format s = [(t(e 1  individual events) and structural properties (i.e. a semantically relevant order and context of events and their represented behavior), research is often faced with different problems, like varying sequence lengths and the lack of feature spaces that allow representing both temporal and structural properties sufficiently well. With suitable feature spaces, processes that rely on such data can be better analyzed and represented, consequently allowing the application of a wide range of learning methods on those features. This is true for all processes that create time-dependent structured data, like multi-layer protocol-based network communication, whose data can be recorded in real-time by logging systems. This allows the analysis of its structural-temporal data, utilizing its structural properties (e.g. protocol state-machine behavior) and its temporal properties (e.g. response timings) to continuously improve the quality of the respective communication service. The problem becomes even more complicated when analyzing the data of multi-directional real-time communication setups, as in video conferencing systems, in cloud infrastructures [1] or even in industrial infrastructure [2] networks. In those cases the temporal and structural properties are contained in multiple interacting event sequences, and the temporal properties are of vital relevance for the service quality of the system.
To be able to apply machine learning methods to solve the different types of problems occurring on such data (e.g. classification or prediction), specific feature spaces are required that properly represent those structural and temporal data properties and allow projecting sequences of arbitrary length onto feature vectors of homogenous length. To achieve this we analyze the structural and temporal properties of such highly time-dependent multi-client sequence data, and propose a new combined feature space which integrates structural and temporal properties in a novel way and allows for an equivalent length of the projected sequences, simplifying the subsequent application of various learning methods. We also show with an extensive competitive evaluation and statistical analysis how this feature space succeeds in this task.
As practical use case for structural temporal data we focus on bi-directional multi-client real-time mobile communication data, used to solve the specific problem of detecting and predicting known sequence classes on data of both known and unknown sequence classes. On this data all of the previously mentioned properties are relevant, i.e. the order and context of the events are class specific and relevant, as are both the individual and the interacting sequences of both clients, as well as the temporal properties represented in the contained timestamps and their sub-sequences. For this use case we introduce and evaluate a system for the automatic classification of failures of communication sequences between mobile clients. This system allows supporting or replacing the expensive manual classification usually handled by domain experts. We conduct a detailed analysis of the practical implications and requirements, especially on how to calibrate the system for a high precision while achieving a reasonable effective recall. This property is one of the main motivations of this manuscript and highly relevant in practice, as only highly reliable predictions allow rigorous consecutive decisions. On this system we comparatively evaluate the classification performances when using the proposed feature spaces with baseline, as well as competitive deep learning, kernel learning and ensemble methods from the fields of process mining and sequence classification, allowing to draw implications on their general suitability and their individual advantages and disadvantages.
While we are focusing our analyses on the specific area of mobile communication, the proposed system for the detection and prediction of pre-defined classes of sequential data, as well as the proposed feature spaces can be applied in all areas working with structural temporal data, specifically for problems that require the incorporation of real-time or multi-client system properties. This enables more precise classifiers and reduces the amount of required manual analysis, whose expensive cost would otherwise prevent the scaling of such a classification system to large scale data sets.
Summarizing the primary objectives, we aim to go beyond existing approaches of process mining and sequence learning by proposing a combination of structural and temporal features and their integration into a system for multi-class detection and prediction. As such the contributions of this paper are the following: 1. We propose a feature space for structural sequential data, which combines the use of both structural and temporal properties and integrates contextual positional variance in a novel way 2. We show the advantageous classification performance of the proposed feature space against feature spaces and learning methods commonly used in comparable process mining and sequence classification publications, including a baseline combined feature space 3. We further strengthen these analysis results with a significance analysis of the classification performances 4. We propose a combined detection and prediction system for multi-class failure classification of structural temporal data of mobile communication, with an additional focus on the calibration of the results' reliability, which is highly relevant in practical applications This paper is structured as follows: This Introduction is followed by a discussion of the Related Work. Then the use case and the utilized datasets are discussed in the Section Use Case Description, followed by the introduction of the relevant Structural and Temporal Feature Spaces. Afterwards the System Layout of the proposed sequence class detection and prediction are described, including the utilized learning methods. This is followed by the Evaluation and a statistical analysis of the classification system and the introduced feature spaces, before finally reaching the Conclusion.

Related work
We are interested in analyzing and evaluating features spaces representing the unique properties of sequences of structural, temporal data, whose inter-event structural properties have a semantically relevant relation to each other. We do this with a focus on mobile communication data, as an example of real-time protocol-based network communication data, where both the raw data logs and logs of additional dynamic analysis allow representing the contained structural dependencies. Further details on mobile communication protocol behavior can be found in [3]. While similar analyses have been done for network communication, as e.g. in [4][5][6][7][8][9][10][11], most research focusses on the structural data properties, and less on the temporal properties relevant for the real-time execution of such processes.
We are using different types of features to represent the structural as well as the temporal data properties within the feature spaces. Specifically we are relying on token n-grams, i.e. sequences of n arbitrary tokens. This feature representation is similar to spectrum kernels [12] and originates in the field of natural language processing [13][14][15][16], but has also been extended to network communication [17][18][19][20][21]. However, we are extending this structural feature type by including additional temporal information, and by also integrating a wider context for each token n-gram, an idea similar to the integration of additional context information as introduced in [22] for the CBOW (continuous bag of words) and Skip-gram models. This makes our research unique but also highly relevant for commercial processes, where such approaches are still highly sought after, e.g. in the form of process mining [23]. In this field different objectives are solved on business analytics data, which often comes in a format similar to our, i.e. consisting of events and timestamps. However, research in this field did so far not address multi-class classification problems, but focuses primarily on very narrow objectives, which do not require a combination of those different properties that we are interested in. As such [24] are using Markov Classifiers to predict the time remaining to completion of a business case. This objective is also addressed by the system proposed in [25], which utilizes Naive Bayes Classification and Support Vector Regression approaches, and in [26] which uses Long Short Term Memory neural networks (LSTM). The verification of linear temporal logic (LTL) compliance is approached in [27] by using Decision Trees, which is similar to the research of [28], which uses Multi Layer Perceptrons (MLP) for detecting service level agreement violations. One large research topic is also the prediction of the next events during runtime, for which graph theory [25], LSTMs [29], Decision Trees [30][31][32], Markov Classifiers [31][32][33] or Multi Layer Perceptrons [34] are used.
With the works of [35,36] there has been research in the process learning domain recently, which allows learning on complex symbolic sequences by combining various of their contained properties. As their proposed feature representations and general systems differ from our envisioned approach in crucial ways (e.g. by not numerically encoding the temporal information, or not covering multi-class classification problems), a direct application of and comparison with these systems is out of the scope of this publication. In [36] the authors are using LSTM to simultaneously allow the runtime prediction of the next events and the prediction of the remaining time to case completion. For their feature representation they combine different temporal properties of their data (relative timestamp, time within the day and within the week). While the concrete time features would be irrelevant or even misleading in our use case, the relative timestamp are very similar to our t e.rel . In [35] the authors are using hidden Markov Models and Decision Trees for predicting the achievement of a performance objective, as well as the fulfillment of a compliance rule. While being more complex, the features used in this work are only structured-sequential, i.e. the temporal component is not included in a quantitative form, thus missing an important requirement for our use case. Besides those differences both approaches do not allow for a multi-class prediction based on manually assigned class labels, as required in our use case. The highly dynamic, often non-deterministic behavior of the mobile network communication log sequences analyzed here, as well as their large number of states and transitions further hinders a direct application of these approaches. Those reasons also prevent the direct application of methods proposed in recent research in deep neural networks and their application on graph data [37][38][39], which opened novel ways of learning on complex data, e.g. for semi-supervised learning approaches or implicit feature spaces. Additionally, practical necessities like system interpretability and reliability are harder to fulfill with these more complex learning methods-as are the higher requirements on the data set sizes, necessary to achieve converging models of the trained neural networks.
Since we are operating on sequential data, one could also use methods originally from the field of sequence labeling, where the task is to predict a label for each event within the sequence, which was solved using methods like Hidden Markov Models [40], Conditional Random Fields [41], MLP [42], but recently also Recurrent Neural Networks and specifically LSTM Neural Networks [43][44][45]. Since in our objective data each event is already labeled though, we are not interested in predicting such individual labels, but instead require predictions based on the behavior represented by the complete sequences. Using predictions of such individual labels to describe a complete sequence label is also no option, as a defined sequence label can depend on structural-temporal properties not represented in event-wise predictions. Since the event order and the inter-event durations are semantically relevant, using methods of sequence alignment [46] is also not directly possible. Also methods like dynamic time warping [47] are not directly applicable, as they are not designed to process temporal data with additional

Data set properties
We are using mobile communication data of a specific format as a concrete example to discuss and show the properties of sequential structural and temporal data. It was recorded to provide a wide-ranging quality analysis of the underlying network infrastructure. The data is collected by a fleet of specialized cars equipped with roof-mounted antennas and multiple android smartphones. The mobile phones run test sequences, which consist of automatically calling a phone within one of the other cars, establishing a connection, playing a voice chat of 1 minute and finally closing the connection. This whole process is monitored, recorded and consecutively analyzed by a system which focuses on various key performance indicators (KPIs) and statistics, radio frequency (RF) values and the successful completion of the key processing steps of the respective protocol state machines of UMTS, LTE and GSM. Since the cars are moving while all of this takes place, the recorded data also contains switches between different transmission technologies (e.g. the sequence may start in GSM, switches then to LTE, and finishes in UMTS). The resulting data already allows for drawing simple conclusions, e.g. on whether communication sequences have been successful at all (i.e. the relevant KPI and RF values did not show negative deviations, and all relevant protocol states have been completed successfully), whether they dropped in the middle (i.e. important protocol states have not been completed), or whether they have failed for other reasons.
Those failed sequences are then manually analyzed to determine their reason of failure and potentially even its cause. This process is called failure classification, allowing to assign a specific failure class to a failed sequence. Doing this manually by looking at many hundred log file entries is a tedious, time-consuming and expensive task. By using statistics over the KPIs, RF values and the protocol states for rule-based approaches, this can partially be automated, specifically for failures with simpler behavioral patterns. A more versatile machine learning approach could however help solving this problem better, potentially allowing to cover less clearly defined failure classes, while also adding some flexibility when trying to find new failure classes, caused by changes in the communication technology backbone architecture, e.g. with the upcoming 5G [61] technology.
For our analyses we collected two different data sets: the MFC data set, containing manually labeled and unlabeled failure class samples, and the AFC data set, containing failure class samples which are automatically labeled by a rule-based approach, as well as unlabeled samples. Table 1 shows some of their most important characteristics. Each contained sample represents a call sequence, which utilizes at least the GSM, UMTS or LTE protocol, following its respective specification and call phases, which is reflected in the logged events and the respective timestamps. Instead of using all the recorded events though, we are mainly interested in those that are relevant for the potential failure classes. Hence we are using filters based on rules that have been defined by experts with deep domain knowledge, removing all events that contain redundant or irrelevant information. As a result we obtain a final set of base events, which together with their different states (e.g. different reasons for a location update reject), and in combination with the respectively utilized protocol, leads to overall 335 different event identifiers.
Both data sets will serve different purposes in this paper, because the details of the different labeling approaches used for both data sets are expected to impact the AFC evaluation performance negatively. The filtered events available in both data sets are selected to cover all important call phases and event sequences potentially relevant for a proper class prediction. As such they are theoretically sufficient to properly reproduce the MFC labels. However, they are insufficient to achieve a similar performance for the AFC data, where also data of additional events, as well as relevant KPI and RF data has been used for defining the classes, none of which we have access to in our structural-temporal data. As a result of these restrictions Table 1 shows that the average number of events per sequence is much smaller for AFC data, and its variance is much larger, reflecting a larger class variance and making a proper discrimination harder. Additionally, the AFC data contains 8 very similar variations of the Core Network Failure class, which are harder to discriminate as well. Due to these shortcomings of the AFC data set, the smaller MFC data set is more relevant for our purpose, as its labels provide a better ground truth. Since this is a problem in the AFC data set, we will restrict our analyses on the AFC data to those which are expected to provide valuable insights on this data set only, namely how well the proposed feature spaces and learning methods can still reproduce the AFC labels under these conditions, and-given its larger number of samples-how much of a performance improvement we can expect when increasing the training data size.
We will now discuss the format of the contained sequences. Each sample in our data sets consists of the bi-directional communication sequence between a caller and a callee. We denote the caller as the MOC client (mobile originated call) and the callee as the MTC client (mobile terminated call). The samples contain highly structured sequential data (order of events) with highly relevant temporal components (inter-event durations), all of which are semantically relevant for discriminating the failure classes, e.g. reflecting call phases being incomplete or too long, or reflecting an anomalous order of events. Table 2 shows an exemplary event sequence of a successful call, starting in LTE and proceeding and ending in UMTS. This example also introduces the new timestamp format t s.rel , denoting sequence-relative timestamps, defined for a sequence s as t s.rel (e i ) = t(e i ) − t(e 0 ) (or t s.rel (s) i = t(s) i − t(s) 0 ), i.e. the timestamp of the first sequence event is set to 0.0, and all other timestamps are reset relative to this value. In the example the MOC client is set up until t s.rel = 12.232, then the MTC client is set up until t s.rel = 14.231. Once the clients are connected at t s.rel = 30.103 the call takes place, before they are finally disconnected again. The abbreviations represent the following event types: extended service (ES) request, security mode (SM) command and complete, connection management service (CMS) request, radio bearer (RB) setup and extended service (ES) request.

Failure classes
Before analyzing how the concrete sequence properties are utilized in the feature spaces, we first need to discuss the existing failure classes and their structural and temporal properties in more detail, to provide a better practical background of the problem domain. Table 1 contains an overview and provides relevant class statistics for the selected, sufficiently sized failure subclasses of the listed main failure classes. It also contains details about the samples of insufficiently sized failure classes, grouped together in the set of Other Failures, as well as the additional number of unlabeled samples per data set.
CSFB problems. A circuit switch fallback is conducted when the current network (e.g. LTE) does not sufficiently fulfill the current connection requirements (e.g. signal strength, cell coverage, sufficient response times) or suffers from other problems, while at the same time an older network (e.g. GSM) is available. It can also occur when one of the communicating clients is not LTE capable. Handing over the correct connection state to another protocol can be problematic, though. Our data set contains cases of failures that occurred when the call setup was not properly continued after the location area update (LAU) and the routing area update (RAU), when the current network did not allow a proper release for redirection, or when the redirection to the older network simply took too long.
Congestion problems. These problems can occur when the network is overloaded, s.t. problems with the connection or response timings occur. In our data sets we have sample sequences where the connection downlink disconnected too early, leading to an interrupted connection, and other cases, where no circuit channel was available, completely preventing to establish a connection.
Core network problems. This failure class represents more general problems, where the causes might be similar to those of the previous failure classes, e.g. an unexpected downlink disconnect or problems with LAU and RAU. The previous classes, however, contain additional semantic properties that lead to the classification as CSFB (fallback to older technology) or congestion problem (high latency, low bandwidth), which are missing here. Failure sub classes dominant in our data sets contain cases of unexpected downlink disconnect, unreachable MTC clients, or no or too slow reply to LAU or RAU. The high similarity to other classes, as well as the fact that only minor differences exist between its individual sub classes makes discriminating them harder, which is specifically relevant for the AFC data, with its higher number of samples of these sub-classes.
E2E problems. End to end problems occur beyond the scope of the core network. In our data set two of its sub classes are prominently represented. Unexpected downlink (DL) radio resource control (RRC) connection releases are symptoms of problems during the downlink authentication phase of the connection, causing it to fail. This also holds for missing downlink setup failures, which already fail at an even earlier state. UE problems. Besides those network and protocol related failures, problems can also occur on the devices themselves. The MFC data set contains samples of potential firmware issues, leading to such problems.
Other failures. Sequences of failure classes that contain only very few samples are not useable for a proper evaluation. Those samples are all re-labeled as Other Failures, allowing to use them in the model class detection step of our system.

Structural and temporal feature spaces
One of the objectives of this paper is to analyze feature spaces capable of representing sequential data with structural and temporal properties, like the one detailed in the previous section, and to propose a feature space suited to better represent those properties. To achieve this, we discuss five different feature spaces Γ qT , Γ T , Γ S , Γ S+T and Γ ST . Γ qT is based on a sequential representation of the data, as commonly used in process mining [23]. Since we are specifically interested to additionally integrated temporal information in our feature spaces, Γ qT optionally allows the inclusion of quantized temporal information. Γ T focusses on the temporal information in a re-ordered sequential representation, while Γ S focusses on the structural information in a non-sequential representation, essentially using a token n-gram approach, as described in Section Related Work. Finally we propose Γ S+T and Γ ST to show the advantages of integrating structural and temporal information complementarily into a single feature space, which is expected to allow a better data representation, compared to using only structural or temporal features alone. We discuss those feature spaces abstract, but also discuss unique properties that are specifically relevant for our use case. As such some of those discussions are exemplary adaptations of the abstract feature space properties to our concrete use case. However, this should not be seen as a restriction on the general applicability of these feature spaces, as they can be adapted to other use cases as well.

Base processing
All of the following feature spaces require an identical base processing, for which we need a second timestamp format, enabling the consideration of relative time-dependencies (i.e. local delays): the event-relative timestamps t e.rel , defined as t e.rel (e i ) = t s.rel (e i ) − t s.rel (e i−1 ) (or t e.rel (s) i = t s.rel (s) i − t s.rel (s) i−1 ) for a given sequence s.
One objective for our proposed feature spaces is, that they represent multi-client behavior within the sequential data. This is relevant for our use case, because some failures can be caused by erroneous behavior on the MOC side of the call, while others are caused by problems on the MTC side-or even by problems on both sides, individually or combined. To reflect those structural properties, we need to create the sub-sequences s MOC and s MTC to contain exclusively the events of MOC and MTC respectively, with |s 2C | = |s MOC | + |s MTC |. These representations allow the feature spaces to omit events of the respective opposite client. Table 2 shows this behavior exemplarily for the MOC-events at t s.rel = 5.192 and t s.rel = 12.232. They are interrupted by three MTC events in the s 2C sequence, but are represented consecutively in the s MOC sequence. As a result, a sequence s can be described by a triple of sequences s 2C , s MOC and s MTC . If not stated otherwise, we use s synonymously with s 2C to denote the complete 2-client communication. As such the example in Table 2 effectively illustrates an s 2C sequence. This definition is extended to the whole set of sequences. Where previously we denoted all sequences s in a set S as s 2 S, we can now also denote the sets of sequences of each different representation, i.e. s 2C 2 S 2C , s MOC 2 S MOC and s MTC 2 S MTC . This also allows extending the definition of t s.rel to those representations, in that t s.rel (s 2C ) denotes the vector of the sequence-relative timestamps of s 2C , and t s.rel (s MOC ) and t s.rel (s MTC ) denote those of the client-wise sequence representations. The same holds for t e.rel . Note that t s.rel is set to 0.0 for the first event of each sequence representation respectively (i.e. t s.rel (s) i = t(s) i − t(s) 0 for s 2 {s MOC , s MTC }), to allow for a better comparability of sequences of the same client type.
Using an exemplary set of event identifiers E = { 0 A 0 , 0 B 0 , 0 C 0 , 0 D 0 } allows creating two artificial example sequences x 1 and x 2 and their timestamps in the two formats, as shown in Table 3. These will be used in the next sections to illustrate various aspects of the different feature spaces.

Γ qT features
Γ qT features are based on the s 2 S 2C feature vectors, essentially representing the most common type of data representation used in the related work of process mining. We extend this representation additionally by quantized temporal features. The idea is to get a sequential representation of the event identifiers, which is specifically suited for classifiers used in process mining, but to additionally include temporal information. To achieve this, we start with the event indices in s. Consecutive events with a temporal distance closer than a predefined minimum interval θ mi maintain their current position in the event index sequence. If consecutive events exceed θ mi , an additional empty event e ; is inserted, reflecting the larger interval between those consecutive events. This is repeated until the next event is reached. As a result, smaller values of θ mi introduce more events e ; and lead to larger, sparser feature vectors, while larger values of θ mi introduce less empty events, thereby decreasing the feature vector length while increasing the density-until a completely dense feature vector is achieved, containing no empty events e ; at all. Through manual analysis of the temporal properties of the analyzed data, a value of θ mi = 5.0s is selected as a compromise capable of filling large temporal gaps occurring in the data (which often represent overly long durations between two protocol states) while not increasing the overall feature vector length too much in less relevant regions of the sequence. For θ mi = 5.0s, the resulting Γ qT feature vector for sequence x 1 in Table 3 is thus . Choosing θ mi > 60.0s allows eliminating all occurrences of e ; , which is identical to the original sequence of events, without the additional temporal information provided by the e ; inserted in the sequence. When using Γ qT in this way, we denote it as Γ qT � , which allows highlighting the performance differences when using both types of feature representations with competing classifiers of the process mining domain.

Γ T features
To create the set of temporal features Γ T , we use the set S 2C . The idea of Γ T is to create a feature space which projects properties of the ith occurrence of each event type in a sequence s onto Features spaces and a learning system for structural-temporal data the same dimension. The projected property is the timestamp t s.rel of the respective event, allowing to compare it with the timestamps of the ith occurrence of the same event type of other sequences. We start by calculating the occurrence frequency f(e, s) for each event type e 2 E in each sequence s 2 S 2C . Then we define its maximum value as m e = max(f(e, s)), 8e 2 E, 8s 2 S 2C , calculating what is the most frequent occurrence of event type e in any sequence. Furthermore a function κ(e, s, i) is required, returning the ith occurrence of e in s. For example the simple sequence x 1 in Table 3 has two occurrences of the event type 'A'. As such,  Table 3 shows two simple sequences {x 1 , As a result the feature vectors have the following format: for s 2 {x 1 , x 2 }, resulting in the following final feature vectors: To obtain an optional binary representation of Γ T , values of τ(s, e, i) larger than a defined threshold can be set to 1, and to 0 otherwise.

Γ S features
The structural Γ S features are event n-gram features, similar to the previously used token ngram features, and as such representing a feature representation commonly used in the related works of sequence classification. Therefore we extract for all s 2C , s MOC and s MTC all event ngrams and index their sorted list, spanning the final feature space Γ S -including the client-specific n-grams of s MOC and s MTC . We denote an event n-gram of a sequence feature vector s via its vector indices in interval notation (similar to the one used in Matlab or numpy), s.t. s [i,i+n) denotes the event n-gram from position i (inclusive) to position i + n (exclusive). The Γ S feature vector of sequence sample s is then defined via the binary occurrence of the respective event n-gram within s. When using the examples x 1 and x 2 from Table 3

Γ S+T features
One base hypothesis of this paper is that the classification performance can be increased by using a complementary structural and temporal feature space. For the structural-temporal Γ S+T feature space we treat the feature vectors of Γ S and Γ T as equivalent. Because of its binary format, Γ S already produces qualitative feature vectors, but Γ T produces quantitative feature vectors. If we binarize its values, we create a qualitative representation, which we can simply concatenate with the Γ S feature vector. This is used here to provide a baseline complementary feature space, before defining the more complex complementary feature space Γ ST .

Γ ST features
For the structural-temporal Γ ST feature space we will first need an analysis of the representative capabilities we specifically want to achieve with this feature space. As such, we will start this section with an analysis of some feature requirements, before explaining how these requirements are met by creating the data representation via structural-temporal δ − n matching and the use of model sequences.
Context and position. Metrics and features for structured, sequential data should reflect its specific properties. A sample of such data could be described by the occurrence of single ngrams (as done in Γ S ). But this description can be improved when these n-grams are also analyzed in terms of their broader context and position. As such two similarly positioned n-grams might be identical, but their respective neighbor events (their context) might be different, which should prevent or penalize a match between them. This is highly relevant in data which is created by protocol-driven processes, like mobile communication data, which follows specific protocol states (e.g. for the radio bearer setup or the security parameter negotiations), all requiring specific events in their context. Thus it is important to focus on comparing contextualized n-grams with each other, i.e. events at the call setup should not be compared with those in the final call phases.
Model sequences. For the definition of Γ ST the concept of model sequences needs to be introduced. Projecting each of the s 2 S onto a feature space spanned by these model sequences yields projected samples of the same length (independent from the length of the projected sequence s), while at the same time incorporating both temporal and structural properties. The use of model sequences is based on the idea of defining the features of a sequence s based on its similarity to each model sequence s M in the set of model sequences S M , which thus defines a feature space model. To this purpose we define the set of model sequence representations as a triple S M ¼ ðS M 2C ; S M MOC ; S M MTC Þ just as we did for our actual sequences. By consecutively indexing the sequences and the events within S M we effectively span a feature space of size P 8s M 2S M js M j. Note that the model sequences do not have to be labeled, and also do not have be mutually exclusive to the set of training or test sequences S = {S 2C , S MOC , S MTC }, as we are not using the labels of the model sequences in any way. We rely instead on the relevance of their contained structural and temporal properties, offering insight into relevant types of behavior, required for the class discrimination. However, as the feature space is spanned by using the model sequences, their labels could potentially be used to increase the contained number of different features, or to balance the representation of features of more complex failure classes against those of simpler ones.
Defining the structural temporal features. The Γ ST feature space is based on the idea of representing structural and temporal properties of the respective sequences. In this paragraph we will discuss, how to achieve this by using n-grams and model sequences in structural-temporal matching procedure, with a focus on the feature properties of context and position. We define the context of each event by the size of the n-grams and the parameter of positional variance δ, and the positional properties by the actual matching procedure. The idea of this procedure is to match each n-gram of each s 2 S with the n-grams of each s M  Using the exemplary sequence x 1 and x 2 of Table 3, with x 1 as and x 2 as model sequence the structural δ-n matching with δ = 1 and n = 1 yields the following matching results for the respective values of i and j Since we are not only interested in the structural properties of our data, we will now extend F by integrating the event-relative timestamps t e.rel as temporal properties, obtaining the final structural-temporal projection function F. The idea is to calculate the absolute differences of the t e.rel of the structurally matching events of s and s M . This is done by modifying the previously used incrementation function int(), giving rise to the final definition of F:  Explaining the semantics. The objective of our projection F is to achieve features highlighting differences in structurally similar, but temporally different sequences, i.e. we aim for a way to define similar features for sequences with similar structural and temporal behavior, while achieving different feature vectors for those which are structurally different, or which are structurally similar, but temporally different. As one can see, the value of a single dimension is 0 if there is no structural match, it is very small if t e.rel (s [i,i+n) ) and t e:rel ðs M ½iþj;iþnþjÞ Þ are similar, and it is large if t e.rel (s [i,i+n) ) and t e:rel ðs M ½iþj;iþnþjÞ Þ strongly deviate. Once the feature vector of s is calculated, it is utilized in the classifier, where this feature projection does indeed allow focussing on the desired differences. If the projections of both samples have small values for a dimension, those small values contribute to a small distance between two samples, yielding a high similarity between the samples. If the projections of both samples have similarly large values for a dimension, these can contribute to a small distance between two samples-but only if they are similarly large. This is only the case if both samples have a similarly large deviation from the timestamps of the model sequence, which is only the case, if they show a similar temporal behavior. If the projections of both samples have differing values for a dimension, these values increase the distance between both samples, emphasizing their inter-sample difference for this dimension.
Further considerations. To improve the feature balancing in the final Γ ST , the sequences for S M should be carefully selected. S M containing an unbalanced number of samples per sequence class will lead to many, potentially redundant features for over represented data aspects (e.g. class specifics) or irrelevant properties (i.e. noise), while other data aspects could remain nearly uncovered, due to an insufficient number of s M covering these. This makes multi-class learning harder, because these over represented features might outweigh less represented features and may therefore produce results prone to classify the corresponding class. While we made sure not to use duplicate sequences in any of our data sets, we did not include such an additional sequence selection optimization. To reduce the dimensionality of Γ ST , one should also select features that are most relevant for the classification task, e.g. by removing redundant features (e.g. those dimensions that are redundant between the single-client vectors s MOC and s MTC , and the multi-client vector s 2C ), or by applying efficient feature selection methods like RDE [62].
Γ ST also allows inspecting, which dimensions are of highest importance for the classification, allowing a knowledge transfer into potentially faster rule-based algorithms. Since the Γ ST components also encode the temporal positions of the relevant n-grams, they can be mapped to additionally recorded radio frequency (RF) time-series data (e.g. reception level (RXLEV), reception quality (RXQUAL), received signal code power (RSCP)) of the logged communication sequences, enabling the detection of further RF-based failure classes like coverage or interference failures.

System layout
This section will introduce the actual system for the detection and prediction of classes of sequence behavior. The system description will be kept as abstract as possible, to allow an application to other relevant use cases. Fig 1 shows the training phase of the system, in which the sets of sequences S are used to create the feature vectors, which are then used to train the required classifiers, responsible for the detection and prediction of properly represented model classes.

Model class detection and prediction
In our use case, the classes of sequence behavior are defined by the different ways communication sequences between both clients can fail. When samples of failed sequences of a new data campaign need to be classified, we could assume a hypothetical scenario in which no new failure classes are found in the new campaign, i.e. all potential failure classes have already been seen before. However, this is not true in practice, where new types of previously unseen failures occur indeed. This is also true in our data sets, with the consequence that only a limited number of failures classes have a sufficient size to properly evaluate supervised classification models with them. We denote such classes as model classes, or MC. Sequences of Other Failures, i.e. of insufficiently large failure classes are denoted as non model classes, or ¬MC.
This motivates the design of our system, consisting of two major components, which allow to detect whether a new sample is potentially a MC sample (and not a sample of ¬MC), and if that is the case, to predict the respective model class. Accordingly, those steps are called the model class detection (MCD) and the model class prediction (MCP). For the MCP a classifier is trained in a multi class approach, learning to discriminate only samples of the MC, but not the ¬MC. For the MCD multiple classifiers are trained, one for each MC. Each of those classifiers is trained in a two class approach, learning to discriminate the respective MC against samples of ¬MC. After training the MCP and MCD classifiers, we can predict the failure class of a new sample by predicting a MC with the MCP classifier, and then using the MCD classifier trained for this MC to confirm or reject this prediction.
These predictions are obtained by applying their respective prediction functions to the feature vector Γ(s) of a test sample s, using one of the previously defined feature spaces, i.e.

Combined classification system
We are combining the prediction functions F MCP and F MCD by using two confidence ratings as a way to ensure a higher confidence in the predictions of the combined (MCP and MCD) classifier, as this helps improving the classification precision of our approach, which is crucial in the practical application. These confidence ratings produce the binary results High and Low. They can be logically combined and can be interpreted either as providing support for the prediction of the MCP (High confidence) or objecting against its prediction (Low confidence).
Since the output of F MCD is limited to a single MC i and ¬MC, it is used as the first confidence rating for y MCP , answering the question of whether sample s really belongs to the already predicted y MCP (i.e. a specific MC i ) or whether it belongs to ¬MC. For this purpose, Features spaces and a learning system for structural-temporal data the confidence rating function Θ MCD (Γ(s)) 2 {High, Low} is defined as: To obtain further confidence on the classification result of F MCP , we define an additional confidence rating Θ db . It uses the decision boundary of the MCP classifier of the predicted class. The idea behind Θ db is to calibrate the decision boundary of each MCP classifier towards more conservative values, requiring a sample with y MCP = MC i to cross a stricter decision boundary for this MC i to obtain a High confidence for Θ db , which reduces the false positive rate and increases the precision. As it is defined over the decision scores, it can be applied to any classifier which provides access to its decision scores or probabilities. For this purpose we access the prediction score of the MCP via the function D MCP (Γ(s)). We also require the existing bias b of the MCP classifier of y MCP , and a parameter θ DB � 0 to define the new bias b db = b + (D ⌀ − b) � θ db , with D ⌀ representing the mean of those decision scores of the training samples that have been correctly classified by the MCP. As such, the parameter θ db gives rise to the following definition of the confidence rating Θ db : θ db can be changed dynamically during model selection, which allows for a calibration of the model precision, similar to other methods of false positive calibration as used in e.g. [59]. The confidence ratings are then processed together with the prediction y MCP to produce the final predicted label F combined (Γ(s)) 2 {MC 1 , . . ., MC k , ¬MC}, defined as follows: The effect of the confidence ratings and the consequently created High-confidence prediction subset on the applied evaluation metrics will be elaborated in the next section. Now that the MCDs and MCPs are trained and calibrated, we can apply the system to classify unlabeled sequences, as illustrated in Fig 2. As just described, the final prediction F combined (Γ(s)) of this combined classifier for a given sequence s only provides the label of the predicted MC i if Θ MCD (Γ(s)) = High and Θ db (Γ(s), θ db ) = High, and ¬MC otherwise. This, together with the still accessible Θ-confidence ratings allows for an effective way of increasing the system classification precision, thereby discriminating reliable and unreliable predictions, which is highly relevant in the practical application.

Evaluation
The evaluation section seeks to answer the following research questions: • Which of the feature spaces achieve the best classification performance, when evaluated with suitable learning methods widely used in related work? Which learning method achieves the best and most robust results?
• Does the proposed Γ ST feature space, which combines structural and temporal data properties in a novel way, achieve a better classification performance than other feature spaces, which do not use its additional properties of temporal features and positional variance?
• Under which circumstances does Γ ST allow for a better classification performance than the baseline feature space Γ S+T ? What are the implications for a practical application of the combined classification system?
To enable the analysis of these questions, subsection Learning Methods starts with describing the different learning methods used for the evaluation. To achieve a better reproducibility of the experiments, individual parameters and settings are described. Subsection Evaluation Metrics proceeds with explanations on the definition of the confusion matrix and additional metrics required in our use case to ensure the required practical applicability. The most important evaluations and analyses are then conducted in subsection Experiments on MFC data. This subsection starts with a description of the evaluation settings and the calibration of the δ-parameter, followed by the evaluation of the classification performances using various performance metrics for the individual MCP and MCD predictors, as well as their combined application to simulate the complete system workflow. The subsection ends with a statistical significance analysis, which further supports the achieved results. Finally subsection Experiments on AFC data describes evaluation results on the AFC data, thereby providing a different perspective on the proposed features and systems.
To enable reproducibility of the experiments, all data utilized in this manuscript is available in anonymized form as supporting material on the PLOS ONE publication page.

Learning methods
While unsupervised or semi-supervised methods have shown to produce good results on textual and structural data and could be relevant to our problem of model class detection, supervised methods are regularly outperforming them and are the preferred solution, if labeled training data is available. As discussed in Section Related Work, Decision Trees, Markov Classifiers and LSTM are learning methods widely used in the domain of process mining, while MLP and SVM are more widely used on non-sequential data representations of sequential data. For these reasons we are conducting our evaluations on those methods, to provide a broad picture of the classification performances achievable on the discussed sequential (Γ qT ), non-sequential (Γ T , Γ S , Γ S+T ) and semi-sequential (Γ ST ) feature representations. We also Features spaces and a learning system for structural-temporal data include the classification performance using k-nearest neighbors to provide an additional baseline. As some of the feature spaces are designed with specific learning methods in mind, they are only evaluated on those learning methods. In our experiments we actually tested all types of features with all learning methods, but achieved suboptimal results on non-suitable learning methods. For those reasons the respective results are omitted. All of the models below have been chosen using standard cross-validation based model selection.
Decision tree. Decision trees model training data based on their sequence, where their shared prefix paths build the root of the tree, which branches along the sequence down to the leafs, annotating the transitions with their respective probabilities. They are widely used, specifically in process mining in [27,[30][31][32] and as random forest of decision trees [35]. We use them as classifiers, based on the sequential Γ qT features, using additional suffix padding to achieve equivalent length sequences.
Markov classifier. Markov Models are commonly used in process mining [24,[31][32][33]63], where they are primarily used for predicting objectives like the remaining time or the next event, and not for sequence classification. It can also be used for classification though, as Markov Models represent the data of each class in the training data as a Markov process. This allows calculating the class-wise path probabilities for a new sequence, and predicting the most probable class by the highest overall path probability. We apply such a classifier on the sequential data representations of the Γ qT feature spaces.
LSTM RNN. Recurrent Neural Network with Long Short-Term Memory nodes are a type of classifier recently used e.g. in process mining [29,36]. Deep Learning approaches work best with large training data sets. In our use case, getting a large amount of labeled samples is not easy, thus a deep learning approach might not be the best way to address this problem. However, recurrent neural networks (RNN) with long short-term memory (LSTM) units have shown great performance on sequence prediction problems, s.t. evaluating their performance on this problem is still highly interesting. For our experiments we are using the tensorflow [64] implementation of RNN with LSTM. The Γ qT features are specifically designed with an LSTM RNN in mind. To achieve samples of homogenous length per batch, we added empty events to the end of each sequence. Since the history of each event is of specific relevance in the event sequence handling in LSTMs, this suffix-padding is a good solution, as it allows to assure that the starting events are not empty. In our experiments we achieve the best results when using one-hot label encoding, a single hidden layer of 20 nodes, 300 epochs and a batch size of 10.
KNN. K-nearest neighbor classifiers are classical distance-based baseline classifiers from the field of natural language processing. We achieve the best results with a value of k = 5, using the euclidean distance while additionally weighing points by the inverse of their distance, s.t. closer points have a larger impact.
MLP. Multi Layer Perceptrons are a widely used type of neural network sequence classifier, e.g. in [28,34]. We achieve the best results by using a single hidden layer with 80 nodes and the identity function as activation function.
SVM. Support Vector Machines are a supervised learning methods, training a maximal margin separating hyperplane between linearly separable class data. While this can also be extended to non-linearly separable class data, we are using a linear kernel, which has shown very good results given sufficiently high-dimensional data, and specifically for protocolbased communication data [19][20][21]60]. For the MCP evaluation we are using a one-vs-rest (OVR) approach, as this includes calculating a separating hyperplane for each model class MC, which allows a confidence calibration to optimize the system precision, as explained in the next section.

Evaluation metrics
The general system application of reliably detecting and predicting known amidst unknown sequence classes, and its concrete practical application in failure classification enforces a focus on two primary objectives: (1) to obtain reliable predictions (2) for as many MC samples as possible. Precision, recall and the F1 score capture those aspects. For the evaluation of the MCP and the MCD classifiers they are calculated for multiple cross validation repetitions, s.t. we chose to calculate their unweighted class mean (denoted with the keyword macro), because we already configured the sampling procedure to produce similarly sized model classes. Whenever the ¬MC class participated in the evaluation (in MCD and combined classifiers) we excluded it from the calculation of the classwise mean values, because our focus is on the correct detection of MC samples-and not on the correct detection of ¬MC samples. As the set of ¬MC samples is much larger than each individual MC, this would otherwise lead to overly optimistic evaluation results. The formula below shows this approach exemplarily for the macro precision, where the ¬MC are implicitly contained in the calculation of the precision of each MC, but are not used as a primary class: By combining MCP and MCD classifiers and applying confidence ratings, we effectively create a filter, allowing us to focus solely on the created High-confidence subset of the predictions, which is designed to contain only those samples and labels which truly are of a model class MC and are correctly classified as such, thereby fulfilling objectives (1) and (2). Due to this mixture of multi-class (MCP) and two-class (MCD) predictions, we also have to adapt the metrics in the utilized confusion matrix, as described in Table 4.
Based on these values precision and recall are defined as usual, with precision = TP/P and recall = TP/(TP + FN). However, to correctly address objective (2) we have to calculate an additional effective recall by considering all existing MC samples, not only those in the High-confidence subset. Therefore we define the effective recall as the recall of correctly predicted samples of MC over the sum of samples in all MC, i.e. effective recall ¼ TP= P s #MC i ; 8MC i . Together with the precision over the High-confidence subset, this metric allows for a conclusive analysis of the overall system classification performance, as provided by the combined classification processing.

Experiments on MFC data
To be able to apply the selected learning methods, we have to assure sufficiently sized failure classes. To obtain any model classes at all, we sample only from classes with a minimum size of 15 sequences. To improve the interpretability of our classification results we opted for similarly Features spaces and a learning system for structural-temporal data sized failure classes, which we achieved by limiting the size of each failure class to a maximum of 25 sequences. As such, the number of samples per MC used for the evaluation, s #MC , is 15 < s #MC � 25. To simulate the complete system, we also need members of the Other Failures class, of which we used all 86 available sequences, i.e. s #¬MC = 86. In the MCD evaluation this allows highlighting the detection purpose of the method, as the ratio of samples of MC i to ¬MC is approximately 1: 4. In the combined evaluation the ratio of all MC samples to ¬MC samples is approximately 1: 1, which helps in the interpretation of the classification results. For defining the Γ ST features we used all 6,077 labeled and unlabeled MFC samples as model sequences S M . As previously described this use of the unlabeled sequences helps to extract additional information about the behavior of the projected sequence. To further reduce the high dimensionality of the resulting Γ ST feature space, we additionally applied a dimensionality reduction, which utilizes the redundancy between the S 2C and S MOC , and the S 2C and S MTC feature vectors, once they are projected with the structural δ − n matching of�ðs; s M ;ŝ M ; d; nÞ. This resulted in a feature space of 294,435 dimensions. We also conducted an additional analysis on the number of dimensions actually relevant for a sufficient data representation, which could motivate further dimensionality reduction steps. This analysis is summarized in the Appendix. For all evaluations we used 20 times random sampling, each with a 5-fold cross validation. For a proper evaluation we made sure all conducted comparisons between classifiers and feature spaces were done on the same respective samplings. For the event n-gram sizes of the Γ S , Γ S+T and Γ ST feature spaces we used a fixed n-gram size of n = 2, which yielded the overall best results. We also tested using multiple values of n simultaneously, as e.g. described in [65]. However, the performance increase was only minimal. As a convention, all results are listed as the mean and standard deviation in percent.
Evaluation of the individual MCP and MCD classifiers. The purpose of the first set of experiments is to find the best performing learning method for all feature spaces, s.t. we can restrict the further experiments to this learning method, allowing to focus on the feature space analyses. Since Γ ST is further parametrized by δ, we start with analyzing the impact of different values of δ on the MCP classification performance of Γ ST using the SVM classifier. To obtain more reliable parameter values, the calibration is conducted using solely the MCP, and not the combined MCP-MCD system. The mean sample length in the MFC data set is 44.11 events, with a standard deviation of 13.51 events. Since δ encodes the positional variance of the matched n-grams within the projected sequence, it does not make sense to increase its size beyond a value of δ = 60, at which the complete average sequence length is covered. Since we expect a high importance of a similar positioning of the matched n-grams, we expect better results for smaller values of δ. The left plot in Fig 3 shows the results for δ 2 [0, 60]. As expected we achieve the best results with a value of δ � 10. For that reason we focused stronger on the range of δ 2 [0, 10], which is illustrated in the right plot in Fig 3, allowing to further reduce the selection of an optimal value down to δ = 5. As this value allows a positional variance of ±5 events on S M , it can additionally be explained by the circumstance that event sub-sequences, which are crucial to the protocol, like the call setup (also illustrated in Table 2) take around 10 events in the s 2C representation, requiring any sequence to match the contained events. Now that we know a proper setting of δ we can conduct a comparative evaluation of all MCP classifiers on the MFC data set, using all feature spaces. The results are shown in Table 5. Of those learning methods which are widely used in the field of process mining, and which have been applied on Γ qT and Γ qT � , the decision tree showed the overall best performance, specifically on Γ qT � , i.e. the original sequence representation. However, both the Markov and the LSTM classifier achieve an improved performance on the temporally enriched Γ qT feature space. When compared to the performance on the other feature spaces though, all of those approaches commonly used in the field of process mining are clearly outperformed by the other feature spaces and learning methods. While this was to be expected, given that the process mining approach, and specifically deep learning approaches like LSTM usually require much larger training data sets, also the lack of the additionally included concrete temporal information is highly relevant, since they are even outperformed by Γ T , which only contains strongly reduced structural information about the data. Hence the SVM classifier performs best on all feature spaces, outperforming the otherwise widely used MLP, as well as the KNN approach. For those reasons we are using it for the remaining experiments. The results of the SVM classifier also show, that in the optimal scenario in which all MC are known, good results can already be achieved without using the proposed θ db system calibration.
For the MCD evaluation only the SVM classifier is evaluated. This decision was taken as it shows the most robust behavior in the MCP evaluation. The results are shown at the bottom of Features spaces and a learning system for structural-temporal data Table 5. Obviously discriminating the MC and ¬MC is harder than separating the MC in the MCP setting, which is to be expected, as the ¬MC samples are very heterogenous. However, using semi-supervised learning via a One-Class SVM to model each MC i against the ¬MC performed even worse. Since all other learning methods also performed much worse, their results have been omitted to maintain a proper readability of this section. Of all feature spaces Γ S+T performs best, while the specifically crafted Γ ST feature space is slightly outperformed by all other feature spaces. While one might think that Γ ST does not look that promising yet, the combined classifier evaluation will show, that it performs better than its competitors in the final system layout, when the effective recall becomes relevant. Evaluation of the combined classifiers. Now that we established some understanding of the performance of the individual MCD and MCP classifiers, we will now evaluate for each qualified feature space its combined classification performance, also integrating the previously described confidence ratings. We do this to find the feature space which has the highest precision, at the highest possible effective recall, which is highly relevant for an effective system in the practical application. Achieving high precision predictions means that we can trust the results to be correctly classified and to not contain any samples of ¬MC falsely being classified as an MC sample. And getting the high effective recall means, we get this predictive behavior for a larger portion of the MC samples that are actually contained in the test set. This aspect is illustrated in Fig 4, which shows the percentage of High-confidence samples of MC to all samples of MC in the test set, under a shifting parameter θ db 2 [0, 1.0]. The more θ db is increased, the less samples of MC are actually contained in the High-confidence subset, reducing the potential effective recall. Features spaces and a learning system for structural-temporal data For comparing the combined classification performance of the different feature spaces with the SVM classifier, we need to select values of θ db representing practically relevant values of precision and recall, which are similar for the respective feature spaces. Fig 5 contains the precision, recall and the effective recall of the classification results for θ db 2 [0, 1.0]. The precision starts to reach 100% for most classifiers at θ db = 0.8. At this value also the recall reaches the maximum of 100%. When looking at the concrete values of mean and standard deviation, shown in Table 6, we see that the recall can not be further increased, and that the corresponding precision can be selected s.t. it is around 93% for Γ ST , Γ S+T and Γ S . Since further increasing the precision would not further increase the recall, and 93% is already a reasonable system precision, we will use this value and the respective settings of θ db for the further analyses. Thus we are using θ db = 0.8 for Γ S and Γ S+T , and θ db = 0.9 for Γ ST . In this respect, the performance of Γ T was not sufficiently high to achieve similar values of precision and recall, which is why we used θ db = 1.1 there, achieving relatively close values for further analyses. Now we need to evaluate which feature space offers the best effective recall, i.e. the fraction of MC samples that can be recovered from the set of test samples, with those high values of precision and recall previously described. At their respective values of θ db , Γ S has the lowest Features spaces and a learning system for structural-temporal data effective recall of 25.05%, followed by Γ S+T with 26.27%, and Γ ST with an effective recall of 31.79%. Thus Γ ST produces a precision performance similar to both Γ S+T and Γ S , while achieving a 5.5% higher effective recall than Γ S+T , and a 6.7% higher effective recall than Γ S . This means, we can get for 31.79% of the MC samples in the test set a correct prediction with 93.11% precision and 100% recall when using Γ ST , compared to 26.27% effective recall, 93.51% precision and 100% recall when using Γ S+T , and even worse when using Γ S . These results are highly relevant in practice, as they effectively allow filtering near-certain from uncertain predictions with a very high precision. These results are also highlighting the effectiveness of both combined feature spaces, with a significant advantage for the Γ ST feature space. In that regard both structural-temporal feature spaces Γ S+T and Γ ST outperform Γ S and Γ T : Whereas Γ T has a relatively good effective recall, but a relatively low precision, Γ S has an acceptable precision, but a low effective recall. This renders both feature spaces less practically relevant than their combined counter parts, highlighting the relevance of combined structural-temporal feature spaces.
The dimensions most relevant for the respective classification results in this use case were security handshake events, followed by the existence of events representing a successful response to the most relevant key protocol states, like a successful radio bearer setup. As we saw in the MCP evaluation, the temporal features are also relevant and utilized in both combined feature spaces. The Γ ST feature space also has an advantage here, as its S M -based feature space allows locating the concrete positions and structural-temporal properties of the relevant events within the sequence, which are in the evaluation often identified as responses occurring too late in the sequence, or security mode negotiations at anomalous sequence positions. All of this can then be used to obtain deeper insights into the data, which can help manual analysts to limit the number of causes for this specific failure class.
Due to their potential in combining classifiers of different features spaces, we also evaluate the classification performance of an ensemble method [66], namely the ensemble classifier E, which could potentially further optimize precision and effective recall. It predicts the combined classification results by using a majority voting over the predicted labels of Γ S , Γ S+T and Γ ST . Table 6 shows its results when using their default trained models of θ db = 0.0, and for their optimized decision boundary models, using θ db = 0.8 for Γ S and Γ S+T , and θ db = 0.9 for Γ ST respectively. Γ T has not been used due to its lower performance. When using the default Features spaces and a learning system for structural-temporal data models at θ db = 0.0, the ensemble classifier in fact achieves the best precision at the cost of the effective recall, an effect similar to the trade-off of θ db . For the optimized models of θ db 2 {0.8, 0.9} the ensemble classifier achieved worse results though. Significance analysis. To further substantiate the results of the previous section, we conduct significance tests on both of our theoretical hypotheses, namely that the combined feature spaces Γ S+T and Γ ST outperform the base feature spaces Γ T and Γ S in terms of effective recall at similar precision (Hypothesis H A ), and that the more complex combined feature space Γ ST outperforms the simpler combined feature space Γ S+T under the same premises (Hypothesis H B ). For the formulation of the hypotheses we denote θ er as the minimal effective recall.
For hypothesis H A the null hypothesis H A 0 is defined as follows: When using Γ S or Γ T for achieving a test set precision mean of 93%, a fraction of p 0 samplings have an effective recall � θ er . The alternative hypothesis H A 1 is then defined as follows: When using Γ ST or Γ S+T for achieving a test set precision mean of 93%, a fraction ofp samplings have an effective recall �θ er . Now we can formulate the question for hypothesis H A : Is there sufficient evidence at the α = 0.05 level to conclude that the effective recall for the high precision classification performance is increased, when using one of the combined feature spaces Γ ST or Γ S+T instead of one of the individual feature spaces Γ T or Γ S ? And at which minimal effective recall θ er does this hold? The results for the minimal θ er , at which we can reject H A 0 in favor of H A 1 (i.e. above which p � α always holds for the resulting p-values) are shown in Table 7, for each pair of base and combined feature space, as calculated on the same sampling that have also been used for the previous combined MFC evaluation. We can see that H A 0 can be rejected for Γ T for values of θ er � 5%, i.e. for nearly all values of θ er , excluding those which do not occur in the combined feature spaces due to their generally higher effective recall. For Γ S , H A 0 can be rejected for θ er � 9% for Γ ST , and for θ er � 14% for Γ S+T . This means Γ ST is better for a larger number of samplings, while Γ S+T starts outperforming Γ S later-both of which is also relevant for hypothesis H B .
For hypothesis H B the null hypothesis H B 0 is defined as follows: When using Γ S+T for achieving a test set precision mean of Â 93%, a fraction of p 0 samplings have an effective recall �θ er . The alternative hypothesis H B 1 is then defined analogous: When using Γ ST for achieving a test set precision mean of Â 93%, a fraction ofp samplings have an effective recall �θ er . The question for hypothesis H B is then: Is there sufficient evidence at the α = 0.05 level to conclude that the effective recall for the high precision classification performance is increased, when using the complex combined feature space Γ ST instead of the simpler combined feature space Γ S+T ? And at which minimal effective recall θ er does this hold? As shown in Table 7, H B 0 can be rejected for all values of θ er � 30%, showing that Γ ST indeed outperforms Γ S+T , a fact that is further strengthened by the performance advantage of Γ ST over Γ S+T , as shown for hypothesis H A .
We will now further elaborate hypothesis H B by analyzing the distribution of precision and effective recall, when using either feature space on the same test sets. The results of this performance variance analysis are shown in Fig 6. As previously stated, and shown in Table 6, we conducted the significance tests on SVM classification models calibrated for an average Features spaces and a learning system for structural-temporal data precision � 93%. As shown in the left plot of Fig 6, the models of both Γ S+T and Γ ST show similar performance distributions, with a slight advantage for Γ S+T , due to its slightly higher average precision of 93.51%, compared to the 93.11% of Γ ST . However, as the right plot shows, the effective recall on the same test sets is much balanced towards Γ ST , clearly supporting hypothesis H B . As a result these analyses support our conclusion that Γ ST is the most advantageous feature space in the discussed use case.

Experiments on AFC data
Due to the already discussed shortcomings of the AFC data set properties, we are only interested to see, whether the MCP are capable of discriminating the model classes of the AFC data set at all, and how much of a performance improvement we can expect with a larger training data set, which is possible only on the AFC data set. Similar to the class size restrictions described for the MFC evaluation, we have to ensure sufficiently large as well as similarly sized failure classes. To reflect smaller and larger training data sets, we evaluate two different setups. The first setup is defined with comparability to the MFC evaluations in mind. Hence we use s #MC = 25, resulting in the 13 sufficiently sized failure classes listed in Table 1. In the second setup, selecting s #MC = 100 allows for a larger training data set, resulting in 4 sufficiently sized failure classes. For defining the Γ ST features we used all 3,264 sequences as model sequences S M , resulting in a feature space of 144,938 dimensions after redundancy-based dimensionality reduction. Table 8 shows the results of the MCP evaluations on the AFC data set. Due to the differences in the MFC and the AFC data, we expected a worse classification performance than on the MFC data, which indeed occurs. However, for the larger sets of training data with  Features spaces and a learning system for structural-temporal data s #MC = 100 the results are largely improved, which shows, that the AFC data set still contains a sufficient number of discriminative features to enable classification. This also documents the potential for an equally increased classification performance on the MFC data, once more class-wise training data is there available as well-which also applied to the general use case of similar classification problems.

Conclusion
This paper addresses theoretical and practical issues, relevant when analyzing real-time log data of structural, temporal processes using specific structural-temporal feature spaces for solving classification problems on mobile communication failure data. On the theoretical side we present an analysis of structural and temporal data properties, specifically on the discussed format of mobile communication data. We introduce and discuss novel individual and combined feature spaces utilizing those properties to obtain a good data representation. We conduct a comparative performance evaluation of these feature spaces against feature spaces and on a range of classification methods, all of which are commonly used in related work. We also show in various evaluations and via hypothesis testing that both of our combined temporal structural feature spaces Γ S+T and Γ ST outperform their competition counterparts from the research fields of sequence learning and process mining, and that the novel Γ ST feature space excels in classification performance when compared to all other approaches, including an ensemble method. On the practical side we propose a system for the detection and prediction of classes of pre-defined sequence behavior, applied on the use case of the automatic classification of mobile communication failures using the proposed feature spaces and supervised learning, for which we also show how to maximize its classification precision and effective recall via a calibration procedure. We highlight the importance of properly labeled training data, for which we show that our proposed Γ ST feature space is able achieve a highly trustworthy precision of more than 93% while having the advantage of an up to 6.7% higher effective recall than the other feature spaces. These results are highly relevant in practice, as they effectively allow separating reliable from unreliable predictions. With the higher effective recall more reliable predictions can be obtained, further reducing the costs of otherwise unfeasible manual analysis processes. We also discussed approaches to further improve those predictions.
As an outlook it could be interesting to evaluate the potential of word vector representations like those of [22] for corpora of structural-temporal data. This would not necessarily reflect the temporal data aspects, and would also require data sets much larger than currently available. However, the sequential and contextual aspects of the event relations could potentially be covered, which could help improving the interpretability of the internal process relations, as well as the overall classification performances. Additionally-and despite the differences in the described data properties and the problems to be solved-a prospective comparison of the proposed features with those of the field of process mining and business intelligence analysis will be highly relevant for future research, e.g. by reformulating our failure sequence classification problem as one of predicting the next events and remaining time. To achieve a more focused scope, this manuscript deliberately limits the comparative evaluations to specific feature representations and learning methods. While those have been chosen based on their use in related work, the primary selection criterion was to allow comparing and incorporating the proposed structural-temporal features. Hence the evaluation of otherwise closely related feature representations (e.g. the complex sequence encoding of [35]) on network communication data sets need to be part of our future work as well, potentially enabling an extension of current process mining approaches to also cover complex temporally aware multi-class problems on event-based communication data like the one discussed here.

Relevant dimension estimation
Due to the projection on the set of model sequences S M , the resulting Γ ST feature space can become very large. To estimate the potential for the application of a dimensionality reduction approach, we conducted additional experiments using Γ ST and the model class predictors (MCP) on the MFC data, to achieve an empirical estimation of the number of relevant dimensions, similar to the analyses conducted in [62]. However, the sparsity of the Γ ST feature space, together with the limited size of the analyzed data sets makes it hard to analyze the full set of dimensions with a sufficient statistical robustness. To achieve statistically more robust results, the number of base dimensions of Γ ST was reduced to those 5.000 dimensions with the largest variances, selected from a filtered set of those dimensions that had a density > 70%. For those dimensions a PCA analysis was conducted repetitively, to extract the k largest PCA components in a range of 1 to 1.000. We then projected the train and test samples onto the dimensions defined by those k largest PCA components to achieve a feature representation with a reduced dimensionality. This allowed us to train and test a linear classifier on each of those projected sets, obtaining the results for the largest 100 and 1.000 PCA dimensions respectively, as shown in Fig 7. As can be seen, the performance slowly increases, as long as additional dimensions are added to the utilized feature space. Although the variance is still relatively high, we already achieve a convergence at about 100 dimension. The variance starts to decrease from 400 dimensions on. This means, that for this set of filtered dimensions, a reduction to the 400 largest PCA dimensions can be achieved without loosing much of the precision of the base feature set. Due to the restrictions mentioned above, the results achieved on this limited set of dimensions do not represent the full set of features. This also explains, why the results are lower than those presented in Table 5. Despite these restrictions, the results allow establishing the hypothesis that a much lower number of dimensions may be sufficient when using the Γ ST feature space. This hypothesis needs to be tested on larger data sets in the future.
Supporting information S1 Data. The MFC and AFC data sets. Each communication sequence is provided in an individual file, containing all relevant data and using anonymized failure classes, protocol identifiers and event identifiers. (ZIP)