PatCluster: A Top-Down Log Parsing Method Based on Frequent Words

Logs are a combination of static message type fields and dynamic variable fields, and the accuracy of log parsing affects the result of subsequent log analysis tasks. In this regard, an offline log parsing method based on frequent words is introduced: PatCluster. This method first generates root nodes by preprocessing; secondly, the frequency of words is counted, and the word with the largest frequency is extracted as the segmentation condition to refine the template generated by the root node. So on recursively, pattern nodes are formed for all elements of the nodes, and corresponding templates are generated to finally achieve the purpose of log pattern mining. The mining process of the log patterns is from coarse to fine which is based on fewer assumptions, and the pattern fitting depth can be controlled by adjusting the termination condition. In optimized algorithm model, we also consider the maximum extent of the log template matching the token in the log message. The experimental results show that this method effectively improves the log parsing quality and has higher log parsing accuracy than other methods, and is more suitable for handling logs with complex structures.


I. INTRODUCTION
Logs contain rich information such as date, user name, and task execution status, which developers can collect for anomaly detection [1], [2], fault diagnosis [3], [4] and user behavior analysis [5], etc., most of these technologies are implemented on the premise of inputting structured logs. However, logs are always unstructured in large-scale deployment systems. Therefore, templates need to be extracted from log messages to convert raw unstructured log messages into structured log messages.
As shown in Table 1, the log records a series of information: the time the event occurred, the information level, the process, and the original message. We observe that the log consisted of constants and variables, namely template words and parameter words. The log constant part remains unchanged when the same type of event occurs The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson . and the variable part changes as the event changes. The goal of log parsing is to distinguish constants and variables from the original log message content, we can extract the log template in Table1 as ''Recovered completed container container_< * >'', and the parameter is ''1445144423 _ 0020_01_000012'', '' * '' is a wildcard character to match the parameter. Traditional log parsing methods extract log templates and key parameters through regular expressions constructed based on domain expert knowledge [6]. But with the continuous increase in the scale and complexity of modern computing systems, the number of logs also increases exponentially, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the traditional manual extraction of log templates is no longer feasible. At present, there are two ways to extract logs automatically: offline parsing and online parsing. The method of offline parsing needs to obtain all logs before processing log messages in bulk, such as SLCT [7], LFA [8], LogCluster [9], LKE [10], and MoLFI [11]. However, the parsing accuracy of these methods is lower than 0.8 which will have a negative impact on subsequent anomaly detection studies. In contrast, online parsing methods deal with logs in a streaming manner and gradually improve parsing accuracy during the parsing process, such as Spell [12], Drain [13], and Paddy [14]. The accuracy of Drain and Paddy are better than Spell, but the first step of these two algorithms requires regular expressions and domain knowledge filtering parameters for different logs. It is noteworthy that Drain is based on the strong assumption that logs of the same message type are also of the same length and these requirements make Drain less transferable.
We introduce a new offline log pattern clustering method in this paper based on frequent words--PatCluster. We first preprocess log messages to generate root nodes and form rough log templates on the root nodes. Secondly, the frequency of words is counted and the word with the largest frequency is taken as the segmentation condition to divide root nodes into left child nodes and right child nodes, and the rough log templates are further refined. Finally, all elements of the node form pattern nodes and generate corresponding templates, and ultimately achieve the purpose of pattern extracting. This method adopts a top-down approach, which avoids the problem of overfitting to a certain extent, effectively improves the log parsing accuracy, and is more suitable for processing logs with complex structures.
The main contributions of this paper include: • A top-down log parsing method based on frequent words is proposed which extracts the log templates from coarse to fine, and we also propose optimization points for this algorithm.
• The parsing accuracy of the algorithm is evaluated on 16 datasets, and the experimental results show that this method effectively improves the log parsing quality and has higher log parsing accuracy than other methods.
• This method does not require prior knowledge and is more suitable to handle logs with complex structures. The rest of the paper is organized as follows. Section II introduces the related work of log pattern mining. Section III describes the model of log template extraction. The new algorithm PatCluster is proposed in Section IV and the performance of the PatCluster is evaluated in Section V. Finally, Section VI is the conclusion and outlook of this paper.

II. RELATED WORK
A typical process in log analysis is to first extract an unstructured log template and then analyze the generated structured data. At present, there are four main types of log template extraction technologies: A. FREQUENT PATTERN MINING Frequent pattern mining automatically generates event templates based on frequent item sets, such as SLCT [7], LFA [8], and LogCluster [9]. SLCT is the earliest frequent pattern mining algorithm for log parsing [7], which first builds frequent item sets by traversing text data, then groups the text information into multiple clusters, and the event template is finally extracted in each cluster. LFA clusters similar frequent words in each log and abstracts them into event types [8], which is analogous to the idea of SLCT. The difference between LFA and SLCT is that SLCT looks for clusters across log lines, whereas LFA looks for clusters within log lines. This is the reason why the LFA can parse rare log message types. LogCluster is an extension of SLCT, which focuses on each word in the log without considering the location information, and thus can mine frequent patterns as well as discrete events in the log [9]. When log templates are performed using a small support threshold, both LogCluster and SLCT tend to overfit and there are overly detailed templates.

B. CLUSTERING
The log pattern mining idea based on the clustering methods is: feature values are extracted from log messages in the first step, then the similarity of the log messages based on the distance between the feature values is measured. The feature values whose similarities are less than the threshold are divided into one class, and the representative in this class is selected as the template. For example, LogMine uses a hierarchical clustering method to group log messages into clusters from bottom to top and represents each cluster with the most specific log pattern [15]. LogSig extracts feature values for each log based on word pairs and clusters log messages after multiple iterations, but this method requires predefined clusters to abstract them into event types [16]. Unlike offline methods such as LogMine and LogSig, SHISO is an online clustering method, which parses logs in a stream fashion and calculates their similarity to existing log clusters [17]. If the similarity is greater than the threshold, log messages then will be added to the current category; else a new log cluster will be created. And SHISO requires manual adjustment of parameters to update the threshold.

C. HEURISTICS
It is different from other methods that treat logs as general text data, heuristic methods obtain appropriate heuristics based on other information such as log content and format. Those methods take full advantage of the characteristics of logs to extract templates. For example, Drain applies a fixed depth tree structure to represent log messages, in which logs are grouped according to length and common templates are extracted in leaf nodes [13]. Moreover, Drain updates the similarity of threshold through the automated parameter adjustment mechanism. Spell parses system event logs based on the method of the longest common subsequence (LCS), this algorithm measures similarity by LCS and reduces processing time by using Jaccard similarity and prefix trees [12].

D. DEEP LEARNING
Another approach to mining the log pattern is using the deep learning method [18], which, unlike exact string matching, matches the pattern to the target string approximately. The feature values of logs are extracted and classified by deep learning algorithms, and frequent items are mined to extract the logs template. However, this method may have subjective dependence on the choice of algorithm, which has high requirements for samples and is prone to problems such as dimensional disaster and underfitting. Meanwhile, it is impossible to know whether the decision basis is reliable in deep learning and the interpretability is relatively poor.
The PatCluster offline parsing algorithm introduced in this paper is a pattern mining method based on frequent words, and in the optimized algorithm model, we consider the maximum degree of tokens in the log template matching log messages. In the case of complex log information, the log templates extracted by PatCluster can still obtain high parsing accuracy and have better template mining results compared with other methods.

III. PROBLEM DESCRIPTION
Logs can be parsed into log templates by the parser. The definitions of these concepts are given below.
Definition 1: A log dataset consists of a sequence of logs of length m: L= (e 1 , e 2 , . . . , e m ), which is usually sorted by the timestamp of the logs.
Definition 2: Given a set of splitters D and a log can be split into a series of words: t= (t 1 , t 2 , . . . , t n ), where n is the length of the sequence. The log template is obtained by replacing the variables in the word sequence with wildcards.
Log messages consist of specific event templates and variables, so a log message can be parsed as [E t , parm 1 , parm 2 , . . . , parm n ]. where E t represents the event template and parm represents the variable in the log.
The Fig.1 reflects the results of log parsing, there is a mapping relationship between the original log dataset and the log template. Each log message can be parsed into a specific event template (E 1 or E 2 ) and variables. All constant message templates are aggregated into a list of log events to form a structured log.
Observing Fig.1, it is found that the extracted log templates have the following characteristics: • The extracted templates are different from each other, and in the set of log templates {E 1 , E 2 , . . . , E t , . . . , E j }, any two templates are inconsistent.
• The extracted templates cover all original logs, and for any log, there exists a unique template corresponding to them.
• The number of log templates is less than the number of original logs. In addition, the current offline log parsing methods have low parsing accuracy, which affects the subsequent automated anomaly detection and analysis; the online log parsing methods require prior knowledge and are based on strong assumptions, so the portability of those methods are poor. In the face of the above problems, this paper then introduces the newly proposed PatCluster algorithm and the related optimization scheme in detail.

IV. THE PATCLUSTER ALGORITHM
This section introduces a pattern mining method based on frequent words: PatCluster. In this method, the word with the largest frquency is used as the segmentation condition of the root node to divide it into the left child node and the right child node. So on recursively, a pattern mining tree is formed and the corresponding templates are generated to finally achieve the purpose of log pattern mining.

A. PRINCIPLE OF PATCLUSTER ALGORITHM
According to the characteristics of the log, we found that there are 2 fundamental assumptions for successfully extracting templates: ① Constants can be identified relative to variables. ② There are certain key constants that can infer the template corresponding to that log. Based on the above two premises, the larger the frequency of words, the more likely they are constants. So, the algorithm adopts the method based on frequent words to parse the log.
As shown in Fig.2, it reflects the distribution of constants and variables in each dataset, where the horizontal coordinates are the frequent words, and the vertical coordinates are the number of times that frequent word appears in the logs. Blue represents the word as constants, and red represents the word as variables. Based on the distribution characteristics of word frequencies on the four datasets, the rationality of extracting log templates based on word frequency is further verified.
This method treats each log as a basic unit and the result of splitting each log is seen as an element placed on the root node for processing. Firstly, a rough pattern is generated on the root node. Secondly, the frequency of words is counted, and the word with the largest frequency is taken as the segmentation condition to divide root nodes into left child nodes and right VOLUME 11, 2023  child nodes. Finally, the pattern generated on the root node is refined based on this, and so on recursively, the pattern on the root node is continuously updated, and the pattern nodes are formed for all elements of the child nodes, and the corresponding patterns are generated. The algorithm adopts a top-down method, which effectively avoids the problem of overfitting that exists in other methods.
The pattern mining tree of PatCluster is shown in Fig.3. The root node is located at the top level of the pattern mining tree, and the root node is partitioned into left and right child nodes based on the word frequency, and patterns are further extracted for all elements of the child nodes. The method can control the degree of pattern fitting by adjusting the termination conditions (e.g., adjusting the depth of the tree and the minimum number of elements). In general, the deeper the tree is, the more effective it is in resolving rare pattern types.

B. LOG PREPROCESSING
The first step of the PatCluster algorithm is to process the raw log messages before extracting log templates. The logs are read by line and the log contents are split into words according to certain rules, for example, English sentences can be split by using spaces as separators, and Chinese sentences can be split by a word splitting tool. As shown in Fig.1, 'mod_jk child workerEnv in error state 7' can be divided into a series of words: ['mod_jk', 'child', 'workerEnv', 'in', 'error', 'state', '7'].
The word segmentation result of each line is used as the root node, and a rough template is generated on the root node, that is, the log template generated on the root node can be expressed as < * >. For example, the elements ['mod_jk', 'child', 'workerEnv', 'in', 'error', 'state', '7'] as a root node can be directly represented as < * >. In the pattern expression of this method, the character ' * ' represents any number of words.

C. PATTERN MINING
After the preprocessing of the log message, the nodes are further segmented by word with the largest frequency to generate pattern mining trees. We adopt a top-down method. After a rough pattern on the root node is generated, the root node is divided by the word with the highest frequency as the segmentation condition. The template generated at the root node is refined based on this, and so on in a recursive loop to continuously update the log template. When the termination condition is met, the final log template is generated. The specific steps are as follows: Step 1: Determine whether the algorithm can be terminated. The first step is to determine whether the input node meets the conditions that need to be processed, such as whether the depth of the node is less than the maximum depth of the pattern mining tree, and whether the number of elements of the node is greater than the minimum number of elements. If all nodes are judged not to be processed, the algorithm is terminated.
Step 2: Word deduplication. The duplicate words are removed from the word separation result of each element.
Step 3: Count word frequency. All the words on the node after deduplication are totaled together, the frequency of each word is counted and the word with the largest frequency, Token max , is selected (if there are several words with the largest frequency, the first one is chosen). According to Fig.1, the most frequent words are 'child' and 'in', and 'child' is taken as Token max .
Step 4: Node segmentation. If the element on the node contains Token max , the node is divided into left child nodes. If not, it is divided into right child nodes. The processing of the right child node is carried out as in step 1, and so on recursively until the termination condition is met. The elements of the logs in Fig.1 are all divided into left child nodes.
Step 5: Pattern generation and aggregation, and generate pattern nodes. Token max , the word with the largest word frequency, is obtained according to step 3. Patterns are generated for all elements of the node and elements of the same pattern are merged into child nodes. The newly generated pattern 8278 VOLUME 11, 2023 nodes are still processed as in step 1, and so on in a recursive loop until the termination conditions are met.

D. MODEL OPTIMIZATION
The template can be generated by following the above steps, however, this method tends to fall into the problem of local optimum in some cases. For example, the log 'mod_jk in error state 7 and jk2_init () Found child in scoreboard slot 8' also conforms to the pattern 'mod_jk in error state < * >' and 'jk2_init () Found child in scoreboard slot < * >'. When a log matches two patterns at the same time, the pattern generated by the basic method assigns the log to the pattern that appeared first. In fact, if the two patterns appear the same number of times, it is more likely to classify this log as a pattern 'jk2_init () Found child in scoreboard slot < * >' (the more specific the pattern, the better). The reason for this local optimum problem in the basic method is that the elements are only divided into the left child nodes in step 4. To solve such problems, we add two optimization points to improve the accuracy of the algorithm.
Optimization 1: Modification of step 4. After the node segmentation is completed, the similarity is calculated for each element of the left child node with all the elements of the right child node. For example, the similarity sim(element1, element2) of two elements is calculated, and the element is copied to the right child node as long as the similarity of the element to any of the elements of the right child node is greater than the threshold value. Here the similarity calculation method can be adopted but not limited to Simhash, TF-IDF, etc.
Optimization 2: Add a step to calculate the pattern matching degree. After all nodes stop splitting, the pattern matching degree is calculated for the elements of all nodes, and the pattern corresponding to the largest pattern matching degree is selected as the final pattern of the element. The matching degree formula is: where PN is the number of pattern words and TW is the total words. According to the above algorithm description, the generated template can not only match as many log messages as possible but also consider the maximum degree of token in the log template to match the log message -the optimal pattern can be obtained by calculating the pattern matching degree. In addition, unlike other methods of bottom-up pattern mining, this algorithm adopts a top-down approach, that is, the mining of the log pattern is a process from coarse to fine. According to the above steps, a pattern mining flowchart is formed as shown in Fig.4.

A. DATASETS
Most companies are reluctant to disclose their production logs for security and confidentiality issues, so logs used for VOLUME 11, 2023 research are scarce data. In this paper, we discuss the performance metric of the PatCluster algorithm, Parsing Accuracy, based on the Loghub dataset [19]. The dataset comes from a large collection of logs from 16 different systems, and the total data is about 77 GB, as shown in Table 2. According to the characteristics of the log to be processed, if only spaces are used as separators, the granularity of word segmentation will be large, which will lead to a decrease in the parsing accuracy. Therefore, this paper adds separators such as '':'' and '';''.

B. EVALUATION INDICATOR
We use Parsing Accuracy (PA) to assess the effectiveness of this method. Parsing accuracy indicates how well the original log messages were parsed correctly compared to the basic template in the log sample, which measures the ability of the log parser to distinguish between constants and variables. Specifically, each event template corresponds to a set of log messages after log parsing, and a log message is correctly parsed only if the event template is consistent with the ground truth template and the corresponding set of log messages is the same. According to [20], the log parsing accuracy is defined as: where CN is the number of logs parsed correctly and TN is the total number of logs parsed. For example, if a set of log messages corresponds to templates [E1, E2, E2], they are parsed to [E1, E3, E4] according to the log parser, then the parsing accuracy is 1/3. The accuracy of log pattern mining affects the performance of subsequent log mining tasks, and log parsers with low accuracy may significantly limit subsequent research.

C. EXPERIMENTAL RESULTS
To evaluate the parsing accuracy of PatCluster, 2000 log messages were randomly sampled from each dataset on LogHub and the corresponding event templates were manually marked [19]. As shown in Table 3, where the number of events (2K) represents the number of event templates in the log sample. These event templates were used as the ground truth to compare the PA of PatCluster with eight other representative log parsing methods on 16 datasets. All experiments were run on a Windows system with Python 3.6 as the programming language. As shown in Table 3, the parsing accuracy results of the log pattern mining method on 16 log datasets are presented. Where each column represents the parsing accuracy of the method on different datasets, and the mean value measures the overall strength or weakness of the method. Each row represents the parsing accuracy of various methods on the same dataset, which helps to compare the parsing accuracy of different methods on the same dataset. The bolded part is the parsing accuracy over 0.9, which has a good parsing effect; * indicates that this method obtains the best parsing accuracy on this dataset compared with other log parsing methods.
The experimental results show that compared with other offline methods, PatCluster achieves the best average parsing accuracy, which is slightly higher than that of the online parsing method Drain. The log parsing accuracy of the online parsing method Drain is also higher, second only to PatCluster. The reason for the high parsing accuracy of Drain is that in the log datasets used in our paper, the logs generated by the same system have the same structure, and log formats are the same, which meets the condition that logs corresponding to the same template have the same length. However, as the complexity of logs gradually increases, the format of logs becomes more diverse, and the length of logs corresponding to the same template is not necessarily the same. For example, the log template for 'mod_jk child init 1' and 'mod_jk child init 1 -2' are both 'mod_jk child init < * >'. However, in the Drain method, the log messages are of different lengths, so the two logs belong to different log templates, which obviously leads to the problem of overfitting. Therefore, the Drain method is more suitable for log pattern mining with inherent structure and features, but it may not be able to take full use of its advantages in log datasets with complex structures. In summary, the use of Drain is limited by assumption conditions, so the applicability of Drain is not strong. In contrast, PatCluster applies a top-down approach based on fundamental assumptions only, and therefore can be applied more broadly. On the dataset OpenStack and HealthApp, PatCluster exceeds the accuracy of Drain by more than 0.2. Furthermore, PatCluster achieves the best parsing accuracy on 5 out of 16 datasets and over 90% on 10 datasets, which has a greater advantage over offline methods.

VI. CONCLUSION
In this paper, we implement a log pattern mining method based on the frequent word. The algorithm adopts a top-down approach and achieves the purpose of pattern mining by gradually refining the rough template on the root node. This method does not require prior knowledge, and based only on the fundamental assumptions, and the process of pattern extraction is from coarse to fine, which effectively avoids the problem of overfitting that exists in other methods, and can control the fitting depth of the pattern by adjusting the termination conditions. Meanwhile, in the optimized algorithm model, we consider the maximum degree of tokens in the log template matching log messages. The experimental results show that this method effectively improves the log parsing quality and has higher log parsing accuracy compared with other offline methods. In addition, compared with online methods such as Drain, PatCluster is more transferable and more suitable for extracting templates from log datasets with complex structures. The method is based on the exact comparison of tokens, and the limitation is that the template cannot be extracted based on semantics. It is hoped that the next step can extract templates based on the semantics of tokens, which will allow the method to be further applied in the field of Natural Language Processing (NLP).