IT Infrastructure Anomaly Detection and Failure Handling: A Systematic Literature Review Focusing on Datasets, Log Preprocessing, Machine & Deep Learning Approaches and Automated Tool

Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability of any element or connection results in downtime and triggers monetary and performance casualties. Thus, reliability engineering has been a topic of investigation recently. The system logs become obligatory in IT infrastructure monitoring for failure detection, root cause analysis, and troubleshooting. This Systematic Literature Review (SLR) focuses on detailed analysis based on the various qualitative and performance merits of datasets used, technical approaches utilized, and automated tools developed. The full-text review was directed by Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) methodology. 102 articles were extracted from Scopus, IEEE Explore, WoS, and ACM for a thorough examination. Also, a few more supplementary articles were studied by applying Snowballing technique. The study emphasizes the use of system logs for anomaly or failure detection and prediction. The survey encapsulates the automated tools under various quality merit criteria. This SLR ascertained that machine learning and deep learning-based classification approaches employed on selected features enable enhanced performance than traditional rule-based and method-based approaches. Additionally, the paper discusses research gaps in the existing literature and provides future research directions. The primary intent of this SLR is to perceive and inspect various tools and techniques proposed to mitigate IT infrastructure downtime in the existing literature. This survey will encourage prospective researchers to understand the pros and cons of current methods and pick an excellent approach to solve their identified problems in the field of IT infrastructure.


I. INTRODUCTION
In recent years, modern software has been rapidly integrated into organizations and in our daily lives. Also, it turns out to be influential. Most of the applications are intended to be accessible and stable continuously. Any trivial or non-trivial downtime can ignite financial [1] and performance losses. For example, four-hour downtime in Amazon Web Services resulted in a $150 million loss [2]. Thus, it is paramount to The associate editor coordinating the review of this manuscript and approving it for publication was Porfirio Tramontana . maintain IT infrastructure's health to improve its availability and reliability.
In the IT infrastructures, several components and assets are connected and continuously interacting with each other. For this reason, it is always precarious to determine the cause of the failure. System logs are considered as the primary source of data as it records the software's runtime information. Logs generate on the execution of logging statements that programmers write while developing source code. However, making use of enriched log data is challenging because of subsequent reasons. First, the rapidly increasing log volume (for example, an extensive scaling service system can record 50 GB/hour logs [3]). Second, using an opensource platform (for example, GitHub) allows designing a system by many developers [4], multiple development styles resulted in complex logging. Third, change in the nature of logging statements due to the new software versions (hundreds of recent logging statements per month). Fig. 1 illustrates the steps required in the IT infrastructure failure detection and failure handling process. This process is majorly divided into two parts: a collection of necessary data such as logs, resource-used data, or IT service tickets, and then pre-processing is to reduce the volume. The second part focuses on training and execution of the models for detection and prediction of failures. Detailed discussion on this process has been done in [5] by Bhanage. In IT infrastructure monitoring, researchers and experts have accomplished ample research in the recent past. As mentioned in the existing literature, failure handling is possible with the help of reactive and proactive approaches [6]. In the study, researchers have explored various types of logs, including RAS log [7], health log [8] event log [9], activity log [10], transactional and operational log [11], etc. Also, log parsing is performed by using frequency pattern mining [12], clustering [13], natural language processing (NLP) techniques [14]. In addition, researchers have explored Machine learning models (SVM, Naïve Bayes, and Random Forest) and Deep learning (RNN, CNN, LSTM, and Bi-LSTM) to detect and predict anomalies or failures in various IT infrastructures.

A. SIGNIFICANCE
In IT infrastructure, many assets and components are connected. They continuously communicate with each other, which generates a massive amount of data. Unavailability of any component or connection in IT infrastructure leads to catastrophic failures and crucial losses [15]. Therefore, it is essential to prevent such failure conditions. According to Du et al. [16], the primary purpose of the log is to record all the executed activities and monitor the status of the IT infrastructure. The system log is also used as the elementary source to identify the problem and troubleshooting [17]. Traditionally developers or administrators were analyzing logs manually to understand the behavior of the system. However, due to the increased complexity and massive data, IT infrastructure monitoring demands automated monitoring [18]. The system logs are enormous and available in an unstructured format. Thus, there is a need to preprocess logs to better understand and retrieve meaningful information from complex log data [14]. Ren et al. [19] reveal that log analysis is a comprehensive approach for failure detection, handling, and prediction. Thus it is imperative to prolong the research and utilize the system logs to carry reactive or proactive strategies in order to avoid failures and prevent monetary and productivity losses.

B. MOTIVATION
Hereafter, IT infrastructures will be available everywhere and in a continuously working state [20].
Thus, it is imperative to conduct unbiased research to prepare reliable IT infrastructures and monitor their health [21]. Currently, many popular commercial tools are present in the market for IT infrastructure monitoring. Many IT companies are working for IT infrastructure monitoring using log analysis. Researchers have suggested various new approaches and tools to take care of the continuous availability of IT infrastructure in the recent past. Ample research has been done in IT infrastructure monitoring, but a comprehensive analysis has not been presented until now.
The existing literature on IT infrastructure failure detection and handling techniques focuses on specifications such as log data, pre-processing of the log, machine learning, and deep learning approaches for detection and prediction. After a thorough analysis of existing literature, a comparative study of present tools and techniques is vital. To the best of our knowledge, a limited number of systematic literature reviews are published on the topic. This analysis concentrates on the following key points: availability of datasets, different VOLUME 9, 2021 technical approaches used to pre-process logs, anomaly or failure detection, and failure prevention to present the study of available literature.

C. EVOLUTION OF THE FAILURE DETECTION AND HANDLING TECHNIQUES
The system logs are rich in information and provide all the details about the activities executed on the IT infrastructure components. Developers and system administrators have been using the system log to identify the IT infrastructure problems and troubleshoot. Also, system experts scrutinize log data manually by considering the different levels of the recorded log data.
Due to the increasing size of log data, automation in log analysis was initiated in 2003 [12] and then further accelerated since 2007. The evolution of failure detection and handling techniques in the studied literature is presented in Fig 2. In the primitive state of the research, the clustering approach was popular for log data pre-processing. Also, log analysis was achieved by using frequency pattern analysis techniques. The researchers started to find the correlation and association between the various types of logs and other metrics to gather further details about the failure, such as the path of failure, causes, component details, etc. Along with correlation, the rule-based approaches were popular for anomaly and failure detection.
Machine learning techniques have been used comprehensively since 2016 because of their classification and prediction proficiency. Random forest, Gaussian NB, Naïve Bayes, Support Vector Machine (SVM) are immensely utilized machine learning algorithms for anomaly and failure detection. In the year 2019, researchers have begun to use Word2Vec, TF-IDF, GloVe, etc. NLP techniques for feature extraction considering log as standard text data. Due to the increase in log data size, deep learning techniques such as RNN, CNN, LSTM, Bi-LSTM have been applied to train detection and prediction models. Many researchers employed the Auto-Regressive Integrated Moving Average (ARIMA) time series technique to predict time series log data.

D. PRIOR RESEARCH ON SYSTEMATIC LITERATURE REVIEW
In the existing literature, reviews are carried out on log abstractions, log clustering, anomaly detection using deep learning techniques, log data quality analysis, log data for troubleshooting, etc. However, these surveys do not cover all the elements such as dataset, techniques, approaches, research gaps, etc., they concentrate on a certain part of the IT infrastructure monitoring research.
El-Masri et al. [22] 2020 published the SLR of automated log-abstraction techniques (ALAT). In this review, the authors evaluated 17 automated log abstraction techniques on seven aspects: mode, coverage, delimiter independence, efficiency, scalability, system knowledge independence, and parameter tuning efforts.
Cyber-attack can be one of the reasons for IT infrastructure failure; for this, we have considered the survey of log clustering approaches in cybersecurity applications. Landauer et al. [23] in 2020 illustrated clustering techniques, anomaly detection, and evaluation aspects in cybersecurity application with the help of assessing 50 approaches and two non-academic solutions. The authors also presented a clustering approach selection tool based on the analysis done in the survey. This tool provides ranking to the approaches by taking the ability to fulfill objectives and visualize results on the PCA plot.
Yadav et al. [24] in 2020 published a survey on anomaly detection using deep learning techniques. The survey focused on NLP-based approaches for feature extraction, whereas machine learning and deep learning methods for anomaly detection using log data. Das et al. [25] presented a systematic mapping analysis in 2020 to discuss the general approaches for failure prediction using logs. In the survey [26], Shilin et al. in 2020 address the questions such as ''How to write logging statements automatically'', ''How to compress and parse log'', ''How to use the log to detect, predict, facilitate diagnosis of the failure''. This survey presents various challenges in the studied literature but fails to provide a comparative analysis.
Bhanage [5] in 2021 categorized literature into three major groups: log pre-processing, anomaly & failure detection, and failure prevention; in this study, authors furnished the metaanalysis contingent on infrastructure used, dataset utilized for analysis, category of work, and methodology used. The authors also enumerated the automated tools for log parsing, log analysis, anomaly or failure detection, prediction, and recovery of IT infrastructure.
Our SLR examines the existing approaches, methodologies, and tools relying on various merit criteria (mode, availability, industrial utilization, and accuracy). The SLR endeavors to open a window of opportunities for forthcoming researchers in the area of IT infrastructure monitoring. In this comprehensive study, we strive to emphasize on consecutive aspects: the availability of datasets for various types of infrastructure, methodologies utilized for detection and prediction, and publicly available automated tools for log pre-processing technical approaches for detection with evaluation metrics and failure prevention techniques.

E. RESEARCH QUESTIONS
This paper attempts to conduct an exhaustive review of the existing literature on IT infrastructure monitoring techniques. The subsequent research questions are accosted in the study. We are facilitating a better understanding of the current literature by answering these questions. Furthermore, the answers to these research questions demonstrate the effectiveness of methods to lead the systematic literature review. Table 1 specify the list of research questions and the objectives of the defined research question.

F. OUR CONTRIBUTION
The systematic literature review emphasizes the current study carried out in IT infrastructure monitoring to maintain the health of IT infrastructure components. In this rigorous analysis, we explored the various tools and techniques used to handle the failure conditions. It also focuses on the miscellaneous frameworks, methodologies, approaches developed and pursued by several researchers. The comparative analysis and the concluding remarks on various components of the literature study are provided as an outcome of answers to the research questions. The systematic literature study has scrutinized different tools and techniques based on results that pinpoint the vital research gaps.
The rest of the paper is organized as follows. The methodology and framework exerted to extract and scrutinize scholarly articles from various databases for review have been discussed in Section 2. Section 3 illustrates the impact of scholarly publications in the existing literature of IT infrastructure monitoring. Section 4 describes the architecture of the proposed system. Section 5 accomplished a comprehensive discussion on the experimentation process and derived results. Section 6 discusses the research questions, which further leads to a systematic literature review. Section 7 states the paper's limitations, discusses future directions. Section 8 exchanges view on concluding remarks.

II. METHODOLOGY FRAMEWORK FOR SYSTEMATIC LITERATURE REVIEW
A systematic literature review of the available literature was undertaken to find the answers to the proposed research  questions and objectives. This process will shed light on the potential research gaps and the challenges encountered in studied research areas and discuss viable solutions. Comprehensive guidelines suggested by Kitchenham et al. [27] were adapted to accomplish the thorough systematic literature review. Table 2 presents the five elements of PICOC (Population, Intervention, Comparison, Outcome, and Context) for framing the searchable questions suggested by Kitchenham.  Fig. 3 illustrates the process and methodology used for the systematic literature review. The research domain is identified for the systematic literature review, followed by defining the research questions and objectives. The significant material is collected based on the dataset, approaches, techniques, and operations to study and answer the research questions. Intended search query executed on various repositories such as Scopus, ACM, IEEE and Web of Science to collect relevant scholarly publications. Then inclusive and exclusive selection criteria are applied to select the most appropriate publications for thorough analysis. The analysis of studied literature is epitomized through discussion on answers to research questions, future directions for forthcoming researchers and a conclusion to state the concluding remarks.

A. RESEARCH PUBLICATIONS COLLECTION CRITERIA
Scholarly articles were collected from various databases like Scopus, ACM, IEEE, Web of Science. We delineated a search query using pertinent keywords such as ''system log or event log,'' ''log Analysis,'' ''failure detection or failure prediction,'' ''machine learning or deep learning,'' etc., to retrieve the related articles. The listed prominent keywords have been utilized by Bhanage and Pawar [28] to collect the information for bibliometric analysis. The same search query was executed on multiple databases to retrieve appropriate publications. The count of the extracted publications is presented in Table 3.

B. INCLUSIVE AND EXCLUSIVE CRITERIA
As stated in Table 4, inclusive and exclusive criteria were applied to acquire the most relevant scholarly articles for systematic literature review. Utilization of various IT infrastructures, different approaches or methodologies used for detection or prediction, and multiple preprocessing techniques parameters applied to select articles for further analysis.

C. PUBLICATION COLLECTION RESULT
In the publication selection results, 177 publications found in the Scopus database, ACM and IEEE, identified 50 articles, whereas Web of science extracted only 3. All research articles were investigated thoroughly and categorized into three groups based on the work's intent. Detailed discussion is done on log preprocessing, anomaly or failure detection, and failure prevention types in the forthcoming sections. Fig. 4 exhibits the process followed while selecting a publication for a detailed study by effectuating inclusion and exclusion criteria. A total of 280 scholarly publications were extracted from various repositories, as indicated in Table 3. The list was reduced to 270 entries by removing equivalent and irrelevant articles. The 270 publications probed through the title, keywords, article's abstract, and 150 scholarly articles were selected for the analysis. In addition to databases, we applied the backwards snowballing technique to identify more articles [29]. In the backward snowballing approach, authors track the references list of the primarily selected papers. Most relevant articles from the references are shortlisted based on the inclusive and exclusive criteria stated in Table 4. Supplementary 43 articles were added for study using Snowballing technique. Finally, 122 articles were selected by excluding 13 articles after quality assessment. Besides scholarly articles, we also referred to a few web links to study and gather information related to ITSM concepts and commercial tools. Fig. 5 presents the contribution of study material as per type's publication in the systematic literature review.

E. QUALITY ASSESSMENT CRITERIA
Quality assessment criteria were applied to select the particular scholarly publications to effectuate the systematic literature survey on IT infrastructure monitoring research. The desired quality articles must significantly contribute to answering the research questions.
Following are the quality measures referred to shortlist scholarly articles: • IT infrastructure: The article must be focused on IT infrastructure monitoring by employing log data and other resource-related metrics.
• Datasets: The articles emphasized the various components of datasets such as type of dataset, infrastructure utilized, time frame and data size.
• State of the art tools: The articles discussed the existing automated tools for log parsing and analysis and presented details such as the technique used, mode, availability, industry utility, and accuracy.
• Classification based approaches: The articles studied machine learning, or deep learning-based classification VOLUME 9, 2021  approaches for anomaly or failure detection and prediction. Moreover, the article provided the information of dataset, technical approach, preprocessing or feature extraction techniques and metrics used for evaluation.
• Data validation: The articles particularly commented on the findings and results considering the stated objectives and expected outcomes.   Fig. 6 shows the classification of studied scholarly publications. The publications were classified by perusing the title, abstract, keywords, and full text of the selected publication for analysis. According to the work's purpose, the articles  related to IT infrastructure monitoring are studied carefully and classified into three categories: pre-processing, detection, and prevention. Pre-processing is further partitioned into log parsing and log analysis. Detection of anomaly and failure conditions targeted in the studied literature. Prevention techniques are categories into reactive and proactive approaches. The reactive process takes place after the occurrence of a failure condition. Whereas, proactive approach predicts the error conditions before it takes place. For RQ1, the importance of log data for troubleshooting is studied based on the nature of logs. Various types of datasets, their availability, and the dataset size were discussed to answer RQ2. Anomaly and failure detection approaches, techniques, and performances these parameters evaluated to answer RQ 3. Reactive and proactive strategies were studied and summarized to extract the answer to RQ4. For RQ5, various state-of-the-art tools are analyzed based on techniques used, mode, availability, and industrial utility and accuracy parameters. All the selected articles are studied cautiously to discover distinguished limitations for RQ6.
All the categories and targeted evaluation points are discussed in detail in the upcoming sections of the paper.

B. DATASETS 1) TYPES OF DATASETS
Studies were performed on various infrastructures such as distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software in the literature. Fig. 7 demonstrates the different types of infrastructures and systems explored in the systematic literature review.

2) AVAILABILITY OF DATASETS
To make a log dataset available for study is a challenging task. Log data provides all the details about the execution of the infrastructure components, and the misuse of this data may cause serious problems. Thus, log data are not readily available for use or experimentation due to strict business policies and confidentiality issues. Few sample logs are collected from the existing literature and released for research studies in academia. Table 6 furnishes the list of the infrastructure type, system, infrastructure, dataset type, time frame, and size of the collected data. Zhu et al. [30] in 2019 released a log dataset repository of different 16 types of systems on loghub [31]. Out of these few datasets VOLUME 9, 2021 (for example, HDFS, Hadoop, BGL) are utilized and released by the previous researcher. In contrast, other datasets were collected from the Zhu et al. authors' lab environment. The component failure log data from various extensive production systems are accessible on the computer failure data repository (CFDR) [32]. Los Alamos National Lab (LANL), HPC cluster, Internet services cluster, Cray systems, and Blue Gene/p system's different types of logs are available to accelerate the research on system reliability. Another research study [33] presented the Apache log files that record and store internet search traffic for EDGAR filings through SEC.gov from 14 th February 2003 to 30 th June 2017. Cotroneo et al. in 2019 [34] executed an empirical analysis of software failure in the OpenStack cloud system. The failure dataset with injected faults, the workload, the failure effect at the user and system side, and error logs used for study and release for further research [35]. Apart from this, there are log datasets collected for cybersecurity research. SecRepo [36] holds a list of security data like threat feeds, malware, system, network, etc. • Unavailability of Data due to Sensitivity Log data carries all the details about the system such as resources involved, event records, sequence of performed activities and other information. That's why these rich logs are considered sensitive data. Misuse of such sensitive information may result in security and different types of issues. This is the reason why logs and event records are not available publicly easily. The unavailability of logs for experimentation is the biggest challenge faced by researchers.

3) CHALLENGES WITH LOG DATASE
• Huge Data Size As IT infrastructure's complexity and execution increase rapidly, massive log data is getting generated every second. According to literature, continuously functioning infrastructure can record approximately 50 GB/hour logs [3]. This gives rise to the increase in the log volume. Making use of enormous volume data for experimentation is challenging by considering problems in the management of data, finding & fixing the quality issues, data integration, controlling big data environment etc.
• Different data formats Logs are recorded on the execution of logging statements written by developers during software development. There is no fixed format or template present for logging statements. Each developer may follow their logging statement style bearing in mind the required contents. Many development styles are accessible in the software, as the use of open-source platforms is increasing expeditiously. Also, many source codes are available on GitHub for reuse [4]. These multiple contributions with various development styles give rise to log data generation with different formats, which causes troubles in the development of standard data analysis processes.
• Imbalanced Data Historical log records will be collected for experimentation through various IT infrastructures. Generally, IT infrastructures operate under a normal state; thus, it is severe to collect anomalous records. According to Yan et al. [47], there is a need to handle the imbalanced data to improve fault detection and diagnosis results. The authors applied a Generative adversarial network (GAN) to convert imbalanced training data to balanced training data.
• Inconsistent log generation Inconsistent logs are getting generated due to the change in the nature of logging statements. Various developers develop different software versions and follow their own writing styles. This results in inconsistent logging statements followed by unstable logs.
• System Dependent Data In any IT infrastructure, the nature of components and their communication varies based on their utility. As of now, no standard rules or conventions are present to write logging statements. Each system has separate ways to write logging statements and record the different types of details. Therefore it generates various types of logs. Multiple researchers have studied different kinds of systems independently in the literature because of the diversity in log format.  Distributed systems, supercomputers, operating systems, mobile systems, server applications, standalone software, and any other system carry different types of logs as described in Table 6.

C. LOG PRE-PROCESSING APPROACHES AND TOOLS
The collected log is always in an unstructured, duplicate, and ambiguous format. Log pre-processing is foremost crucial before transmitting it for analysis. The pre-processing includes three steps 1) Log filtering, which removes the duplicate and noisy data, 2) Log parsing, which converts unstructured log to a structured format; and 3) Log analysis which visualizes the log in a more readable and understandable format.
Hassani et al. [48] claimed that sometimes log messages are unreliable. They may hold errors such as improper log messages, lacking logging statements, unsatisfactory log level, log archives structure problems, runtime problems, overpowering logs, and log files library alterations. Thus, before pre-processing log data, need to validate its quality. Fig. 9 has listed the techniques and approaches used for log data pre-processing on critical analysis of scholarly publications. As stated in Fig. 9, semantic value similarity, duplicate removal, and adaptive similarity are applied to remove duplicate and redundant data from the log. The log filtering step is considered a data cleaning process. Frequency pattern mining, clustering, heuristics, and longest common subsequence are the commonly used approaches for log parsing. For log analysis clustering (DBSCAN, same level of log), sematic techniques (for example, the appearance of words, text mining), and semantic value similarity (friend of a friend) techniques employed by researchers in the literature.

1) LOG FILTERING
Irrelevant and redundant data generally leads considerable noise to feature extraction and affects the accuracy of the analysis. Log filtering is removing duplicate or unwanted data and reducing the size of the logs. Log filtering is possible with the help of the following techniques: semantic value similarity, duplicate removal, and adaptive similarity. In the literature, Di et al. [49] conducted duplication filtering prior to log analysis as ras log has numerous same messages. Whereas Liu et al. [50] proposed a filtering threshold to categorize the clusters into normal and anomaly candidates, thus author can discard regular events and concentrate on others to analyze. In addition to this, oliner and stearley [46] claimed that filtering switches alert distribution drastically by removing duplicate alerts within the last 'T' seconds. Ren et al. [20] removed stop words and punctuation from removing redundant event data in distributed cluster systems.

2) LOG PARSING
After log collection, it is imperative to be parsed before sending it for further analysis. In the log, some part is constant (written by the developer), and some are dynamic (update at runtime). The primary objective of the parser is to recognize the persistent and variable data. The contact part of the log represents the event template. Thus, the output of the log parser gives log data with the following contents: timestamp, level, component, event template, and critical parameter. In the systematic literature review, we explored 16 automated log parsing tools. To evaluate these parsers, we focus on techniques used and four merit criteria: mode, availability, industrial utilization, and accuracy, as shown in Table 7. SLCT [12] stands for Simple Logfile Clustering Tool; this tool is based on the novel log clustering approach to identify the log files' patterns. Similarly, LFA [51] also works on the same clustering technique to abstract log lines and derive event types. LogCluster [52] is similar to SLCT, but this performs better on log messages with flexible lengths. LKE [53] proposed a novel algorithm to get critical log messages from the Hadoop and SILK system's unstructured log data. The proposed algorithm was further used in anomaly detection, such as workflow errors and low performance of the selected system. SHISO [17] can continuously dig and refine the log template on real-time system logs of OpenSSH except for any prior knowledge by applying a structured tree concept. AEL [54] is the tool designed to monitor the execution of applications using execution logs where log lines are expressly not for monitoring purposes. Extensive enterprise applications were considered to check the tool's performance, and derived results gives 90% precision and 98.4% recall, respectively. LenMa [55] is an online template generator tool with one-pass template mining techniques. It carries out the classification of log messages based on the length of words of each message and forms clusters for same-length messages to identify unique system log message patterns.
OILog [14] proposed extracting keywords from the unstructured log and design log templates by applying a multilayer dynamic PSO algorithm (MDPSO). The tool can pull out the keywords from a real-time and new log with higher efficiency than the existing four tools. LogParse [56] framework works on word classification problems instead of template generation to discover the features of the template and variable words. It also works efficiently on new types of generated logs. The Drain [57], an online parsing tool, is based on a directed acyclic graph and maintains log groups through the tree's leaf nodes. This tool gives 99.9% accuracy on BGL, HDFS, and Zookeeper data sets over LKE, IPLoM, SHISO, and Spell parsers. POP [39] operate on parallel processing; this uses distributed computing to speed up the parsing process of large scale logs. POP reduces the parsing time as compared to other parsers (200 million log messages in only 7 mins).
Spell [58] supports parallel implementation, which helps to accelerate the parsing process. Spell utilizes specialized data structures such as inverted trees and prefix trees. LogMine [13] work efficiently on heterogeneous log messages generated by various systems. It was implemented in the map-reduce framework to extract high-quality patterns by processing millions of log messages in a second. Craftsman [59] is an online parsing tool that applies prefix-tree and frequent patterns techniques for template matching. But this tool fails to merge similar templates effectively.   Splunk [60] and Loggly [61] are the commercial log analysis tools included with automated log parsers. These tools are mainly used for enterprise on-premises or software as a service (SaaS).

3) LOG ANALYSIS
The log analysis makes the log more readable and understandable. The clear and simplified outlook of the logs assists in problem detection and troubleshooting. With log analysis, one can extract patterns and knowledge which could guide and facilitate IT infrastructure monitoring, problem diagnosis, root cause analysis, and troubleshooting. After carefully studying the automated log analysis tools, we compared the tools based on techniques used, merit criteria such as product type, mode, availability, and industrial utility, as shown in Table 8. LogAider tool [63] works on the spatial and temporal correlation mining between events to extract fatal events effectively. Tool show 95% similarities in the analysis as compared to the report generated by the admin. LogLens [64] works on the concept of finding the relationship between the typical workflow execution log sequence and streaming logs to find anomalies. This technique speeds  up the problem detection and saves up to 12096x personhours. Priolog [65] utilizes temporal analysis and prioritization techniques to split the enormous log data into small groups. This grouping helps to identify problems fast and root cause analysis. According to the systematic literature review, various researchers designed and implemented many log analysis tools, providing adequate results.
Along with tools present in the literature, a few enterprise log analysis tools are also available. Graylog [66], Elastic Stack [67], and Fluentd [68] are a few of the open-source tools used by many companies to analyze logs and monitor the infrastructure. Sumo Logic [69] and Logz.io [70] is cloudbased data analytics tool that helps to analyze log data quickly and support system monitoring and troubleshooting problems in real-time.

D. ANOMALY AND FAILURE DETECTION APPROACHES
The abnormal pattern does not correspond to the expected behaviors recorded as an anomaly in the system. This strange behavior propagates and may be responsible for failures.
This section elaborates on a systematic literature review of anomaly and failure detection approaches in the existing literature. Fig. 10 is outlined after a critical analysis of scholarly publications. Fig. 10 briefs about the detection approaches, pre-processing or feature extraction techniques, datasets utilized in the study, and evaluation metrics applied to check the performance of models. In the existing literature, great work has been done on various infrastructures by utilizing different types of logs (Syslog [71], event log [72], switch log [73], exception log [74], RAS log [49], etc.), IT service ticket data [75] and resource used data [76] as a dataset for analysis. Traditional machine learning (ML) algorithms generally execute on extracted features. Thus, pre-processing or feature extraction of logs is obligatory. Template mining [77], sematic vectorization [78], [79], use of NLP techniques [80] are the popular approaches applied for pre-processing or feature extraction of the selected dataset. After a rigorous analysis of scholarly articles, we can say predominantly detection approaches are classified into four categories, such as rule-based approach [11], a method based on association analysis [81], Clustering [82], and classification-based methods [83], [84]. Different evaluation metrics are used to measure the performance of these algorithms. Most frequently used evaluation metrics are precision [85], recall [86], accuracy [87], F1 score [88], correlation coefficient [89].
The studied research components are listed collectively in Table 9 with specific fascinating properties. Expressly, properties like particular infrastructure, dataset, technical approaches for detection, pre-processing or feature extraction techniques, stated performance, and relevant insight indicated to particularize the strengths and weaknesses of the study in Table 9. The approaches are divided into subcategories based on detection strategy, namely anomaly detection, failure detection, fault detection, impactful service detection, and run-time problem detection.
A significant study has been done in this domain. However, current systems demand correspondence within the alerts and events to weaken the untrue warnings [90]. According to Studiawan and Sohel [80], imbalance situations in log data can be the reason for the low performance of the anomaly detectors.
This paper predominantly emphasizes general types of anomalies and failure conditions that occur due to the software systems' spontaneous flaws and results in downtime. External causes of the failure, such as cyber-attacks and malicious activities, are out of scope for this systematic literature review, although they best fit system security.

1) THE RULE-BASED APPROACH
The rule-based approach compares logs against a set of expert-defined rules to identify the abnormal behavior of the software system. This approach primarily engages graph models for the early detection of anomalies or failures.
Jia et al. [11] introduces time-weighted control flow graphs (TCFG) to catch the normal execution of cloud system. An anomaly alarm is embossed when abnormal behavior is observed in the transactional and operational log of the Hadoop system. Nandi et al. [91] employed a control flow graph (CFG) technique to overcome the need for instrumentation requirements or application specification assumptions. Model claim approximately 90% recall rate for sequential and distributed anomaly detection in OpenStack. Jia et al. [77] claim an average of 90% precision and 80% recall with a hybrid graph model. This model runs in two layers. First layer work on the calculation of service topology based on the frequency of the log. Graph-based mining takes place to design time-weighted control flow graphs (TCFGs) for anomaly detection.

2) CORRELATION AND ASSOCIATION-BASED APPROACH
Farshchi et al. [72] proposed a regression-based approach to encounter anomalies in the execution of amazon DevOps operations. Correlation between the operation log and the resource used data was established to check operations activities' effect on cloud resources. B et al. [89] proposed a LADT (lightweight anomaly detection tool) to detect anomalies in virtual machines on the cloud. An anomaly detected alarm raises when the correlation coefficient value drops below the threshold level. The correlation coefficient value is calculated using node-level data and VM level metrics. Di et al. utilized various types of data to find the correlation and detect anomalous behavior. Data sources include reliability, availability, and serviceability (RAS) log; job scheduling log; the log regarding each job's physical execution tasks; and the I/O behavior log used for joint analysis [49]. Di et al. recorded meantime to interruption (MTTI) as 3.5 days for the whole Mira system during the experiment. Nie et al. [86] identified pair-wise relationships in sequences to form the clusters using a multivariate relationship graph. An anomaly is recorded in the physical plant sensor's dataset if one or more pairwise relationships are breached.

3) CLUSTERING
In a clustering-based approach, a cluster of logs is generated depending upon the similarity of features. The size of the log message, recorded timestamp, and the log level are a few of the parameters applied for clustering. Log entries similar to each other are combined in the same cluster and dissimilar in others. Cluster with very few log instances likely to be anomalous. To assist the developers by detecting a problem with the help of forming the clusters of event log sequences. The LogCluster [52] algorithm designed by Lin et al. He et al. [92] proposed Log3C, a unique cascading clustering algorithm-based framework to detect impactful system problems by accessing log event sequence and KPIs (Key performance indicators). This framework forms clusters of massive data promptly and precisely by iteratively sampling, clustering, and matching log sequences. CRUDE (Combining Resource Usage Data and Error) [93] employed console log resource used data of Ranger supercomputer for accurate error detection. Jobs with abnormal resource usage were identified with the help of making clusters of similar behavior nodes. Du and Cao [82] observed the relation between log sequences and corresponding behavior patterns to point out Hadoop and LANL data anomalies. In the study Chen et al. [94], the hierarchical clustering algorithm was used to form clusters to identify anomalies based on their score, but they neglected the incompleteness of logs. Recently, Yang et al. proposed a novel reclustering algorithm by improving K-means to detect a BlueGene/L and Thunderbird system fault. The distributed Memory model of Paragraph Vectors (PV-DM) utilized to procure low-dimensional log vectors then an improved K-mean algorithm was applied to form the clusters [95].

4) CLASSIFICATION BASED APPROACH
In the research of IT infrastructure monitoring, ML and DL techniques were utilized to classify log data and detect anomalies and failures in the system. Most of the literature  TABLE 9. Anomaly and failure detection approaches, techniques, metrics, and performance in it infrastructure studied in systematic literature review. VOLUME 9, 2021 focuses on the classification approach for detection using ML and DL techniques, as reflected in Table 9.
• Machine Learning-Based Techniques: In the research of IT infrastructure monitoring, ML techniques are utilized to classify log data and detect anomalies and failures in the system. As reflected in Table 9, most literature focuses on the classification approach for detection using ML techniques. Various algorithms like decision trees, random forest, SVM, Naïve Bayes, and Gaussian NB are applied to classify log data. Few researchers employed feature reduction, feature selection, and feature extraction processes to improve the performance of classifiers. Word2Vec, a bag of words (BoW), Term Frequency -Inverse Document Frequency (TF-IDF), template2Vec are the literature's most prevailing feature extraction techniques. The classes are formed based on standard and anomalous behavior where outlier detected as an anomaly or failure Bertero et al. [71] extracted features by applying the Word2Vec approach on the system log followed by the Binary classifier, random forest, Gaussian NB to detect stress behavior of the virtual network. An unsupervised learning approach was utilized to detect anomalies in online streaming data [96] and privacy-aware abnormalities in the HPC system [97]. Also, Bronevetsky et al. [102] introduced the unsupervised model to enumerate individual node abnormality. To analyze the reason for network anomaly, Wang et al. [84] exploited Isolation Forest, OneClassSVM, and LocalOutlierFacto unsupervised algorithms Yan et al. [104] Proposed a novel EKF-CS-D-ELM hybrid classification method to resolve the air handling unit's (AHU) fault detection and diagnosis issue. The authors applied a cost-sensitive dissimilar extreme learning machine. Authors claimed more accurate, fast and robust fault diagnosis results over support vector machine (SVM).
HitAnomaly [105] anomaly detection framework developed to apprehend semantic data in the template sequence and parameter value by applying encoder. LogTransfer [106] proposed a method to transfer the unusual observation of source software systems to target software systems by considering global word co-occurrence and local context information and tackling logs in different formats. Studiawan and Sohel [80] suggested using the class balancing method to deal with the challenge of handling imbalance in data.
• Deep Learning-Based Techniques: DeepLog [16] transformed log records into natural language sequences by applying LSTM neural network model and claimed 100% anomaly detection accuracy. Wang et al. [81], Meng et al. [79] Processed log data using NLP techniques and generated vectors provided to LSTM for anomaly detection to mitigate the false alarms. At the same time, Zhang et al. [78] presented a sequence of semantic vectors to Bi-LSTM (Bidirectional Long Short-Term Memory) Borghesi et al. [88] implemented a semi-supervised autoencoder-based strategy to avoid trouble in data labelling. Xie et al. [83] applied a confidence-guided anomaly identification model by blending multiple algorithms to combat concept drift. Supervised models such as random forest, naïve Bayes, and neural networks outperform anomaly detection on vectorized data [99].

E. FAILURE PREVENTION APPROACHES
When any system swerves from its intended work and cannot accomplish system-required functions, this situation is called a failure condition. Even if we handle such failure conditions very promptly, it introduces downtime. Such unavailability in the continuously working large-scale system is unexpected and dissatisfactory for users. The problem discovery in the components and connections in IT infrastructure is possible by observing the unusual behavior of the system. Although fault is determined, gathering the required information such as location, path, involved components, cause, etc., is extremely difficult. Thus, we need to build a system that can predict the failure condition in prior. Another way to prevent the failure condition is to find the root cause of the problem and take corrective actions to avoid it in the future. In IT infrastructure, the conventional procedure is to leverage the precious system logs to predict failure preemptively.

1) PREDICTION
This section will elaborate on a systematic literature review of anomaly and failure prediction approaches in the existing literature. Fig. 11 is outlined on critical analysis of scholarly publication. Fig. 11 briefs about the prediction approaches, feature extraction techniques, datasets utilized in the study, and evaluation metrics applied to check the performance of models. In the existing literature, significant work has been done on various infrastructures by utilizing benchmark datasets [104], released for experimentation purposes, and real-time datasets [75] developed in the specific environment. Traditional machine learning (ML) algorithms generally execute on extracted features. Thus, the extraction of logs is obligatory. Bag of words [107], Term Frequency -Inverse Document Frequency (TF-IDF) [75], Global Vectors for Word Representation (GloVe) [106], feature matrix algorithm [108] are the popular approaches applied for preprocessing or feature extraction of the selected dataset.
The studied research components are listed collectively in Table 10 with specific fascinating properties. Expressly, properties like particular infrastructure, dataset, technical approaches for detection, pre-processing or feature extraction techniques, stated performance, and relevant insights indicated to particularize the strengths and weaknesses of the study in Table 10.
• Failure Prediction Zheng et al. [7] affirmed betterment in fault tolerance (reduce service unit loss by up to 52.4%) by applying a genetic algorithm-based method. Seer [123] can predict 54% of the system's hardware failures. Karakurt et al. [124] utilized machine learning approaches to predict failure in the oracle database. In comparison, Rawat et al. [110] applied a time series stochastic model to predict VM failure in cloud infrastructure. Researches augmented the concept of TF-IDF with LSTM [120] and deep CNN algorithms [19] to predict the failure in HPC and Hadoop infrastructure, respectively. Doomsday [125] enforced time-based learning to detect the rare computer node failure and time-based phrases as prediction mechanisms Li et al. [111] proposed a framework that can predict node failure ultra-large cloud computing and helps DevOps (software development and IT operations) in establishing AIOps (Artificial Intelligence for IT Operations). Elsayed and Zulkernine projected PredictDeep [122] framework for cloud security anomaly detection and prediction by applying a combination of graph analytics and deep learning techniques. It also successfully reduced the false alarm rate of anomaly prediction • Event Prediction Researchers have explored probability, correlation, machine learning, and deep learning techniques in the existing literature for event prediction. According to Gainaru et al. [126], event prediction in the HPC system is vital to acquire proactive actions for failure identification, tolerance, and recovery Fu et al. [127] proposed a tool for a system administrator for semi-automated detection of the root cause failure event by applying a three-step approach.
• Fault prediction Gainaru et al. [128] suggested a hybrid approach (signal analysis and data mining) for fault prediction in an HPC system. He also claimed that the hybrid approach outperformed than individual execution. Pal and Kumar [114] applied distributed log mining using ensemble learning (DLME) on network logs.
• Job Status Prediction Saadatfar et al. [10] served the Bayesian network as a data mining technique to encounter the relationship between workload characteristics and job failures. The analyzed data assists in detecting the failure pattern in the auvergrid system. Yoo et al. [115] utilized machine learning classifiers for job status prediction by characterizing the patterns of task executions in a job with the classes of successful and unsuccessful job statuses. The authors applied 13 resource-usage-related fields measuring resource usages in the job logs and feed them as features to machine learning mechanisms • Correct Maintenance Time Prediction Predicting the correct maintenance time and scheduling maintenance action can relieve failure situations in any hardware system. ML techniques practiced for maintenance time VOLUME 9, 2021   prediction enforced on ATM [108] and vending machines [116] in the literature.

• Remaining Useful Life Prediction
To predict the health of the hard disk Self-Monitoring, Analysis and Reporting Technology (SMART) attribute of a hard disk provided to Bayesian Network [117] and Random Forest [130].
• Incident Prediction Roumani and Nwankpa [118] used a hybrid model that engages ml and time series (arima) techniques to prophesy cloud incidents. Moreover, the ewarn [75] framework proposed to predict general incidences in online service systems by utilizing historical log data. Table 11 presented the required data, techniques, metrics, and performance of root cause analysis in a systematic literature review. Root cause analysis is the approach to define, understand and resolve the fault in the system. Root cause analysis is necessary to find the underlying cause of the problem to identify appropriate solutions. Furthermore, the primary reason can also pertain to the precise point in employing corrective action and preventing failure [135]. Lu et al. [131] designed a model to identify the root cause of application delay in the Spark system by utilizing weighted factors to determine the probability of root cause. CPU, memory, network, and disk are four components included to find the root cause of abnormalities. Weng et al. [132] developed a solution to assist cloud administrators in localizing the anomaly's root cause. This solution works effectively on VM and process level and encounters root cause even if anomaly happens due to multiple reasons. Weng et al. took advantage of both application layer and underlay infrastructure to discover the root cause. Graph base framework proposed by Brandón et al. [136] to find the root cause analysis for service-oriented and micro service architectures.

2) ROOT CAUSE ANALYSIS
The authors also claimed that graph base methods outperformed by 19.41% over the machine learning approach Yuan et al. [133] applied a learning-based approach in Open-Stack cloud service to track the root cause for anomalies. The stated process learns log patterns from past experience and is used for knowledge building. According to Konno and Défago [134], root cause analysis is momentous to ensure the cloud system's quality of service (QoS). Experiments performed on time series monitoring data of injected faults and real-time strategy.

IV. THE ARCHITECTURE OF THE PROPOSED SYSTEM
The proposed architecture of failure prediction in IT Infrastructure to avoid failure conditions is shown in figure 12. The proposed methodology pipeline is divided into four phases: 1) Preprocess raw log data and extract valuable features for the Deep Learning Models. 2) Training model trains the Deep Learning models considering the provided features 3) Model testing investigate the effectiveness of the trained Deep Learning Model, and 4) Deliver output in the form of prediction along with supporting actions.
The first block shows any raw log data as a dataset available for experimentation purposes. The second block represents a log parsing step, which derives the log template from the raw log by using the log parsing tool. It is the process of converting unstructured logs to structured logs. Log parsing reduces the log data size by removing the redundant logs generated through the same logging statement. The third block depicts the feature extraction process, which derives the semantic vector sequence from the log template records. This semantic analysis will be performed to identify relevant features from massive log data with the help of Natural Language Processing techniques. By considering only relevant features, we will be able to avoid the challenges in handling massive log data. These extracted features will be put forward to the fourth block to train the model. A deep learning model will be trained to detect the probable failures and identify the failure pattern by analyzing historical data.
The fifth block illustrates the process of model testing. In this phase, the testing dataset (balanced dataset) will be supplied to the trained model. The time window is introduced to get sufficient lead time for a prediction. Late predictions as less time before the failure would be of no use as system admin would not have time to take mitigation actions. To deal with this essential parameter, we use log data in a specific time window. Sufficient lead time of failure prediction will be helpful to take corrective actions and avoid downtime.
The last part of the architecture shows the activities to be performed after getting the model results. An alert will be generated to notify the system admin of the prediction of any potential failure.

V. EXPERIMENTS
The authors performed experimentation to fulfil the proposed architecture's first phase (data collection and feature extraction). All the required datasets and parsing tools, and feature extraction approaches utilized for the experiment are shortlisted on rigorous analysis of existing literature. In the literature review, more focus was given on studying the availability of log datasets, tools and techniques applied for preprocessing, detection and prediction operations etc.
In a way, we can say that selection of parsing tools and vectorization techniques for experimentation is the output of this systematic literature review. Similarly, other aspirants can benefit from this SLR to identify the appropriate tools, techniques, or approaches while working in the IT infrastructure monitoring domain.
In IT infrastructure failure detection and prediction first and foremost action is to collect the log data from selected infrastructure. The gathered log is always present in raw format. Such log data cannot be served directly for the detection or prediction process. Thus, it is obligatory to metamorphose unstructured raw log data into the structured log. The processed structure logs are undertaken for subsequent analysis. From the proposed architecture, log parsing followed by sematic analysis for feature extraction is targeted for implementation. This section emphasis on the experimentation modules: dataset, log parsing, and semantic analysis.

A. DATASET
In accordance with the conducted literature review, various datasets are utilized in the study and released for further experimentation, as shown in Table 6. We have picked up one sample dataset from each category for experimental activities. Logs of various infrastructures were collected for experimentations, such as HDFS from distributed system category, BGL HPC from the supercomputer category, Linux from the operating system category, Android from the mobile system category, Apache from the server application category, and Proxifier from the software category.

B. LOG PARSING
Every single log message is inscribed by a logging statement that records the state of the system execution. Log messages registered with log header and message contents. Log header is an amalgamation of id, state, timestamp, level, etc. Moreover, message contents are a combination of the constant and variable parts. The developer wrote the constant string as a printing statement and variable component updates on execution and permeated current state particulars. The constant string imparts the log template of the log message and stays intact for the entire event presence. The primary aspiration of log parsing is to alter every log message into a particular template. Fig. 13 exhibits the elements of the sample HPC system log. HPC raw log message included different log header parameters (LogId, Node, Component, State, Time, and Flag) and message contents (Content). Furthermore, contents conveyed to procreate a unique event template.
Many automated parsers are open-source and grant accuracy adequately concerning the investigation done and rendered in Table 7. ''Drain'' parser transforms logs into the most anticipated format. Also, the environment set up for the tool's execution is not much complicated and easy to configure with confine system configuration. All selected log entries are parsed by executing the ''Drain'' automated parser. Table 12 illustrates the compendious of obtained results on the execution of Drain parser on different types of dataset. From derived results as stated in Table 12, we can observe that ''Drain'' provides acceptable accuracy for all types of infrastructures.

C. SEMANTIC ANALYSIS
Most ML and DL models for detection or prediction are not prepared to work directly on normal text data. As a result,   feature extraction or a digital delineation of the event template is obligatory. We have performed semantic analysis by squeezing the event template's sematic knowledge and transforming each event template into vectors. This vectorization positively facilitates preventing the influence of change in the syntax of logs. Our semantic analysis experimentation was achieved with the aid of the BERT (Bidirectional Encoder Representations from Transformers) model. Fig. 14 exemplifies the process of semantic analysis. For semantic analysis, the event template (for example, NIFF: node node-< * > detected a failed network connection on network < * > via interface alt0) undergoes the following steps: pre-processing, tokenization, vectorization, and clustering.
• We begin by removing all non-character emblems from the event template, such as special symbols, punctuation marks, numbers, operators, etc. For example: ''NIFF node node detected a failed network via interface alt0'' • Tokenization is the technique of partitioning a string into a list of tokens. We have performed tokenization by applying ''BertTokenizer'' of the ''BERT pre-train model''. For example : ''[ni, ##ff, node, node, detected, a, failed, nework, via, interface, alt, 0]'' • Then data in pertinent format is forwarded to the pretrain model for word embedding. Finally, the vectors acquired for each token of the event template. This section conveys a panorama of noteworthy points from the systematic literature review on IT infrastructure monitoring. The analysis targeted to furnish the answers to research questions and satisfy objectives as stated in Table 1.
• RQ1: How are log entries valuable for troubleshooting the failure?
Various IT infrastructure components generate different types of log data on the execution of events. The system logs are rich in information and provide all the details about the activities executed on the IT infrastructure components. System logs are considered as the primary source of data as it records the software's runtime information. Thus, recorded logs in the IT infrastructure are a valuable resource to track the issues in the system and handle it correctly. By processing the log, one can obtain the details about the timestamp, log level, log message, resources involved, etc. this data helps identify and analyze the problem. Information that arises after processing massive log data can monitor the system's behavior; it examines the root causes of the issues. Also, historical logs are helpful to understand the behavior of the system and identify the failure pattern. The analysis of logs (sequence of records) is advantageous to gather the details about the execution of activities and resources utilized. This data requires troubleshooting the identified problems in the system. Considering the properties of system logs and data generation on processing on them add great worth in maintaining the health of IT infrastructure by troubleshooting the failure. • RQ3: What is the performance of different approaches for anomaly and failure detection in IT Infrastructure? In the systematic literature review, we have studied different approaches used for preprocessing, anomaly & failure detection, and prevention, as discussed in section III and represented in Figures 9, 10, and 11. After rigorous analysis of all these approaches, techniques, and results, we have listed a few popular and efficient methods for different operations. 1) Preprocessing: natural language processing (NLP) for preprocessing logs as logs combine text and numbers and log message plays a vital role in analyzing problems. Thus, rather than statistical analysis, the semantic analysis provides better results. The systematic literature review reveals that semantic scrutiny is preferable over statistical analysis to infer the relevant meaning from log data. Thus, many researchers have applied NLP techniques for preprocessing log data. Also, efficient feature extraction supports improving detection and prediction accuracy. 2) Anamoly or failure detection: classification using machine learning or deep learning techniques provides better accuracy than rule-based or method-based approaches. Also, the presentation of logs in the form of time series data is one of the ways the researcher explores to claim better results. In addition, a handful of researchers have explored autoencoder semi-supervised learning approaches. The exploitation of an autoencoder is advantageous on the chance of big unlabeled log data.
• RQ4: What are the different existing techniques available to prevent and predict failure in IT Infrastructure monitoring? Failure prevention is possible by heterogeneous ways such as maintaining the health of components, finding the root cause of the failure, avoiding known causes, calculating the remaining useful time, monitoring the behavior, predicting the failure condition, etc. Different predictions have been made in the existing literature, such as failure propagation path, failure or fault or event prediction, or the accurate time for maintenance. Additionally, systems are enforced to forecast the maintenance period, remaining valuable life of the hard disk, and stress in the network to maintain the system's health. Thus, primarily, failure prevention is possible by predicting the failure situation with sufficient lead time. For the prediction using massive log data, sophisticated deep learning approaches imparts improved performance. Many researchers powerfully used Recurrent Neural Network, Convolutional Neural Network, LSTM, Bi-LSTM, etc. Considering the massive amount of logs, researchers recently preferred deep learning approaches to train the models. With the help of advanced, sophisticated deep learning techniques, it is possible to design a system that can update dynamically and improve the accuracy of failure prediction and prediction lead time.
• RQ5: What are the different state-of-the-art tools and techniques used for log monitoring and analysis? In the SLR, we evaluated various automated tools for log preprocessing (Table 7) and log analysis (Table 8) based on the technique applied and four merit criteria: mode, availability, industry utility, and accuracy. Many parsing tools are available with adequate accuracy; thus, upcoming researchers can use any tool in accordance with their requirements instead of developing the new. Many commercial tools are accessible in open source and payment mode to visualize the analysis of logs. The simplified and clear view of log analysis certainly helps in troubleshooting the problem. However, current prediction tools or frameworks have many limitations such as lack of accuracy, resulting from certain assumptions, insufficient lead time, etc. Thus there is a demand for virtuous prediction tools which can apprise failure states with adequate lead time.
• RQ 6: What are the distinguished limitations of existing literature? We did an extensive literature review on existing research and highlighted potential research gaps. Significant research has been done on different IT infrastructures using various types of log data, but the proposed solutions are system-specific. Limitations of the existing literature are discussed below: i. Existing models in the literature are system-specific: There is no solution available that can be applicable for all types of infrastructure. Different Infrastructures are obtainable and provide various features based on the utilization of components log generated in a different format. The above stated is the main reason for the system-dependent solutions.
ii. The logs considered with an assumption: The system's log is the primary source of information that delivers details about the execution of events and component utilizations. Sometimes, logging instructions are not appropriately written; thus, logs do not produce the required information. Research has been conducted on such log data assuming that the generated log is complete and accurate.
iii. Preprocessing may result in loss of essential data: The preprocessing carried out on log data by executing abstraction, filtering, encoding, removing unimportant data, etc., may lead to loss of critical information. This loss in essential data may decrease the accuracy of anomalies or failure detection and prediction. Also, the removal of some data will convert logs into incomplete records.
iv. Only significant anomalies/ failures can be detected: More focus is given only on the detection of substantial anomalies or failures. Effective means the anomalies or failures occurs frequently and cause significant losses. Therefore, existing models cannot detect every anomaly or failure in the selected infrastructure.
v. The current system does not provide information for taking necessary actions: Available models can detect or identify failure but do not provide information like the cause of failure, location or path, components involved which can help to adopt necessary measures. The additional details about failure will be helpful to take quick action and avoid propagation of failure and reduce downtime in the system.
vi. Not sufficient prediction lead time: The estimated forecasting time in the existing anomalies or failure prediction system is inadequate to grab remedial actions. The researchers are designing a predictive system that can notify failure in advance, but the correctness of prediction declines with rising lead time.
vii. Systems not updating dynamically: Existing systems are not picking up dynamically, which cannot detect or predict anomalies or failures that have never appeared in history or are unreported. But, likely, new irregularities or failure conditions will not occur in the upcoming future. viii. Concurrent anomalies/ failures cannot be detected: Furthermore, the existing model cannot detect or predict anomalies or failures that co-occur. Hence there is a need for a system that can handle such issues.
ix. Human intervention required: Human intervention is essential in an earlier unobserved log sequence; consequently, no fully automatic system exists. Also, system administrators will need to handle the failure situation and take corrective actions. Human intervention introduces human errors due to human limitations with his respective knowledge, availability, and qualities.
x. Root cause analysis present only for past failure: Also, Root cause analysis is available only for past failures. As a result, new failure conditions cannot be handled quickly, leading to increased downtime and associated losses.

VII. FUTURE DIRECTIONS
This coherent literature review was conducted on the scholarly publications extracted from Scopus, IEEE, ACM, and Web of Science databases until early 2021. The relevant publications are limited by selected keywords applied while searching in the database. The manual screening was conducted on available full-text articles to finalize the list of publications for detailed analysis; thus, it is not assured that all the articles from the literature are studied thoroughly. In the systematic literature review, more focus was given on the tools and techniques used to handle failure conditions in IT infrastructures using log data. For these reasons, the evaluation may endure the threat ofinclination. The paper presents the proposed system architecture for IT infrastructure failure detection and prediction as a further solution to existing literature options. This systematic literature review focused more on the following points: (1) IT infrastructures used for study in the literature, (2) log data to detect anomaly and failure conditions, (3) activities needed to handle failure conditions and (4) various publicly available automated tools for log parsing or log analysis.
The proposed methodology discussed in section IV is under research and evaluation. From the proposed architecture, log parsing followed by sematic analysis for feature extraction is targeted for implementation, and preliminary results are discussed in the paper.
• Design generalized solutions to detect or predict failure of any IT infrastructure The existing IT infrastructure failure detection or prediction solutions are systems dependent due to the change in nature of components, connections, utility and log formats. Although significant research has been done in this area, the IT industry demands a generalized solution that will apply to any IT infrastructure. Thus, there is a need to design a generalized system that will monitor any IT infrastructure.
• Generate or collect required logs with enhanced quality All the components in the IT infrastructure generate different types of logs, which are helpful to monitor and maintain the health of the system. But these logs are not available in a standard format and also have quality issues. To resolve this issue, identify and configure a tool that can gather the required logs from all infrastructure components. Also, remember the common features from different types of logs to improve data quality for further processing.
• Validate and improve failure prediction further with the help of already predicted events. Data of previously predicted incidences can be further used and back feed to the system to validate further and improve the prediction model. Thus, a confidence-based system can be established to validate and improve prediction. In addition, the solution can be further strengthened to handle failure conditions proactively.
• Identify failure patterns based on historical data with a minimal set of log data. Identify different failure patterns to improve prediction with minimal data sets as the system predicts more and more failure conditions. In some instances, the researcher may not have all debug level data or all level log data in this situation; a model trained with minimal data set will be effective. Such improvement in the technique can help to predict and prevent failure with the help of minimal logs.
• Create a monitoring console to show potential failure, performed actions, and different matrices.
User Interface (UI)-based monitoring consoles can be built to monitor the components in IT infrastructure better. This console can help to visualize the system appropriately. This console can provide a detailed view of the overall system.
It may show what action and suggestions are provided and how many measures are taken manually or automatically. It can also have an idea of the system's confidence, prediction, and success rate. This console will also help to reduce the human intervention in IT infrastructure monitoring.
• The researcher can further design a remediation system with the help of Orchestrator Application Programming Interfaces (APIs).
An automated system can suggest corrective action on the identified anomalies and failure conditions. One can explore how these suggestions and techniques can be utilized and executed to avoid failures. The stated recommendations may be orchestrated with API to run automatically. These workflows will help to prevent failure conditions proactively. This solution will be helpful to update the system dynamically as per the runtime requirements.

VIII. CONCLUSION
The recent past has witnessed the flourish in the utilization of IT infrastructures. Extensive importance has been given to system logs to establish stable and reliable infrastructure. Many researchers have furnished immense efforts for efficient and compelling log analysis to detect and control failure conditions to evade downtime. This systematic literature review mainly probes the five main stages in the IT infrastructure monitoring framework: availability of the log data, log parsing, log analysis, anomaly or failure detection, and prevention techniques. Furthermore, we elaborated on the open-source as well as commercial automated tool kits used in IT infrastructure monitoring. On rigorous analysis of studied literature, we have derived ten prominent research gaps. In accordance with the exploration of these recent advances, we suggested novel insights and listed various future directions. As a result of a systematic literature review, experimentation is performed with shortlisted parsing tools and feature extraction approaches. For experiments, the authors utilized the datasets from various infrastructures as suggested in Table 6. Also, a ''Drain'' open-source parser was applied to convert unstructured log to structured log, which gives acceptable accuracy for all infrastructures. BERT pre-train model was selected for semantic analysis based on the comparative study of feature extraction techniques in the available literature.
This systematic literature review and performed experimentation enable the forthcoming researchers to step into this encouraging and pragmatic field and empower them to fill their understanding gaps. • Micro-F1: Used to calculate the F1-score in the case of multi-class settings. Micro-F1 is also called a microaveraged F1-score and is calculated by combining micro average precision and micro average recall [140]. • RMSE -Root Mean Square Error • RNN -Recurrent Neural Network • ROC-AUC Curve: Receiver Operating Characteristic (ROC) is a two-dimensional representation of the trade-off among the TP and FP rates [138]. This curve was utilized to calculate and compare the performance of the classifiers. Area Under Curve (AUC) is mainly applied for the binary classifiers equivalent to the concept of probability [141].

Micro Average Precision
• RQ-Research Question • SaaS -Software as a Service • SMART -Self-Monitoring, Analysis ). She has more than 19 years of experience as an Academician and more than ten years as a Researcher. She has published 47 research paper publications in international journals/conferences and one book published by Taylor & Francis, CRC Press. According to Google Scholar, her articles have 135 citations, with an H-index of six and an i10-index of four. Her research interests include security and privacy solutions using blockchain and AIMLDL technologies.
KETAN KOTECHA has worked as an Administrator with Parul University and Nirma University and has several achievements in these roles to his credit. He has expertise and experience in cuttingedge research and AI and deep learning projects for more than the last 25 years. He has pioneered education technology. He is a Team Member for the nationwide initiative on AI and deep learning skilling and research named Leadingindia.ai initiative sponsored by the Royal Academy of Engineering, U.K., under the Newton Bhabha Fund. He currently heads the Symbiosis Centre for Applied Artificial Intelligence (SCAAI). He is considered a Foremost Expert in AI and aligned technologies. He is also with his vast and varied experience in administrative roles. He has published widely in several excellent peer-reviewed journals on various topics ranging from education policies and teaching-learning practices and AI for all. VOLUME 9, 2021