Towards data-driven reliability modeling for cyber-physical production systems

Reliability is one of the most important performance indicators in contemporary production facilities. Increasing reliability of manufacturing systems results in their prolonged lifetimes, and reduced maintenance and repair costs. Reliability modeling is a common technique for deriving reliability measurements and illustrating relevant fault-dependencies. There is a significant body of research focusing on hardware-and software reliability models, such as Fault Trees, Petri Nets and Markov Chains. Up until now, development of reliability models has been a labor-intensive and expert-knowledge-driven process. To remedy that, through the prevalence of data stemming from the new and technologically advanced manufacturing systems, we propose that data generated in modern manufacturing lines could be used to either automate or at least to support development of reliability models. In this paper, we elaborate on the details of our proposed framework for data-driven reliability assessment of cyber-physical production systems. We, furthermore, introduce a case study that will aid the development and testing of the proposed novel data-driven approach. © 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Conference Program Chairs.


Introduction
Reliability of a system is a property that indicates how correctly a function assigned to the system is carried out in a predefined time interval [1]. One of the main reasons engineers become more and more interested in reliability is the

Introduction
Reliability of a system is a property that indicates how correctly a function assigned to the system is carried out in a predefined time interval [1]. One of the main reasons engineers become more and more interested in reliability is the increasing number and variety of failures [2], which is due to the ever increasing complexity of systems. Since reliability is dependent on multiple stochastic events and processes in systems of interest (such as faults and repairs), it cannot be directly assessed.
Estimation of reliability measures of a system usually takes place in two steps: First, reliability models, such as Fault Trees, Petri Nets, or Markov Chains are created using the domain knowledge of the system. Second, reliability metrics, such as mean time to failure (MTTF), mean time to repair (MTTR), and overall system availability, can be calculated. For the first step, domain experts are needed which poses a bottleneck, particularly in increasingly complex production systems [3,4].
In [1], the authors identified the need to develop data-driven reliability approaches. These approaches could either complement existing reliability modeling techniques or they could be a combination of new and conventional reliability modeling techniques. Lazarova-Molnar and Mohamed [4] formalized this need and presented a general feedback loop on how data could be integrated in the reliability assessment processes.
The purpose of this paper is to build up on the work of Lazarova-Molnar and Mohamed [4] and to present a framework that will aid the process towards data-driven reliability modeling of cyber-physical production systems (CPPS). We, furthermore, present a case study with the aim of providing an accurate example of a cyber-physical production system that can be used to develop and test novel reliability modeling techniques.
The paper is structured as follows: We begin by providing background knowledge on CPPS, traditional reliability modeling and reliability-related data collection in CPPS in Section 2. In Section 3, we present our framework of data-driven reliability modeling and explain possible steps to implement it. We discuss implications, challenges and opportunities of data-driven reliability modeling in Section 4, followed by a summary and outlook in Section 5.

Background
In this section, we present the necessary background for the remainder of the paper. For this, we introduce the topics of CPPS and reliability modeling in more detail. In addition, we briefly describe the data that is collected in modern manufacturing environments that can be used for reliability engineering purposes.

Cyber-Physical Production Systems
In cyber-physical systems (CPS), mechanical and software components are connected to one another via networks and information technology. CPS also enable interaction with humans that could be, for instance, maintenance staff or actual users of the system. The key feature of CPS is enabling the management and control of complex systems and infrastructures [5].
CPS that are applied within the context of industry as a manufacturing-centered version are commonly referred to as cyber-physical production systems (CPPS) [6]. CPPS consist of isolated but cooperative sub-systems and elements that are implemented at all levels of production [7]. They encompass processes to machines up to complex production networks. Lee et al. [8] propose an architecture for CPPS that serves as a template for the deployment of such systems. The authors go on to identify five levels of CPPS, including the following: Smart Connection, Datato-Information conversion, Cyber, Cognition and Configuration. Each of the levels comes with specific attributes that contribute to the abilities of the overall system. CPPS can enable decentralized, non-linear production of goods in large quantities. In contrast to conventional production paradigms, in which the subsequent production stage of a product is initialized by a factory worker, the product itself always knows which phase it is in, and it communicates autonomously with production resources. In this way, the product moves in flexible production cells and calls on other resources to support it. This enables improved personalization of the products and cost-effective production of different product types in a single factory.
Ding [9] recently concluded that even though the connection of cyber and physical assets makes it possible to achieve autonomous manufacturing, the increasing complexity and uncertainty of such systems call for methods like digital twins [10]. Digital twin is a simulation of a complex system that mirrors its life [11]. Simulation and virtualization, therefore, play major roles in transforming conventional production lines into CPPS [4]. Simulating production runs helps planners to estimate the effects of production changes in advance. In addition, virtualization of production lines supports continuous localization of parts and products on their way through production [10].

Reliability Modeling
Besides availability and maintainability, reliability is one of the most important characteristics of a system. Faults and failures are the main factors that negatively affect reliability, and they occur when the system or its components no longer perform their intended functions [12]. System reliability and other related measures, such as MTTF, MTBF, and, for repairable systems, mean time to repair (MTTR), can be calculated using data on faults, failures, and their occurrence times.
Conceptualizing graphical reliability models, before calculating the above-mentioned reliability measures, is common practice in reliability engineering. This not only helps to better understand the underlying causes and effects of faults and failures, but also supports and improves maintenance decision-making. Reliability modeling, as part of reliability analysis, includes modeling of all critical elements that make up a system. With regards to CPPS, these systems can include hardware, software and human resources. Several modeling techniques have been developed for all three types of resources. However, there is a broad knowledge base when it comes to hardware and software reliability models, while reliability related to human interaction has not yet received sufficient attention, mostly due to challenges of predicting human behavior [1].
Several probability distribution functions are used to model reliability-relevant data. Some of the most important distribution functions are the exponential, Weibull, normal and lognormal distribution [12]. Systems with a constant failure rates can be represented using the exponential distribution. The Weibull distribution function can be used to represent systems or components that have a varying failure rate based on specific life cycles. Normal and lognormal distributions can be used for various purposes in reliability analysis, such as service times and repair times.
Hardware reliability models can be differentiated as static and dynamic models. Static models only depend on combinations of individual events and do not take failure sequences into account, i.e. these models cannot support conclusions based on event sequences. Due to their intuitiveness, static models are especially used for the calculation of system failure probability of safety-critical systems [13]. Dynamic models, on the other hand, take the order in which failures happen into account leading to a more realistic representation of many systems, i.e. these models support conclusions based on event sequences. Popular static hardware reliability models are static fault trees (SFT), reliability block diagrams (RBD), and binary decision diagrams (BDD). Dynamic models comprise dynamic fault trees (DFT), Markov models (MM), and sequential binary decision diagrams (SBDD) [13,14]. Each of the mentioned hardware reliability modeling techniques focuses on different aspects of a system. Fault trees (FT) exhibit certain events that can ultimately contribute to the failure of a system [15]. RBD represent the components of a system and their specific reliability structure. MM are used to model the states and transition probabilities of a system.
Software reliability models differ from their hardware counterpart as they mainly focus on errors during the software development [1]. Sharma et al. [16] provide an extensive overview of common software reliability models. The authors classify the models into the specific development phases, namely requirements, design, implementation, testing and validation.

Reliability-Related Data Collection in CPPS
Nowadays, production facilities generate several different types of data. Zschech [17] proposes a comprehensive taxonomy that focuses on recurring data analysis problems in maintenance analytics. The author distinguishes between Condition Monitoring Data, Event Data, Metadata and Business Data. Table 1 provides a brief overview of these four data types and exemplary attributes in relation to the three asset types of CPPS.
Condition Monitoring Data provide measurements about the specific assets of a cyber-physical production system. For physical and human assets, sensors are commonly used for this task. Software can be programmed to generate monitoring data. Event Data capture what happens on an asset and what actions are taken [17,18]. Lazarova-Molnar et al. [1] provides a comparison of common reliability-related events between software and hardware assets. They discover that the interpretation of a failure as a complete standstill of a system is the same for both assets. However, the meaning of errors and faults differs slightly in both areas. Metadata refer to context details about the specific assets of CPPS. Business Data provides more general, contextual data about CPPS themselves. This may range from aggregated quality measures and performance indicators to information about business processes generated by, e.g. Enterprise Resource Planning (ERP) systems or Supply Chain Management (SCM) systems. In summary, it can be said that all asset types from CPPS generate similar data types with distinct features. However, the features listed in Table 1 are only an extract from reality.

A Framework for Data-driven Reliability Modeling for Cyber-Physical Production Systems
In this section, we present our vision of how data can be incorporated into the process of developing reliability models and how these models can then be used for plant simulation, analytics and ultimately decision making. Our focus is on hardware reliability models such as FT, RBD and MM due to their widespread use in manufacturing scenarios and easy-to-understand graphical representations. The general concepts that we present, however, can also be applied to software reliability and human factors-related reliability of CPPS. Figure 1 provides an initial framework for data-driven reliability modeling, inspired by the feedback loop proposed by Lazarova-Molnar and Mohamed [4].
First, the data that are relevant for design of reliability models must be defined, generated and collected. Data on relevant events (such as faults and failures) is the minimum requirement for the construction of most hardware reliability models. To support and refine the construction of reliability models, and in cases when event data is not available, condition monitoring data from physical assets needs to be collected and provided. In addition, metadata about the production assets and business data can be also included in the process of creating reliability models.
Data requirements vary with the types of reliability models that are being built. For example, RBD solely require information about up and down times of production machines, while FT need contextual information about specific events (e.g. event description and corresponding production machine). In either case, a structured event log is necessary for algorithms to derive reliability models. This event log is in a time series format and contains information on all events that occur during the operation of CPPS. If the infrastructure provides the necessary information backend, data on relevant events, including faults, failures, relevant operations, repairs and configurations can be directly integrated into the event log. In general, however, relevant events must first be detected and identified before they can be extracted from condition monitoring data. Machine learning provides powerful tools and methods for this event detection task [4].
In [19], for example, Akhavan-Rezai et al. propose a framework for fault detection using an Internet of Things (IoT) embedded cloud control architecture that processes different kinds of input data. In their work, a deep learning-based approach for fault detection and classification is applied that leads to promising results. Another relevant contribution by Yu et al. is presented in [20] describing a data-driven ecosystem that implements fault detection and diagnosis using machine learning algorithms.
As illustrated in Figure 1, it would be beneficial to include other data, such as metadata and business data, into the overall model extraction process. This data contains valuable information that would certainly enhance the accuracy and overall quality of generated reliability models (e.g. process models from ERP or SCM systems could help refine the reliability models). In the case that this additional data is available, methods will need to be developed to integrate metadata and business data into the event log or directly into the model generation algorithm. If the data can be represented in a time series format, it can be integrated into the event log, otherwise into the model generation algorithm.
Once the data has been structured and prepared according to the steps described above, the actual data-driven model generation can take place. Some studies already suggest methods for that: Linart et al. [21] propose an evolutionary algorithm that learns FT from observational data. Lazarova-Molnar et al. [15,22] extend this approach by considering multi-state systems and by learning FT from time series data. However, the time series used in this study were generated from fault trees rather than actual assets. Badreddine and Amor [23] propose a new method for constructing bow tie diagrams that take the dynamic behavior of real systems into account. The authors represent bow tie diagrams as Bayesian networks that can be learned from actual data. This method allows for dynamic implementation of protective and preventive maintenance barriers.
The automatically generated reliability models can be further verified and adjusted by domain experts. The resulting data-driven reliability models can then be used to run simulations, as well as to calculate system's reliability measurements. Finally, calculated reliability models and measurements will be presented in a dashboard to support the decision-making process for preventive maintenance planning and optimized production line configuration.

Case Study
In this Section, we present a case study that will be used to validate the developed methods for data-driven reliability assessment. The University of Southern Denmark recently deployed an Industry 4.0 laboratory that facilitates a cyber-physical production system that can be used for scientific and commercial purposes [24,25]. In its current state, this reconfigurable cyber-physical production system produces a part of a quadcopter drone. In particular, rotors, motors and landing gears are assembled in an automated environment. Figure 2 shows the production process and the final product. Below, we briefly describe the production process of this part.
The raw material (i.e. rotors, motors and landing gear) is stored in a multi-functional order picking and automated storage system. As soon as a production run has started, a transport robot picks up the raw material and transfers it to the production line. The actual production line consists of two isolated production cells which are connected to one another via a magnetic conveyer system. The first cell consists of a robot and assembles the motors with the landing gear. After the motors are connected to the landing gear, the parts are transferred to the second cell. The second cell consists of two collaborative robots and attaches the rotors to the motors. After a production run, a transport robot picks the assembled parts and transfers them to another storage system. In all presented steps, there can be faults or failures, which are going to be detected and labeled from the data that is generated by the system.
In future, further work steps and production cells will be added to production process. This includes cells that either combine the collaboration between humans and robots or are based solely on human interaction. An information backbone is currently being developed that will allow researchers to access data at all levels of production.
The proposed framework for data-driven reliability modeling (Section 3) is planned to be used for this case study in several ways: First, through the data that is collected from the various sensors, reliability models can be extracted for both the production line, as well as for its individual machines. Second, through simulation and analytical methods, reliability metrics can be calculated. Finally, redundancy and buffers can be added as parameters to study their impact on the generated reliability metrics for the production system. Finally, these reliability metrics can be utilized for making purchasing decisions for parts and machines or defining maintenance schedules and strategies.

Implications
In this section, we outline the implications that arise from our vision presented in Section 3. We describe existing and expected challenges and opportunities in data-driven reliability modeling and assessment. The challenges include the lack of publicly available datasets, the risks of synthetic datasets produced by simulations, event detection based on condition monitoring data and the integration of various data sources. The opportunities include more realistic reliability models, better root cause analysis and improved maintenance decision making. In Table 2, we provide an overview of the specific implications for the different reliability modeling techniques. We focus on the challenges and opportunities that we expect when implementing data-driven solutions for RBD, FT, Petri nets, MM and discrete-event simulations.

Challenges
Many research areas aimed at data-driven methods have received extensive attention and accelerated as soon as benchmark-like datasets became publicly available. For example, the advent of the following datasets has fueled scientific development in their respective fields: NASAs Turbofan Engine Degradation Dataset for Prognostics and Health Management [26], ImageNet for Computer Vision [27] and KITTI for Autonomous Driving [28]. We are not aware of such benchmark data sets in the field of reliability engineering for smart manufacturing. This could be due to the fact that companies prefer not to share critical information about their production lines. Providing the scientific community with a benchmark dataset, covering data from a CPPS and the corresponding reliability models, could raise interest in data-driven reliability modeling and contribute to the development of innovative approaches.
Depending on the level of digitization and commitment to the concept of CPPS, the actual data collected by today's manufacturers also varies. The concepts behind Industry 4.0 and industrial IoT are yet to be adopted by the majority of companies. Due to the lack of public datasets capturing smart manufacturing processes, CPPS need to be simulated to generate synthetic data. However, simulating CPPS poses the risk that due to the abstraction of the system, important real factors that influence production could be omitted or neglected. Therefore, when data-driven reliability methodologies are developed based on synthetic data generated by simulations, they should be validated against real data in order to avoid misleading results.
When condition monitoring data, such as sensor data, is included in the data-driven reliability modeling process, events need to be detected. When using supervised machine learning methods for event detection, event data needs to be labeled. This is a tedious and time-consuming process. In addition, new, unseen events would not be recognized by the detector. Unsupervised machine learning methods could be used to detect anomalies instead of events. However, detected anomalies do not provide the semantic information that is required for many reliability modeling techniques. Therefore, although condition monitoring data provides useful insights, the actual use of this data for data-driven reliability modeling is difficult and requires further research.
Combining and integrating data from heterogeneous sources is another challenging task. In particular, metadata and business data are stored in many different, likely textual, formats. In contrast, condition monitoring data and event data are stored as time-series. Therefore, methods must be developed that integrate and merge data from all possible sources in order to use them for data-driven reliability modeling.

Opportunities
Data-driven reliability modeling offers several advantages over its knowledge-driven counterpart. Using a datadriven approach, CPPS could generate more realistic reliability models that better reflect that actual systems that typically evolve and change their behaviors during their lifetimes. These data-driven reliability models can be also utilized to validate or calibrate knowledge-driven reliability models that are already in use.
Data-driven reliability modeling can provide manufacturers with a better insight and understanding of causes of errors and failures. In addition, data-driven approaches enable easier root cause analysis and finding of interdependent errors.
Not every existing reliability modeling technique requires the same level of detail when it comes to the data required to generate it. The analysis of raw data and selection of the data that are needed to generate certain models is a key task of data-driven reliability modeling. An indicator, such as "Smart Readiness Indicator" for buildings [29], could be a useful addition to today's manufacturing systems: the better the indicator, the more useful data can be extracted from the system and thus support the data-driven reliability modeling process.
Finally, data-driven reliability modeling could support decision-makers in maintenance planning, machine purchasing or configuration of systems' layouts, and thus contribute to more efficient resource allocations [30]. Visualizing differences among possible configurations (e.g., in terms of redundancy or buffer sizes) through reliability models will lead to better strategies and decisions.

Summary and Outlook
We presented a novel approach and a framework for data-driven reliability modeling of cyber-physical production systems. Currently, reliability modeling is heavily guided and dependent on experts. For many companies, this is a bottleneck to integrating Industry 4.0 concepts into their production lines. To address this bottleneck, we suggest using data to generate reliability models. As this is a fairly new area of research, several challenges need to be addressed, including the current lack of reliability-relevant manufacturing data, the combination of different data sources and formats, the development of event detection methods, and finally the development of efficient algorithms for retrieving reliability models from data. We presented our findings in form of a framework to facilitate the data-driven reliability analysis of CPPS.