Introduction

The Internet of Things (IoT) is a megatrend that is dominating current social transformation. The number of networked devices and the resulting volume of data is constantly increasing worldwide. Until 2025 there will be 75 billion networked devices [1] worldwide with a data volume of approximately 80 zettabytes [2]. The IoT has become a key technology for future-oriented scenarios. Driven by Murphy’s law—“Anything that can go wrong will go wrong”, the reliability of computer systems is becoming more important. In particular, the civil infrastructural systems get an extremely high societal relevance [3,4,5]. The provided services such as water or electricity supply are increasingly dependent on highly available and functional information technology. The so-called smart meters can record real consumption data and forward them to higher level instances to provide this data for the overall management of whole ecosystems. A fault, an impairment, or even a failure could lead to significant effects to public safety or other dramatic consequences [3,4,5]. The resulting dependence of modern society on complex information systems, especially for the above-mentioned infrastructures, is constantly growing [3,4,5].

This article is an extension of the paper "Reliability Estimation of a Smart Metering Architecture using a Monte Carlo Simulation" [6] published in IoTBDS 2022. In that paper, we have approximated the reliability of the reference architecture for smart metering systems that is depicted in Fig. 1 and presented as well as interpreted the results. Focusing on improving the reliability of this reference architecture using specific optimization methods, by validating the optimized smart meter architecture using our approximation method from the published paper. This approximation method aims to represent reality in the most accurate manner, and therefore, the Monte Carlo simulation based on RBD models is used. The outcome will be a more reliable smart meter architecture. As a result, a low-fault overall system is reached.

Fig. 1
figure 1

Smart meter architecture [7, 8]

Figure 1 shows a Europe-wide reference architecture for smart metering systems (gas, water, heat, or electricity) [7, 8] in a schematic structure. The central concept in these specifications provides a separate unit—the smart meter gateway (SMGW) as a central communication device. This provides the interfaces between the diverse domains and the smart metering system. According to the last European Commission report [9] in 2020, the penetration rate of smart electricity meters is estimated to 43% (123 million) and of smart gas meters to 27% (31 million). In 2030, there will be a penetration rate of 92% (226 million) for smart electricity meters. Furthermore, a penetration rate of 44% (51 million) is projected for smart gas meters in 2024. For comparison, about 53 million metering locations [10] will be equipped with smart metering systems in Germany. In this context, a metering location is a component that measures energy and includes all the technical equipment required to determine and transmit the metered values. This projected increase demonstrates that the Europe-wide and national rollout of smart meters will be driven by grid operators.

The general goals of this digital data collection are a more efficient and transparent energy distribution as well as the sustainable control of energy generation and the overall network utilization [11, 12]. To ensure the required goals of this ecosystem the reliability is a fundamental objective of the design phase [13]. The present article focuses exactly on this subject—the optimization of the constructive reliability of smart metering architectures. In general, the smart metering systems are more fault-prone than conventional metering devices because of the more complex interaction between hardware and software components [11].

Based on the reliability-oriented V-model, which is based on ISO 26262 Standard, this article follows a structured procedure for reliability optimization [14, 15]. The V-model is used as a structural approach and for a simplified explanation of the procedure. Figure 2 shows the V-model, which defines the reliability-oriented design of a system on the left side. At this level, the requirements for the system will be set, so that a low-fault smart metering architecture can be developed. There are various optimization methods for these requirements in the literature [16] and some elements of these are used in this article. The right side of the V-model describes the reliability analysis and verification. Analytical approaches are not generally possible for common reliability problems at the component or system level. Therefore, approximative reliability methods, such as Monte Carlo simulation techniques, have become very popular [17]. In comparison with other reliability methods, the Monte Carlo simulation has the advantage that it is both precise and easy to implement. Therefore, for the presented reliability analysis of a smart metering architecture, the Monte Carlo simulation is used. Derived from the described challenges we defined the following research question: “How can a low-fault smart metering architecture be achieved based on constructive methods?.

Fig. 2
figure 2

Dependability-driven V-model [14, 15]

Following this introduction, section “Foundations” presents the theoretical foundations for reliability and its optimization. After that, we will demonstrate and explain the optimized smart meter architectures in section “Optimized Smart Meter Architecture”. The approach for reliability analysis is described in section “Approach for Reliability Analysis”. Based on this approach the reliability is approximated using Reliability Block Diagrams (RBD) and a Monte Carlo simulation in section “Evaluation of the Smart Meter Architectures”. Section “Conclusion and Future Work” concludes this article by summarizing the paper and outlining future work.

Foundations

This chapter presents the basics of the reliability domain in a logically structured order. First of all, the reliability theory and the systematic reliability optimization are described, since this is the basis to develop the reliability-optimized smart meter architectures. Subsequently, the reliability analysis approach is explained.

Foundations of Reliability

The research field of reliability was characterized by Jean-Claude Laprié. He established a standard framework and general terminology for reliable and fault-tolerant systems [18]. According to Bertsche [19] and Laprié [18], the reliability R(t) is defined as the probability that a system performs its functions satisfactorily and without any failures under given functional and environmental conditions over a specific time period. The literature classifies four methods for a reliable system design: fault prevention, fault tolerance, fault removal, and fault forecast [18, 20]. This article focuses on fault prevention, because based on a previous literature review [16], the highest potential for reliability optimization is in the design phase. Fault prevention refers to methods that are intended to prevent the occurrence of a fault condition or the implementation of faults into the entire system [18, 20]. To be able to prove that the identified methods of the literature review [16] increase the reliability R(t) of a smart metering architecture, it is necessary to do a validated reliability analysis. Reliability analysis is a methodical approach to be able to identify the reliability of a system and the frequency of failures. This approach starts with the conception of the RBD model and is finished with the statistical calculation of the overall reliability [21].

There are several techniques for quantitative and qualitative analysis of reliability in the literature [22]. Basis for our approach is a combination of quantitative methods. In this group, the most important techniques are the RBD [23], the network diagram method [24], Markov modeling [25] and the Monte Carlo simulation [26]. To be able to calculate the most precise reliability, it is necessary to combine the above mentioned techniques [22, 27]. The valuation approach we have chosen is using the RBD to model the entire system together with a Monte Carlo simulation to calculate the reliability per component, which is described in detail in section “Approach for Reliability Analysis”. RBD is a schematic notation of the main components of the overall system, which represents the hierarchy and interaction with each other for the function of the entire system [22, 28]. At the next step, the Monte Carlo simulation is used to simulate the reliability of each component. The Monte Carlo simulation implementation is based on repeated random sampling and statistical analysis to estimate the reliability R(t) for complex system functions [29, 30]. This approximation technique helps to generate realistic values that we can use for the reliability analysis of the whole smart metering architecture.

Reliability Optimization

In the literature, the reliability R(t) is calculated with the following formula: Reliability R(t) = e−λt

Based on the conducted literature review [16], different methods for optimizing system reliability were identified. Here, it is evident that the failure rate λ(t) interacts with the time t. In many cases, the level of the failure rate, that is the reliability of a system component that is still intact, depends on the age that is already reached [31, 32]. The so-called bathtub curve in Fig. 3 describes the time history of the failure rate λ(t) for hardware components in three phases. The first phase of the bathtub curve is known as the period of early failure or "infant mortality" and is characterized by a decreasing failure rate λ(t). For example, through construction or material defects, a component may fail after a short period of operation [31,32,33]. If the component has passed a certain time without damage, then the risk of failure decreases significantly per time unit. The middle area of the bathtub curve is almost flat, so that the failure rate λ(t) is constant. The risk of the failure of a component stays unchanged over a longer time period. In this case, the reasons for downtimes are primarily random faults. The third area of the bathtub curve is characterized by a significantly increasing failure rate λ(t). These late failures are usually the result of wear and fatigue processes [32, 33].

Fig. 3
figure 3

Bathtub curve [31, 32]

As explained in section “Foundations of Reliability”, this article focuses on constructive reliability at the hardware level. Measures for optimizing the reliability of the system or for minimizing the risk of failures can be classified here into two fundamental categories—patterns for hardware reliability engineering and architecture patterns. Hardware reliability engineering is primarily characterized by quality control techniques that are used in the design and manufacturing of hardware [34]. In contrast, the architecture patterns define structured and strict design rules for the architectural structure or topology of the entire system. Each of these categories has a different impact on the three phases of the bathtub curve. Hardware reliability engineering patterns primarily have an impact on phase I and phase III, because early failure or wear and tear is due to hardware-specific conditions. However, the architecture patterns have a greater impact on the much longer service life (phase II), because faults in the architecture or system topology generally occur after the initial hardware or software faults at a higher level of maturity.

Design Science Research (DSR)

In the present article, we use the DSR approach, because a key feature of DSR is to solve societal and practical problems through the construction and evaluation of a scientific artifact [35]. Artifacts can be classified as concepts, models, methods, or realizations that contribute to a scientific result. According to Peffers [36], the DSR consists of six major steps—problem identification and motivation, definition of the objectives for a solution, design and development, demonstration, evaluation and communication (cf. Fig. 4). This article describes a practice-oriented problem that has to be solved by optimizing reliability. For this purpose, a low-fault smart meter architecture was defined in Fig. 5c and evaluated using reliability analysis in section “Evaluation of the Smart Meter Architectures”. This specific approach is detailed and implemented in the following chapter as well as the result is interpreted and communicated.

Fig. 4
figure 4

Six major steps of DSR [36]

Fig. 5
figure 5

Incremental optimization of a smart meter architecture

Optimized Smart Meter Architecture

In this chapter, popular reliability optimization methods [16] are used to design a low-fault smart meter architecture. The individual optimization stages for a smart meter architecture with a lower error rate will be shown and explained. A smart meter architecture basically consists of three layers, as shown in Fig. 5 [7, 8]. The data layer is equivalent to the Local Metrological Network (LMN) from Fig. 1, which includes all smart meters in a house or household. Located above is the gateway layer, in which the SMGW as a telecommunication device provides all information to the application layer. The Home Area Network (HAN) and the Wide Area Network (WAN) from Fig. 1 have been combined into the Application Layer, because in both domains, the meter information can be read and visualized or a remote configuration can be executed [37]. Due to these features, it can also be summarized as a meter data management system (MDMS) [38].

Starting with the reference architecture [7, 8] up to the reliability-optimized smart meter architecture, Fig. 5 shows the individual optimization stages. For this purpose, the smart meter architectures have already been transferred to a simplified model. In section “Evaluation of the Smart Meter Architectures”, an approximation for the reliability of the individual smart meter architectures from Fig. 5 is provided by a simulation and the step-by-step optimization of the overall system is verified.

Figure 5a shows the simplest approach and has no constructive reliability methods for fault avoidance. All components of the smart meter architecture are connected to each other via a single channel, so that a failure of the SMGW or an interruption of the communication channels between the smart meters and the SMGW or the SMGW and the application will affect the overall system immediately. To achieve the first optimization level of the architecture in Fig. 5b, physical hardware redundancy is used as reliability method and applied to the gateway layer. Due to semiconductor components becoming smaller and cheaper, the concept of hardware redundancy has become popular in recent years [39, 40]. This hardware redundancy can compensate the failure of the SMGW and also enables a multi-channel connection between the smart meters and the two SMGWs as well as the SMGWs and the application, so that a disruption of the connection can be tolerated. The result is that interruptions in the communication channels do not affect the service provision of the system.

The last optimization stage of the smart meter architecture as shown in Fig. 5c consists of the previous hardware redundancy of the SMGW and additional reliability methods in the data layer for smart meters. In this case, the principle of clustering is applied and two of the smart meters are defined as root nodes that act as data concentrators and aggregate all information of the subordinate smart meters [41, 42]. To guarantee this, all smart meters are interconnected multiple times and form a kind of mesh network topology [43]. Each of the two root nodes has a redundant connection to the superordinate SMGWs which act as pure telecommunications devices and send the information to the application [7, 8, 40]. The application, at the top level, aggregates all the information. From here, the smart meters can be managed and the smart meter data can be visualized or analyzed and used for general purposes (cf. MDMS) [38].

Approach for Reliability Analysis

In this chapter, the methodology for reliability analysis is presented and will be applied in the next chapter 5. The smart meter architectures in Fig. 5a to –c are the basis for the reliability analysis. These smart meter architectures already represent a simplification of the entire system, so they can be used directly for the methodology. In our reliability analysis, we assume five smart meters, because in the future, there will most likely be no only smart electricity meters in common use, but also smart water or gas meters. In the next step, it is necessary to transfer the simplified models from Fig. 5 into the logic of the RBD. A RBD configuration could consist out of three basic component connections, which can be combined with each other—the series connection, the active redundancy, or the standby redundancy [44, 45]. Depending on the configuration, the failure of any component can cause the entire system to fail or restrict individual services of the entire system, so that the required system functions are not fulfilled [44, 45]. In Fig. 6, we have transferred the three smart meter architectures from Fig. 5 into the RBD logic. This formalization of the smart meter architectures allows the mutual dependencies of the hardware components to be evaluated with formulas.

Fig. 6
figure 6

Simulation models based on reliability block diagram (RBD)

To calculate the quantitative reliability of the entire system, the failure probabilities of each component are required. For a validated value for the failure probability of the smart meters and the SMGW, we scanned five publicly available databases and contacted ten organizations. The analysed databases contain aggregated raw data of smart meters over a specific period of time. We examined the following databases—opennetzero.org, osf.io, kaggle.com, data.gov.uk and ieee-dataport.org. The list of organizations comprised national institutions with regulatory supervision over the entire energy supply and large companies (> 500 employees) that offer hardware products, such as smart meters or SMGWs as well as services for the digitalisation of the energy sector. This research revealed that there are currently no validated values for failure probability. Currently, there are just no long-term data from the practice and the grid level. Because validated values for the failure probabilities cannot be determined, we use a Monte Carlo simulation to calculate the values. The Monte Carlo simulations can be used to approximate the individual reliabilities of the hardware components, so that they correspond more to reality. The interaction of the hardware component, which is represented by RBD in Fig. 6, determines the formula for calculating the reliability of the entire system.

Evaluation of the Smart Meter Architectures

This section presents the incremental approach for reliability analysis of the three smart meter architectures from Fig. 5. Reliability distributions of systems must be modeled with suitable mathematical functions, so that the practice can be mapped. The bathtub curve from Fig. 3 can be approximately described as a summary of Weibull distributions [46]. Due to its versatility, the Weibull distribution has become one of the most commonly used reliability techniques. The Weibull distribution can be used to represent the decreasing, constant, and increasing failure rates λ(t) in technical systems. Therefore, it is able to represent different failure modes and all ranges of the bathtub curve [46]. Depending on the life phase (cf. Fig. 3) of a component, the Weibull distribution corresponds to an exponential distribution or a logarithmic normal distribution [47]. As described in section “Reliability Optimization”, the focus of this reliability analysis is on the phase of the useful life, which has a constant failure rate λ(t). In this case, the reliability distribution corresponds to an exponential distribution. The exponential distribution is often used in the development of electronic systems, because it is accurate for reliability analysis [46]. Therefore, the following formula for the reliability R(t) is obtained [48,49,50]:

$$Reliability\,\, R\left(t\right)= {e}^{-\mathrm{\lambda t}}.$$
(1)

For an overall reliability analysis, the system must be divided into individual components. These are shown in Figs. 5 and 6—the smart meter, the SMGW and the application. Because of the high technical similarities between the smart meter and the SMGW [11, 51], it is possible to use identical reliability analysis for these two components. For the application as a separate component, we assume that it is operated in a cloud environment. To obtain the reliability RApp of the application, the characteristic availability from the three major cloud providers (AWS, Azure, GCP) is used. The minimum availability is 99.90% [52, 53]. Therefore, for the reliability analysis of each smart meter architecture, there is a reliability RApp = 99.90%.

Reliability Simulation of Smart Meter and SMGW

In the following section, the reliability of the smart meter and SMGW is approximated. To be able to calculate the reliability, the characteristic lifetime T and the failure probability G(t) of the components are required. Based on various European studies [51], a characteristic lifetime T of 12 years can be assumed. The failure probability G(t) can be assumed with 2% on average [11, 54]. The following formulas show the calculation of the failure rate λ(t) and the lifetime t:

$$\begin{gathered} Failure\,\, Rate \,\,\lambda \left( {\text{t}} \right) = \frac{1}{T}, \hfill \\ Lifetime \,t = T \times G\left( t \right). \hfill \\ \end{gathered}$$
(2)

Using the e-function, which is an exponential function with Euler's constant [55] as the base, the reliability R(t) for the two components can be calculated according to formula (1):

$$Hypothetical \,\,Reliability R\left(t\right)\approx 98.02\boldsymbol{\%}.$$
(3)

We use the principle of the Monte Carlo simulation to make the accuracy of this calculation more realistic by approximating the reliability of the smart meter and SMGW. The objective is to approximate a realistic value of the reliability R(t) based on the Law of large numbers [56].

$$Lifetime t \left(x, \mu , \sigma \right)= \frac{1}{\sigma \sqrt{2\pi }}{e}^{-\frac{\left(x-\mu \right)}{{2\sigma }^{2}}},$$
(4)
$$x\in \left[\mathrm{0,1}\right]; \mu =2.081, 52 \,\,hours; \sigma =5.256 \,\,hours$$

This function [57] calculates the percentile for a specified mean and standard deviation. The parameters for the reliability calculation in formula (4) are described below:

  • For the parameter \(x\), which indicates the probability in the normal distribution, we create a random number between 0 and 1,

  • The parameter μ, which indicates the arithmetic mean of the distribution, is equivalent to the lifetime tµ of our previously calculated reliability R(t) from formula (3). This is calculated as follows:

    $$Lifetime\,\, {t}_{\upmu }=12\,\, years \times 1.98\mathrm{\%}\approx 2.081, 51 \,\,hours.$$
    (5)

The parameter σ, which indicates the standard deviation of the distribution, is empirically assumed at 5% [54] and inserted into the related formula 2 for the lifetime tσ:

$$Lifetime\,\, {t}_{\sigma }=12 \,\,years \times 5\%=5.256 \,\,hours.$$
(6)

Formula (4) will be executed for 80.769 random samples to simulate the lifetime t (x,μ,σ). According to Liu, the 80.769 random samples constitute an optimal number of trials for a Monte Carlo simulation [58]. Each simulated lifetime t (x, μ, σ) has to be inserted into the formula (1), so that we can determine the reliability R(t) for 80.769 smart meters or smart meter gateways in a realistic manner. This statistical simulation of 80.769 samples was performed with an automated tool to reduce processing time and avoid human error. Finally, the average of the results can be calculated to obtain an approximately real reliability of the two components:

$$Reliability\,\, R\left(t\right)\approx 96,93\boldsymbol{\%}.$$
(7)

This reliability R(t) is the basis for the subsequent reliability analyses of the optimized smart meter architectures from Fig. 6. The simulation process of the reliability R(t) from formula (7) is shown in Fig. 7. The diagram shows the smoothed reliability R(t) for the smart meter and the SMGW. Due to the large number of samples, only every 327th random sample, in total 247 measurements, were included in the x-axis of the graph. The least reliable value is just a bit more than 85% so we have set the range of values of the y-axis between 0.85 and 1. The red trend line represents the moving average of the random samples. The diagram illustrates the strong variations of the calculated reliability R(t), which is caused by the simulated failures of the Monte Carlo simulation. We can see that the reliability R(t) of the smart meters and SMGW is between 100 and approx. 85% because of the integrated coincidence (cf. formula (4)). In this case, a reliability R(t) of 100% means that the characteristic lifetime T of the component is reached or even exceeded.

Fig. 7
figure 7

Smoothed calculation of reliability for the smart meter and SMGW

Reliability Approximation of Smart Meter Architectures

In this section, the overall reliability analysis of the three architectures from Fig. 5 and the three RBD models from Fig. 6 will be calculated. The objective is to demonstrate the increased reliability of smart meter architectures by calculating the overall system reliability. The reliability R(t) of the smart meter and SMGW simulated in paragraph 5.1 is used for the following reliability analyses. For the RBD model of the reference architecture from Fig. 6a, the reliability Ra(t) is calculated. The reliability Rb(t) can be assigned to Fig. 6b and the reliability Rc(t) to Fig. 6c. These two approximations represent the optimization level of the smart meter architectures.

Calculation of Reliability Ra(t)

In this section, the simulated reliability R(t) is merged with the defined RBD model from Fig. 6a to obtain the overall system reliability Ra(t) from Fig. 6a. For the smart meters, we assumed a "k-out-of-n" dependency [44, 45]. The variable k corresponds to variable i in present formula (8). Thereby, the objective is that all of the five smart meters from the architecture in Fig. 5a will not fail. Therefore, the following formula is obtained for the reliability RSM1(t) of the smart meters based on the RBD model in Fig. 6a:

$$\begin{gathered} R_{SM1} \left( {i,n,R\left( t \right)} \right) = \mathop \sum \limits_{i}^{n} \left( {\begin{array}{*{20}c} n \\ i \\ \end{array} } \right)R\left( t \right)^{i} \left( {1 - R\left( t \right)} \right)^{n - i} ,\,\,i = 5;\,\,n = 5;\,\,R\left( t \right) = 96.93\% , \hfill \\ R_{SM1} \left( t \right) = (R \left( t \right)^{5} \times \left( {1 - R\left( t \right)^{0} } \right)) \times \left( {R \left( t \right)^{4} \times \left( {1 - R\left( t \right)^{1} } \right)} \right) \times \left( {R \left( t \right)^{3} x \left( {1 - R\left( t \right)^{2} } \right)} \right) \times \left( {R \left( t \right)^{2} x \left( {1 - R\left( t \right)^{3} } \right)} \right) \times \left( {R \left( t \right)^{1} \times \left( {1 - R\left( t \right)^{4} } \right)} \right) \times \left( {R \left( t \right)^{0} \times \left( {1 - R\left( t \right)^{5} } \right)} \right), \hfill \\ R_{SM1} \left( t \right) = 88.35\% . \hfill \\ \end{gathered}$$
(8)

The following applies to this:

  • Variable i is the minimum number of units required for successful service provision of the system,

  • Variable n is the total number of parallel connected units,

  • And R(t) is the simulated reliability of the smart meter from section “Reliability Simulation of Smart Meter and SMGW”.

The remaining components of the architecture in Fig. 6a are connected in series. Hence, it is a simple multiplication of the determined reliabilities to calculate the reliability Ra(t) of the entire system from the architecture in Fig. 6a.

$$\begin{gathered} R_{a} \left( t \right) = R_{SM1} \times R\left( t \right) \times R_{App} , \hfill \\ R_{a} \left( t \right) = 88.35\% \times 96.93\% \times 99.90\% , \hfill \\ Reliability \,\,R_{a} \left( t \right) \approx 85.55\user2{\% }. \hfill \\ \end{gathered}$$
(9)

Calculation of Reliability Rb(t)

In this section, the reliability analysis of the overall system of the architecture in Fig. 6b is presented. This is a reliability-optimized smart meter architecture, as described in section “Optimized Smart Meter Architecture”. The redundancy of the SMGW is equivalent to a parallel RBD model. This means that there are two identical communication channels, which are independently connected to all five smart meters. The reliability for this type of system is calculated below [43, 45]:

$${R}_{Parallel}\,\,\left(t\right)=1-\prod_{i=1}^{N}{(1-R}_{i}(t)).$$
(10)

For the reliability analysis, the calculated reliability R(t) is integrated into formula (10). Following the architecture in Fig. 6a, we have a "k-out-of-n" dependency. The variable k corresponds to variable i in present formula (10). The objective is that all five smart meters will not fail. Therefore, it is possible to take the result RSM(t) from formula (8) and integrate it into formula (10). The result is the following formula for the reliability RParallel(t) of the architecture in Fig. 6b:

$$\begin{gathered} R_{Parallel} \left( t \right) = 1 - \left( {\left( {1 - R_{SM1} } \right) \times R\left( t \right)} \right)*\left( {\left( {1 - R_{SM1} } \right) \times R\left( t \right)} \right), \hfill \\ R_{Parallel} \left( t \right) = 1 - \left( {\left( {1 - 88.35\% } \right) \times 96.93\% } \right)) \times \left( {1 - 88.35\% } \right) \times 96.93\% )), \hfill \\ Reliability \,\,R_{Parallel} \left( t \right) = 97.94\% . \hfill \\ \end{gathered}$$
(11)

In the final step, the series-connected reliability RApp of the application must be multiplied by the calculated reliability RParallel(t) to get the reliability Rb(t) of the entire system from Fig. 6b. The higher system complexity implies an optimized reliability Rb(t) of the entire system, that we calculate as follows:

$$\begin{gathered} R_{b} \left( t \right) = R_{Parallel} \left( t \right) \times R_{App} \left( t \right), \hfill \\ R_{b} \left( t \right) = 97.94\% \times 99.90\% , \hfill \\ Reliability \,\,R_{b} \left( t \right) = 97.84\user2{\% }. \hfill \\ \end{gathered}$$
(12)

Calculation of Reliability Rc(t)

This section presents the reliability analysis for the architecture in Fig. 6c. This smart meter architecture is the last stage of our optimized architectures shown in Fig. 5. Based on the used reliability optimization methods, we expect the highest reliability Rc(t) for this smart meter architecture. To be able to calculate the reliability Rc(t) of the overall system, the reliability analysis for the "k-out-of-n" dependency of the smart meters is performed first. The variable k corresponds to variable i in present formula (13). For this case from Fig. 5c, the reliability RSM2(t) is calculated for just four smart meters, because the root node described above is connected separately before them. Thus, based on formula (8) [44, 45] and the corresponding RBD model from Fig. 6c, the following formula is obtained:

$$\begin{gathered} R_{SM2} \left( {i,n,R\left( t \right)} \right) = \mathop \sum \limits_{i}^{n} \left( {\begin{array}{*{20}c} n \\ i \\ \end{array} } \right)R\left( t \right)^{i} \left( {1 - R\left( t \right)} \right)^{n - i} ,\,\,i = 4;\,\,n = 4;\,\,R\left( t \right) = 96,93{\text{\% }}, \hfill \\ R_{SM2} \left( t \right) = (R \left( t \right)^{4} \times \left( {1 - R\left( t \right)^{0} } \right)) \times \left( {R \left( t \right)^{3} \times \left( {1 - R\left( t \right)^{1} } \right)} \right) \times \left( {R \left( t \right)^{2} \times \left( {1 - R\left( t \right)^{2} } \right)} \right) \times (R \left( t \right)^{1} \times \left( {1 - R\left( t \right)^{3} } \right)) \times (R \left( t \right)^{0} \times \left( {1 - R\left( t \right)^{4} } \right)) , \hfill \\ R_{SM2} \left( t \right) = 91.16{\text{\% }}. \hfill \\ \end{gathered}$$
(13)

At the next level, the root node is connected to the SMGW in a parallel series. These create two completely independent communication channels, that can still communicate with each other and exchange data if a fault occurs. The result is a significantly increased reliability Rc(t) of the overall system. The formula for a parallel series RBD model is shown below:

$${R}_{Parallel/Series}\left(t\right)=1-\prod_{i=1}^{M}{(1-\prod_{j=1}^{N}(R}_{ij}(t)).$$
(14)

The next step is to merge the formulas, so that we get the reliability Rz(t) as an intermediate result by the following formula:

$$\begin{gathered} R_{z} \left( t \right) = 1 - \left( {\left( {1 - R\left( t \right)} \right) \times \left( {1 - R\left( t \right) \times R_{SM2} \left( t \right)} \right) \times \left( {1 - R\left( t \right)} \right) \times \left( {1 - R\left( t \right) \times R_{SM2} \left( t \right)} \right)} \right), \hfill \\ R_{z} \left( t \right) = 1 - \left( {\left( {1 - 96.93\% } \right) \times \left( {1 - 96.93\% \times 91.16\% } \right)} \right) \times \left( {\left( {1 - 96.93\% } \right) \times \left( {1 - 96.93\% x 91.16\% } \right)} \right), \hfill \\ Reliability \,\,R_{z} \left( t \right) = 99.9987\% . \hfill \\ \end{gathered}$$
(15)

In the final step, the series-connected reliability RApp of the application will be multiplied with the calculated intermediate result Rz(t) to get the reliability Rc(t) of the entire system from Fig. 6c. The defined system topology results in an optimized reliability Rc(t) of the overall system, which we calculate as follows:

$$\begin{gathered} R_{c} \left( t \right) = R_{z} \left( t \right) x R_{App} \left( t \right), \hfill \\ R_{c} \left( t \right) = 99.99\% \times 99.90\% , \hfill \\ Reliability\,R_{c} \,(t)\, = \,99.90\% . \hfill \\ \end{gathered}$$
(16)

Consolidation of the Results

Finally, the performed calculations are summarized and evaluated in Table 1. In the first evaluation, the hypothetical reliability R(t) was calculated. This shows that the reliability Ra(t) for the smart meter architecture in Fig. 5a is approximately 5% lower than the hypothetical reliability R(t). The result illustrates that about 15% of the smart meter architectures could fail within the characteristic lifetime T of 12 years. Based on extrapolations for Germany, almost 7,6 million of the 53 million [10] metering locations would be affected annually. To counteract that, the reliability of the entire system has to be increased. To achieve this, several reliability optimization methods [16] were applied to the smart meter architecture. An optimization of the reliability Rc(t) to 99,90% was achieved for the smart meter architecture in Fig. 5c. This corresponds to only 53.000 metering locations that would fail annually, if we assume the above-mentioned statistics for Germany. The best case scenario is that the possible number of annual failures of smart meter architectures has been reduced to over 99%. The results of the reliability calculation Ra(t), Rb(t) and Rc(t) are specified in the column Result by RBD models.

Table 1 Summary of the results

Conclusion and Future Work

Among the three essential design dimensions for computing systems (cost, performance, reliability), the reliability is least understood [20] and offers the most scientific potential. Therefore, we focused on the reliability aspect in this article. The three architectural options for reliability optimization of a smart meter architecture were presented and a structured reliability analysis was performed to calculate the reliability of the entire systems for several different architectural approaches from Fig. 5. In the beginning, the theoretical basics and the optimized smart meter architectures were described. After this, the reliability of the entire systems was calculated by a Monte Carlo simulation based on the defined RBD models. The results are realistic reliability values, which are summarized in Table 1. The performed approximation demonstrates the optimization potential in the design phase and the need for reliability optimization in the context of smart meter architectures. However, because the improvement of reliability with a positive cost–benefit ratio is an important academic and industrial objective [59], amending the findings of this article with a cost–benefit analysis appears to be a necessary next step for future work.

Besides the architecture, optimization methods at the software and hardware level also have an impact on the reliability of the overall system [18]. For example, more reliable hardware components could increase reliability by improved quality control during manufacturing or the use of higher quality materials [34]. Moreover, it is also possible to implement reliability optimization at the network level [59]. For Example, the cloud layer offers a high potential through the specific optimization of the cloud business performance related to the technical risks [60]. Furthermore, according to Laprié [18], the other types, such as fault tolerance, can also be applied for reliability optimization. However, those mentioned optimization methods were not considered in this article and have a high potential for future scientific research. The objective is to obtain a cost-neutral and easy-to-implement concept that enables a low-fault system. All these methods can be grouped into the "Reliability by Design" approach. This approach offers the greatest potential, because early consideration of reliability by defined design criteria forms the basis for a robust and low-fault system.