Predicting IoT failures with Bayesian workflow

E-mail addresses: IoT networks are so voluminous that they cannot be treated as individual devices, but as populations. Main aim of the paper is to create a comprehensive method for predicting failures taking device variance into consideration. We propose using data fusion of happenstance observations (resets and failures) to better estimate device parameters. We propose using methods of population analysis in Bayesian statistics to estimate failure times investigating only a part of the population. For this purpose, we use multilevel hierarchical Bayesian model and provide it with post stratification. We propose model assumptions, construct the model and evaluate it, and perform computations using Hamiltonian Monte Carlo. This method is known as the Bayesian workflow. We have analyzed three different models showing that, in case of small device variance, it can be ignored, or at least compensated, while significant differences require hierarchical modeling. We also show that hierarchical model shows significant robustness to a small amount of data. We have shown attractiveness of Bayesian approach to modeling failures of IoT devices. Ability to diagnose and tune models, and assure their computational fidelity is a great advantage of Bayesian workflow. Highlights Abstract


Introduction
We can say, that Internet-of-Things (IoT) devices have become a ubiquitous aspect of current industry and life. The remote monitoring systems using Internet of Things include vehicle or assets monitoring, people/pets monitoring, fleet management, water and oil leakage, energy grid and power plant monitoring etc. Main applications of IoT devices include on-line monitoring and predictive maintenance of industrial equipment. Their use provides both process monitoring for constant quality assurance, and condition monitoring in order to prevent unplanned downtimes. The machines and devices diagnostics, their maintenance and methods of preventing failures with the use of IoT relate to various fields such as Industry 4.0, management of transport devices or medical devices.
Currently rising in relevance and especially interesting field are the IoT networks comprising multiples of same device. Example would be an industrial plant sensor network, which collects data regarding the operation of the entire installation at various points. Failures of sensor devices are not uncommon, especially if the devices are cheap and/or mass produced. This, of course, causes degradation of cov-erage, but also introduces a multitude of problems. Replacement of networked devices during plant operation might be difficult -for example, because of physical accessibility. Usual reasons for switching to wireless solutions are difficulties with providing necessary cable connections. IoT device failures introduce difficulties in energy management, while they are usually low power solutions, their number complicates matters. This is even more troublesome for battery operated devices.
As IoT networks are not failure free, we need to consider accidental failures, such as broken communication links or failures of cyber instances and cyber-attacks. Even sensors made in the same factory can have significant variance. Thus ensuring the robustness of IoT devices and predicting the failure of these devices is a very important issue. In this subject, we can divide research studies into the following categories: IoT devices anomaly detection, • attack detection in IoT, • self-healing IoT sensors, • cascade-of-failures across domains. • Because of advances in sensor monitoring technology, low-cost solutions, and high impact in various application domains, IoT systems are becoming increasingly popular, and anomaly detection has attracted considerable attention from the research community Fahim and Sillitti [9] present a literature review of anomaly detection in IoT systems. They identify several research gaps related to data collection, analysis of unbalanced large data sets, limitations of statistical methods in processing massive sensory data. They note that a few research articles deal with predicting anomalous behavior in realworld scenarios. In anomaly recognition, the focus of most papers is on automatic detection of anomalies in data. Key considerations are point, contextual, and collective anomalies. Rarely, they focus on relationships of data anomalies with sensor failures. Kaya et al. [15] present the prediction of sensor failure because of aging or environmental factors in the food industry. They base proper food quality assessment on the correct operation of the sensors used. The authors examine the loss of data, in order to tolerate for the failed sensor and keep the overall prediction accuracy acceptable. They propose a Single Plurality Voting System (SPVS) classification approach, applying K-Nearest Neighbor (kNN), Decision Tree, and Linear Discriminant Analysis (LDA) as the base classifiers.
Vangipuram et al. [23] present detection of missing values in IoT environment based on open sourced data. The authors presented a technique for imputation of missing data values, a classifier that is based on feature transformation to perform classification, and an imputation measure to compute the similarity between any two instances, also useful as a similarity measure. They tested the performance of the proposed classifier using imputed datasets got by applying Kmeans, F-Kmeans and the proposed imputation methods. Moghaddam and Muccini [20] present an overview of methods, techniques and architectures for Fault-tolerance in IoT. Failure can occur at all levels of the Internet of Things (IoT) application architecture, e.g., oversight of sensor nodes, failure of network links, and failure of components that process and store data. For this reason, fault-tolerance (FT) has become a key issue for IoT systems.
Anomaly detection and prediction is the first step to secure IoT systems. The goal of IoT security is to protect assets, ensure the privacy, confidentiality, availability, and integrity of communications in the IoT ecosystem. Therefore, IoT security has recently gained the attention of researchers in attack detection. Hasan et al. [14] analyzed performances of several machine learning models and compared them in attack and anomaly prediction on the IoT systems accurately. The authors consider several machine learning (ML) algorithms, such as Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Artificial Neural Network (ANN). Diro and Chilamkurti [8] present the application of deep learning to enable attack detection in the social web. The authors compare a deep and a shallow neural network using open source dataset and show that the deep model is more effective in detecting attacks than its shallow counterparts. Manimurugan et al. [18] also consider deep machine learning for anomaly and intrusion detection in IoT. They propose a model of the Deep Belief Network (DBN) algorithm.
Self-healing IoT sensors are the next research area. Lin et al. [16] propose a solution called SensorTalk for automatic detection of potential sensor failures and calibrate the aging sensors semi-automatically. They have proposed both analytical and simulation models for the detection delay selection so that when a potential failure occurs; it detects it early enough without causing too many false alarms. Cascading failures are another important topic. In the Internet of Things (IoT), various devices work together to collect data, communicate information to each other, and process it intelligently. Because of the interactions and dependencies between IoT devices, a malfunction of one of them can trigger a cascade of failures. Xing [27] systematically reviews cascading failure modeling and reliability analysis methodologies, as well as failure mitigation strategies. However, research on cascade failure prediction is lacking. Wang et al. [25] consider a sequence of failures due to randomly failed physical links (and simul-taneous failures of cyber nodes). Makhshari and Mesbah [17] present new research on the challenges of IoT. The authors provide the first systematic study of bugs and challenges that IoT developers face in practice through a large-scale empirical investigation. They collected 5,565 bug reports from 91 representative IoT project repositories and categorized a random sample of 323 based on the observed failures, root causes, and the locations of the faulty components.
Aleš et al. [1], who consider effectiveness of production and maintenance in Industry 4.0 installations with a focus on Industrial IoT, present an interesting approach. In their work, they propose original calculations of Nakajim's OEE (overall equipment effectiveness) indicator for the entire production and monitoring line. This approach, however, relies on relatively simple mathematical models.
As we can conclude, from the observed results, a lot of work was done in the anomaly detection, failure classification and security. The condition of entire networks is a different matter. Efficient diagnostics of large scale IoT device networks are very difficult. Because of device number, maintenance has to be worked in large scale and predictive solutions. Such networks, especially comprising similar devices, can be viewed as populations of sorts, and this is a potential for a new method. The hypothesis that we want to verify in this work is it is possible to use methods usually reserved for populations to analyze networks of IoT devices. We wanted to find out if such methods can provide useful results in the technical context, especially for reliability analysis.
Population behavior is normally a domain of social sciences and psychology. That is why statistical modeling thrives in these areas, with special mention to Bayesian statistics. Bayesian statistics is an approach allowing for inference in the presence of uncertainty (A. Gelman et al. [11]). And in case of analyzing mass-produced products, such as IoT devices, one has to consider production variability and systematic behavior, not unlike 'traditional' populations. This is a field where most successful are multilevel, hierarchical models (Bürkner [6]; Browne and Draper [5]; A. Gelman [12]). There are certain results of Bayesian statistics in the reliability field, as for example Andrzejczak and Bukowski [2] have considered Bayesian approach to model Weibull distribution of expected lifetime. Currently, there is a substantial field of methods allowing practical diagnostics and evaluation of such models (Vehtari, Gelman, and Gabry [24], Gabry et al. [10]; Mikkola et al. [19]). Justification for reliance on population analysis methods also comes from the fact that collection of data regarding IoT behavior is rarely a goal itself, but it's collected as happenstance. So instead of planned validation experiments, we can log information such as restarts, package losses, and times of failure.
The goal of this work is then to provide a conceptual design of a Bayesian method to estimate failure times (or equivalently remaining useful time). Main contributions are: creation of data fusion method, joining restart and failure time • data, to determine underlying state of system, analysis of methods for modeling pooled data to compensate for • small variability in the network and their shortcomings when variability increases, creation of a multilevel hierarchical model that captures group • variability and providing complete design analysis in the aspect of prior selection and computational faithfulness, post-stratification with a hierarchical model, using limited data • about a sample of population, to predict statistics of the entire group (with potentially different composition) and validating robustness with limited data.
We organized paper as follows. In the following section, we provide basics of Bayesian inference methods used for the rest of the paper. We describe the Bayesian basics, prior and posterior predictive checks, simulation based calibration and computational framework that was used. Then we follow with the description of considered case study. Next three sections cover pooling models for data with similar groups (no faults), multilevel model and post-stratification analysis. Paper ends with conclusions.

Materials and Methods
In this section, we will cover the basics of so called Bayesian workflow (Gelman et al. [13]), which is a collection of practices allowing principled and responsible modeling of phenomena. We cover basics and concepts we use in this paper. We also describe the computational setup and express the assumptions of data fusion of observational data for IoT maintenance.

Bayesian inference
We will now introduce the main ideas of Bayesian modeling, which encapsulate information and data as a highly structured collection of probability distributions. Obviously, such a defined model fully realizes the requirements of transparency and interpretability, as we can instantly recover information about not only feature influence but also their uncertainty.
Principled Bayesian modeling (otherwise known as Bayesian workflow) relies on creation of highly customized models that will capture phenomenon of interest. Usually actual process depends on some kind of latent system, and we can observe it through a measurement process. Main strength of Bayesian approach is the ability to jointly model this process and the latent system. This is a model of the actual data generating process. Our measurement process results in observations y  , that belong to an observation space, Y . On the other hand, models have their assumptions, or structures from the assumptions space  . Parameter identify structured models, representing model configurations. Our observational model on the Y ×  is then the likelihood: Likelihood together with a prior distribution π θ  ( ) gives a Bayesian joint distribution: This distribution is a model representing measurement process and the expertise about possible model configurations. Having actual observations we can incorporate them together into a posterior distribution: which is our main tool for inference.
Important aspect in practice becomes model utility verification. In particular, we need to address prior selection, analysis of prior predictive distribution and simulated based calibration.

Prior predictive checking
Main tool for prior distribution selection (in a way that allows encoding our expertise) is the prior predictive check. For that, we use the prior predictive distribution: This distribution allows as to average our data generating process over the prior. This is then a marginal distribution of possible measurements given our expertise. A reasonable prior distribution will pro-vide us the information about observational space expected by the model that it is consistent with observed data. In case of low dimensional problems observation of possible measurement distribution is enough. In multidimensional cases we need to verify distributions of some desired statistics of both simulated and real measurements.
Fortunately, we do not need to construct the distribution (4) explicitly integration might be difficult, impractical or simply not possible. We can, however, approximate it (or pushforward distributions of statistics) using joint distributions samples. Sampling: is equivalent to sampling from the joint distribution (2). In order to get samples from pushforwards we need to evaluate the statistics of interest on values of y  . We compute Actual prior predictive checks using either visualizations or interval coverages of appropriate marginals.

Posterior predictive checking
One of the main tools allowing to verify model validity is to verify how predictions generated by the model behave regarding data. Main tool for that is the posterior predictive distribution: Similarly to the prior predictive distribution, we do not integrate directly, but first sample the posterior distribution of parameters, and then use them to sample the predicted outputs. This is an extremely useful tool, as it allows to compare how the model fits the data and how well it captures the uncertainty.

Simulated-based calibration
Theoretical tools for model analysis are only worth as much as our ability to compute them. In case of Bayesian models we usually need to use Monte Carlo sampling in some variant (in this paper we use Hamiltonian Monte Carlo). HMC allows for analysis of certain types of problems, like non-identifiability, by analysis of divergences, however there are other issues that make exploration of the posterior (or joint distribution) problematic.
One theoretical tool that we can use for verification of HMC sampling validity is the self consistency of posterior. This property means, that posterior distributions fit using simulated data from prior predictive distribution recover the prior distribution when averaged. We can formulate it as [11] π θ π θ π θ θ This is of course needs to be considered in the context of HMC samples. Talts et al. [22] proposed to compare prior distribution and estimated average (7) using rank statistic. If those distributions are equivalent, the distribution of ranks should be uniform. This requires a following procedure. First, we sample parameters from priors. Next, we simulate outputs from corresponding prior predictive distribution, and use them to estimate parameters (in other words, to fit the posterior). We then use prior samples of parameters to compute ranks with respect to sampled parameters from posterior. Because of possibility of autocorrelation influencing rank statistic the samples we thin the samples (usually by a factor of 8). This is repeated many times (usually 1000) and histograms are computed and evaluated for uniformity. Deviation form uniformity indicates computational problems.

Multilevel regression with post-stratification (MRP)
This method (sometimes called "Mister P") is a statistical technique used for correcting model estimates for known differences between a sample population (the population of the data you have), and a target population (a population you would like to estimate for). There are multiple success stories of this method. For example, Wang et al. [26] using political surveys for Xbox gamers could construct very reliable predictions of U.S. presidential election results. This idea is popular in social sciences, but as we show here, we can also use it in technical applications.
The post-stratification is essentially using a fitted model for data of a population of different composition than the training dataset. Multilevel regression aspect is representing using Bayesian models for partial pooling of data from different groups within the population. This allows for capturing both similarities and differences between groups.

Computational setup
For Bayesian computation, we have used Hamiltonian Monte Carlo algorithm, for which the currently most advanced software is Stan (Carpenter et al. [7]). Hamiltonian Monte Carlo algorithm (see, for example, paper by Betancourt [3]) (also known as hybrid Monte Carlo), is a Markov chain Monte Carlo method used for sampling of probability distributions. The main idea behind HMC is to use Hamiltonian dynamics equations to simulate movement of points on the "surface" of probability distribution to construct proposal distribution for Metropolis-Hastings algorithm. Hamiltonian differential equations are automatically created by automatic differentiation of probability distribution and then solved with symplectic integrators (preserving volumes and differential forms, time reversible and limiting diffusion of solutions). The samples from HMC have smaller autocorrelation than, for example, random walk Metropolis-Hastings algorithm and HMC is more efficient in exploration of typical sets of probability distributions. Moreover, HMC has self diagnosis properties as divergences (numerical instabilities in solutions of differential equations) indicate either non-identifiability or problems with posterior geometry.
We performed independent computation on Apple Mac Mini with Apple M1 CPU and 16 GB RAM and on Apple MacBook Pro with Intel Core i9 CPU and 32 GB RAM. Computation results were identical on both platforms, with approximately 14% speed advantage for the former. All the codes are available in the repository listed at the end of the paper.

Case study -Diagnostic data fusion
In case of multi-element systems, starting with process industry, through complicated machinery to IoT networks, detailed diagnostics of each part is not possible. One cannot provide sensors for each part, that is why traditional methods of diagnostics are useless. Failure phenomena in all kinds of devices are hard to model, as they usually are tangent to fundamental behavior of systems. It can have multiple, subtle underlying causes. That is why we have two major problems to handle: lack of dedicated data, and no basis in first principle modeling.
That is why we need to consider observational data that can be available with little overhead, in case of IoT devices those are number of resets and total lifetime of device (referred in this paper as failure time). We will create a data fusion [21] of these two disparate sources, in order to provide information about underlying parameters of the model and to predict failure times of devices.
We give the following basic assumptions: There are underlying causes of increased number of restarts and • approaching failure. This relationship is approximately monotonous: high probability • of restart reduces expected failure time.
The approximate model of the underlying cause is a parameter • with an appropriate link function corresponding to the state of the system (its health). There is a possibility of a variance between batches of the same • (or functionally identical) devices, but with some general similarity. Distribution of restarts and failure time is Poisson and Gamma, • respectively. Certain number of restarts are normal.

•
Very short failure times are rather unlikely. • We have an observational data of a few devices, but dataset has • representation of this variability. We provide a case study covering an extended network of 1000 similar IoT devices. We have data from 200 devices that have failed; we have registered their failure time and number of resets that occurred. Devices came from four batches approximately evenly represented in the dataset. We artificially generated data, in a way fulfilling the assumptions. Data, together with generating code, are available in the repository. We present registered pairs of resets and failure times in the figure 1. Color of the points corresponds to batches from which they come. As we can observe, three batches express a similar behavior, while one (marked in the dataset as third) exhibits short time till failure and many resets.
In the following sections, we will cover three aspects of modeling: Modeling based only on the healthy data. This is a typical situ-1.
ation, where we have only accessed to normal operation of the system. We will try to cover such a situation with a simple model, pooling all the data together, and then we will analyze how model behaves with addition of a faulty batch.
Modeling entire dataset with a full hierarchical model. Because 2. of the increased model complication, we will cover the entire analysis aspect of prior verification and analysis of computational faithfulness. Post-stratification. We will then perform the post-stratification 3.
using the hierarchical model on the entire 1000 device network to predict the failure time of the given percentage of the entire network. We will also present the robustness analysis of the model. We will consider how the model behaves when, instead of an original sample of 200 devices, it would consider only its subset. We will verify how it would influence the post-stratification, especially its confidence interval.

Pooling of all data in the normal operation
The first case we want to consider is also a very common one. Often, system data collection is not a planned experiment, but a happenstance, or casual observation. And obviously the most operation occurs under normal conditions, so data about abnormal cases are scarce. That is why we focus on an example of such situation in the considered case. In the figure 2, we see that for three batches behavior of devices is relatively similar, with a bulk of points concentrated in one space.
Because of the observed similarity, we attempt to create a relatively simple model, still adhering to the assumptions. We, however, ignore the information about batches to create a model capturing entire population. This is not an uncommon approach, as not knowing about group variance it is natural to treat all object similar way.

Poisson model
We propose a three-parameter model with combined likelihood. This likelihood joins both information about resets and the lifetime.
We introduce two additional parameters (besides system health β): A reset baseline • λ R which is added to β and through exponential link function serves as a Poisson distribution's rate. A shape parameter • κ of Gamma distribution, for which (also via link function) β serves as a scale.
Link function is necessary to ensure the positivity of coefficients, while allowing efficient exploration of parameter space by Hamiltonian Monte Carlo.
We present graphical representation of this model in the figure 3, and the mathematical formulas for it are: Prior parameters were reasonable guesses to comply with assumptions and not over-saturate the exponential. We do not give here details of prior predictive checks because of space constraints. Code given in the repository allows such an analysis.
We present summaries of samples from the posterior distribution in the table 1. As we can observe, the parameters are relatively well concentrated with λ R having the smallest standard deviation of all (relative to the mean). Verifying posterior predictive samples in the figure 4 we see, that while for the simple dataset failure times are captured rather well, the predictions of resets have not represented the variance of the sample. This suggests that a more dispersed model can improve it.

Dispersed Poisson or Negative Binomial model
One of main limitations of Poisson models (e.g. Poisson regression) is that this distribution enforces mean and variance to be equal. An attractive solution for this requirement is the Negative Binomial distribution, in the so called location-overdispersion parametrization 1 : Seeing that Poisson µ ( ) has variance μ, negative binomial distribution has its variance increased by an additional µ φ 2 0 / > . This gives much more flexibility in modelling of distributions of integers. The only problem with this parametrization, is that only φ ∞ → returns to Poisson distribution. That is why it is more convenient to construct model using ψ φ = −1 , which actually covers the overdispersion, while ψ = 0 returns (computationally) the original Poisson. We will not provide details of the implementation, but the code is in the repository.
Using a negative binomial distribution, we formulate the model in the following form (graphical representation in the figure 5): Regarding prior on φ , we actually set uniform, bounded from below by 0 prior on ψ φ = −1 This essential gives a reciprocal of square for prior on φ , but it does not influence computation, and would only clutter the formulas. In the code, we represented it by setting a lower bound of 0 on the parameter.
We present summaries of samples from the posterior distribution in the table 2. All the parameters are much more uncertain than in the previous case, which actually is a desired quality. As we know, data came from different batches, and capturing variance between them requires more uncertain parameters. β parameter has even unknown sign (as 67% confidence interval has zero somewhere around the middle), but this is not as important because of link function. This uncertainty, however, impedes the predictive quality of failure time. As we can see in the figure 6, the ribbon plot of failure time predictions is much wider, and the actual median prediction is well below observed data. Ability to capture resets variance in the right image in figure 6 results in unreliable prediction for failure time in the left image. Most uncertainty is especially for short failure times.
As we can observe, both models pooling all the data together have their shortcomings. For completeness, we present how they predicted resets and failure times for individual batches. We can observe it in the figure 7. As we can see, predictions for the population are not representative of the individual groups. This puts into question how well both models would function in the full dataset.

Inference on full dataset
While simple models worked rather well evaluating pooled data for similar groups, introduction of the less similar group (batch, see Fig. 1) show their weakness more significantly. In order to save space, we will not provide here full results of sampling (they are available in the repository), but we just give some main takeaways: Sampling of both models was rather similar regarding parameter • concentration and HMC behavior. We observed that • λ R increased significantly for the Poisson model (mean of 5). While both models incorporated the same distribution for failure • time, its coefficients were completely different. For this case β   we can see that Poisson model is stuck between actual data and negative binomial is very vague. We are not presenting failure time predictions, nor individual • group predictions, as they are rather uninteresting. Complete set of figures with generating code is available in the repository.

. Posterior predictive distributions of Poisson (top two rows) and negative binomial (bottom two rows) show that neither model using pooled data can capture behavior within individual groups. Batch 3 is missing, as it contains the faulty data that we will introduce to the model in the next section
These issues show that construction of a model cannot ignore group behavior, especially if we expect groups to differ from one another. In case of IoT devices, changes in manufacturer, model or even production batch can introduce significant deviations. That is why we need to focus on more sophisticated models.

Focusing on individual batches -Multilevel hierarchical model
We propose to use multilevel hierarchical model. Batches will share the same distribution type, but with different parameters, coming from a common distribution with unknown hyper-parameters. We give proposed model structure in the figure 9 in both centered and non-centered parametrization. In our code, we only use the latter because of its computational efficiency. We avoided providing detailed equations, because they are more readable from the code or diagram than from equations.
Main justifications of the model are following: We assume general structure of Poisson-Gamma • model for each group.
We keep reset baseline • λ R and shape κ identical for each group.
Multilevel effect come from the individual 'health' • parameter β i for each group (batch) of devices.
Hierarchy is in the regularization enforced by the • assumption that all β i come from the same Normal distribution with hyper-parameters µ b and σ b . Hyper-parameters have respectively priors as standard normal, and half-normal with standard deviation of 2.
Non-centered parametrization is functionally • equivalent, but is computationally more efficient to consider group parameters from standard normal and scaling and shifting them by parameters.

Prior predictive checking
Because the model is much more complicated, we provided a thorough prior predictive checks illustrated in table 3 and with prior predictive distributions in the figure 10. The general idea is to sample parameter values from priors. So using priors we have simulated 1000 times parameters for 200 devices split into 4 batches (with the same composition of batches as in original data) and then using those parameters sampled from appropriate Poisson and Gamma distributions corresponding pairs of resets and failures. Having such samples, we have computed relevant statistics. In table 3, we can see that statistics of individual parameters are consistent with priors. We visualize simulated data in the figure 10. We have used the same bins as with histogram of original data and computed histograms for each sample (of 200 devices), then for each bin we have computed quantiles obtaining the visualized ribbon plots. Data simulated on priors is obviously zero inflated, but it does not contradict the original data, as it fits within the ribbon plots.

Simulation-based calibration
To ensure computational fidelity, we have performed full simulation-based calibration. We did this to avoid situation, when complicated model geometry provides difficulties in proper exploration of typical set by HMC in Stan, and also to avoid potential coding errors. Using a similar scheme as with prior predictive checks, we have simulated 1000 samples of 200 devices and their corresponding resets and failure times. For each of those samples, we have performed MCMC sampling of the multilevel model using them as data. Then, for each of sampled parameters, we have computed the rank statistic, of how many samples were smaller than the one of the model which generated the data. Distributions of those ranks should be uniform. We verified that by thinning samples 8 times and plotting histograms. We present histograms of ranks in the figure 11. All are reasonably uniform.

Posterior predictive (retrodictive) distributions
We complete our analysis with posterior predictive checks, which work as an actual validation of the model. As previously, we can verify that using ribbon plots and comparing them to the data. In this way we can consider them retrodictive, as we want to obtain the same data, that we used for the inference.
In the figure 12, we can see that, especially in the aspect of resets, we have obtained very good results. While median of predicted failure times is consistent with data, previous models also were not bad in this aspect. However, in case of resets we can see, that model well identified groups of devices. This is also visible in plots for the individual groups (figure 13) as the model coefficients well represented each batch.
Finally, we can observe the parameter fits in table 4. We can see the relatively good concentration of parameter values for each group i β . What is perhaps interesting, such concentration is visible in 'centered' parameters and not in the 'non-centered' that are the part of the actual model. Hyper-parameters have significant uncertainty, allowing all the individual batches to be captured. As we can also see, batch no. 3, the faulty one, has significantly different coefficient.

Post-stratification and robustness analysis
Having a well-fitted model it was possible to perform post-stratification. We have tested it on a sample of 1000 devices with only their batch number recorded. Our goal was to analyze following statistics: after what time 20% devices will fail, • after what time 50% devices will fail. • The idea of post-stratification computationally uses the samples from the MCMC inference. In our case, we had 4000 samples of parameters for each group and the global parameters. For each of those, we sampled 1000 failure times (with the desired composition of devices) and computed the relevant statistic. In this way, we obtained 4000 samples of each statistic, which we could use for our maintenance predictions. We visualize these results in the figure 14. As we can see, their distribution is approximately normal and value concentrations are relatively good. As we can see in the table 5, confidence intervals are tight with standard deviations on the level of 4% of mean.

Robustness to amount of data
As a final verification of the methodology, we have analyzed the model consistency if scarcer data are available. Considered 200 samples is a relatively big part of post-stratified 1000, and not I realistic.  Recent works (Broderick, Giordano, and Meager 2021) show that even 1% of data can significantly influence the predictions. While authors provide more detailed metrics for certain simpler problems, we have dropped randomly bigger chunks of data to see how it influences the post-stratification results. We have decided to randomly drop 10%, 75% and even 90% of data to see how post-stratification results will behave. As a metric, we have considered means and standard deviations of statistics of interest. We summarize results in the figure 15 and the table 6. As we can see, the method handles issue remarkably well. Means are very consistent and what really changes are the confidence intervals, which also are not blowing up, as they are at most doubled (better than the expected 10 ). What is also interest-ing (but because of constraints kept to the repository) is that even with 20 data points, the model still captures individual properties of each group rather well (and estimate 8 parameters).

Discussion and conclusions
In this paper, we have performed a conceptual analysis of using Bayesian models for modeling IoT device behavior using Bayesian analysis. Our intention was to show that such models allow capturing complicated group behavior (for different batches of devices). We have shown, that for simpler cases, with limited variation between units we can ignore it all together with obtaining estimates with some   uncertainty. When the device population has large variation (for example from bad quality control at the manufacturer or subcontractor) it justifies more complicated models. We have shown that using hierarchical multilevel structure, we can efficiently estimate population lifetime. Proposed model is robust with respect to data reduction, provided groups are representation. These results show great potential for using Bayesian method in technical applications, where they are underrepresented, especially for reliability analysis.
There are obvious limitation to our approach. Because we have used simulated data, it is obviously not applicable directly. However, using real data would obfuscate main ideas, as it would necessitate more complex model. And such complications are a potential for method extension.
Natural ideas of extension would model not only within device type but also between type variance of devices. As long as we can capture certain similarities, concepts still hold. This conceptually is just adding a level to the model. We have just used examples of post-stratification metrics. Those could be much more advanced utility functions. For example, one could consider expected coverage, energy losses, costs of maintenance, etc. Such complications of the model are possible and we will investigate them in the future.
The observation about robustness is an assuring one, as 20 datapoints allowed to estimate 8 parameters of the model with accuracy high enough to obtain useful results. Reason of such gain might be the data fusion effect. Using two measurements from each device adds additional regularization. This compensated for an increased number of parameters.
There is a large application potential for the depth of Bayesian methods in the technical sciences. This work is one example and we will continue it in further research. Monitoring and maintaining IoT networks is one area where their use can be very natural.