Reducing Large Adaptation Spaces in Self-Adaptive Systems Using Machine Learning

Modern software systems often have to cope with uncertain operation conditions, such as changing workloads or fluctuating interference in a wireless network. To ensure that these systems meet their goals these uncertainties have to be mitigated. One approach to realize this is self-adaptation that equips a system with a feedback loop. The feedback loop implements four core functions -- monitor, analyze, plan, and execute -- that share knowledge in the form of runtime models. For systems with a large number of adaptation options, i.e., large adaptation spaces, deciding which option to select for adaptation may be time consuming or even infeasible within the available time window to make an adaptation decision. This is particularly the case when rigorous analysis techniques are used to select adaptation options, such as formal verification at runtime, which is widely adopted. One technique to deal with the analysis of a large number of adaptation options is reducing the adaptation space using machine learning. State of the art has showed the effectiveness of this technique, yet, a systematic solution that is able to handle different types of goals is lacking. In this paper, we present ML2ASR+, short for Machine Learning to Adaptation Space Reduction Plus. Central to ML2ASR+ is a configurable machine learning pipeline that supports effective analysis of large adaptation spaces for threshold, optimization, and setpoint goals. We evaluate ML2ASR+ for two applications with different sizes of adaptation spaces: an Internet-of-Things application and a service-based system. The results demonstrate that ML2ASR+ can be applied to deal with different types of goals and is able to reduce the adaptation space and hence the time to make adaptation decisions with over 90%, with negligible effect on the realization of the adaptation goals.


Introduction
Engineering modern software systems is complex. One of the important factors that underlies this complexity is the dynamic and complex environment in which systems need to operate, requiring the systems to deal with uncertain conditions that are often difficult to predict before they are in operation [32]. These uncertainties may jeopardize the system goals. Network interference can for example affect the availability of the system if not properly dealt with.
To mitigate such uncertainties, self-adaptation has become prevalent in modern software systems [14,56,63]. Self-adaptation enhances a software system with a feedback loop mechanism that monitors the system and its environment, resolves the uncertainties, and adapts the system to maintain its goals, or degrades gracefully if necessary. Hence, self-adaptive systems consider system goals as first-class runtime entities; we refer to these goals as adaptation goals. Adaptation goals commonly refer to quality properties of the system [67].
In this paper, we apply architecture-based adaptation [48,25,40,69,43], where the feedback loop implements four functions: Monitor-Analyze-Plan-Execute (MAPE in short) [39]. The MAPE functions are centered around Knowledge that typically include various forms of runtime models [4], such as architectural models of the managed system and environment, goal models, and parameterized quality models that allow predicting qualities of different system configurations. We focus on uncertainties that can be represented as parameters of runtime models, e.g., stochastic automata or Markov models. The values of the uncertainty parameters are updated by the monitor function that monitors the system and its environment. We consider three types of adaptation goals: threshold goals that require a system to keep a system property above/below a given threshold, optimization goals that require a system to minimize or maximize a system property, and setpoint goals that require a system to keep a system property at a given value or as close as possible to it. An example of a threshold goal for a client-server system is to keep the failure rate of service invocations below a given threshold, an example of an optimization goal is to minimize the cost of operation, and an example of a setpoint goal is to keep the response time of service invocations at a required level.
Our particular focus is on the analysis function of the MAPE loop that (1) determines whether the current configuration complies with the adaptation goals, and if this is not the case, (2) predicts the qualities of alternative configurations. An alternative configuration is a configuration that can be reached from the current configuration by applying one or more adaptation actions. We refer to the alternative configurations as adaptation options, and the set of all adaptation options as the adaptation space. A common technique used to analyze the adaptation space is formal modeling and verification. Formal models represent the system and its environment from the angle of one or more quality properties. These quality models are parameterized. One set of parameters allow the instantiation of the models for a particular configuration of the system. Another set of parameters represent uncertainties that are instantiated based on the actual conditions of the system. During analysis the parameters of the quality models are instantiated. Commonly used analysis techniques are model checking [7,9] and runtime simulation [36,66,68]. Based on the analysis results, a decision can then be made to adapt the system compliant with the quality goals. Recently, there is an increasing use of machine learning techniques to support the adaptation functions [31].
For systems with a limited number of adaptation options, i.e., small adaptation spaces, the analysis can be done fairly quickly ensuring that adaptation decisions are made within the available time frame to handle the dynamics of the system properly. However, for larger and more complex self-adaptive systems, the time required for analysis may dramatically increase and formal assessment of the whole adaptation space may not be feasible in such situations.
Different techniques have been proposed to deal with the problem of analyzing large adaptation spaces. E.g., Cheng et al. [13] applied search-based software engineering techniques to generate and analyze models of dynamically adaptive systems in order to deal with uncertainties both at development time and runtime. Our particular focus in this paper is on a conceptually different technique that relies on machine learning to support the reduction of adaptation spaces, see e.g., [51,62,19]. While promising, current approaches do not provide a systematic solution with first-class support for reducing large adaptation spaces during operation that is able to handle different types of goals.
This paper contributes ML2ASR+, short for "Machine Learning to Adaptation Space Reduction Plus", a novel approach for reducing large adaptation spaces. 1 ML2ASR+ relies on classic supervised machine learning techniques, particularly classification and regression. We evaluate ML2ASR+ on two self-adaptive systems in distinct domains with varying sizes of adaptation spaces. We compare the approach with: a reference approach that exhaustively analyzes the whole adaptation space, and a state of the art learning-based approach that we developed in previous work [62], called DLASeR, which exploits deep neural networks to achieve adaptation space reduction. In addition, we perform a sanity check where we compare ML2ASR+ with an approach that randomly selects a subset of adaptation options in each adaptation cycle.
The remainder of this paper is structured as follows. Section 2 presents the state of the art and pinpoints the problem we tackle in this paper. In Section 3, we explain the model we use for self-adaptation in this paper, we elaborate on the different types of adaptation goals, and we introduce a running example. Section 4 then describes the core contribution: ML2ASR+, with its runtime architecture and workflow. Section 5 explains the metrics that we use for evaluating ML2ASR+. In Section 6, we evaluate ML2ASR+ for two application domains. Section 7 elaborates on the results, presents insights, and discusses threats to validity. Finally, we wrap up and conclude in Section 8.

State of the Art and Problem Description
We have divided the state of the art into three main areas of research. For each area, we summarize a number of representative efforts and we conclude with the open problems in the area. From this analysis, we pinpoint the research problem we tackle in this paper.

Machine Learning to Support the Analysis of Large Adaptation Spaces
We start with approaches that apply machine learning to deal with the analysis of large adaptation spaces. The FUSION framework learns the impact of adaptation decisions on the system's goals [21]. The approach utilizes M5 decision trees to learn the utility functions that are associated with the qualities of the system. The results show a significant improvement in analysis. FUSION targets the feature selection space, focusing on proactive latency-aware adaptations relying on a separate model for each utility. Chen et al. [12] study feature selection and show that different learning algorithms perform significantly different depending on the types of quality of service attributes considered and the way they fluctuate. The work is centered on an adaptive multi-learners technique that dynamically selects the best learning algorithms at runtime. The focus of this work is also on features instead of adaptation options. Metzger et al. [44] apply online learning to explore the adaptation space of self-adaptive systems using feature models with an emphasis on the adaptation and evolution of adaptation rules.
Jamshidi et al. [37] present an approach that learns a set of Pareto optimal configurations offline that are then used at runtime to generate adaptation plans. The approach reduces adaptation spaces, while the system can still apply model checking to quantitatively reason about adaptation decisions. Camara et al. [10] use reinforcement learning to select an adaptation pattern relying on two long short-term memory (LSTM) deep learning models. The focus is on the use of runtime quantitative verification, with support for threshold goals. Thallium exploits a combination of automated formal modeling techniques to significantly reduce the number of states that need to be considered with each adaptation decision [60]. Thallium addresses the adaptation state explosion by applying utility bounds analysis. Diallo et al. [19] present a framework consisting of a MAPE-K feedback loop with an explainable AI module to tackle the issue reducing adaptation spaces. Their framework leverages convolutional neural networks to efficiently reduce adaptation spaces, alongside using explainable AI to build trust in the system.
In our initial work [51] we applied classification and regression to reduce large adaptation spaces. The work also only considered threshold goals. In [62], we investigated the use of deep learning to reduce the adaptation space of self-adaptive systems. That work focused on handling threshold and optimization goals only.
Open problems. Several approaches that apply machine learning to enhance the runtime analysis of self-adaptive systems look at a coarse-grained level of system features rather than a fine-grained level of adaptation options. The approaches that look at the reduction of large adaptation spaces propose solutions that inherently mix the reduction of the adaptation space with the way analysis is performed, while other approaches (including our own earlier work) only consider specific types of adaptation goals. In conclusion: existing approaches in this area do not provide explicit support for adaptation space reduction, or they cover only specific types of adaptation goals.

Reinforcement Learning to Support Decision-making in Self-Adaptation
We look now at reinforcement learning techniques used to support decision-making in self-adaptation. Porter et al. [50] study the dynamic composition of software elements using a reinforcement learning algorithm, covering the analysis and planning stages in the self-adaptation process. The approach reduces the adaptation space to a single option, hence integrating adaptation space reduction and decision-making. Idziak et al. [33] study different machine learning algorithms to deal with the so called virtual machine placement problem. These algorithms similarly take over the analysis and planning stages of the self-adaptation process. Lui et al. [42] use a reinforcement learning algorithm to improve resource efficiency in autonomous electrified vehicles. Similarly to the previous two works, the approach reduces the adaptation space to a single option that is used for decision-making. Bu et al. [6] and Metzger et al. [45] propose strategies to explore the adaptation options in reinforcement learning algorithms.
Open problems. While relying on different learning techniques compared to the approaches discussed above, the approaches proposed in this area also inherently integrate the reduction of adaptation spaces with the decision-making to select the best adaptation options for the goals at hand. An advantage of relying on reinforcement learning to realize this integration is that it does not require a (formal) model of the system, which may be a benefit if creating such a model is problematic. In conclusion: the proposed approaches do not support a separation of concerns between an explicit and tune-able reduction of adaptation spaces and the decision-making of selecting the best option.

Efficient Analysis in Self-Adaptive Systems
A number of approaches have been proposed to enhance the efficiency of analysis in self-adaptive systems. Filieri et al. [23] propose an approach to generate a static set of expressions from a reliability model with a set of requirements. By using these expressions more efficient analysis is possible at runtime. That approach targets formal models based on PCTL (Probabilistic Computation Tree Logic). Calinescu et al. [8] combine compositional verification with model checking to effectively adapt large-scale systems. The authors employ assume-guarantee reasoning to reduce the cost of analyzing system properties, compared to infeasible exhaustive model checking approaches. Gerasimou et al. [27] explore caching, lookahead, and nearly-optimal reconfiguration techniques to optimize the response time and overhead of Runtime Quantitative Verification to enhance scalability.
Ghahremani et al. [30] look at ways of reducing the cost of realizing self-adaptation in self-healing systems by combining utility-driven approaches with rule-based adaptation. Moreno et al. [46] present an approach for proactive latency-aware adaptation that relies on stochastic dynamic programming to enable more efficient decision-making. Experimental results show that this approach is close to an order of magnitude faster than runtime probabilistic model checking to make adaptation decisions, while preserving the same effectiveness.
El-Kassabi et al. [20] use a deep neural network to support proactive system adaptation by providing predictions of cloud resource usage. The predictions enable the suggestion of adaptation decisions to anticipate future quality of service violations. Di Sanzo et al. [18] equip a client-server application with a framework that provides proactive management of the application. The framework exploits a multitude of machine learning methods such as linear regression and support vector machines to build and use failure prediction models at runtime. The predictions are then used to proactively adapt the system before failures take place. Ghahremani et al. [29] evaluate machine learning algorithms for the prediction of system utility in adaptive systems, without relying on detailed system information.
Open problems. The approaches proposed into this area can be structured in three groups. A first group focuses on improving the verification process. These approaches do not deal with the problem of adaptation space reduction but can be combined with an approach for adaptation space reduction. A second group focuses on alternative solutions to enhance the efficiency of decision-making in self-adaptive systems. Yet, as with other related approaches discussed above, these approaches inherently integrate an implicit reduction of adaptation spaces with the decision-making to select an adaptation option. A third group applies machine learning techniques to make predictions of qualities and other properties to support the decision-making process. These solutions are complementary to the problem of adaptation space reduction. In conclusion: two groups of related efforts do not solve the problem of adaptation space reduction, but can be combined with approaches to reduce the adaptation space in order to enhance the efficiency of analysis; another group of related efforts do not separate the reduction of adaptation spaces with decision-making.

Research Problem
The analysis of the related work highlights the need for systematic approaches that provide explicit first-class support for adaptation space reduction while covering different types of goals. To that end, we formulate the following research question that we tackle in this work: How can machine learning be used to reduce large adaptation spaces of self-adaptive systems with different types of adaptation goals to perform more efficient analysis without compromising the goals?
To answer the research question, we propose ML2ASR+, a novel approach for adaptation space reduction. Leveraging on classification and regression, ML2ASR+ offers a modular approach for efficient reduction of adaptation spaces for self-adaptive systems with threshold, optimization, and setpoint goals. We translate the research question to six requirements for ML2ASR+ that serve as drivers for devising the solution and evaluating it.
The first four requirements -reusability, automatic operation at runtime, modularity adaptation goals, and granularity of adaptation space reduction -are of a qualitative nature. The last two requirements -negligible utility penalty and efficiency -are of a quantitative nature.
Reusability. As a first requirement, the solution should be reusable, i.e., the solution should offer distinct functionalities and modules that can be instantiated and applied across application domains. We evaluate this requirement by demonstrating that the proposed solution can be applied to applications in two different domains.
Automatic Operation at Runtime. As a second desirable requirement, we want the solution to operate at runtime without human involvement. We evaluate this requirement by demonstrating that the proposed solution fully automatically reduces adaptation spaces at runtime for different application domains.
Modularity Adaptation Goals. As a third requirement, we want the solution to be able to handle different types of adaptation goals. The approach should be able to handle independent types of adaptation goals, as well as a combination of different types of goals in one system. We evaluate this requirement by demonstrating that the proposed solution can be applied to instances of the same applications with different types and combinations of adaptation goals.
Granularity of Adaptation Space Reduction. As a fourth requirement, we want our solution to have the option to specify the granularity of adaptation space reduction, i.e., the degree to which the solution reduces the adaptation space. Granularity applies to optimization and/or setpoint goals, enabling to determine which adaptation options to include based on well-defined criteria. E.g., for a setting with a setpoint goal, we may require the solution to find all the adaptation options within a given window around the setpoint value. Differentiating the granularity offers flexibility when the available adaptation time may be different under different conditions. We evaluate this requirement by demonstrating that the proposed solution can be applied for different levels of granularity of adaptation space reduction.
Negligible Utility Penalty. As a fifth requirement, we desire that the solution reduces the adaptation space with little or no penalty on the quality properties that are the subject of adaptation compared to an ideal solution where no adaptation space reduction is applied. Utility denotes here the effect on the quality properties due to the adaptation decisions made by using learning. We evaluate this requirement by comparing the differences in mean values of the relevant quality properties over time with and without learning. Depending on the type of goal (elaborated in Section 3) we either compare the satisfaction of the goal or compare the difference of the quality tied to that specific goal. We provide a concrete metric for the evaluation of utility penalty in Section 5.
Efficiency. As a sixth and final requirement, the solution should be efficient, i.e., the adaptation space should be reduced such that the analysis can be performed within the time window available to make adaptation decisions. We evaluate this requirement by demonstrating that the proposed solution effectively reduces the adaptation space in two different domains. We use three metrics to judge the efficiency of the adaptation space reduction: (1) the Average Adaptation Space Reduction (AASR in short) that compares the average number of adaptation options selected by learning over multiple adaptation cycles with the average of the total number of adaptation options over these adaptation cycles; (2) the total percentage of time saved as a result of the space reduction; and (3) the percentage of overhead in time of ML2ASR+ due to learning compared to the verification time required to verify the reduced adaptation space. We provide concrete metrics for the evaluation of efficiency in Section 5.

Model of Self-Adaptive System with Adaptation Goals and Running Example
We briefly outline the model for self-adaptation that we use in this research. Then, we give a simple example of a self-adaptive system that we use as a running case in the paper. Finally, we explain different types of adaptation goals. Figure 1 shows a high-level model of a self-adaptive system as we follow in this paper, leveraging on [63]. A self-adaptive system consists of two parts: a managed system and a managing system. The managed system can be any regular software-intensive system or a part of it. Hence, the managed system may refer to an entire system, a subsystem, one or more components, just a particular feature of a larger system, infrastructure or resources used by a system, etc. Other terminology used to refer to self-adaptive system are auto-tuned system, elastic system, controlled system, context controlled system, autonomic system, among others.

Model of Self-Adaptive System
The managed system takes input from an environment and produces output to the environment. While the managed system can be controlled, the elements in the environment cannot. The environment may include other software systems, hardware, communication networks, users, the operating context, and so forth. The managing system acts upon the managed system with a particular purpose, for instance to improve its performance when operating conditions change or to deal with errors that may suddenly appear. The purpose is provided by stakeholders in the form of adaptation goals. The managing system monitors the managed system and/or its environment during operation, resolves uncertainties, and based on the adaptation goals adapts the managed system or parts of it when needed. A common approach to realize the managing system is by means of combining four basic functions: Monitor-Analyze-Plan-Execute that share a common Knowledge, which is often referred to as MAPE-K or MAPE in short [39]. The types of adaptations of the managed system may range from adjusting parameter settings, up to architectural reconfigurations. Hence, the managed system needs to provide the necessary support to be monitor-able and adapt-able.
Operators or other stakeholders may support the managing system in its tasks, but this is optional.

Running Example
We introduce a small example of a self-adaptive system that we use as a running case to illustrate ML2ASR+. The managed system in this example is a simple service-based system that handles service requests of clients through the invocation of a series of services. These services are deployed on two machines named M1 and M2. The system has to deal with two uncertainties: fluctuations in network bandwidth and the workload of both machines respectively. These uncertainties affect three qualities that form the adaptation goals: the failure rate, response time, and the cost of service requests. To make sure that the qualities comply with the service level agreements of users, the system is equipped with a managing system. This managing system realizes a feedback loop that monitors the service system and has the ability to adapt the distribution of service requests between M1 and M2.

Adaptation Goals
One of the requirements and a distinct feature of ML2ASR+ is support for different types of adaptation goals. We start with describing the types of adaptation goals ML2ASR+ supports one by one, and then we explain how multiple types of goals can be combined. We illustrate the goals with the running example.

Threshold Goals
The first type of adaptation goal that we cover in this work is a threshold goal. A threshold goal imposes a restriction on one of the system's quality properties in the form of a threshold value that should not be exceeded. Exceeded in this context can refer to either an upper bound value that the quality property should not cross, or a lower bound value that acts as a minimum requirement for the quality property. We define the satisfaction of a threshold goal T ∈ T with a threshold valuex for any value of the quality property q (or quality value in short) as follows: A threshold goal allows a self-adaptive system to categorize adaptation options in two distinct classes: compliant with the threshold goal or in violation of the threshold goal. Hence, threshold goals form a perfect candidate for classification of adaptation options, a classic supervised machine learning technique.
Example: Applied to the running example, we can define a threshold goal for the system to keep the failure rate below a given time percentage, say 10%, as shown in Figure 2a. In this case, the set of quality values q that satisfy the threshold goal, i.e., T <10% (q) = T rue, correspond to classification class 1, while the set of quality values q that do not satisfy the threshold goal, i.e., T <10% (q) = F alse, correspond to classification class 0.

Optimization Goals
The second type of adaptation goal that we cover is an optimization goal. As the name suggests, an optimization goal aims to optimize a quality property of the system, which can be either maximize or minimize the value of the quality property. We define the satisfaction of an optimization goal O ∈ O for any quality value q as follows: O min (q) = T rue : q = min({q 1 , q 2 , ..., q n }) F alse : otherwise O max (q) = T rue : q = max({q 1 , q 2 , ..., q n }) F alse : otherwise with {q 1 , q 2 , ..., q n } the set of quality values of all the adaptation options in the adaptation space. The natural approach to predict the values of the quality property and judge the adaptation options accordingly is regression. After the prediction, different strategies can be applied to perform the analysis. One strategy is selecting and analyzing a subset of adaptation options that were predicted to have quality values close to optimal. This way a small margin of error for the applied regression technique is taken into account. Another strategy is to restrict the analysis to only the adaptation option with the optimally predicted value of the quality property. This strategy can be applied if the time for computing the adaptation option is critical; yet, it may miss the best adaptation option since the predictions with regression are subject to errors. The strategy chosen represents the requirement of granularity of adaptation space reduction, see Section 2.4.
Example: For the running example, we can define an optimization goal that minimizes the response time of service requests to the system, i.e., O min (q). Here we reduce the adaptation space by looking at the top 10 adaptation options in terms of predicted response time. 2 Alternatively, we could opt to reduce the adaptation space to just a single option, when choosing a more strict granularity. Figure 2b shows the optimization goal when we choose to reduce the adaptation space to just one option.

Setpoint Goals
The third and final type of adaptation goal covered in this paper is a setpoint goal. The aim of a setpoint goal is to keep the quality property of interest at (or close to) a given target value (i.e., the setpoint value or just the setpoint). We define the satisfaction of a setpoint goal S ∈ S with target µ and error margin for any quality value q as follows: For this type of goal, both classification and regression are candidates to predict quality values. Regression allows the identification of adaptation options with predicted quality values close to the setpoint value. Classification on the other hand enables the classification of adaptation options as either (1) being inside the specified epsilon window around the setpoint value or (2) outside the window.
Example: For the running example, we can specify a setpoint goal to keep the average cost of service invocations in the system at 8 cents with an error margin of 1 cent, i.e., S 8c,1c (q), as shown in Figure 2c. Depending on the granularity set for adaptation space reduction, the adaptation space is reduced to adaptation options within a limited window around the setpoint value.

Combination of Multiple Goals
In practice, self-adaptive systems usually have to deal with multiple adaptation goals. ML2ASR+ supports adaptation space reduction for an arbitrary set of adaptation goals. However, in this paper we restrict ourselves to combinations of multiple threshold goals T, multiple setpoint goals S, and a single optimization goal O, representing a large class of practical systems, as illustrated with the running example and the cases used for the evaluation of ML2ASR+ in Section 6. The combined set of goals, denoted as G, is defined as: Hence, self-adaptive systems that rely on multi-objective optimization of adaptation goals to make adaptation decisions are not in scope of the work presented in this paper. The following sections explain in detail how ML2ASR+ reduces adaptation spaces when a combination of goals G needs to be satisfied.
Example: For the running example, we can combine different types of goals as specified above, for instance keeping the failure rate below a given threshold while minimizing the response time of service requests to the system.

Machine Learning To Adaptation Space Reduction
We now present ML2ASR+, addressing the research question we presented in Section 2.4. ML2ASR+ is a modular approach for adaptation space reduction in self-adaptive systems, meaning it can be instantiated in multiple ways, depending on the needs of the domain at hand. We focus specifically on the use of two classic supervised machine learning methods: classification and regression, applied to systems with different types of adaptation goals.
We start with presenting the runtime architecture of ML2ASR+ that integrates a machine learning module in the architecture of a self-adaptive system. Then, we give a high-level overview of the workflow of ML2ASR+. Finally, we zoom in on the design time and runtime stages of the workflow.

Managed System
Managing System   Figure 3 shows the high-level runtime architecture of a MAPE-based self-adaptive system extended with a Machine Learning Module that realizes adaptation space reduction. The Monitor tracks the uncertainties and properties of the underlying managed system (1) and updates the information in the Knowledge repository. The Analyzer then evaluates the need for adaptation, based on the current conditions (2). When this is the case, the analyzer composes a set of possible adaptation options, i.e., the configurations that can be reached from the current configuration by applying adaptation. This set is then passed to the Machine Learning Module (3) that makes predictions of the adaptation options using the machine learning models. Based on these predictions and the adaptation goals, the Machine Learning Module filters the options, reducing the set of adaptation options. These adaptation options are verified by the Verifier Module using a set of runtime models of the quality properties that correspond with the adaptation goals (4). The resulting estimates of the quality properties per adaptation option are then used by the Machine Learning Module to further train its internal learning models (5), resembling the online learning part of ML2ASR+. The Planner then evaluates the verified adaptation options, determines the best adaptation option available based on the adaptation goals, and composes a plan to adapt the managed system (6). Finally, the Planner triggers the Executor (7) that executes the steps of the plan adapting the managed system (8).

Runtime Architecture of ML2ASR+
In the remainder of this section, we explain how the Machine Learning Module is designed for a problem at hand (design stage of the ML2ASR+ workflow) and how the module reduces adaptation spaces at runtime (runtime stage).

High-level Overview of the ML2ASR+ Workflow
We start with a high-level overview of the workflow of ML2ASR+, shown in Figure 4. We explain the two stages of the workflow in general here and discuss them in detail in the next sections.
The design stage starts with the collection of data from the managed system and its environment. This data captures information relevant to the adaptation of the system over a period of time. This includes properties in the environment that affect the behavior of the system (e.g., actual workloads of the machines in the running example), system configurations (e.g., the distribution of service requests between the machines), and quality properties (e.g., the response time of service requests). Besides the system in operation, other suitable resources can be used to collect  the data, such as a simulator or files with historical data. Next, features are extracted from the data. Features are measurable properties of the system and its environment that are relevant for self-adaptation. Uncertainties in the running example are the fluctuating workload of the machines and the bandwidth of the network. The extracted features are then used for the identification of the Machine Learning Module. To that end, different configurations of the Machine Learning Module (based on different types of learning models and other attributes such as scalers that are used to normalize the collected data) are compared and the best configuration is selected. The output of the design stage is a Machine Learning Module that comprises machine learning models with a set of attributes (for instance scalers), and a predictor with a filter that allows predicting the qualities of adaptation options that can then be filtered to reduce the adaptation space. The Machine Learning Module is then ready for deployment and use at runtime. The runtime stage works in cycles, each representing an opportunity for the system to perform adaptation. The workflow starts with gathering runtime data from the managed system and its environment that is relevant for adaptation. An example for the running example is the actual value of the workload of the two machines used in the service-based system. From this data, features are extracted, similarly to the design stage, yet now based on the data collected at runtime. Then two sub-stages are distinguished: training and testing. Immediately after deployment of the Machine Learning Module, the machine learning models need to be trained to make accurate predictions about quality properties in the system, filter the adaptation options, and reduce the adaptation space. The adaptation options in the running example are determined by the different settings that are available for distributing service requests between the two machines. In the training sub-stage, the system does not make any predictions yet. Consequently, as many adaptation options as possible are analyzed (i.e., the qualities are estimated using a verifier). Different heuristics can be used to select adaptation options from the total set, for instance options may be selected randomly, or the options may be divided into batches that are analyzed in subsequent slots. The number of cycles that are used for the training sub-stage is a parameter that is determined during the design stage.
The second sub-stage of the runtime workflow is called testing. During testing, the trained machine learning models are effectively used to reduce the adaptation space. In addition, the new verification results for adaptation options of the reduced adaptation space are used to continue the learning of the machine learning models. In the testing sub-stage, the Machine Learning Module predicts the quality properties of the adaptation options, and based on these results and the adaptation goals set for the system, a subset of adaptation options is selected for verification. The verification results, i.e., estimates of the quality properties of the adaptation options of the reduced adaptation space, are used for online learning. The updated machine learning models are then ready to perform adaptation space reducing for the next cycle, and the verification results can be used by the planner to make an adaptation decision.
In the following sections, we elaborate on the different steps of the two stages of the workflow. To precisely describe the different activities, we use a lightweight formalization. Figure 5 describes the workflow of the design stage activities in detail. The design stage comprises five distinct activities: Data Collection, Feature Selection, Feature Engineering, Model Evaluation, and Model Selection. The output of the design stage is a configuration for the Machine Learning Module that can then be deployed and used to support a self-adaptive system with reducing large adaptation spaces at runtime.

Design Stage of the ML2ASR+ Workflow in Detail
Before explaining the activities in detail, we highlight the software artifacts used for each activity and the responsibilities of the engineer; the various software artifacts are at the disposal of the engineer to perform the different activities. Table 1 gives an overview of the artifacts used for the activities with the responsibilities of the engineer. Data collection is initiated by an engineer who selects a system resource and configures an artifact that is then used to collect data. Feature selection uses the collected data as input in a feature importance function that is used to filter out unimportant features. Feature engineering is automatically initiated after feature selection which adjusts individual feature values according to the feature scaling algorithm. Model evaluation automatically initiates after feature engineering by taking the updated features and collected system qualities to run and evaluate different machine learning algorithms. Lastly, model selection is performed by an engineer who inspects the evaluation metrics from model evaluation to make a final decision about the configuration of the Machine Learning Module.

Data Collection
During Data collection, data concerning adaptation is gathered from the managed system and the environment in which the system operates. We categorize the data into two categories: potential features and system qualities. A potential feature f is any type of property of the system or the environment that could have an influence on at least Make final decision about configuration by inspecting evaluation metrics one quality property of the system. A system quality q represents a non-functional property of the system. Data is collected for a period of time. At each time instance, the potential features and the associated qualities are collected. We introduce the following definitions 3 : .., π n }: A set of adaptation options in the system.
The set of all sets of adaptation options.
.., u n }: A set of uncertainties that can be monitored.
The set of all possible sets of uncertainties that can be monitored.
.., f n+m }: A set of features comprising n features that represents a system configuration and m features that represent uncertainties. 4 We call λ i a feature vector. s ∈ S: A system resource, with S the set of all resources of managed systems.
The standard resource used for data collection is the system deployed in its real world setting. This resource ensures that the most accurate data is collected to design and configure the Machine Learning Module. However, collecting real-world data may be hard, for instance for large-scale distributed systems, or it may be an expensive and time-consuming process. Alternative approaches can then be applied, such as simulating the system or using historical data collected from the system. Such techniques may be more convenient to generate large amounts of data covering a wide range of different system states. We formally define the CollectData function as follows:  Response time  40  75  20  10  10  16ms  50  40  15  50  50  8ms  20  60  80  25  30  14ms  80  5  25  50  75  3ms  50  25  20  70  60  5ms  60  70  40  40  60 11ms Data collection from resource s results in a list of feature vectors {λ 1 , ..., λ n } and quality vectors {φ 1 , ..., φ n }. The potential features of λ i correspond with quality values φ i . The feature vectors and quality vectors provide the input to the next design stage activities.
Example: Table 2 shows an excerpt of data collected for the running example application. Each row defines a feature vector with values for {Distribution, Workload M1, ..., ABW M2} and a quality vector with values for {Response time}. Note that only a subset of potential features and qualities are listed in the table for the sake of clarity. We also consider a small set of feature vectors to keep the example simple. The full data set includes all the features that may have an impact on the qualities of the system, as well as all the associated quality values of the system.

Feature Selection
During the next two activities relevant features are extracted from the collected data, defined as follows:

ExtractF eatures = EngineerF eatures • SelectF eatures
During Feature selection, the potential features and their respective quality values are evaluated using a feature selection algorithm. This algorithm analyzes the impact of individual features on the quality values associated with them. Irrelevant features, i.e., features that do not (or only marginally) influence the qualities, can be filtered out. This will simplify the machine learning model and enhance the performance of the Machine Learning Module.
It is important to note that Feature selection carries an inherent risk. In fact, the algorithm determines the relevance of each feature based on its influence on the quality values, yet this evaluation is based on the data collected during Data collection. If this data does not cover the scenarios where the feature has an influence on the qualities of the system, this feature will not be selected. For this reason, we leave Feature selection as an optional activity in the design stage. It is the task of the engineer to carefully evaluate the data collected from the system to determine whether or not feature selection should be included. Feature selection is formally defined as follows: Feature selection uses the Relevant function which uses the set of indices (denoted as ind) to decide whether individual features of a feature vector should be included or filtered out. Hence, the features in the resulting feature vector are the subset of the features in the original feature vector that are relevant.
Example: The results of applying feature selection on the excerpt of the data collected from our example system (shown in Table 2) are shown in Table 3. In this case, feature selection determined that feature ABW 2 has no influence on the response time and consequently, this feature is excluded.  Response time  40  75  20  10  10  16ms  50  40  15  50  50  8ms  20  60  80  25  30  14ms  80  5  25  50  75  3ms  50  25  20  70  60  5ms  60  70  40  40 60 11ms

Feature Engineering
During Feature engineering, the concrete values of the features are inspected and adjusted if this benefits the quality of predictions. As such feature engineering ties in closely with the next activity: Model selection. A well known example of feature engineering is scaling that is used for features with values of varying magnitude, range, and units. One scaling technique is normalization where values of features are shifted and re-scaled to fit in a range between 0 and 1 (known as Min-Max scaling). Another scaling technique is standardization where values of features are centered around the mean with a unit standard deviation. Formally, feature engineering is defined as follows: Feature engineering is centered around T ransf orm that transforms the values of the features according to a concrete engineering method that is used (e.g. scaling with normalization or with standardization). The result of feature engineering is a set of normalized features.
Example: Table 4 shows an example of feature engineering applied to the selected features of our running example shown in Table 3. In this particular case, the values of the distribution, workload and available bandwidth are normalized, i.e., the values are rescaled to a range between 0 and 1 instead of original values between 0 and 100.

Evaluation of Models
During the last two activities of the design stage, we identify the machine learning models of the Machine Learning Module, which is defined as follows: Identif yM odels = SelectM odels • EvaluateM odels We start with Evaluation of models that uses the features extracted from the data of the system to determine a set of metrics that can be used to evaluate the performance of different machine learning models (in the next activity). Such metrics are determined to evaluate learning models per adaptation goal. To that end, a list of potential machine learning models are composed that combine different learning algorithms with variations on their internal loss and penalty functions 5 . It is important to note that the selected algorithms need to support online learning, i.e., have the ability to continue training and thus updating machine learning models after deployment.
Besides the list of machine learning models, two internal parameters of the Machine Learning Module are evaluated during model evaluation. The first parameter, exploration rate, represents the percentage of extra adaptation options that are selected for analysis (by the self-adaptive system) on top of the adaptation options that are predicted by the Machine Learning Module as being compliant with the adaptation goals. Exploring an additional percentage of adaptation options ensures that the Machine Learning Module also relearns a sample of options that may otherwise be ignored. The second parameter that we evaluate is called warm-up count. This parameter gives an indication for the number of training cycles that the Machine Learning Module should consider before it can be used to make meaningful predictions during operation (i.e., switch from training to testing).
We introduce the following definitions: θ: A set of metrics for the evaluation of machine learning models.
Θ: The complete set of possible evaluation metrics.

E:
The set of all sets of evaluation metrics for machine learning models.
For the evaluation of the machine learning model we use train-test split. Train-test split is an efficient procedure to estimate the performance of classification or regression models. The method can be used if a sufficient large labeled dataset is available [2,22], which applies to our case where such dataset can be obtained from the system or a simulation as explained above. 6 The evaluation of a machine learning model involves two steps: (1) training the model with a set of feature-and quality vectors and (2) testing the efficacy of the model by making predictions over a different set of feature vectors, examining the predictions through analyzing the according machine learning metrics. This process can range from splitting up the complete data set into two partitions (a train-and test dataset) to dividing the complete data set into multiple pairs of train-and test datasets (cross validation). For the interested reader we refer to Appendix A.1 where we present a formal foundation for the former. Model evaluation is then defined as follows: Model evaluation results in a set of metrics sets, one set per machine learning model. Recall that metrics are determined per adaptation goal. Hence, we repeat model evaluation per goal, resulting in a set of evaluation metrics sets for each adaptation goal. After model evaluation, the metrics are used in the final design stage activity to select the learning models of the Machine Learning Module that will be used for adaptation space reduction at runtime.
For model evaluation of threshold goals, we apply classification using two evaluation metrics: F1-score, and Matthews correlation coefficient. For model evaluation of setpoint and optimization goals, we apply regression using four evaluation metrics: the R2-score, mean squared error, median absolute error, and maximum error. We elaborate on all aforementioned metrics further in Section 5.

Selection of Models
In the last activity of the design stage, we select a learning model from the evaluated learning models relying on the metrics derived from the evaluation of these models. Selection of models is formally defined as follows: During model selection the designer evaluates the metrics for the different machine learning models to make an informed decision about which model to use at runtime. This is repeated for each adaptation goal. Once the learning models are selected the Machine Learning Module can be configured and deployed to be used at runtime (we explain the elements of a Machine Learning Module configuration below).
Example: Table 5 illustrates model evaluation and model selection for our running case. The data in the table builds on the previous examples and considers three machine learning models for classification, denoted with Model 1, 2 and 3. To keep it simple, we restrict the evaluation to a single threshold goal: the response time of service requests should not exceed 10ms. We also consider the accuracy of the model as a single evaluation metric 7 . The table at the top shows the predictions of the different learning models. E.g., the first line for the features with response time 16ms exceeds the threshold goal (of 10ms) and should be classified as 0. This is correctly done by Model 1 and Model 2, but not by Model 3. The table at the bottom shows the accuracy of each Model. Based on these results, selecting a model is straightforward: the engineer selects model 3 in this example which has the highest accuracy. 8 In case, multiple metric are used, the engineer needs to make an informed decision taking into account the different results.  The Feature constructor is responsible for assembling and extracting feature vectors. It takes as input a set of adaptation options and the values of uncertainties. The feature constructor combines this input using Feature composition (we explain this runtime activity below) and Feature engineering. The output is a set of feature vectors obtained from the runtime data. The feature constructor is configured using two specific parameters: the indices of relevant features {ind} (determined in Feature selection) and a Transform function (determined in Feature engineering).
The Machine Learning Models that are determined during the design stage are maintained in a data repository. Conceptually, the learning models are part of the Machine Learning Module. However, in practice, the models may be stored in the Knowledge repository of the MAPE-K feedback loop. The Predictor is responsible for making predictions of the adaptation options (i.e., the feature vectors produced by the feature constructor). In particular, the predictor makes predictions about the satisfaction of adaptation goals of the adaptation options (as specified by the P redict function), leveraging on the machine learning models. The output of the predictor is a set of predictions for the different adaptation options that need further filtering. The predictor is configured using the internal parameter warm-up count that determines the period that is used for training of the machine learning models. We explain the predictor below in the section about testing.

Selected adaptation options
Finally, the Filter is responsible for filtering the adaptation options based on the predictions for the adaptation goals made by the predictor. Besides determining relevant adaptation options, the filter selects a subset of additional features to be explored based on the exploration rate parameter. The output of the filter is a reduced set of adaptation options that are used for verification. The verification results are then used for online learning of the machine learning models. We explain filtering further in the section about testing below.

Runtime Stage of the ML2ASR+ Workflow in Detail: Training
The runtime stage consists of two sub-stages: Training followed by Testing. In contrast to the design stage activities, the runtime stage activities work fully automatic and require no human input.
We start with Training. Training takes place immediately after deployment, when the Machine Learning Module has not yet learned nor gathered enough data of the system and its environment to make accurate predictions about the system qualities. Figure 7 gives a detailed overview of the workflow of the runtime stage activities during training. Training is applied for a number of cycles, based on the warm-up count that was determined during the design stage.
During training, the Machine Learning Module is not reducing the adaptation space yet. Instead, the available adaptation time of the system is used to formally verify as many adaptation options as possible, and the verification results are used to train the learning models of the Machine Learning Module. If not all adaptation options can be verified within a single time window that is available to make an adaptation decision, different strategies can be applied to select and verify adaptation options. A simple strategy selects adaptation options randomly. Another more balanced approach applies a round-robin strategy to select adaptation options one by one in consecutive time windows. Yet another strategy applies active learning to choose adaptation options for verification such that the Machine Learning  Module can learn more efficiently by e.g., selecting options with maximum entropy [58]. ML2ASR+ is flexible and does not prescribe any particular strategy.
We use the following basic definitions to formally describe the activities of the runtime stage: It is important to note that ML2ASR+ currently focuses on handling discrete adaptation options. System designers can however discretize a continuous adaptation space to apply ML2ASR+.
In the first activity of training, feature vectors are composed, meaning, the set of possible adaptation options are combined with the set of uncertainties monitored by the system. Formally ComposeF eatures is defined as follows: Features composition generates a feature vector that combines the features representing a system configuration (adaptation option) with the features representing the monitored uncertainties. The composed features then undergo selection and engineering before they are used for online learning (see below), resulting in updated feature vectors.
Example: Table 6 illustrates the composition of features as well as feature extraction, i.e., feature selection and feature engineering. The monitored uncertainties include the workload and available bandwidth of the machines. The different settings of the distribution of service requests represents here the adaptation options, i.e., system configurations. Based on feature extraction, the feature ABW M 2 is not included in the updated feature vectors.
To enable online learning, i.e., which is actually a continued training activity, adaptation options need to be verified, preferably as many as possible (as explained above). We define Verify as follows: .., qm n }: A set of formal quality models used to estimate system qualities.

Q:
The set of all sets of formal quality models.
Verification generates a set of quality values (one per goal) for each adaptation option. It is important to note that the quality values for the different adaptation options are estimates. The accuracy of these estimates is determined by the precision of the quality models used, the measurements of the uncertainties, and the verification method applied.
Example: Table 7 illustrates the verification results of a quality model for response time of a sample of adaptation options from our example service-based application.
Lastly, we use the updated feature vectors and the estimated quality vectors to train the machine learning models. More specifically, we employ online learning to continuously update and refine the machine learning models during the testing cycles. Online learning, also referred to as incremental learning, allows a learner to incrementally learn from newly provided data samples. We refer the interested reader to the following articles [11,41]. LearnOnline is defined as follows: Online learning is defined for a single learning model and hence needs to be repeated for all models. Initially, online learning starts from the model selected during the design stage (m selected ). Online learning then uses the features extracted from the composite feature vectors that are derived from the runtime data, the quality vectors associated with the adaptation options obtained from verification, and the model that is subject to training. The result is an updated learning model.

Runtime Stage of the ML2ASR+ Workflow in Detail: Testing
Once the machine learning models are trained (based on the warm-up count determined during the design stage), the Machine Learning Module switches to Testing. As opposed to training, during testing cycles, the Machine Learning Module uses the machine learning models to make effective predictions about the qualities of adaptation options. These predictions can then be used to reduce the adaptation space, improving the efficiency of the analysis. Figure 8 shows the workflow with the activities of the testing cycles. Example: Table 8 illustrates predictions for a sample of feature vectors done by a classifier that predicts the classes for each feature vector. Class 1 and 0 refer to predictions for the satisfaction and violation of a threshold goal respectively, as defined in the examples of the design stage.

Subset
The predictions made by the machine learning models are then used to filter the adaptation options. Only the adaptation options that are predicted to meet the adaptation goals are included for verification. For a full explanation and formal foundation of the filter operation we refer the interested reader to Appendix A.3. To summarize the filter operation: first adaptation options that are predicted to not meet any of the threshold or setpoint goals are filtered out. Second, out of the remaining adaptation options, the adaptation space is further reduced according to the specified granularity value (typically denoted by the letter g from this point on) and the predictions affiliated with the quality of the optimization goal of the system (if present).
A consequence of the filtering approach is that adaptation options that do not meet all the adaptation goals are not selected for verification (since these options are filtered beforehand). However, these adaptation options may be Example: Table 9 illustrates the application of filtering based on the predictions made by the Machine Learning Module, as well as extending this set with a selection of additional adaptation options based on the exploration rate.

Algorithms, Models, and Metrics for Evaluating ML2ASR+
Before we present the evaluation of ML2ASR+, we give an overview of the algorithms and models we used for the design of the learning modules, and we define the metrics that we use for the evaluation in Section 6.

Algorithms and Models for the Design of the Machine Learning Modules
For the design of the Machine Learning Modules of the evaluation cases, we evaluated different machine learning algorithms and models. The selection of the algorithms and models is based on their common use in the community; for some recent examples see [17,26,1]. In addition, the algorithms and models are supported by the widely used scikit-learn implementation kit [49] (see for instance [38,61,19]) that we also used for implementing the Machine Learning Module. We summarize now the algorithms and models that we used in the different design steps.
Feature Extraction Algorithms. Feature extraction involves two steps: Feature selection and Feature engineering. For Feature selection we have used Extremely Randomized Tree algorithms [28] to determine the importance of individual features based on their influence on the target values, i.e., the qualities of the system. The algorithms are based on random forest algorithms (composed of an ensemble of classical decision trees). On top of the random forest algorithms, the extremely randomized tree algorithms introduce extra randomness with the objective of reducing variance of the machine learning algorithm further (reducing overfitting). We utilized two implementations of the extremely randomized tree algorithms: an implementation of the algorithm for classification and an implementation of the algorithm for regression. After applying the algorithms and detecting relevant features, we adjusted the collected data accordingly for the subsequent activities. For Feature engineering we considered 4 scaling algorithms: no scaling

Name
Description Objective

F1-score
A combined metric of recall (percentage of samples that were retrieved using the classifier) and precision (percentage of samples that were correctly predicted), defined in the interval [0, 1].

MCC
The Matthew's correlation coefficient: a metric representing how well the classifier performs compared to making random predictions, defined in the interval [-1, 1] (-1 representing completely incorrect predictions, 0 representing on par predictions with random predictions and 1 representing perfect predictions).
Maximize algorithm, min-max scaling, max-abs scaling and standard scaling. We described the min-max and standard scaler briefly in Section 4.3.3. The max-abs scaler rescales the feature values in the range between 0 and 1 relative to their absolute value (e.g. the maximally encountered absolute feature value rescales to a value of 1).
Machine Learning Models. For our evaluation, we considered classification and regression machine learning models from the scikit-learn library [49] that are commonly used and support online learning. More specifically, we evaluated Stochastic Gradient Descent classifiers, Passive-Aggressive classifiers, Perceptron classifiers, Stochastic Gradient Descent regressors and Passive-Aggressive regressors. The Perceptron classifier is a single layer classifier that utilizes the broadly known Perceptron algorithm (written and published by Frank Rosenblatt [55]). The Stochastic Gradient Descent classifier and regressor both use an SGD learning routine to train their internal models. This learning routine tries to approximate the true gradient of the regularized training error of the model by analyzing a single training data sample at a time (based on [54]). The Passive-Aggressive classifier and regressor [15] both utilize a more aggressive strategy compared to the previously mentioned Perceptron or SGD models by correcting its model in case the internal loss exceeds a threshold, regardless of the step-size required to amend the model. It is important to note that ML2ASR+ supports any other type of machine learning model (classification or regression) that supports online learning and fits within the architecture as described in Figure 6.

Metrics for Evaluating the Learning Models of ML2ASR+
Learning models need to be evaluated both during the design stage and the runtime stage. During design, evaluation is used to select the best model. For hyper-parameter tuning, we varied a number of parameters, in particular the loss function, penalty function (if applicable), scaler, exploration rate, and warmup-count for the learners that we evaluated 9 (for the list see Section 4.3.4). During runtime, we monitored the selected learning models to validate that they perform well after deployment. Table 10 and Table 11 show the metrics we used for the evaluation of the learning models for classification and regression respectively. F1-score combines recall (the percentage of samples that were retrieved by prediction from a specific class) and precision (the fraction of samples that have been predicted to be of a specific class that are actually part of that class). The F1-score is defined as a value in the interval [0, 1]. A higher F1-score means in general a better performing classifier. The F1-score is commonly used to judge the performance of classifiers, e.g., [2,22]. The Matthews correlation coefficient has a value in the range [-1, 1], where 1 represent perfect predictions, 0 represent predictions that are equal to random predictions, and -1 represents incorrect predictions. Hence, this metric enabled us to compare the predictions made by the machine learning model with an approach that predicts based on random selections.
The R2-score represents how well the predictions of the model fit the actual quality values by looking at the variance of the predictions compared to the actual quality values. The R2-score is defined as a value in the interval [0, 1]. A higher R2-score indicates that the model is a better fit for the quality under consideration. The mean squared

Utility Penalty
The average difference in value of quality properties of the system obtained by applying the reference approach, DLASeR, and ML2ASR+.

Average Adaptation Space Reduction
The average proportion of adaptation options that were filtered by DLASeR and ML2ASR+.
(1 − selected total ) × 100 Maximize Learning Time Overhead The average proportion of additional time introduced by DLASeR and ML2ASR+ at runtime.

Overall Time Saved
The average proportion of total time saved of DLASeR and ML2ASR+ (taking into account overhead) compared to the reference approach.
( To To+Tr ) × 100 Maximize error refers, as the name suggests, to the mean of the squares of the errors in predictions made by the regressor. The median absolute error offers an alternative to the mean squared error, which is not as susceptible to outliers in the predictions. Lastly, the maximum error gives a good indication of the worst-case prediction made by the regressor. The R2-score serves as a good general metric for evaluating learning models (e.g., used in [57]). The other metrics can provide useful insights depending on the domain and context of the application at hand, see e.g., [24]. Table 12 summarizes the metrics for utility penalty and efficiency. We refer to these metrics as Quantitative Metrics from this point on, since they address the evaluation of the Negligible Utility Penalty and Efficiency requirements, which are both quantitative by nature (see Section 2.4). The utility penalty metric is used to address the Negligible Utility Penalty requirement of ML2ASR+. In the utility penalty formula, n equals the total number of adaptation cycles, q i o represents the quality value in cycle i which would have been chosen in an optimal situation and q i c represents the quality value in cycle i chosen by our proposed solution.  To address the Efficiency requirement of ML2ASR+, we define three metrics: average adaptation space reduction, learning time overhead, and overall time saved. In the formula of average adaptation space reduction (AASR), selected represents the average number of adaptation options selected by learning over multiple adaptation cycles, and total represents the average of the total number of adaptation options over these adaptation cycles. For the remaining formulas, the parameters T x refer to one of the time units as defined in Figure 9.

Evaluation ML2ASR+
We evaluate and benchmark ML2ASR+ on two cases from different domains: DeltaIoT [35] and a Service-Based System that is based on TAS [65]. DeltaIoT is a small IoT system with only threshold and optimization goals and a rather small adaptation space of 216 adaptation options. The Service-Based System is a more challenging case with threshold, setpoint, and optimization goals and an adaptation space of 13500 adaptation options.
We start with the evaluation with DeltaIoT and then look at the Service-Based System. We present the results of the evaluation for different scenarios. For the runtime stage, we focus on the evaluation of the requirements with quantitative metrics: utility penalty, average adaptation space reduction, overall time saved, and learning time overhead. We elaborate on the other requirements in the discussion in Section 7.
Both applications are evaluated using a simulator. Simulations are run on a computer system with an AMD Ryzen 7 Pro 3700u CPU with 13.7GB of RAM. For the learning approaches, we have used the implementations of the Scikit-Learn algorithms (classifiers, regressors, scalers) [49]. The full replication package is available online. 10

Evaluation with DeltaIoT
We start with introducing DeltaIoT. Then we present two evaluation scenarios and we explain the benchmarks we use. Next, we present the results of the design stage activities and finally the results of the runtime stage activities.

DeltaIoT Application
DeltaIoT is a small Internet-of-Things (IoT) application that offers a smart environment monitoring service. The application is developed by VersaSense. 11 The IoT network comprises 15 Long-Range (LoRa) motes that are deployed at the KU Leuven Computer Science Campus as shown in Figure 10. Each mote is equipped with a sensor (temperature, RFID and infrared) that periodically collects data and sends this data to a gateway. An end-user application processes the data allowing users to monitor the Campus area and take action when needed.
The network uses time-synchronized communication organized in cycles. Each cycle consists of a number of communication slots between a sender and a receiver mote. The slots are allocated from the leaf nodes of the network towards the gateway. Each mote has an internal buffer to store its own generated data and data received from other motes. When a mote is allocated a communication slot, it sends the data of the buffer to the receiving mote of the slot.
Uncertainties. We consider two important types of uncertainties: dynamics in the traffic load and interference of the wireless network. Dynamics in the traffic load result from variations in the frequency that sensors take samples and transmit data. For example: a temperature sensor collects and sends measurements periodically, while an RFID sensor only sends data when it is available, e.g., when a person scans an RFID badge. As a result, the load of packets that need to be sent to the gateway fluctuates. Interference of the wireless network arises from dynamic conditions in the environment, such as weather conditions or the presence of other wireless networks. Interference may result in the loss of packets communicated over the link. Figure 11 shows excerpts with data of both types of uncertainties over a period of time. This data is based on measurements of DeltaIoT in the field. 12 Quality Goals. Besides what the network should do, i.e., collecting data at the gateway, stakeholders of DeltaIoT also have demands on how this is done, i.e., the quality of the transmission. We consider three quality goals of DeltaIoT: packet loss, latency, and energy consumption. As explained above, packet loss depends on network interference. Latency depends on the traffic load in the network since only a limited number of packets can be transmitted during a time slot. The remaining packets remain in the buffers for communication in the next slot, causing delays in the transmission of data. Lastly, energy consumption depends on the number of packets that motes need to communicate and the power that is used to communicate packets. Evidently, stakeholders prefer to keep the packet loss, latency, and energy consumption low. However, these qualities are conflicting, for instance, using less energy (lower power) over a network link may result in higher packet loss as the signal may get lost in the noise along the link. 10 https://people.cs.kuleuven.be/danny.weyns/material/ML2ASR/ 11 www.versasense.com 12 Network interference is represented as the Signal-to-Noise ratio (SNR). An SNR below 0 may lead to the loss of packets.  Adaptation of the IoT Network. To ensure the quality goals during operation, DeltaIoT offers a management interface that is connected with the gateway. This interface can be used to observe the behavior of the network (e.g., the interference along links, the packets lost over a time period, etc.) and change the settings of the motes in the network. Here, we consider two types of settings. First, the power used to transmit packets over an outgoing link of a mote can be set in a range [0 . . . 15] (0 is minimum power and 15 maximum power). Sending packets with a higher power setting reduces the chance of packets being lost over a noisy link, but it consumes more power. Second, for motes with more than one outgoing link the distribution of the packets over these links can be set. This way, the transmission of packets along paths with high interference or high traffic can be reduced or avoided, yet the packets may follow a

Scenario 1
Threshold: the average packet loss over 12 hours should not exceed 10% of the messages sent.
Threshold: the average latency over 12 hours should not exceed 5% of the cycle time.

Scenario 2
Threshold: the average packet loss over 12 hours should not exceed 10% of the messages sent.
Threshold: the average latency over 12 hours should not exceed 5% of the cycle time.
Optimization: the average energy consumption over 12 hours should be minimized.
longer path requiring more energy. Since motes in DeltaIoT have at most two parent motes, we consider the following distribution settings for these motes: 0 − 100, 20 − 80, 40 − 60, 60 − 40, 80 − 20, 100 − 0. An example configuration is shown in Figure 10 (bottom right corner). Here a power setting of 5 is used for the upper link that transmits 20% of the packets, and a power setting of 9 is used for the bottom link that transmits 80% of the packets. Without self-adaptation, an operator is responsible for ensuring the quality goals by monitoring the network and adjusting the settings using the management interface. This is a tedious and costly task that is often not very efficient.
To that end, we add a managing system (MAPE feedback loop) to the system that connects with the management interface to automate the adaptation of the settings. We use such setting for the evaluation of ML2ASR+.

Evaluation Setup
For the evaluation with DeltaIoT, we used a simulation of the network with 15 motes as shown in Figure 10. We applied 300 communication cycles of the network that correspond with a wall clock time of around three days. We used uncertainty profiles for traffic load of motes and network interference that are based on measurements of the physical network. For the traffic load, motes generate between 0 to 10 packets per cycle. The level of interference (SNR) fluctuates between -40dB and +15dB. Figure 11 shows two example profiles we used.
Adaptation Goals. We devised two evaluation scenarios with learning tasks for different adaptation goals summarized in Table 13. In scenario 1, learning needs to predict and filter adaptation options based on these two threshold goals. In scenario 2, learning needs to additionally predict and filter adaptation options for an optimization goal. Note that a threshold goal that should keep the average packet loss under 10% over a period of 12 hours, implies that on average 90% of the transmitted packets should be received by the gateway. On the other hand, a threshold goal that should keep the average latency under 5% over a period of 12 hours, implies that on average at least 95% of packets generated in a cycle should be received by the gateway within that cycle.
Adaptation Settings. Adaptation options are composed in each cycle following two steps. Firstly, the power setting is determined for each link of each mote. These settings are determined such that the current Signal to Noise ratio (SNR) over each link is at least 0dB. The adaptation options are then determined based on the possible distribution settings for outgoing links of motes with two parents (0 − 100, 20 − 80, etc.). As such, the complete adaptation space for the DeltaIoT case consists of 6 3 = 216 adaptation options. 13 The MAPE feedback loop and the quality models have been designed as networks of timed automata models. These models are directly executed at runtime using the ActivFORMS execution engine [34]. The analysis of the adaptation options is performed using the runtime models by applying statistical model checking at runtime using runtime statistical model checking with Uppaal-SMC [16].
Benchmarks. We benchmark ML2ASR+ using three approaches. First, we use a baseline approach that analyzes the whole adaptation space without using machine learning. Second, we use a competing approach, called DLASeR, that applies a deep neural network to reduce adaptation spaces [62]. 14 We have rerun the results presented in [62] to ensure that the same settings were used to compare ML2ASR+ and DLASeR. Third, as a sanity check, we used an approach that selects a subset of adaptation options randomly. We average the obtained results over 10 runs to reduce variability. We highlight the results of this random approach separately and focus on statistically relevant differences.
In the next sections, we start with the evaluation results of the design stage. Then we present the results of the runtime stage. To conclude, we summarize the machine learning activities in both stages.

Design Stage Evaluation with DeltaIoT
Data Collection. We collected data of 300 cycles of DeltaIoT to derive the machine learning modules for both scenarios, each cycle containing 216 data points. Experiments showed that 300 cycles for the design stage activities ensured that the learners performed well during runtime. As explained in Section 4, the collected data consists of a set of feature vectors that represent adaptation options with uncertainties, and quality vectors that represent the qualities of the corresponding adaptation options.
Feature Extraction. After collecting the data, we applied Feature extraction. The first activity, Feature selection, removes features from the collected feature vectors that do not have an influence on the resulting qualities in the system. Based on feature extraction 34 of the original 65 individual features were selected as relevant. For instance all features related to SNR were selected. An example of a feature that was not selected is the load of motes that generate a constant number of packets, for instance motes that periodically track the temperature in the environment. Next, we use the pruned data to perform the second activity of feature extraction: Feature engineering. For both scenarios, we selected the Min-Max scaler for threshold goals. For the optimization goal, no scaler was selected as this provided the best results. As Feature engineering closely ties with Model selection we explain the results below. Section 5.1 describes how Feature extraction was done. For detailed results, we refer to the website with the replication package.
Machine Learning Model Identification. In the first activity, Model evaluation, we evaluated three different types of classifiers and two types of regressors. For the second activity, Model selection, we closely examined the evaluation metrics for each model to make a decision on the learning models to be used at runtime; Section 5.2 describes how this was done. Table 14 summarizes the chosen machine learning models and their corresponding metric values obtained during the evaluation process. Appendix B provides a detailed description of the chosen machine learning models. H2: The utility penalties when applying ML2ASR+ is not significantly higher compared to DLASeR.
H3: ML2ASR+ significantly reduces the adaptation space and hence the time required for verification compared to the reference approach.
H4: The reduction of adaptation spaces with ML2ASR+ is not significantly lower compared to DLASeR, nor does ML2ASR+ require significantly more time for adaptation space reduction.
Granularities for Adaptation Space Reduction with an Optimization Goal. In scenario 2, ML2ASR+ predicts the energy consumption in the network (optimization goal), on top of predicting packet loss and latency (threshold goals).
After filtering out options that are predicted to satisfy the threshold goals (be of class C 3 , see Section 5.2), ML2ASR+ reduces the adaptation space further based on the energy consumption predictions. We evaluate two cases: a reduction to at most 25 options and at most 10 options, corresponding to granularity values of 25 and 10, respectively.
Quality of the Learning Models.   Table 16 summarizes the results of the evaluation for the quantitative metrics. We discuss these results now in detail.
Utility Penalties. Figure 12 shows the results for utility penalties. Note that the reference approach that exhaustively verifies all adaptation options provides optimal adaptation 15 . First, we take a closer look at the threshold goals. Afterwards, we look at the optimization goal. Energy consumption (mCoulomb) Threshold Goals. When inspecting the results in detail, we notice that the values for the threshold goals with ML2ASR+ are very close to those obtained with the reference approach. The average values of the penalties are respectively 0.045% and 0.025% for packet loss and latency in scenario 1, and 0.515% and 0.299% for the worst-case of scenario 2 with a granularity value of 10. The marginal increases that result from adaptation space reduction with ML2ASR+ do not impede on the satisfaction of both goals compared to the reference approach in scenario 1 and scenario 2 with a granularity of 25. On the other hand, in scenario 2 with a granularity of 10 we notice that the threshold goals were not satisfied in 7 additional adaptation cycles after adaptation space reduction took place (cycles for which the reference approach does not violate the requirements). These results show that a lower granularity that substantially reduces the adaptation space for analysis may result in penalties for the quality properties of interest. This trade-off has to be carefully considered when making decisions about the granularity of adaptation space reduction.
Comparing the results of ML2ASR+ with DLASeR, we observe that the satisfaction of the threshold goals are not impeded in both scenarios as opposed to the few violations in scenario 2 with a granularity value of 10 when using for granularity values of 25 and 10, respectively. Here we also notice a similar trade-off between granularity values and utility penalty: a more fine-grained reduction carries the risk of adapting the system less optimally compared to a less constrained strategy. Note that the penalty compared to the reference approach is still acceptable (12.719mC mean energy consumption for the reference approach, 12.727mC for ML2ASR+ with a granularity value of 25 and 12.739mC for ML2ASR+ with a granularity value of 10) considering the significant time gain that both approaches offer (see below). For DLASeR we observe an average energy consumption of 12.769mC with a utility penalty for energy consumption of 0.038mC, meaning that ML2ASR+ performs quite well compared to the competing approach.
Sanity Check with Random Approach. We compared ML2ASR+ with a simple approach that randomly selects adaptation options (using an average of 10 random runs). For the threshold goals, packet loss and latency, both approaches satisfy the goals in both scenarios. However, the results show that the random approach violates the threshold goals for 28 adaptation cycles (of a total of 300 cycles), which is 16 cycles more compared to ML2ASR+. For the optimization goal of energy consumption in scenario 2, the Wilcoxon signed rank statistical test [70] showed a significant difference between the random approach and ML2ASR+ both for each random run and on the average of 10 random runs (for the latter we measured a p-value of 1.2e −15 with alpha level 0.05) 16 . Figure 13 shows the distribution for the energy consumption of the IoT networks in scenario 2 with both approaches. The average energy consumption with the random approach is 12.800mC compared to an energy consumption of 12.739mC for ML2ASR+. While the differences in absolute value of the energy consumption is relatively small, the difference is statistically relevant. The second case will show that for more complex application scenarios, the impact is practically relevant.
Hypotheses H1 (negligible utility penalties compared to reference approach) and H2 (utility penalties not significantly higher compared to DLASeR). The results show that the utility penalties when applying ML2ASR+ are negligible compared to the reference approach. ML2ASR+ with a low granularity value in one scenario did not satisfy the threshold goals in all cycles, emphasizing the importance of a good selection of granularity. Comparing ML2ASR+ with DLASeR, we notice that the utility penalties remain negligible for both threshold goals and the optimization goal. In conclusion, we can accept hypotheses H1 and H2. Average Adaptation Space Reduction. Figure 14 (left) shows the size of the adaptation spaces for the three evaluated approaches. During the first 45 training cycles, when the Machine Learning Module of ML2ASR+ is not exploited, all adaptation options are analyzed (multiple data points overlapping at 216 adaptation options for ML2ASR+). In the case of DLASeR, there is only a single entry at 216 adaptation options corresponding to the only training cycle. Applying ML2ASR+ results in an Average Adaptation Space Reduction (AASR) of 56.5% for the scenario 1, 88.5% for scenario 2 with a granularity value of 25 and 95.4% with a granularity value of 10. For scenario 1 this means that, on average, more than half of the adaptation options available in the adaptation space are filtered out before verification is applied. For scenario 2, we obtained results that match in most cases the granularity value.
For DLASeR we obtain an Average Adaptation Space Reduction of 58.8% for scenario 1, a result similar to the one obtained with ML2ASR+. However, for scenario 2, DLASeR works differently: the approach relies on deep learning models starting with predicting the energy consumption of all adaptation options; then it iterates over the adaptation options (from low energy consumption predictions to high) until an adaptation option is found that meets both threshold goals. This way, DLASeR achieves an average adaptation space reduction of 94.19% in scenario 2.
Learning Time Overhead. Figure 14 (right) shows the overhead introduced by ML2ASR+ (red and yellow lines) and DLASeR (blue line). The learning overhead is on average less than 1% of the total time necessary to both reduce and verify the reduced adaptation space for ML2ASR+ in both scenarios. Concretely, the overhead of ML2ASR+ is at most 4.28ms, which is less than 10% of the time required to verify a single adaptation option. We conclude that this overhead is negligible compared to the time necessary to verify the selected subset of adaptation options.
DLASeR on the other hand introduces a slightly higher overhead of 8.34% in scenario 1. Even though this number is higher, it is important to bear in mind that the overhead is still a minor part of the overall time required for verification of the reduced adaptation space. For scenario 2 however, we notice a significantly higher overhead of 45.96% for DLASeR due to the strategy DLASeR employs for adaptation space reduction. Even though DLASeR reduces the adaptation space to a small subset, the overhead is significantly higher than ML2ASR+.
Overall Time Saved. Figure 14 (middle) shows the overall time used to analyze all the adaptation options in each cycle with the reference approach (green), and the time used to reduce and verify the adaptation space with ML2ASR+ (red and yellow) and DLASeR (blue). We observe that in scenario 1 ML2ASR+ saves more than half of the time (62.81%) for verifying the reduced adaptation space compared to the reference approach. This observation is in line with the average adaptation space reduction, resulting in a significant time gain compared to the reference approach. DLASeR shows results that are also in line with the average adaptation space reduction, albeit slightly worse due to the higher overhead introduced by the approach (62.59% of the time saved). Similarly for scenario 2 we observe results closely aligned with the adaptation space reduction metric since the learning time is negligible: 90.82% and 96.37% for granularity values 25 and 10 respectively. For DLASeR in scenario 2 we notice an average time saved of 89.69%.
Hypotheses H3 (significant reduction of adaptation spaces and time gain) and H4 (adaptation space reduction comparable to DLASeR). ML2ASR+ realizes a significant reduction of the adaptation space of 56.5% for scenario 1 and over 90% for scenario 2, resulting in an overall time saving for analysis of 62.81% compared to the reference approach in scenario 1 and again over 90% for scenario 2. ML2ASR+ and DLASeR realize a similar adaptation space reduction in scenario 1 and scenario 2 with a granularity value of 10. Yet, the time required for adaptation space reduction with ML2ASR+ is negligible (< 1%), and small to significantly larger for DLASeR (8.34% to 45.96%). In conclusion, we can accept hypotheses H3 and H4. Table 17 summarizes the number of inputs, features, objective variables and metrics of the learning activities for DeltaIoT in each of the design stage and runtime stage activities. Table 17: The number of inputs, features, objective variables and metrics for the activities of the machine learning pipeline of scenario 2 of DeltaIoT in the design and runtime stage (separated by the double line). The prediction column is marked in red to indicate that it is not be used yet in the training cycles. The number of inputs for online learning is determined by the number of options that could be verified by the Verifier. Abbreviations: "f vectors" → feature vectors, "q vectors" → quality vectors, "Pl" → packet loss, "Ec" → energy consumption, "La" → latency.

Summary of design stage and runtime stage machine learning activities
The system consists of a set of services that perform tasks. The services are composed in a workflow with two main branches. The "sleep branch" analyzes the data of patients when they sleep that can be visualized for the patient afterwards. The "awake branch" analyzes data of different activities, processing the data using exercise and diet services and visualizing the results to the patient. Each branch fulfills its tasks using different services. For example, Exercise service processes activity data regarding exercises and makes recommendations. Similarly, the Diet service processes activity data regarding dietary information and makes recommendations. The Exercise-Diet service combines both these responsibilities in a single service providing an alternative path in the workflow.
The workflow defines service types that need to be instantiated. Three different service providers offer such service instances. These instances are marked in the workflow by small colored rectangles at the top of each service symbol. Service instances differ in the qualities they provide (e.g., response time) but also the cost for using them. During operation a concrete set of services instances is selected and used to handle incoming service requests.
Uncertainties. Each service provider is characterized by two parameters in our evaluation: its workload and the available bandwidth of its network. Both these parameters fluctuate at runtime representing uncertainties. These fluctuations in turn affect the qualities of the service instances they provide, including the failure rate, response time and cost (see below). Figure 16 shows the models we used for the fluctuations of the qualities of service instances per service provider. Failure rate and cost increase with higher load, while response time decreases with lower bandwidth.
Besides fluctuations in work load and the available bandwidth, the system has to deal with an additional uncertainty, namely the distribution of service requests for sleep analysis (sleep branch) or activity analysis (awake branch). This distribution, denoted by the value p in the workflow, may change over time depending on the patient's behavior.
Quality Goals. In the evaluation, we consider three key qualities for stakeholders of the service-based system: the failure rate of service invocations, the response time, and the cost of invocations. Each service instance is characterized by a specific failure rate, response time and cost. Hence, the overall qualities for service requests are determined by the individual service instances that are selected to handle these requests. In particular, the overall failure rate is determined by the multiplication of each failure rate associated with the selected service instances. As an example, assume we invoke two service instances, each characterized by a failure rate of 5%. The overall failure rate of a service request then corresponds to 1 − (0.95 * 0.95) = 0.0975, i.e., 9.75%. The overall response time of service requests is simply determined by the sum of the individual response times associated with these selected service instances. Similarly, the overall cost is determined as the sum of the costs associated with individual service instances.  Clearly, stakeholders want to keep the failure rate, response time, and cost as low as possible. Yet, these qualities conflict. Invoking a service with lower failure rate and/or lower response time will usually imply a higher cost. However, the selection of services is complicated by uncertainties. For instance, the cost to invoke a service of a service provider may increase when the service provider is under heavy load. Similarly, the failure rates and the response times of the provided service instances fluctuate in time.
Adaptation of the Service-Based System. Given the fluctuations in load and available bandwidth of service providers and changes of patient behavior, the selection of service instances may be changed dynamically based on the changing conditions. To that end, the system can be configured such that the requests are distributed in a particular way over different instances. In the evaluation setting, we use service types with 2 and 3 instances. For services with 2 instances the system offers 3 possible configurations: 0/100%, 50/50% and 100/0%. For services with 3 instances there are 10 possible configurations: 0/0/100%, ..., 0/33/67%, ... 100/0/0%. This way, preference can be given to services with better actual quality values, or services can even be (temporally) avoided if necessary. In addition, the parameter α that determines which path is taken in the awake branch (distinct services for the exercise and diet tasks or a combined service) can be set to one of four values: 0%, 25%, 50%, 75% and 100%. Figure 15 shows (on top of the general workflow) an example configuration of the workflow (with concrete selections for service instances and α set to 25%).
Without self-adaptation, it is practically infeasible for an operator to change the service selection dynamically. Hence, the only option for an operator would be to allocate a predefined set of possible service instances to the system and perform a coarse-grained adaptation. However, this would result in a sub-optimal solution or even worse in case particular services would fail unexpectedly. To that end, we add a managing system (MAPE-based feedback loop) to the system that monitors the changing conditions and adapts the service instances of the workflow dynamically when needed to maintain the stakeholder goals (failure rate, response time, and cost).

Evaluation Setup
We used a simulation of the setup as shown in Figure 15. We considered 30.000 service requests that are generated sequentially and processed individually by the system. Adaptation is triggered every 100 requests, resulting in 300 feedback loop iterations. The work load and available bandwidth are modelled as stochastic variables that gradually change during operation of the system (between 0 and 100%). The change in values occurs by sampling a normal distribution with a standard deviation of 1.7, increasing or decreasing the bandwidth and work load of service providers. The factor p is initially set at 50% and is modelled similarly to the work load and available bandwidth.
Adaptation Goals. We devised two scenarios of the service-based application as illustrated in Table 18. In scenario 1 we consider three adaptation goals: two threshold goals and an optimization goal. Learning first filters out adaptation options based on the threshold goals, and subsequently orders and reduces the adaptation space according to the optimization goal. In scenario 2, learning has to deal with all three types of goals (threshold, setpoint, and optimization).
Adaptation Settings. The adaptation space for the Service-Based System is fixed. The adaptation options are determined based on the distribution of available service instances per service type and the setting of α. Concretely, the Optimization: the average cost should be minimized.

Scenario 2
Threshold: the average failure rate should not exceed 10%.
Setpoint: the average response time should be kept at 10ms.
Optimization: the average cost should be minimized.
total adaptation space comprises 5 * 10 * 10 * 3 * 3 * 3 = 13500 adaptation options. 17 Similarly to the DeltaIoT case, we designed a MAPE feedback loop and quality models as networks of timed automata models that are directly executed using ActivFORMS [34]. For the analysis of the parameterized quality models, the feedback loop applies statistical model checking at runtime using Uppaal-SMC [16].
Benchmark. We benchmark ML2ASR+ again with a reference approach that analyzes the whole adaptation space without using learning and with DLASeR [62], a state of the art approach. As a sanity check, we use again an approach that selects adaptation options randomly over a set of runs to reduce variability. It is important to note that the results for the different approaches are obtained from identical configurations and parameters settings of the application. For ML2ASR+ and DLASeR the data is collected during simulation. Yet, for the reference approach the data is collected during the design stage since analyzing the complete adaptation space for one cycle takes around 2 hours.
In the next sections, we start with the results of the design stage. Then we present the results of the runtime stage. To conclude, we summarize the machine learning activities in both stages.

Design Stage Evaluation with the Service-Based System
Data Collection, Feature Extraction, Machine Learning Model Identification. The design stage activities for the Service-Based System followed the same procedure as the activities for DeltaIoT (see Section 6.1.3). First, we collected data from the system used to derive the machine learning modules for both scenarios. This data consisted of a set of feature vectors (composed of adaptation options and uncertainties in the application) and a set of quality vectors. We collected data for 100 adaptation cycles corresponding to 10.000 service requests, each adaptation cycle containing 13500 data points. Then we performed Feature extraction. During Feature selection, all 22 features were selected as relevant, e.g., all features concerning the distribution of service request over service instances and all the features concerning the load of the service providers. During Feature engineering, we determine which scaling algorithms to use to adjust feature values, following Section 5.1. Lastly, for Model evaluation and Model selection, we evaluated and selected machine learning models based on the criteria listed in Section 5.2. Table 19 summarizes the results for both scenarios, including the selected scaling algorithms, the selected classifier and regressor models and their corresponding machine learning metric values obtained during model evaluation. We refer to Appendix B for a detailed description of the chosen machine learning models.
Exploration Rate and Warm-up Count. Finally we selected 5% as exploration rate (extra random adaptation options selected for verification) and 60 cycles (of 300) as the warm-up count (the number of training cycles to initialize the learning model). For detailed results, see Appendix B.

Runtime Stage Evaluation with the Service-Based System
Hypothesis. For the evaluation of the runtime stage activities of ML2ASR+ we use the same hypotheses H1 to H4 as for DeltaIoT, see Section 6.1.4. However, we test hypothesis H2 (the utility penalties when applying ML2ASR+ is not significantly higher compared to DLASeR) and H4 (the reduction of adaptation spaces with ML2ASR+ is not significantly lower compared to DLASeR, nor does ML2ASR+ requires significantly more time for adaptation space reduction) only for scenario 1 as DLASeR does not support setpoint goals yet. Granularities for Adaptation Space Reduction with an Optimization Goal. In both scenarios, ML2ASR+ has to deal with an optimization goal to keep the cost in the application minimal. After filtering out adaptation options based on the predicted satisfaction of threshold and setpoint goals in the system, ML2ASR+ further reduces the adaptation space based on the cost predictions. We evaluate two cases for each scenario: a reduction to at most 1000 adaptation options and a reduction to at most 100 adaptation options. This corresponds to granularity values 1000 and 100.
Quality of the Learning Models. Table 20 shows the results for the quality of the learning models at runtime. We highlight the most important metrics. The classifier used to make predictions for both threshold goals in scenario 1 (failure rate and response time) has an F1-score of 0.841, and the classifier used to predict the failure rate threshold goal in scenario 2 has an F1-score of 0.935. The regressor used to predict the optimization goal in scenario 1 (cost) has an R2-score of 0.862. For scenario 2, the regressor used to predict the setpoint goal (response time) has an R2-score of 0.902 and the regressor used to predict the optimization goal (cost) has an R2-score of 0.913. These results confirm that the machine learning models can make accurate predictions for the quality properties of the system.  Cost (c) Summary of Results for Quantitative Metrics. Table 21 summarizes the evaluation results for the quantitative metrics for the Service-Based System. We discuss these results now in detail.
Utility Penalties. Figure 17 shows the results for the utility penalties for both scenarios when applying ML2ASR+ with a granularity value of 1000 in red and a granularity value of 100 in orange, and DLASeR in blue (only for scenario 1). Subsequently, we zoom in on the threshold goals, setpoint goal, and optimization goals. In scenario 1, we notice a utility penalty of 0.436c for DLASeR, which is slightly better than ML2ASR+. We also observe that the results of ML2ASR+ with a granularity value of 100 are slightly worse compared to a granularity with value 1000. This can be explained by the additional restriction put on the reduced adaptation space size: the resulting adaptation space is on average 10 times smaller compared to a granularity value of 1000, leaving fewer adaptation options to be selected from to apply self-adaptation.
Sanity Check with Random Approach. We compared ML2ASR+ with batches of 10 runs of an approach that randomly selects adaptation options. For the threshold goals, failure rate and response time in scenario 1 and failure rate only in scenario 2, the random approach manages to always select at least one adaptation option that satisfies the goals. For the optimization goal in both scenarios and the setpoint goal in scenario 2, the Wilcoxon signed rank tests showed statistically significant results between ML2ASR+ and the batches of 10 runs with the approach that randomly selects adaptation options (p-values of 4.88e −21 for cost in scenario 1, 1.44e −4 for response time in scenario 2, and 2.55e −66 for cost in scenario 2). Note that we could not identify statistical relevant differences for the response time in scenario 2 for all individual runs with the approach that randomly selects adaptation options. Figure 18 shows the distributions for the optimization and setpoint goals of the IoT network in the two scenarios. 18 The average cost (optimization goal) for scenario 1 is 24.58c with ML2ASR+ compared to 26.38c with random selection (Random), a difference of 1.8c (6.82%). For scenario 2, the values are 25.57c with ML2ASR+ compared to 31.51c with Random, a difference of 5.94c (18.85%). For the average response time (setpoint goal at 10ms±0.25ms), we noticed that the approach that randomly selects adaptation options violated on average the goal in 48 cycles (of a total of 300 cycles) compared to no cycles for ML2ASR+. These results show that ML2ASR+ performs substantially better for more complex adaptation scenarios compared to an approach that randomly selects adaptation options.
Hypotheses H1 (negligible utility penalties compared to reference approach) and H2 (utility penalties not significantly higher compared to DLASeR). The results show that the utility penalties incurred by ML2ASR+ are negligible compared to the reference approach. Specifically, the penalties for cost (optimization goal) are very low in both scenarios (at most 1.589mC and 2.653mC with granularity values 1000 and 100 respectively). A smaller granularity value reduces the adaptation space significantly but implies higher utility penalties. The satisfaction of the threshold and setpoint goals remains unaffected with ML2ASR+. The slight increase in cost is acceptable, especially considering that it is not feasible to use the reference approach in practice due to time constraints. In scenario 1, DLASeR shows slightly better results compared to ML2ASR+ with a granularity value of 1000 (with a penalty of cost of 1.381c vs 0.436c). However this cost is acceptable considering that DLASeR does not support all types of adaptation goals yet. In conclusion, we can accept hypotheses H1 (for scenario 1 and 2) and H2 (for scenario 1) in the Service-Based System application.
Average Adaptation Space Reduction. Figure 19 (left) shows the number of adaptation options remaining after reduction. We used a warm-up count of 60 cycles for both ML2ASR+ and DLASeR. The total number of options in these training cycles is limited by the available time for adaptation (30m for the Service-Based System). Note that the reference approach is not subject to this time restriction for the purpose of evaluation; i.e., the reference approach fully analyzes the whole adaptation space with 13500 adaptation options, which is infeasible in practice due to time Learning Time Overhead. Figure 19 (right) shows the learning time overhead introduced by ML2ASR+ with the two evaluated granularity values, and DLASeR (scenario 1). The overhead of ML2ASR+ is very small compared to the overall verification time (the overhead for learning is denoted in ms, while overall verification time is denoted in s). Specifically, the overhead for both granularity values accounts for less than 0.5% of the time required to reduce and verify the adaptation space. In absolute terms, ML2ASR+'s overhead is capped at approximately 500ms during online training cycles. The learning time overhead of DLASeR in scenario 1 is substantially higher compared to ML2ASR+ with a granularity value of 1000 (with close to equal Average Adaptation Space Reduction). Yet, the overhead remains minor compared to the overall verification time; the overhead of DLASeR is 0.30% of the overall verification time. In absolute terms, the overhead of DLASeR is capped at approximately 2200ms during the training cycles as well.
Overall Time Saved. Figure 19 (middle) shows the overall time used to analyze the (selected) adaptation options. We can clearly see that the overall verification time is significantly reduced, closely aligned to the corresponding Average Adaptation Space Reduction. Concretely, we observe an overall time saved of approximately 92-92% and 99% using ML2ASR+ with granularity values 1000 and 100 respectively. For DLASeR, we notice an overall time saved of 92.62%, which is in line with ML2ASR+ for a granularity value of 1000 in scenario 1.
6.2.5. Summary of design stage and runtime stage machine learning activities Table 22 summarizes, similarly to the DeltaIoT case before, the number of inputs, features, objective variables and metrics for the Service-Based System. Note that no features are removed by Feature Extraction since all features were deemed to be relevant (hence the table cells being marked in gray).  Abbreviations: "f vectors" → feature vectors, "q vectors" → quality vectors, "Pl" → packet loss, "Ec" → energy consumption, "La" → latency. Hypotheses H3 (significant reduction of adaptation spaces and time gain) and H4 (adaptation space reduction comparable to DLASeR). The evaluation shows that ML2ASR+ significantly reduces the adaptation space: up to a reduction of 99% depending on the specified granularity value. Paired with this, up to 99% of the time used by the reference approach is saved, closely aligned with the average adaptation space reduction, since the overhead introduced by ML2ASR+ is minimal (constituting less than 0.5%). For scenario 1, DLASeR obtains a similar Average Adaptation Space Reduction (92.66% vs 92.59%) and overall time saved (92.62% vs 92.60%) compared to ML2ASR+ with granularity value 1000. ML2ASR+ with granularity value 1000 outperforms DLASeR on learning overhead with a value of 0.04% vs 0.30%. As such, we can accept hypothesis H3 for scenarios 1 and 2, and H4 for scenario 1 in the Service-Based System.

Discussion
In Section 2.4, we described the research question targeted in this work and we listed the desirable requirements for an approach to tackle the research question. To that end, we proposed ML2ASR+. In the evaluation, we assessed the quantitative requirements. We now discuss the remaining qualitative requirements, we answer the research question, we highlight insights obtained from this research endeavor, and conclude with a discussion of threats to validity.

Qualitative Requirements
Reusability. With reusability we refer to the ability of ML2ASR+ to be instantiated and applied over multiple application domains. To demonstrate that we have covered this requirement, we demonstrated the applicability of ML2ASR+ to the Internet of Things domain and the Service-Based Systems domain. In both applications we analyzed the performance of ML2ASR+ in two evaluation scenarios, while also assessing different granularity values. From the results, we can conclude that ML2ASR+ has the ability to handle both applications and the different evaluation scenarios.
Automatic Operation at Runtime. To evaluate the second requirement, we make a distinction between the design stage and the runtime stage of the approach. In the design stage, the system requires manual input from the system developer(s) to properly configure the Machine Learning Module, as described in section 4.3. Hence this step is not completely automated. Once the Machine Learning Module is deployed, no further input or intervention is necessary from the system operators. This is also demonstrated in the evaluation: during operation ML2ASR+ reduces the adaptation space without any input from an operator or system developer. ML2ASR+ thus satisfies this requirement.
Modularity Adaptation Goals. To evaluate the ability of ML2ASR+ to deal with different combinations of adaptation goals, we specifically investigate whether ML2ASR+ is able to deal with threshold, setpoint, and optimization goals. In our evaluation, we defined four scenarios that combine different types of goal types for two different applications. This way, we ensured that different combinations of the three types of adaptation goal types are assessed. We conclude that ML2ASR+ supports dealing with all the goal types in the evaluated application scenarios.
Granularity of Adaptation Space Reduction. With granularity we refer to the degree with which ML2ASR+ is able to reduce adaptation spaces, i.e., selecting a specified number or percentage of adaptation options from the original adaptation space. ML2ASR+ allows the specification of a granularity value that constraints the size of the reduced adaptation space. We have demonstrated this for both applications in different evaluation scenarios with granularity values of 10, 25, 100 and 1000. We can thus conclude that ML2ASR+ satisfies this last requirement as well.
Answer to research question "How can machine learning be used to reduce large adaptation spaces of self-adaptive systems with different types of adaptation goals to perform more efficient analysis without compromising the goals?" This work demonstrates how classic supervised machine learning techniques can be used to reduce the adaptation space to a more manageable subset. After designing the Machine Learning Module, ML2ASR+ initializes the learning models, trains the models during warm-up, and then uses the models to make predictions about the satisfaction of adaptation goals for individual adaptation options. ML2ASR+ uses classification to predict the satisfaction of a threshold or setpoint goals, and regression to predict the quality value associated with an adaptation option. ML2ASR+ provides the means to reduce the adaptation space with the specified granularity value; this flexibility enables the approach to adjust with the available time window to perform adaptation. Empirical evaluation shows that ML2ASR+ drastically reduces the time required for analysis with a negligible effect on the satisfaction of the adaptation goals in our evaluated systems.

Insights
We share a number of insights we obtained during the design and evaluation of ML2ASR+: • When handling multiple adaptation goals, there is a risk that errors in learning models propagate further with each prediction. Accumulating prediction errors may ultimately reduce the efficacy of the approach.
• The overhead introduced by the learning approach directly links with the selected granularity value. Even for small granularity values, the gain in time required for analysis is significant. Yet, such a setting may also have a significant impact on the utility penalties. Hence, the right choice for setting the granularity value is important and requires experimentation.
• It is important to highlight that the use of linear machine learning models is not a "one size fits all" solution. The effectiveness of the approach depends on the underlying relation between input data (features) and the output (qualities). If this relation cannot be properly modeled in a linear way, other approaches such as DLASeR that rely on deep neural networks may be preferable as these approaches capture these intrinsic relations better. It is however important to keep in mind that other approaches may follow different workflows and carry their own drawbacks. For instance, DLASeR follows different steps in its design stage and runtime stage workflows and the approach introduces a larger learning overhead compared to ML2ASR+.
• ML2ASR+ relies on the assumption that the (formal) models that are used to estimate quality properties of the underlying system provide reliable and correct results. A quality model that cannot handle concept drift nor evolution of the system may yield data that does not capture the real system accurately. This can affect the performance of the machine learning models. Yet, extracting data directly from the real system rather than the (formal) model is not a solution as this data is inherently very limited since only a single adaptation option can be applied each cycle. However, exploiting the data retrieved from the real system to detect issues with the model and adapt or evolve the models dynamically is an option; we leave this as future work.
• For the validation of ML2ASR+, we trained the learning models both during the design stage and the runtime stage. In principle, it would be possible to generate all the data and train machine learning models completely during design stage. However, since the space of adaptation options combined with the uncertainties leads to a very huge space of possible configurations, generating all the data and training machine learning models would be very time consuming. Therefore, we used a representative sample of the data to apply design stage training and then collect additional data after startup to continue the training according to the actual system configuration and uncertainties at runtime.
• In this work, we considered setpoint goals based on a small window , resembling similarities with steady state error in control-based approaches [59]. Any configuration within this window complies with the goal. An interesting option for future work is to refine this view and consider the option closest to the setpoint as the optimal one. Combined with an optimization goal this will lead to a multi-objective optimization problem.

Threats to Validity
The empirical evaluation of ML2ASR+ is subject to threats to validity. For each threat, we discuss potential critiques of this study and we explain how we dealt with those.
Internal Validity. To make sure that we can draw a causal conclusion based on the study, we took several measures. Concerning the contribution, we specified the approach formally, providing a basis to define that the approach works as described. Concerning the evaluation, we have applied the same settings of the simulator with the same settings for the application parameters when comparing ML2ASR+ with the other approaches. This is particularly relevant in settings with stochastic behavior. As such we provide a basis for deriving the conclusions of comparing the approaches. We also provide a replication package [52] for other researchers to validate the results.
External Validity. External validity concerns the generalization of the results beyond the scope of the study. This study contributes an architectural approach for adaptation space reduction in self-adaptive systems that is centred on the Machine Learning Module with an according workflow. This approach uses classical supervised machine learning techniques to support the adaptation process. Since we have applied and evaluated ML2ASR+ to a limited set of scenarios with particular characteristics and types of uncertainties, we cannot make general claims about the efficacy of the approach in other settings. To mitigate this threat to some extent, we have evaluated the approach in two distinct domains with different challenges regarding adaptation space reduction for different combinations of adaptation goals.
Construct Validity. With construct validity we analyze whether we have obtained the right measures to answer the proposed research question. To minimize threats to construct validity we provided an explicit definition of six requirements to be evaluated. For several of these requirements we defined concrete metrics that enabled us to evaluate the performance of the approach empirically (in terms of efficiency and overhead). Several of these metrics are based on established practice for the evaluation of learning approaches. In addition, the formal specification of ML2ASR+ provides a rigorous description of how the approach works. Nevertheless, we acknowledge that other metrics may have been considered for evaluating the appropriateness of adaptation space reduction.
Conclusion Validity. Threats to conclusion validity concerns reaching an incorrect conclusion about a relationship in the observations. To mitigate conclusion validity threats, we applied ML2ASR+ in different scenarios of different domains with different characteristics. Based on a set of well-defined metrics, the results confirm the observation that ML2ASR+ is effective for adaptation space reduction in self-adaptive systems. In addition, we have made all code and experimental data publicly available [52] to reproduce the experiments in order to confirm the findings.

Conclusion
In this paper we presented ML2ASR+, a novel approach to analyze large adaptation spaces more effectively by exploiting classic supervised machine learning techniques to reduce adaptation spaces on the fly. ML2ASR+ extends the basic MAPE-K architecture with a Machine Learning Module that supports the Analyzer component by reducing the adaptation space to a manageable subset. In particular, the Machine Learning Module filters adaptation options that are predicted to not meet the adaptation goals in the system. We have demonstrated the effectiveness and viability of ML2ASR+ in our evaluation in two different application domains. We evaluated the effectiveness of ML2ASR+ in reducing the adaptation space as well as the overhead introduced by the approach. The results showed that the overhead introduced by ML2ASR+ is minimal compared to the time required to verify the remaining subset of filtered adaptation options. On top of this, the penalty in system qualities is negligible when choosing a new system configuration from the reduced adaptation space. In future work, we plan to investigate adaptation space reduction in decentralized self-adaptive systems where multiple feedback loops need to coordinate the analysis. In the long term, we plan to expand our study on the use of machine learning and self-adaptive systems, and investigate how evolutionary learning can be used to support self-adaptation in systems that are exposed to unanticipated changes, requiring system evolution. First ideas in this direction are reported in [64].
The second type of filter operation relates to setpoint goals in the system. We define the filter operation that filters adaptation options according to setpoint goal S with target µ and error margin for any quality value q as follows: f S µ, = {π 1 , π 2 , ..., π n } → {π f1 , π f2 , ..., π fm } where m ≤ g and m i=0 q fi − µ = min({ k∈K |q k − µ| | K ⊆ Π and |K| = m}) Lastly, the filter deals with up to one optimization goal. We define the filter operation for an optimization goal O for quality values q as follows: f O min = {π 1 , π 2 , ..., π n } → {π f1 , π f2 , ..., π fm } where m ≤ g and In our research, we use filters that combine the different filter operations in a predefined order. In particular, the filter first filters adaptation options that violate threshold goals. Next it filters adaptation options that violate the setpoint goals of the system. Finally, it filters the options based on a single optimization goal. We restrict filtering to a single optimization goal to avoid conflicting scenarios when multiple optimization goals are specified in the system. Equation A.1 specifies how we define the main filter operation: In case any of the types of adaptation goals are not applicable, that type is ignored by the filter.
Appendix B. Additional Machine Learning Material Table B.23 and Table B.24 summarize the scalers and models selected for evaluation scenarios in both applications. The numbers between square brackets indicate the boundaries of the evaluation metric values for the alternative options that were not selected. For both applications and scenarios, the warm-up count is selected from 30, 45 and 60, and the exploration rate is selected from 5% and 10%.  Table B.24: Design stage model selection for the Service-Based System application in the two system scenarios.