GoSafeOpt: Scalable Safe Exploration for Global Optimization of Dynamical Systems

Learning optimal control policies directly on physical systems is challenging since even a single failure can lead to costly hardware damage. Most existing model-free learning methods that guarantee safety, i.e., no failures, during exploration are limited to local optima. A notable exception is the GoSafe algorithm, which, unfortunately, cannot handle high-dimensional systems and hence cannot be applied to most real-world dynamical systems. This work proposes GoSafeOpt as the first algorithm that can safely discover globally optimal policies for high-dimensional systems while giving safety and optimality guarantees. We demonstrate the superiority of GoSafeOpt over competing model-free safe learning methods on a robot arm that would be prohibitive for GoSafe.


Introduction
The increasing complexity of modern dynamical systems often makes deriving mathematical models for traditional model-based control approaches forbiddingly involved and timeconsuming. Model-free reinforcement learning (RL) methods [1] are a promising alternative as they learn control policies directly from data. To succeed, they need to explore the system and its environment. Without a model, this can be risky and unsafe. Since modern hardware such as robots are expensive and their repairs are time-consuming, safe exploration is crucial to apply model-free RL in real-world problems. This paper proposes GoSafeOpt, a model-free learning algorithm that can search for globally optimal policies while guaranteeing safe exploration with high probability. amounts of data. In learning control, such data is often gathered by conducting experiments with physical systems, which is time-consuming and wears out the hardware. (ii) Learning requires exploration, which can lead to unwarranted and unsafe behaviors.
Challenges (i) and (ii) can be addressed jointly by Bayesian optimization (BO) with constraints. BO [7] is a class of black-box global optimization algorithms, that has been used in a variety of works [8][9][10][11] to optimize controllers in a sample-efficient manner. In constrained BO, there are two main classes of methods. On the one hand, approaches like [12][13][14][15] find safe solutions but allow unsafe evaluations during training. Herein, we focus on approaches that guarantee safety at all times during exploration, which is crucial when dealing with expensive hardware. SafeOpt [16] and safe learning methods that emerged from it, e.g., [17][18][19], guarantee safe exploration with high probability by exploiting properties of the constraint functions, e.g., regularity. Unfortunately, these methods are limited to exploring a safe set connected with a known initial safe policy. Therefore, they could miss the global optimum in the presence of disjoint safe regions in the policy space (see Fig. 1). Disjoint safe regions appear when learning an impedance controller for a robot arm, as we show in our experiments and in many other applications [8; 20; 21]. To address this limitation [21] proposes GoSafe, which can provably and safely discover the safe global optimum in the presence of disjoint safe regions under mild conditions. To achieve this, it learns safe backup policies for different states and uses them to preserve safety when evaluating policies outside of the safe set. Specifically, it switches between actively exploring local safe regions in the state and policy space and safe global exploration. However, the active exploration in the state and policy space requires a coarse discretization of the space and is infeasible for all but the simplest systems with low-dimensional state spaces, [22] argues that dimension d > 3 is already challenging. As a result, GoSafe cannot only handle most real-world dynamical systems, and is restricted to impractical systems with low-dimensional state spaces. The concept of switching between two exploration stages is also pursued in the stagewise safe optimization algorithm proposed in [23]. However, also [23] is restricted to an optimum connected to a safe initialization. Lastly, the general idea of learning backup policies is related to safety filters and control barrier functions [24][25][26]. Nevertheless, those methods require either availability or learning of a dynamics model besides learning the policy and are, therefore, model-based. In this work, we focus on a model-free approach.

Contributions
This work presents GoSafeOpt, the first model-free algorithm that can globally search optimal policies for safety-critical, real-world dynamical systems, i.e., systems with highdimensional state spaces. GoSafeOpt does not discretize and actively explores the state space. Therefore, it overcomes the main shortcomings and restrictions of GoSafe, while still performing safe global exploration. This makes GoSafeOpt the first and only model-free safe global exploration algorithm for real-world dynamical systems. Crucially, GoSafeOpt leverages the Markov property of the system's state to learn backup policies which it uses to guarantee safety when evaluating policies outside the safe set. This novel mechanism for learning backup policies does not depend on the dimension of the state space. We provide high-probability safety guarantees for GoSafeOpt and we prove that it recovers the safe globally optimal policy under assumptions that hold for many practical cases. Finally, we validate it in both simulated and real safety-critical path following experiments on a robotic   arm (see Fig. 2), which is prohibitive for GoSafe, the only competing model-free global safe search method. Further, we show that GoSafeOpt achieves considerably better performance than SafeOpt, a state-of-the-art method for local model-free safe policy search, and its high-dimensional variants. Table 1 compares GoSafeOpt to SafeOpt and GoSafe in terms of safety guarantees, scalability, global exploration, and sample efficiency. It shows that GoSafeOpt is the only method that can perform sample-efficient global exploration in high-dimensional systems while providing safety guarantees.

Problem Setting
We consider a Lipschitz-continuous system dx(t) = z(x(t), u(t)) dt, (1) where z(·) represents the unknown system dynamics, x(t) ∈ X ⊂ R s is the system state and u(t) ∈ U ⊂ R p is the input we apply to steer the system state to follow a desired trajectory x des (t) ∈ X for all t ≥ 0. We assume that the system starts at a known initial state x(0) = x 0 .
The control input u(t) we apply for a given state x(t) is specified by a policy π : X ×A → U, with u(t) = π (x(t), a) := π a (x(t)). The policy is parameterized by a ∈ A ⊂ R d , where A is a finite parameter space 1 . We encode our goal of following the desired trajectory x des (t) through an objective function, f : A → R. Note, the trajectory of a deterministic system (1) is fully determined by its initial state x 0 and the control policy. Therefore, the objective is independent of the state space X . We seek for a controller parametrization a ∈ A that optimizes f for a constant initial condition x 0 . Since the dynamics of the system in Eq. (1) is unknown, so is the objective f . Nonetheless, we assume we obtain a noisy measurement of f (a) at any a ∈ A by running an experiment. We aim at optimizing f from these measurements in a sample-efficient way. Additionally, to avoid the deployment of harmful policies, we formulate safety as a set of unknown constraints over the system trajectories that must be satisfied at all times. Similar, as for f , these constraints only depend on the parameter a and hence take the form g i : A → R for each constraint function g i , where i ∈ {1, . . . , q} := I g and q ∈ N. The resulting constrained optimization problem with unknown objective and constraints is: We represent the objective and constraints using a scalar-valued function in a higher dimensional domain, proposed by [18]: with I g = {1, . . . , q}, I := {0, 1, . . . , q}, and i ∈ I. This representation will later help us in learning the unknown function.
In summary, our goal is to find the optimal and safe policy parameter for the system starting from the nominal initial condition x 0 . We refer to the solution of Eq. (2) as the safe global optimum a * . Note, finding the optimal policy for a fixed initial condition x 0 is a common task in episodic RL [1].
Solving this problem without a dynamics model and without incurring failures for generic systems, objectives, and constraints is hopeless. The following section introduces our assumptions to make this problem tractable.

Assumptions
To solve the problem in Eq. (2) safely, we assume to have at least one initial safe policy to start data collection without violating constraints. This initial policy could be derived from available simulators, first principles models, or by performing controlled experiments on the hardware directly. This policy can be conservative and sub-optimal. For instance, for mobile robots, a policy that barely moves the robot could be an initial safe policy. Assumption 2.1. A set S 0 ⊂ A of safe parameters is known. That is, for all parameters a in S 0 we have g i (a) ≥ 0 for all i ∈ I g .
In practice, similar policies often lead to similar outcomes. In other words, the objective and the constraints exhibit regularity properties. We capture this by assuming that the function h, Eq. (3), lives in an reproducing kernel Hilbert space (RKHS) [27] and has bounded norm in that space.
Assumption 2.2. The function h lies in an RKHS associated to a kernel k and has a bounded norm in that RKHS ∥h∥ k ≤ B. Furthermore, the objective f and constraints g i are Lipschitz continuous with known constants.
Without Assumption 2.2, the constraint and reward functions can be discontinuous making it impossible to infer the safety of a policy before evaluating it and to provide safety guarantees. In practical applications, such behavior is undesirable, and therefore rare. For further discussion on the practicality of this assumption, we refer the reader to [28].
Next, we formalize our assumptions on the measurement model.
We obtain noisy measurements of h with the measurement noise independent and identically distributed (i.i.d.) σ-sub-Gaussian. That is, for a measurement y i of h(·, i), we have y i = h(a, i) + ϵ i with ϵ i σ-sub-Gaussian for all i ∈ I.
Assumptions 2.1, 2.2, and 2.3 are common in the safe BO literature [16][17][18]. However, these approaches treat the evaluation of a policy as a black box. In contrast, we monitor the rollout of a policy to intervene and bring the system back to safety, if necessary. This can be achieved for a Markovian [29] system, like the one we consider in Eq. (1) (see Proposition A.3 in the appendix).
To monitor the rollouts, we assume that we receive a state measurement after every ∆t seconds and that in between discrete time steps, the system cannot arbitrarily jump, i.e., its movement within these (typically small) time intervals is bounded. Note, for many robotic systems this assumption is valid. Especially, since we can choose the sampling time ∆t. However, estimating this bound can be challenging. A conservative value for the bound may be estimated by performing controlled experiments, e.g, with the safe initial policy from Assumption 2.1, directly on hardware. Simulators or first principle models, if available, can also be leveraged.
Assumption 2.4. The state x(t) is measured after every ∆t seconds. Furthermore, for any x(t) and ρ ∈ [0, 1], the distance to x(t + ρ∆t) induced by any action is bounded by a known constant Ξ, that is, ∥x(t + ρ∆t) − x(t)∥ ≤ Ξ.
Remark : Implicitly, we here assume noise-free measurements of the state for simplicity. Our method also works for the noisy case (see Appendix A.1.1), which is typical in the real world.
Triggering a backup policy for a Markovian system is not sufficient to guarantee the safety of the whole trajectory for a generic constraint. Consider the case where safety is expressed as a constraint on a cost accumulated along the trajectory. Even if we are individually safe before and after triggering a backup policy, we might be unsafe overall. Therefore, we limit the types of constraints we consider.
Assumption 2.5. We assume that, for all i ∈ {1, . . . , q}, g i is defined as the minimum of a state-dependent functionḡ i along the trajectory starting in x 0 with controller π a . Formally: g i (a) = min with ξ (0,x 0 ,a) := {x 0 + t 0 z(x(τ ); π a (x(τ ))dτ } the trajectory of x(t) under policy parameter a starting from x 0 at time 0.
An example of such a constraint is the minimum distance of the system to an obstacle. We can now provide a formal definition of a safe experiment. Definition 2.6. An experiment is safe if, for all t ≥ 0 and all i ∈ {1, . . . , q}, g i (x(t)) ≥ 0. (5) This is a more general way of defining safety for the optimization problem from Eq. (2). In particular, where Eq. (2) only considers trajectories associated with a fixed policy parameter a, Definition 2.6 also covers the case in which different portions of the trajectory are induced by different controllers.

Preliminaries
This section reviews Gaussian processes (GPs) and how to use them to construct frequentist confidence intervals, as well as relevant prior work on safe exploration (SafeOpt).

Gaussian Processes
We model our unknown objective and constraint functions using Gaussian process regression (GPR) [30]. In GPR, our prior belief is captured by a GP, which is fully determined by a prior mean function 2 and a covariance function k (a, a ′ ). Importantly, if the observations are corrupted by i.i.d. Gaussian noise with variance σ 2 , i.e., y i = f (a i ) + v i , and v i ∼ N (0, σ 2 ), the posterior over f is also a GP whose mean and variance can be computed in closed form. Let us denote with Y n ∈ R n the array containing n noisy observations of f , then the posterior of f atā is (6b) The entry (i, j) ∈ {1, . . . , n} × {1, . . . , n} of the covariance matrix K n ∈ R n×n is k (a i , a j ), k n (ā) = [k(ā, a 1 ), . . . , k(ā, a n )] captures the covariance between x * and the data, and I n is the n × n identity matrix.
Eq. (6) considers the case where f is a scalar function. To model the objective f and constraints g i , we use the selector function from Eq. (3).

Frequentist Confidence Intervals
To avoid failures, we must determine the safety of a given policy before evaluating it. To this end, we reason about plausible worst-case values of the constraint g i for a new policy a. We use the posterior distribution over the objective and constraints given by Eq. (6) to build frequentist confidence intervals that hold with high probability, i.e., at least 1 − δ, and are of the form: For functions fulfilling Assumption 2.2 and 2.3, [31; 32] derive an appropriate value for β n . This value depends on δ, n and the maximum information gain γ n , cf., [33] 3 .

SafeOpt for Model-Free Safe Exploration
SafeOpt leverages the confidence intervals presented in Section 3.2 to solve black-box constrained optimization problems while guaranteeing safety for all the iterates with high probability. It ensures safety by limiting its evaluations to a set of provably safe inputs. In particular, SafeOpt defines the lower bound of the confidence interval l n as l n (a, i) = max{ l n−1 (a, i), µ n−1 (a, i) − β 1 /2 n σ n−1 (a, i)}, with l 0 (a, i) = 0 for all a ∈ S 0 , i ∈ I g and −∞ otherwise, and the upper bound u n as u n (a, i) = min{u n−1 (a, i), µ n−1 (a, i) + β 1 /2 n σ n−1 (a, i)} with u 0 (a, i) = ∞ for all a ∈ A, i ∈ I. Given a set of safe parameters S n−1 , it then infers the safety of nearby parameters by combining the confidence intervals with the Lipschitz continuity of the constraints: with L a the joint Lipschitz constant of f (a), g i (a). This leads to a local expansion of the safe set. Thus, in the case of disconnected safe regions, the optimum discovered by SafeOpt may be local (see Fig. 1).

GoSafeOpt
In this section, we present our algorithm, GoSafeOpt, which combines the sample efficient local exploration of SafeOpt with global exploration to safely discover globally optimal policies for dynamical systems. To the best of our knowledge, GoSafeOpt is the first model-free algorithm that can globally search for optimal policies, guarantee safety during exploration, and is applicable to complex hardware systems.

The algorithm
GoSafeOpt consists of two alternating stages, local safe exploration (LSE) and global exploration (GE). In LSE, we explore the safe portion of the parameter space connected to our current estimate of the safe set. Crucially, we exploit the Markov property to learn backup policies for each state we visit during LSE experiments. During GE, we evaluate potentially unsafe policies in the hope of identifying new, disconnected safe regions. The safety of this step is guaranteed by triggering the backup policies learned during LSE whenever necessary. If a new disconnected safe region is identified, we switch to a LSE step. Otherwise, GoSafeOpt terminates and recommends the optimum a * = arg max a∈Sn l n (a, 0).
In the following, we explain the LSE and GE stages in more detail and provide their pseudocode in Algorithms 1 and 2, respectively. Algorithm 4 presents the pseudocode for the full GoSafeOpt algorithm.

Local Safe Exploration
Similar to SafeOpt, during LSE we restrict our evaluations to provably safe policies, i.e., policies in the safe set, which is initialized with the safe seed from Assumption 2.1 and is updated recursively according to Eq. (8) (line 4 in Algorithm 1). We focus our evaluations on two relevant subsets of the safe set introduced in [16]: the maximizers M n , i.e., plausibly optimal parameters, and the expanders G n , i.e., parameters that, if evaluated, could optimistically enlarge the safe set. For their formal definitions, see [16] or Appendix D. During Return: S, B ,D LSE, we evaluate the most uncertain parameter, i.e., the parameter with the widest confidence interval, among the expanders and the maximizers: a n = arg max a∈Gn∪Mn max i∈I w n (a, i), where w n (a, i) = u n (a, i) − l n (a, i).
As a by-product of these experiments, GoSafeOpt learns backup policies for all the states visited during these rollouts by leveraging the Markov property. Intuitively, for any state x(t) visited when deploying a safe policy a starting from x 0 , we know that the subtrajectory {x(τ )} τ ≥t is also safe because of Assumption 2.5. Moreover, this sub-trajectory is safe regardless of how we reach x(t) since the state is Markovian. Thus, a is a valid backup policy for x(t).
This means we learn about backup policies for multiple states during a single LSE experiment. To make them available during GE, we introduce the set of backups B n ⊆ A×X . After running an experiment with policy a, we collect all the discrete state measurements in the rollout R = We perform LSE until the connected safe set is fully explored and the optimum within the safe set is discovered. Intuitively, this happens when we have learned our constraint and objective functions with high precision, i.e., when the uncertainty among the expanders and maximizers is less than ϵ, and yet the safe set does not expand any further, max Note, GoSafeOpt, like SafeOpt, only explores the connected safe set in the parameter space and learns backup policies via the Markov property.

Global Exploration
GE aims at discovering new, disconnected safe regions. In particular, during a GE step, we evaluate the most uncertain parameter, i.e., with the highest value for max i∈Ig w n (a, i), outside of the safe set, a ∈ A \ S n . As this parameter is not in our safe set, it is not guaranteed to be safe. Therefore, we monitor the state during the experiment and trigger a backup policy, learned during LSE, if we cannot guarantee staying in a safe region of the state space when continuing with the current choice of policy parameters (cf. Fig. 3).
If a backup policy is triggered when evaluating the parameter a, we mark the experiment as failed. To avoid repeating the same experiment, we store a and the state x Fail where we intervened in sets E ⊂ A and X Fail ⊂ X , respectively (see line 9 in Algorithm 2). Thus, dur- Trigger backup at x(t) ing GE, we employ the following acquisition function a n = arg max This picks the most uncertain parameter, i.e., the parameter with the widest confidence interval, that is not provably safe but that has not been shown to trigger a backup policy. If the experiment was run without triggering a backup, we know that a is safe. Therefore, we add the observed values for g i and f to the dataset and the rollout R collected during the experiment to our set of backups B n , i.e., B n+1 = B n ∪ R. Furthermore, we add the parameter a to our safe set and update its lower bound, i.e., l n (a, i) = 0, ∀i ∈ I g (see lines 12 and 13 in Algorithm 2). Then, we switch to LSE to explore the newly discovered safe area. Note, the lower bound is updated again before the LSE step, i.e., l n+1 (a, i) = max{l n (a, i), µ n (a, i) − β 1 /2 n+1 σ n (a, i)} for all i ∈ I g (see Algorithm 4 line 6). If A\(S n ∪ E) = ∅, there are no further safe areas we can discover and GE has converged.

Boundary Condition.
Throughout each GE experiment, we monitor the state evolution, and, whenever a state measurement is received, we evaluate online a boundary condition to determine whether a backup policy should be triggered. Ideally, it must (i) guarantee safety, (ii ) be fast to evaluate even for high-dimensional dynamical systems, and (iii ) incorporate discrete-time measurements of the state. To fulfill requirement (i), the boundary condition leverages Lipschitz continuity of the constraint. In particular, when we are in x(t), we check if there is a point (a s , x s ) in our set of backups B n such that x s is sufficiently close to x(t) to guarantee that a s can steer the system back to safety for any state we may reach in the next time step. Boundary Condition: During iteration n, we trigger a backup policy at x if there is no point in our set of backups (a s , x s ) ∈ B n such that l n (a s , i) ≥ L x (∥x − x s ∥ + Ξ) for all i ∈ I g . In this case, we use the backup parameter a * s with the highest safety margin, that is Since we already calculate l n (a s , i) for all i ∈ I g and a s ∈ S n offline to update the safe set (see Eq. (8)), we only need to evaluate ∥x − x s ∥ online, which is computationally tractable for most real-world systems (e.g., O(s) for the 2−norm, where s is the dimension of X ). Thereby, it satisfies requirement (ii) and enables the application of our algorithm to complex systems with high sampling frequencies. The boundary condition is summarized in Algorithm 3.

Algorithm 2 Global Exploration (GE)
Input: Safe set S, confidence intervals C, set of backups B, dataset D, fail sets: E, X Fail 1: Recommend global parameter a n with Eq. (11) 2: a = a n , x Fail = ∅, Boundary = False 3: while Experiment not finished do //Rollout policy 4: z (x(t), π(x(t); a)) dt 5: if Not Boundary then //Not at boundary yet 6: Boundary, a * s = Boundary Condition(x(t), B) 7: if Boundary then //Trigger backup policy Return: S, C, B, D, E, X Fail

Algorithm 3 Boundary Condition
Input: Updating Fail Sets. Parameters for which the boundary condition is triggered, i.e., parameters evaluated unsuccessfully during GE, are added to the fail set E. However, when LSE is repeated after discovering a new region during GE, we can learn new backup policies, which makes the boundary condition less restrictive. Hence, it may happen that a parameter a for which a backup policy was triggered during a previous GE step, i.e., a ∈ E, we would not trigger a backup policy after LSE step has converged in the new safe region. Thus, after learning new backup policies during LSE, we re-evaluate the boundary condition (line 3), and update E and X Fail accordingly. These states may then be revisited during further GE steps.
In summary, GoSafeOpt involves two alternating stages, LSE and GE. LSE steps are similar to SafeOpt, nonetheless, they additionally leverage the Markov property of the system to learn backup policies. These backup policies are then used in GE for global exploration. The only model-free safe exploration method that explores globally is GoSafe. However, it evaluates a completely different and expensive boundary condition, which relies if Not Boundary Condition(x, B n ) then //Algorithm 3 5: if LSE not converged (Eq. (10)) then //Perform LSE (Algorithm 1) on a safe set representation in the parameter and state space. This safe set is actively explored. Because of the active exploration, and expensive boundary condition, GoSafe becomes restricted to only systems with low-dimensional state spaces.
Remark. GoSafeOpt is devised for the episodic RL setting where the initial state x 0 is fixed and known. In several applications, the initial state is not known apriori and instead sampled i.i.d. from a state distribution ρ. Our formulation can also be extended to this setting by treating the initial state as a context variable, c.f. [34]. Moreover, to guarantee safety in this setting, Assumption 2.1 has to be modified such that the parameters in the initial safe seed S 0 , are safe for all initial states in the support of ρ, i.e., x ′ 0 ∈ supp(ρ). Then, given a context/initial state x ′ 0 , the acquisition function for LSE or GE is optimized for the context. This is similar to the contextual SafeOpt algorithm [18]. The boundary condition can also be extended to incorporate the context. Finally, for a continuous state space, supp(ρ) can be discretized similarly to as in GoSafe.

Theoretical Results
This section provides safety (Section 4.2.1) and optimality (Section 4.2.2) guarantees for GoSafeOpt.

Safety Guarantees
The main safety result for our algorithm is that GoSafeOpt guarantees safety during all experiments.
The proof of this theorem is provided in Appendix A.1. Intuitively, we can analyze the safety of LSE and GE separately. For LSE, we can leverage the results in [18], which studies it extensively. Therefore, novel to our analysis is the safety of GE. We show that while running experiments during GE, we can guarantee that if our boundary condition triggers a backup, we are safe, and if a backup is not triggered, then the experiment is safe, i.e., we discovered a new safe parameter.

Optimality Guarantees
Next, we analyze when GoSafeOpt can find the safe global optimum a * , which is the solution to Eq. (2). During LSE, we explore the connected safe region. For each safe region we explore, we can leverage the results from [18] to prove local optimality. Furthermore, due to GE, we can discover disconnected safe regions and then repeat LSE to explore them. To this end, we define when a parameter a can be discovered by GoSafeOpt (either during LSE or during GE).
is the largest safe set we can safely reach from A (see Eq. (A.13) in Appendix A.2 or [18; 21]).
In practice, GoSafeOpt tends to find better controllers than SafeOpt, which converges after LSE. This is formalized in the following proposition.  Remark. The safety threshold δ is used to pick the designer's appetite for unsafe evaluations. For a large value of δ, more parameters are available for sampling at each iteration. Accordingly, the method converges faster, however, while also allowing more unsafe evaluations, see [18] for more detail.

Practical Modifications
In practice, we can further improve the sample and computational efficiency by introducing minor modifications. While they do not guarantee optimality, they yield good results for our evaluation in Section 5. Furthermore, all the proposed modifications do not affect the safety guarantees of the method, and thus can be safely applied in practice.

Fixing Iterations for Each Stage
In Algorithm 4, we perform a global search, i.e., GE, after the convergence of LSE. Nonetheless, it may be beneficial to run LSE for a fixed amount of steps and then switch to GE, before LSE's convergence. This heuristic allows for the early discovery of disconnected safe regions, which may improve sample efficiency. Moreover, this allows "jumping" between different safe regions of the domain that, even though would be connected if we ran the current LSE to convergence, are currently disconnected. To this end, we apply the following heuristic scheme: (i ) run LSE for n LSE steps, (ii ) run GE for n GE steps or until we have discovered new safe parameters, and (iii ) if GE discovers a new region, return to (i ). Else, return to (i ) after GE completion, but with reduced n LSE . Note, the proposed scheme still retains optimality because we do not restrict the total number of iterations with the system. However, in practice, we additionally impose an upper bound on the interactions, and therefore n LSE , and n GE influence the budget of global and local exploration, this affects optimality (c.f., Appendix C).

Updated Boundary Condition
If required, the boundary condition can be further modified to reduce computation time by considering only a subset of the states collected from experiments. The updated boundary condition reduces the online computation time at the expense of a more conservative boundary condition. Due to this conservatism, we lose our optimality guarantees. In practice, however, we still achieve good results (see Section 5).
Definition 4.5. Consider η l ∈ R and η u ∈ R such that η l < η u . The interior set Ω I,n and marginal set Ω M,n are defined as Ω I,n = {x s ∈ X | (a, x s ) ∈ B n : ∀i ∈ I g , l n (a, i) ≥ η u } Ω M,n = {x s ∈ X | (a, x s ) ∈ B n : ∀i ∈ I g , η l ≤ l n (a, i) < η u }.
The interior set contains the points in our set of backups B n that are safe with high tolerance η u , whereas the points in the marginal set are safe with a smaller tolerance η l . We use those sets for the updated boundary condition. Updated Boundary Condition: Consider d l ∈ R and d u ∈ R such that d l < d u . We trigger a backup policy at x if there is not a point x s ∈ Ω I,n such that ∥x − x s ∥ ≤ d u or there is not a point x ′ s ∈ Ω M,n such that ∥x − x ′ s ∥ ≤ d l . In this case, we use the backup parameter a * s a * s = max {as∈A|(as,xs)∈Bn} l n (a s , i); with x s = min Intuitively, we define distance tolerances d u , and d l for points in B n based on their safety tolerances η u , η l . As for Theorem 4.1, we can derive appropriate values for η u , d u , respectively η l , d l to guarantee safety.

Evaluation
We evaluate GoSafeOpt in simulated and real experiments on a Franka Emika Panda seven degree of freedom (DOF) robot arm 4 (see Fig. 2 and 4). The objective of our experi- Figure 4: Setup for our evaluation in Section 5. We consider a safety-critical path following problem where deviations from the desired path (blue) could cause the robot to hit the wall (red box) and incur damage.
ments is to demonstrate that GoSafeOpt (i) can be applied to systems with high dimensional state spaces, (ii) is successful in safely tuning control parameters in common robotic tasks such as path following with manipulators, and (iii) is superior to the existing state-ofthe-art method, SafeOpt, for safe control parameter tuning of real-world robotic systems.
Accordingly, in our results, we show that GoSafeOpt can scale to high dimensional systems, jump to disconnected safe regions while guaranteeing safety, and is directly applicable to hardware tasks with high sampling frequencies. In this work, we do not consider very high dimensional parameter spaces, which are in themselves challenging to tackle for methods such as SafeOpt. Methods in [35] and [22] alleviate this challenge and can be integrated with our algorithm easily. Thus, we concentrate on the novelty of our method, which is its globally safe parameter exploration, unlike SafeOpt, and scalability to high-dimensional state spaces compared to GoSafe. Specifically, the state space of systems we consider in this section is too large for GoSafe and it cannot be applied to any of our problems.
Details on the objective and constraint functions are provided in Appendix B. The hyperparameters of our experiments are listed in Appendix E.
In all experiments with the robot arm, we solely control the position and velocity of the end-effector. To this end, we consider an operational space impedance controller [36] with impedance gain K (see Appendix B). The state space for our problem is six-dimensional. This is prohibitively large for GoSafe (struggles with state space greater than three [22]). Therefore, we compare our method with SafeOpt.
Impedance controllers for manipulators are usually tuned manually. This is often a tedious and time-consuming process. Accordingly, we show in our results that GoSafeOpt can be used to automate this tuning safely.

Simulation Results
We first evaluate GoSafeOpt in a simulation environment based on the Mujoco physics engine [37] 5 . For this we consider two distinct tasks, (i ) reaching a desired position, and (ii ) path following. We determine the impedance gain through an approximate model of the system and perform feedback linearization [36]. For the resulting linear system, we design a linear-quadratic regulator (LQR) [38] with quadratic costs that are parameterized by matrices Q ∈ R n×n and R ∈ R p×p . Since the model is inaccurate, the feedback linearization will not cancel all nonlinearities and the LQR will not be optimal. Thus, our goal is to tune the cost matrices Q and R to compensate for the model mismatch. This approach is similar to [9]. We evaluate our methods over twenty independent runs of 200 iterations.

Task 1: Reaching a Desired Position
We select a target x des ∈ R 3 for the robot. For this task, we parameterize the matrices Q and R by two parameters (q c , r) ∈ [2, 6] × [−3, 3], that trade-off accurate tracking, i.e., large q c , and small inputs, i.e., large r. We choose the objective function to encourage reaching the target as fast as possible while penalizing large end-effector velocities and control actions (see Appendix B.1 for details). Thus in total, we have an eight-dimensional task (sixdimensional state space and two-dimensional parameter space). For analysis purposes, we run a simple grid search, that we could not run outside of simulation, to get an estimate of the safe set and the global optimum. Fig. 5 depicts the ϵ-precise (ϵ = 0.1) safe set observed via grid search. From the figure, we observe that there is a disconnected safe region. Evaluation: Fig. 5 depicts the safe sets of SafeOpt and GoSafeOpt after 200 learning iterations. We see that SafeOpt cannot discover the disconnected safe region and hence is stuck at a local optimum. On the other hand, GoSafeOpt discovers the disconnected regions and can jump within connected safe sets. The learning curve of the two methods is depicted in Fig. 7. Our method performs considerably better than SafeOpt. The optimum found by our method is 0.007 (less than ϵ = 0.1) close to the optimum found via the grid search. SafeOpt cannot significantly improve over the initial policy. This is because the initial safe seed S 0 already contains a near-optimal policy from the connected region SafeOpt explores, i.e., max a∈S 0 f (a) ≈ max a∈R c ϵ (S 0 ) f (a). Lastly, our method also achieves comparable safety to SafeOpt (on average 99.9% compared to 100%). We encounter the failures during LSE, which corresponds to SafeOpt, one could also expect similar behavior from SafeOpt if it were initialized in the upper region.
Remark. We can increase β n to encourage conservatism and avoid all unsafe evaluations. However, this also influences the algorithm's convergence rate. Hence, in practice, based on the task and appetite for unsafe evaluations β n has to be selected.

Task 2: Path Following Task
For this experiment we define a parameterized path for the robot arm to follow x d (ρ(t)). Here, we define ρ(t) as a state to indicate progress along the trajectory, i.e., x d (0) = x 0 , x d (1) = x des . The evolution of ρ(t) ∈ [0, 1] is controlled by a parameter a ρ ∈ [0, 1], that is, ρ(t) = min{t(a ρ ( 1 /100 − 1 /500) + 1 /500), 1}. The objective is to find optimal control parameters for Q, R, and a ρ such that we progress on x d (·) as fast as possible while ensuring that x − x d ρ(t) 2 ≤ ζ. In this example, we model Q, R using three parameters, q c , r, κ d , where κ d ∈ [0, 1] is used to weigh the velocity cost with respect to the positional cost of our state in the Q matrix (c.f. Appendix B.1). Together with a ρ as a parameter, this task is elevendimensional, with seven states (including ρ) and four parameters. This problem incorporates a challenging trade-off between fast trajectories and high-tracking performance. We compare it to SafeOptSwarm [35], a scalable version of SafeOpt for larger parameter spaces that use adaptive discretization. The results are presented in Fig. 8. Our results again show that GoSafeOpt performs considerably better than SafeOpt, specifically SafeOptSwarm. Furthermore, both SafeOpt and GoSafeOpt give 100% safety over all 20 runs. We also compare our method with expected improvement with constraints (EIC) [12] in Fig. 9. EIC discourages potentially unsafe regions but allows for unsafe evaluations. Our results show that EIC, and GoSafeOpt attain similar performance. However, EIC has considerably more unsafe evaluations (on average greater than fifteen) than GoSafeOpt, which has none.

Hardware Results
While the simulation results already showcased the general applicability of GoSafeOpt to high dimensional systems and its ability to discover disconnected safe regions, we now demonstrate that it can also safely optimize policies on real-world systems. Control Task: We consider a path following task (see the experimental setup in Fig. 4), and model the impedance gain K as where K x = α x K r,x with K r,x > 0 a reference value used for Franka's impedance controller and α x ∈ [0, 1.2] the parameter we would like to tune (same for y, z). Accordingly, α x,y,z = 1 corresponds to the impedance controller provided by the manufacturer. The parameter space  we consider for this task is [0, 1.2] 3 . We require the controller to follow the known desired path while avoiding the wall depicted in Fig. 4. Optimization Problem: We choose our objective function to encourage tracking the desired path as accurately as possible and impose a constraint on the end-effector's distance from the wall (see Appendix B.2 for more details). We receive a measurement of the state at 250 Hz and evaluate the boundary condition during GE at 100 Hz. Evaluation: The parameter space for this task is three-dimensional. Therefore, we compare our method to SafeOptSwarm [35] and run only 50 iterations for each algorithm in three independent runs. We choose a 0 = (0.6, 0.6, 0.6) as our initial policy. During our experiments, both GoSafeOpt and SafeOptSwarm provide 100% safety in all three runs. For GoSafeOpt, safety during GE is preserved by triggering a backup policy if required. One such instance is shown in Fig. 6. We see in Fig. 10 that GoSafeOpt performs considerably better than SafeOptSwarm. In particular, even if we cannot prove the existence of disconnected safe regions for this task, GoSafeOpt still finds a better policy due to GE. Interestingly, the optimal value suggested by GoSafeOpt for both α x , and α y is 1.2. Therefore, in the direction of our path, GoSafeOpt suggests aggressive controls to reduce tracking error. Moreover, the controller suggested by GoSafeOpt is more aggressive than the manufacturers' reference controller (α x = 1.0, α y = 1.0), and tracks the trajectory better.

Choosing Hyperparameters
GoSafeOpt, like many safe exploration BO algorithms such as SafeOpt and GoSafe, makes assumptions on prior knowledge of the system (see Section 2.1). These assumptions are crucial for theoretical guarantees. In practice, they are hard to verify. Yet, safe exploration BO methods have been successfully and safely applied to a large breath of applications [23; 39-42]. In our case, we leverage the available simulator to obtain a range for the hyperparameters: kernel parameters, β n , and distance metric for the boundary condition. Lastly, with β n fixed, we fine-tune the remaining parameters by performing controlled safe experiments with the hardware. Even though this approach gives good results, recent work    from [43] investigates the hyperparameter selection problem for safe BO more systematically. In general, there are a few other works which investigate the gap between theory and practice [28; 44]. Nonetheless, given the potential of these algorithms for reliable and safe artificial intelligence (AI), we acknowledge that future research on bridging this gap is needed.

Conclusion
This work proposes GoSafeOpt, a novel model-free learning algorithm for global safe optimization of policies for complex dynamical systems with high-dimensional state spaces. We provide for GoSafeOpt high probability safety guarantees and show that it provably performs better than SafeOpt, a state-of-the-art model-free safe exploration algorithm. We demonstrate the superiority of our algorithm over SafeOpt empirically through our experiments. GoSafeOpt can handle more complex and realistic dynamical systems compared to existing model-free learning methods for safe global exploration, such as GoSafe. This is due to a combination of an efficient passive discovery of backup policies that leverages the Markov property of the system and a novel and efficient boundary condition to detect when to trigger a backup policy. Future extensions could design hybrid algorithms that leverage the Markov property and actively explore the state space. Moreover, GoSafeOpt is designed for efficient and safe controller tuning. We believe it can be applied to other dynamical systems, e.g., in legged robotics, where controller parameter tuning is a crucial component [45].

A. Proofs of Theoretical Results
In this section, we provide proof for the theoretical results stated in the main body of the paper. In the following, we denote by k discrete time indices and with t continuous ones. This difference is important because, while we obtain state measurements at discrete times, we need to preserve safety at all times. Moreover, similarly to the notation in GoSafe [21], we denote by ξ (t,x(t),a) = {x(t) +

A.1. Safety Guarantees
In the following, we prove Theorem 4.1, which gives the safety guarantees for GoSafeOpt. Since GoSafeOpt has two stages, LSE and GE, we can study their safety separately. For LSE, [18] provides safety guarantees. Therefore, here we focus on the safety guarantees for GE and then show that combining both will guarantee the safety of the overall algorithm. To this end, we first make a hypothesis on our safe set S n and confidence bounds l n (a, i) and u n (a, i).
Hypothesis A.1. Let S n ̸ = ∅. The following properties hold for all i ∈ I g , n ≥ 0 with probability at least 1 − δ: (A.2) We leverage this hypothesis to prove that we are safe during GE and then we show that it is satisfied for GoSafeOpt. Particularly, during LSE, [18] proves that our hypothesis is fulfilled. Hence, before GE, the safe set and the confidence intervals satisfy it. In the following, we show the updates of the safe sets and the confidence intervals implemented by GE also satisfy our hypothesis, which is sufficient to conclude that the hypothesis is satisfied for all n ≥ 0 (we will make this concrete in Lemma A.9).
During GE, we receive measurements of the state in discrete times and evaluate our boundary condition to trigger a backup policy if necessary. Therefore, we first show that even with discrete-time measurements, we can still guarantee safety in continuous time.
Lemma A.2. Let Assumptions 2.4 and 2.5 hold and let k + ≥ k − ≥ 0 be arbitrary integers. If, for all integers k ∈ [k − , k + ], there exists a s ∈ A such that g i (a s , x(k)) ≥ L x Ξ for all i ∈ I g , thenḡ i (x(t)) ≥ 0, for all t ∈ [k − ∆t, (k + + 1)∆t] and i ∈ I g .
Proof. By choice of the sampling scheme, we have that the state x(k) measured in discrete time, corresponds to the state x(k∆t) in continuous time. Hence, g i (a s , x(k)) = g i (a s , x(k∆t)). Consider some k ≥ 0 and a s ∈ A such that g i (a s , x(k∆t)) ≥ L x Ξ. For any t ∈ [k∆t, (k + 1)∆t] we have g i (a s , x(k∆t)) − g i (a s , x(t)) ≤ L x ∥x(k∆t) − x(t)∥ (Lipschitz continuity (Assumption 2.2)) ≤ L x Ξ.
(Assumption 2.4) Now, since g i (a s , x(k∆t)) ≥ L x Ξ, we have, for all t ∈ [k∆t, (k + 1)∆t] and i ∈ I g , g i (a s , x(t)) ≥ g i (a s , x(k∆t)) − L x Ξ ≥ 0. (A.3) For our choice of constraints (Assumption 2.5) this impliesḡ i (x(t)) ≥ 0 for all i ∈ I g and t ∈ [k∆t, (k + 1)∆t]. Finally, since this holds for all integers k with k − ≤ k ≤ k + , it also holds for all t ∈ [k − ∆t, (k + + 1)∆t]. Now we have established a condition that guarantees for a given time interval that g i (x(t)) ≥ 0 for all i ∈ I g .
We collect parameter and state combinations during rollouts in our set of backups B n . The intuition here is that for a Markovian system, all states visited during a safe experiment are also safe. This is important as it allows GoSafeOpt to learn backup policies for multiple states without actively exploring the state space. We formalize this in the following proposition.
Proof. The system in Eq. (1) is Markovian, i.e., for any x(t 1 ) ∈ ξ (0,x 0 ,a) and x(t 2 ) ∈ ξ (0,x 0 ,a) with t 2 > t 1 > 0, Therefore, a trajectory starting in x(t 1 ) will always result in the same state evolution, independent of how we arrived at x(t 1 ). Combining this and Assumption 2.5, we get g i (a, x(t 1 )) = min In the following, we show that g i (a s , x 0 ) is a lower bound for all points (a s , x s ) in B n , i.e., g i (a s , x s ) ≥ g i (a s , x 0 ). This will play a crucial role in showing that we preserve safety whenever we trigger a backup policy.
Proof. Each point (a s , x s ) in B n is collected during a safe experiments (see Algorithm 1 line 3 and Algorithm 2 line 12). Therefore, x s ∈ ξ (0,x 0 ,as) . The result then follows from Proposition A.3.
Corollary A.4 shows that l n (a s , i) is a conservative lower bound on g i (a s , x s ). Crucially, if we can observe not just the rollouts but also the constraint values g i (a s , x s ), we could model them with a GP to obtain a potentially less conservative lower bound. However, in our work, we only assume that we can measure g i (a s , x 0 ) (Assumption 2.3). Proposition A.3 and Corollary A.4 formalize how we collect our backup policies and leverage them in our boundary condition. In the following, we prove that experiments, where we trigger a backup policy, are safe. First, we show that if the boundary condition is triggered at a time step k * , then we are safe up until k * ∆t, i.e., time of trigger.
Lemma A.5. Let the assumptions from Theorem 4.1 and Hypothesis A.1 hold. If, during GE, the boundary condition from Algorithm 3 triggers a backup policy at time step k * > 0, then, for all t ≤ k * ∆t and i ∈ I g ,ḡ i (x(t)) ≥ 0 with probability at least 1 − δ.
Proof. Consider k < k * . Since the boundary condition (Algorithm 3) did not trigger a backup policy at k, we have ∃(a s , x s ) ∈ B n such that l n (a s , i) ≥ L x (∥x(k) − x s ∥ + Ξ) , ∀i ∈ I g .
(A.4) By Lipschitz continuity of g, we have ) for all i ∈ I g and k < k * . Therefore, we can use Lemma A.2 to prove the claim by choosing k − = 0 and k + = k * − 1.
Lemma A.5 shows that up until the time we trigger our boundary condition, we are safe with enough tolerance (L x Ξ) to guarantee safety. In the following, we show that if we trigger a safe backup policy at k * , we will fulfill our constraints for all times after triggering. Then for all t ≥ k * ∆t,ḡ i (x(t)) ≥ 0 for all i ∈ I g with probability at least 1 − δ.
Proof. We want to show that Eq. (A.6) finds a parameter a * s such that g i (a * s , x(k * )) ≥ 0. For k * = 0, this follows by definition because B n consists of safe rollouts (see Algorithm 1 line 3 and Algorithm 2 line 12) and thus, for all parameters a s in B n and i ∈ I g , we have g i (a s , x 0 ) ≥ 0.
Let us now consider any integer k * > 0. Let (a s , x s ) ∈ B n be arbitrary. Following the same Lipschitz continuity-based arguments as in Lemma A.5, we have for all i ∈ I g , : where the last inequality follows from the fact that the boundary condition was not triggered at time step k * − 1 (see Section 4.1.2). Furthermore, from Eq. (A.7) we can conclude that there exists a s ∈ A such that for some x s ∈ X , (a s , x s ) ∈ B n , and l n (a s , i)−L x ∥x s − x(k * )∥ ≥ 0 for all i ∈ I g . Therefore, we have for a * s recommended by Eq. (A.6): max Hence, g i (a * s , x(k * )) ≥ 0 for all i ∈ I g with probability at least 1 − δ, which proves the claim.
Lemmas A.5 and A.6 show that, if we trigger a backup policy during GE, we can guarantee the safety of the experiment before and after switching to the backup policy, respectively.
Next, we prove that, if the backup policy is not triggered during GE with parameter a GE , then a GE is safe with high probability.
Lemma A.7. Let the assumptions from Theorem 4.1 and Hypothesis A.1 hold. If, during GE with parameter a GE , a backup policy is not triggered by our boundary condition, then a GE is safe with probability at least 1 − δ, that is, g i (a GE , x 0 ) ≥ 0 for all i ∈ I g .
Proof. Assume the experiment was not safe, i.e., there exists a t ≥ 0, such that for some i ∈ I gḡi (x(t)) < 0. Consider the time step k ≥ 0 such that t ∈ [k∆t, (k + 1)∆t]. Since the boundary condition was not triggered during the whole experiment, it was also not triggered at time step k. This implies that (see Section 4.1.2) there exists a point (a s , x s ) ∈ B n such that l n (a s , i) − L x (∥x s − x(k)∥ + Ξ) ≥ 0, (A.8) for all i ∈ I g . Therefore, we have g i (a s , x(k)) ≥ L x Ξ (Hypothesis A.1). Hence, from Lemma A.2 we haveḡ i (x(t)) ≥ 0 for all i ∈ I g . This contradicts our assumption that for some t ≥ 0 and i ∈ I g ,ḡ i (x(t)) < 0.
The following Corollary summarizes the safety of GE.
Corollary A.8. Under the assumptions from Theorem 4.1 and Hypothesis A.1 GoSafeOpt is safe during GE, i.e., for all t ≥ 0,ḡ i (x(t)) ≥ 0 for all i ∈ I g .
Proof. Two scenarios can occur during GE, (i ) a backup policy is triggered at some time step k * ≥ 0, (ii ) the experiment is completed without triggering a backup policy. For the first case, Lemma A.6 guarantees that we are safe after triggering the backup policy, and Lemma A.5 guarantees that we are safe before we trigger the backup. For second scenario, Lemma A.7 guarantees safety.
We have now shown that under the assumptions of Theorem 4.1 combined with Hypothesis A.1, we can guarantee that we are safe during GE, irrespective of whether we trigger a backup policy or not. We leverage this result to show that Hypothesis A.1 is satisfied for GoSafeOpt.
Lemma A.9. Let the assumptions from Theorem 4.1 hold and β n be defined as in [18]. Then, Hypothesis A.1 is satisfied for GoSafeOpt, that is, with probability at least 1 − δ for all i ∈ I g and n ≥ 0 ∀a ∈ S n : g i (a, x 0 ) ≥ 0, (A.9) ∀a ∈ A : l n (a, i) ≤ g i (a, x 0 ) ≤ u n (a, i). (A.10) Proof. We use induction on n.
Base case n = 0: By Assumption 2.1, we have, for all a ∈ S 0 , g i (a, x 0 ) ≥ 0 for all i ∈ I g . Moreover, the initialization of the confidence intervals presented in Section 3.3 is as follows: l 0 (a, i) = 0 if a ∈ S 0 and −∞ otherwise, and u 0 (a, i) = ∞ for all a ∈ A. Thus, it follows that l 0 (a, i) ≤ g i (a, x 0 ) ≤ u 0 (a, i) for all a ∈ A.
Inductive step: Our induction hypothesis is l n−1 (a, i) ≤ g i (a, x 0 ) ≤ u n−1 (a, i) and g i (a, x 0 ) ≥ 0 for all a ∈ S n−1 and for all i ∈ I g . Based on this, we prove that these relations hold for iteration n.
We start by showing that l n (a, i) ≤ g i (a, x 0 ) ≤ u n (a, i) for all a ∈ A. To this end, we distinguish between the different updates of the two stages of GoSafeOpt, LSE and GE.
During LSE, we define l n (a, i) and u n (a, i) as l n (a, i) = max(l n−1 (a, i), µ n (a, i) − β n σ n (a, i)), u n (a, i) = min(u n−1 (a, i), µ n (a, i) + β n σ n (a, i)). We know that g i (a, x 0 ) ≥ l n−1 by induction hypothesis and g i (a, x 0 ) ≥ µ n (a, i) − β n σ n (a, i) with probability 1 − δ from [18]. This implies g i (a, x 0 ) ≥ l n . A similar argument holds for the upper bound.
During GE, we update l n (a, i) if the parameter we evaluate induces a trajectory that does not trigger a backup policy (see Algorithm 2 line 13). For this parameter, the induction hypothesis allows us to use Lemma A.7 and conclude g i (a, x 0 ) ≥ 0. Therefore, the update of the confidence intervals during GE also satisfies Eq. (A.10) for iteration n, thus completing the induction step for the confidence intervals.
As for the confidence intervals, we distinguish between the different updates of the safe set implemented by LSE and GE. In the case of GE, we update the safe set by adding the evaluated policy parameter a only if it does not trigger a backup, i.e., S n = S n−1 ∪ {a}. Following the same argument as above, we can conclude g i (a, x 0 ) ≥ 0 for all i ∈ I g . This together with the induction hypothesis means g i (a, x 0 ) ≥ 0 for all i ∈ I g and a ∈ S n in case of a GE update. Now we focus on LSE. We showed Eq. (A.10) holds for n. Moreover, we know by induction hypothesis g i (a, x 0 ) ≥ 0 for all a ∈ S n−1 and for all i ∈ I g with high probability. The update equation for the safe set (Eq. (8)) gives for all a ′ ∈ S n \ S n−1 , there exists a ∈ S n−1 such that for all i ∈ I g l n (a, i) − L a ∥a − a ′ ∥ ≥ 0.
We show that this is enough to guarantee with high probability that g i (a ′ , x 0 ) ≥ 0. Due to the Lipschitz continuity of the constraint functions, we have (Eq. (A.11)) Therefore, g i (a ′ , x 0 ) ≥ 0 for all i ∈ I g and a ∈ S n with probability at least 1 − δ also in case of an LSE step. Lemma A.9 ensures that Hypothesis A.1 holds for GoSafeOpt. We also know that under the same assumption as Theorem 4.1 and Hypothesis A.1, we are safe during GE (see Corollary A.8). Hence, we can now guarantee safety during GE. Finally, we prove Theorem 4.1, which guarantees safety for GoSafeOpt.
Proof. We perform GoSafeOpt in two stages; LSE and GE. In Lemma A.9, we proved that for all parameters a ∈ S n , g i (a, x 0 ) ≥ 0 for all i ∈ I g with probability at least 1 − δ. During LSE we query parameters from S n (Eq. (9)). Therefore, the experiments are safe. During GE, Corollary A.8 proves that when assumptions from Theorem 4.1 and Hypothesis A.1 hold, we are safe during GE for our choice of β n . Furthermore, in Lemma A.9 we proved that Hypothesis A.1 is satisfied for GoSafeOpt. Hence, we can conclude that if the assumptions from Theorem 4.1 hold, we are safe during GE at all times.

A.1.1. Proof of Boundary Condition For Noisy Measurements
Lemma A.10. Assume at each time step k we receive a noisy measurement of the state to evaluate our boundary condition, i.e., y = x + ε and ε i.i.d. Specifically, assume Proof. We would like to show that l n (a s , i) − L x (∥y − y s ∥ + Ξ + d) ≤ l n (a s , i) − L x (∥x − x s ∥ + Ξ). This implies that d ≥ ∥x − x s ∥ − ∥y − y s ∥.
Accordingly, ∥x − x s ∥ − ∥y − y s ∥ ≤ ∥x − x s − (y − y s )∥ (reverse triangle inequality) = ∥ε s − ε∥ ≤ ∥ε∥ + ∥ε s ∥ ≤ d (with probability at least 1 − δ 2 .) Following the lemma, we can come up with a more conservative boundary condition (with one step jump bound Ξ ′ = Ξ + d) which still guarantees safety. However, the price we pay for not measuring our state perfectly is the additional probability term 1 − δ 2 . Lastly, here we only look at the influence of noisy state measurements on the boundary condition. Nevertheless, if the policy π uses some form of feedback, the noise also enters the dynamics. In this work, we assume that this influence is captured by our observation model, see Assumption 2.3.

A.2. Optimality Guarantees
In this section, we prove Theorem 4.3 which guarantees that the safe global optimum can be found with ϵ-precision if it is discoverable at some iteration n ≥ 0 (see Definition 4.2).
Then, we show in Lemma A.18 that for many practical applications, this discoverability condition is satisfied.

A.2.1. Proof of Theorem 4.3
We first define the largest region that LSE can safely explore for a given safe initialization S and then we show that we can find the optimum with ϵ-precision within this region. To this end, we define the reachability operator R c ϵ (S) and the fully connected safe region R c ϵ (S) by (adapted from [18; 21]) The reachability operator R c ϵ (S) contains the parameters we can safely explore if we know our constraint function with ϵ-precision within some safe set of parameters S. Further, (R c ϵ ) n (S) denotes the repeated composition of R c ϵ (S) with itself, andR c ϵ (S) its closure. Next, we derive a property for the reachability operator, that we will leverage to provide optimality guarantees.
Proof. This lemma is a straightforward generalization of [18,Lem. 7.4]. Assume R c ϵ (S) \ S = ∅, we want to show that this impliesR c ϵ (A) \ S = ∅. By definition R c ϵ (S) ⊇ S and therefore R c ϵ (S) = S. Iteratively applying R c ϵ to both the sides, we get in the limitR c ϵ (S) = S. Furthermore, because A ⊆ S, we haveR c ϵ (A) ⊆R c ϵ (S) [18,Lem. 7.1]. Thus, we obtain R c ϵ (A) ⊆R c ϵ (S) = S, which leads toR c ϵ (A) \ S = ∅. In the following, we prove that our LSE convergence criterion (see Eq. (10)) guarantees that for the safe initialization S, we can exploreR c ϵ (S) during LSE in finite time.
Theorem A.12. Consider any ϵ > 0 and δ > 0. Let Assumptions 2.2 and 2.3 hold, β n be defined as in [18], and S ⊆ A be an initial safe seed of parameters, i.e., g(a, x 0 ) ≥ 0 for all a ∈ S. Assume that the information gain γ n grows sublinearly with n for the kernel k. Further let n * be the smallest integer such that (cdf. the convergence criterion of LSE in Eq. (10)) max a∈G n * −1 ∪M n * −1 max i∈I w n * −1 (a, i) < ϵ and S n * −1 = S n * . (A.14) Then we have that n * is finite and when running LSE, the following holds with probability at least 1 − δ for all n ≥ n * :R Proof. We first leverage the result from [18,Thm. 4.1] which provides the following worstcase bound on n * n * β n * γ |I|n * ≥ C 1 R c 0 (S) + 1 ϵ 2 , (A. 17) where C 1 = 8/ log(1 + σ −2 ) and n * is the smallest integer that satisfies Eq. (A.17). Hence, we have that n * is finite. The sublinear growth of γ n with n is satisfied for many practical kernels, like the ones we consider in this work [31]. Next, we prove Eq. (A.15). For the sake of contradiction, assumeR c ϵ (S) \ S n * ̸ = ∅. This implies, R c ϵ (S n * ) \ S n * ̸ = ∅ (Lemma A.11). Therefore, there exists some a ∈ A \ S n * such that for some a ′ ∈ S n * = S n * −1 (Eq. (A.14)), we have for all (Lemma A.9) Therefore, a ′ ∈ G n * −1 (see [18] or Appendix D Definition D.1) and accordingly, w n * −1 (a ′ , i) < ϵ. Next, because w n * −1 (a ′ , i) < ϵ ,we have for all 18) This means a ∈ S n * (Eq. (8)), which is a contradiction. Thus, we conclude thatR c ϵ (S) ⊆ S n * and because S n * ⊆ S n for all n ≥ n * (Proposition A.13), we getR c ϵ (S) ⊆ S n . Now we prove Eq. (A.16). Consider any n ≥ n * . Note, w n * −1 (a ′ , i) < ϵ, implies w n (a ′ , i) < ϵ (see Algorithm 1 line 4 or Algorithm 2 line 13). For simplicity, we denote the solution of arg max a∈R c ϵ (S) f (a) as a * S . We have (by definition ofâ n ) Therefore, a * S is a maximizer, i.e., a * S ∈ M n (see Appendix D Definition D.2) and has uncertainty less than ϵ, that is, w n (a * S , i) < ϵ. Now, we show that f (â n ) ≥ f (a * S ) − ϵ. For the sake of contradiction assume, Then we obtain, l n (a * S , 0) ≤ l n (â n , 0) (by definition ofâ n ) , (by definition of w n (a * S , 0)) which is a contradiction. Therefore, we have f (â n ) ≥ f (a * S ) − ϵ. Theorem A.12 states that for a given safe seed S, the convergence of LSE (Eq. (10)) implies that we have discovered its fully connected safe regionR c ϵ (S) and recovered the optimum within the region with ϵ-precision.
Based on the previous results, we can show that if the safe global optimum is discoverable for some iteration n ≥ 0 (see Definition 4.2), then we can find an approximately optimal safe solution. However, to prove optimality, what we also require is that if a * ∈ S n then a * ∈ S n+1 . Proposition A.13. Let the assumptions from Theorem 4.1 hold. For any n ≥ 0, the following property is satisfied for S n .
S n ⊆ S n+1 , (A.20) Proof. The safe set provably increases during LSE [18, Lem. 7.1]. During GE, the safe set is only updated if a new safe parameter is found. The proposed update also has the nondecreasing property (see Algorithm 2, line 13). Hence, we can conclude that S n ⊆ S n+1 .
Proposition A.13 shows that if the safe global optimum a * ∈ S n , then a * ∈ S n+1 . Next, we prove that if a new safe region A is added to our safe set S n , we will explore its largest reachable safe setR c ϵ (A). Lemma A.14. Consider any integer n ≥ 0. Let S n be the safe set of parameters explored after n iterations of GoSafeOpt and let β n be defined as in [18]. Consider A = S n+1 \ S n . If A ̸ = ∅, then there exists a finite integern > n such thatR ϵ c (A) ∪R ϵ c (S n ) ⊆ Sn with probability at least 1 − δ. 13). This implies R c ϵ (Sn) \ Sn ̸ = ∅ (Lemma A.11). Since A ̸ = ∅, the safe set is expanding. For GoSafeOpt, this can either happen during LSE or during GE when a new parameter is successfully evaluated, i.e., the boundary condition is not triggered. In either case, we perform LSE till convergence. Letn > n be the smallest integer for which we converge during LSE, i.e., for which max a∈Gn −1 ∪Mn −1 max i∈I wn −1 (a, i) < ϵ and Sn −1 = Sn (A. 21) holds. From Theorem A.12, we know thatn is finite. Consider a ∈ R c ϵ (Sn) \ Sn. Then we have that there exists a ′ ∈ Sn such that 0 ≤ g i (a ′ , x 0 ) − ϵ − L a ∥a − a ′ ∥ (see Eq. (A.12)). Furthermore, Sn −1 = Sn, means a ′ ∈ Sn −1 . Hence, we also have 0 ≤ un −1 (a, i) − L a ∥a − a ′ ∥, which implies that, a ′ ∈ Gn −1 (Appendix D Definition D.1) and therefore, wn −1 (a ′ , i) < ϵ. This implies that 0 ≤ ln −1 (a ′ , i) − L a ∥a − a ′ ∥. Therefore, according to Eq. (8), a ∈ Sn, which is a contradiction. Hence,R ϵ c (A) \ Sn = ∅. We can proceed similarly to show that R ϵ c (S n ) \ Sn = ∅. Since we haveR ϵ c (A) \ Sn = ∅ andR ϵ c (S n ) \ Sn = ∅, we can conclude that R ϵ c (A) ∪R ϵ c (S n ) ⊆ Sn. In Lemma A.14 we have shown that for every set A that we add to our safe set, we will explore its fully connected safe region in finite time. This is crucial because it allows us to guarantee that when we discover a new region during GE, we explore it till convergence. Finally, we can now prove Theorem 4.3. Theorem 4.3. Let a * be a safe global optimum. Further, let Assumptions 2.1 -2.5 hold, β n be defined as in [18]. Assume there exists a finite integerñ ≥ 0 such that a * is discoverable at iterationñ (see Definition 4.2). Then, for any ϵ > 0, and δ ∈ (0, 1), there exists a finite integer n * ≥ñ such that with probability at least 1 − δ, withâ n = arg max a∈Sn l n (a, 0).

A.2.2. Requirements for Discovering Safe Sets with GE
In the previous section, we showed that if a safe global optimum a * is discoverable at some iterationñ, we can then find it with ϵ-precision. In this section, we show that if for a parameter a GE in A \ S n , we have backup policies for all the states in its trajectory, then a GE will be eventually added to our safe set of parameters. Finally, we conclude this section by showing that for many practical cases, a * fulfills the discoverability condition. Now, we derive conditions that allow us to explore new regions/parameters during GE. To this end, we start by defining a set of safe states X s n , i.e., the states for which our boundary condition does not trigger a backup policy.
Definition A.15. The set of safe states X s n is defined as Intuitively, if a trajectory induced by a parameter being evaluated during GE lies in X s n , then the boundary condition will not be triggered for this parameter. Now we will prove that this set of safe states X s n is non-decreasing. This is an important property because it tells us that GoSafeOpt continues to learn backup policies for more and more states.
Lemma A. 16. Let the assumptions from Theorem 4.1 hold. For any n ≥ 0, the following property is satisfied for X s n . X s n ⊆ X s n+1 . (A.24) Proof. The lower bounds l n (a, i) are non-decreasing for all i ∈ I by definition (see Algorithm 1 line 4 or Algorithm 2 line 13). Additionally, because we continue to add new rollouts to our set of backups, we have B n ⊆ B n+1 (see Algorithm 1 line 3 or Algorithm 2 line 12).
For each x ∈ X s n , there exists (a s , x s ) ∈ B n , such that l n (a s , i) − L x (∥x − x s ∥ + Ξ) ≥ 0 for all i ∈ I g . Because B n ⊆ B n+1 and l n+1 (a s , i) ≥ l n (a s , i), x ∈ X s n+1 . Next, we state conditions under which a parameter a GE ∈ A \ S n will be discovered during GE, i.e., no backup policy would be triggered during GE, in finite time.
Lemma A.17. Consider any n ≥ 0. Let S n be the safe set of parameters explored after n iterations of GoSafeOpt and a GE a parameter in A \ S n . Further, let the assumptions from Theorem 4.3 hold and β n be defined as in [18]. If, for all k ≥ 0, x a GE (k) ∈ X s n , where, x a GE (k) represents the state at time step k for the system starting at x 0 with policy π a GE (·), then there exists a finite integerñ > n, such that a GE ∈ Sñ.
where s represents the end-effector position, q the joint angles, and Λ(q), Γ(q,q), η(q) are nonlinearities representing the mass, Coriolis, and gravity terms, respectively. The state we consider is x(t) = [s T (t),ṡ T (t)] T . We apply an impedance controller: u (x(t)) = −K(x − x des (k)) + Γ(q,q)ṡ + η(q), (B.2) with K being the feedback gain. The torque τ applied to each of the joints can be calculated via τ = J T u(x(t)), with J the Jacobian.
For our experiments, we can directly measure g(a, x(k)), where k denotes a discrete time step. Therefore, instead of using l n (a s , i) for the boundary condition in Section 4.3.2, we take a lower bound over all the tuples in our set of backups, B n , i.e., l n (a s , x s , i), which could potentially reduce the conservatism of the boundary condition. Therefore, we define a GP over the parameter and state space, which contains all the points from B n . The set B n consists of rollouts from individual experiments, we typically add 50 − 100 data points from each experiment to B n . As the data points of our GP increase, inference becomes prohibitively costly. To this end, we use a subset selection scheme to select a small subset of points (a, x) from B n at random with a probability that is proportional to exp (− min i∈Ig l 2 n (a, x, i)). Crucially, we want to retain points that have a small lower bound such that we have low uncertainty around these points. We perform this subset selection once our GP has acquired more than n max data points. Then, we select a subset of m < n max points. Lastly, as described in Section 5.2.1, for the boundary condition from Section 4.3.2, we define the distances d u , d l using covariances κ u , κ l , respectively. Particularly, we pick d u such that k(d u ) ≥ κ u for the stationary isotropic kernel k that we use to model our GP (same for d l ). This makes the choice of d u more intuitive since it directly relates to the covariance function of our GP.

B.1. Simulation
For the simulation task, we determine the impedance K using an infinite horizon LQR parameterized via Q = Q r 0 0 κ d Q r , Q r = 10 qc I 3 , R = 10 r−2 I 3 , A = 0 I 3 0 0 , B = 0 I 3 .
The matrices A, B are obtained assuming that we use a feedback linearization controller [36]. However, because we instead use an impedance controller, there are nonlinearities and imprecisions in our model. The parameters q c , r, κ d are tuning parameters we would like to optimize. We define the desired path x des (k) as x des (k) = x des (ρ(k)) (11D task) Here, ρ(·) is used to parameterize a cubic spline from x 0 to x target . The constraint is: g (x(t)) = ∥s(t) − s des ∥ 2 − ∥s(0) − s des ∥ 2 ∥s(0) − s des ∥ 2 − α, α = 0.08 (8D task) g x(t) = ζ − x(t) − x d ρ(t) 2 . (11D task)

D. Additional Definitions
In this section, we present some of the definitions from SafeOpt for completeness.
Definition D.1. The expanders G n are defined as G n := {a ∈ S n | e n (a) > 0} with e n (a) = |{a ′ ∈ A \ S n , ∃i ∈ I g : u n (a, i) − L a ∥a − a ′ ∥ ≥ 0}|.

E. Hyperparameters
The hyperparameters of our simulated and real-world experiments are provided in Table E.2.