Joint Learning of Reward Machines and Policies in Environments with Partially Known Semantics

We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions' truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from the exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions' truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine the policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.


Introduction
Reinforcement learning (RL) studies the problem of learning an optimal behavior for agents with unknown dynamics in potentially unknown environments. A variety of methods incorporate high-level knowledge that can help the agent explore the environment more efficiently [TS07]. This high-level knowledge is usually expressed through abstractions of the environment and sub-goal sequences [Sin92,KNST16,AALL18] or linear temporal logic [LVB17,AJK + 16]. Recently, the authors in [IKVM18] proposed the concept of reward machines in order to provide high-level information to the agent in the form of rewards. Reward machines are finite-state structures that encode a possibly non-Markovian reward function. They decompose a task into several temporally related subtasks, such as "get coffee and bring it to the office without encountering obstacles". Reward machines allow the composition of such subtasks in flexible ways, including concatenations, loops, and conditional rules, providing the agent access to high-level temporal relationships among the subtasks. These subtasks are defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables, such as "office" and "obstacle" in the aforementioned example. Intuitively, as the agent navigates in the environment, it transits between the states within a reward machine; such transitions are triggered by the truth values of the atomic propositions. After every such transition, the reward machine outputs the reward the agent should obtain according to the encoded reward function. Furthermore, [IKVM18] develops a q-learning method that learns the optimal policy associated with the reward function that is encoded by the reward machine.
In many practical situations, the reward machines encode complex temporal relationships among the subtasks and are hence too difficulty to construct. Therefore, the need emerges for the learning agent to learn how to accomplish the underlying task without having a priori access to the respective reward machine. The work [XGA + 20] develops an algorithm that infers both the reward machine and an optimal RL policy from the trajectories of the system.
The procedure of inferring and using a reward machine is tightly connected to the values of the atomic propositions; such values trigger the reward machine's transitions, which output the reward the agent should obtain. In the related literature, there are mainly two ways that encode the access of the learning agent to the truth values of the atomic propositions. First, the agent has access to a socalled labelling function, a map that provides which atomic propositions are true in which areas of the environment, e.g., which room of an indoor environment has an office. Second, the agent detects the truth values of the atomic propositions from sensors as it navigates in the environment. However, many practical scenarios involve agents operating in a priori unknown environments, and hence accurate knowledge of a labelling function is an unrealistic assumption. Additionally, autonomous agents are endowed with sensors that naturally suffer from imperfections, such as misclassifications or missed object detections. Consequently, the truth values of the atomic propositions are uncertain. When the reward machine that encodes the task is a priori unknown, uncertainties in such values complicate significantly the RL problem since they interfere with the estimation of the reward machine and, consequently, the learning of a suitable policy.
We investigate an RL problem for a task encoded by a reward machine over a set of atomic propositions. The reward machine and the labelling function associated with the atomic propositions are a priori unknown. The agent obtains the truth values of the atomic propositions via measurements from sensors, which, however, suffer from imperfections. Consequently, the truth values of the atomic propositions are uncertain. Our main contribution lies in the development of an RL algorithm that learns a policy that achieves the given task despite the a priori unknown reward machine and the atomic proposition uncertainties. In order to account for such uncertainties, the algorithm holds a probabilistic belief over the truth values of the atomic propositions and it updates this belief according to new sensory measurements that arrive from the exploration of the environment. Furthermore, the algorithm maintains a so-called hypothesis reward machine to estimate the unknown reward machine that encodes the task. It uses the rewards the agent obtains as it explores the environment and the aforementioned belief on the atomic propositions to update this hypothesis reward machine. Finally, the algorithm deploys a q-learning procedure that is based on the hypothesis reward machine and the probabilistic belief to learn a policy that accomplishes the underlying task. An illustrative diagram of the proposed algorithm is depicted in Fig. 1.
We establish theoretical guarantees on the algorithm's convergence to a policy that accomplishes the underlying task. Our guarantees rely on the following sufficient conditions. First, the belief updates lead to better estimation of the truth values of the atomic propositions. That is, there exists a finite time instant after which the probability that the estimated propositions' values match the actual ones is more than 0.5. Second, the learning agent explores the environment sufficiently well in each episode of the proposed RL algorithm. More specifically, the length of each episode is larger than a predefined constant associated with the size of the operating environment and of the reward machine that encodes the task to be accomplished. Based on the aforementioned assumptions, we prove that the proposed algorithm 1) asymptotically infers a reward machine that is equivalent to the one encoding the underlying task, and 2) asymptotically learns a policy that accomplishes this task.
Finally, we carry out experiments comparing the proposed algorithm with a baseline version that does not update the beliefs of the propositions' truth values. The experiments show that the proposed algorithm outperforms the baseline version in terms of convergence. The experiments further show the robustness of the proposed algorithm to inaccurate sensor measurements.

Related Work
Previous RL works on providing higher-level structure and knowledge about the reward function focus mostly on abstractions and hierarchical RL; in such cases, a meta-controller decides which subtasks to perform, and a controller decides which actions to take within a subtask (e.g., [Kon19,KNST16,AKL17]). Other works use temporal logic languages to express a specification, and then generate the corresponding reward functions [WET15, LVB17, SKC + 14, AJK + 16, HAK18, TIKVM18, WBP21, BWZP20, HPS + 19, CJT20]. Inspired by [IKVM18], we use in this work the concept of reward machines, which encode more compactly and expressively high-level information on the reward function of the agent. Additionally, unlike the works in the related literature, we consider that the reward machine that encodes the task at hand is a priori unknown.
There exists a large variety of works dealing with perception uncertainty [GBT20, dSKL19, AMCA14, GJD13, LMF + 16, LMB12, MLD17, AAB13, CCSM17, RCBP19]. However, most of these works consider the planning problem, which consists of computing an optimal policy by explicitly employing Figure 1: Overview of the proposed reinforcement-learning algorithm, consisting of perception updates, inference of the reward machine, and q-learning. the underlying agent model. Furthermore, the considered uncertainty usually arises from probabilistic dynamics, which is resolved either by active-perception procedures, belief-state propagation, or analysis with partially observable Markov decision processes (POMDPs). Besides, the works that consider high-level reward structure (like temporal logic [GBT20, dSKL19, GJD13, LMB12]), assume full availability of the respective reward models, unlike the problem setting considered in this paper.
The recent works [ZWL15, TIWK + 19, XGA + 20] consider tasks encoded by a priori unknown automata-based structures, which they infer from the agent's trajectories. The authors in [XGA + 20], however, do not consider uncertainty in the truth values of the atomic propositions; [ZWL15] and [TIWK + 19] assume partial environment state observability by employing POMDP models, while considering the truth values of atomic propositions in the environment known. In contrast, in this work we consider uncertainty in the semantic representation of the environment, i.e., the truth value of the environment properties that define the agent's task is a priori unknown.

Problem Formulation
This section formulates the considered problem. We first describe the model of the learning agent and the environment, the reward machine that encodes the task to be accomplished, and the observation model of the sensor the agent uses to detect the values of atomic propositions.

Agent Model
We model the interaction between the agent and the environment by a Markov decision process (MDP) [Bel57,Put14], formally defined below.
Definition 1. An MDP is a tuple M = (S, s I , A, T , R, AP, L G ,L, γ) consisting of the following: S is a finite state space, resembling the operating environment; s I ∈ S is an initial state; A is a finite set of actions; T : S × A × S → [0, 1] is a probabilistic transition function, modelling the agent dynamics in the environment; R : S * × A × S → R is a reward function, specifying payoffs to the agent; γ ∈ [0, 1) is a discount factor; AP is a finite set of Boolean atomic propositions, each one taking values in {True, False}; L G : S → 2 AP is the ground-truth labelling function, assigning to each state s ∈ S the atomic propositions that are true in that state;L : S → 2 AP is an estimate of the ground-truth labelling function L G . We define the size of M, denoted as |M|, to be |S| (i.e., the cardinality of the set S).
The atomic propositions AP are Boolean-valued properties of the environment; we denote by s |= p if an atomic proposition p ∈ AP is true at state s ∈ S. The ground-truth labelling function L G provides which atomic propositions are true in the states of the environment, i.e., L G (s) = P ⊆ AP is equivalent to s |= p, for all p ∈ P . We consider that L G is unknown to the agent, which detects the truth values of AP through sensor units such as cameras or range sensors. However, such sensors suffer from imperfections (e.g., noise). Therefore, the agent maintains a time-varying probabilistic beliefL i : S × 2 AP → [0, 1] about the truth values of AP, where i denotes a time index. More specifically, for a state s ∈ S and a subset of atomic propositions P ⊆ AP,L i (s, P ) represents the probability of the event that P is true at s, i.e.,L i (s, P ) = Pr( p∈P s |= p). Notice that at each time index i and for every state s ∈ S, it holds that P ⊂APL i (s, P ) = 1. The prior belief of the agent might be an uninformative prior distribution. We assume that the truth values of the propositions are mutually independent in each state, i.e., P r p∈P s |= P = p∈P P r(s |= p), ∀s ∈ S, P ⊂ AP and also between every pair, P r s |= P ∧ s |= P = P r(s |= P )P r(s |= P ), ∀s, s ∈ S, P, P ⊂ AP The agent's beliefL i , for some time index i ≥ 0, determines the agent's inference of the environment configuration, and hence the estimateL of the labelling function. More specifically, we defineL : i.e, the most probable outcomes. The following example illustrates the relation between L G andL. A policy is a function that maps states in S to a probability distribution over actions in A. At state s ∈ S, an agent, using policy π, picks an action a with probability π(s, a), and the new state s is chosen with probability T (s, a, s ), defined in Def. 1. A policy π and the initial state s I together determine a stochastic process; we write S 0 A 0 S 1 ...for the random trajectory of states and actions.
A trajectory is a realization of the agent's stochastic process S 0 A 0 S 1 : a sequence of states and actions s 0 a 0 s 1 . . . s k a k s k+1 with s 0 = s I . Its corresponding label sequence is 0 1 . .
where γ is the discount factor defined in Def. 1. Note that the definition of the reward function assumes that the reward is a function of the whole trajectory; this allows the reward function to be non-Markovian [IKVM18]. Given a trajectory s 0 a 0 s 1 . . . s k+1 and a beliefL i , we naturally define then the observed label sequence byˆ 0ˆ 1 . . .ˆ k , withˆ j =L(L i , s j ) for all j ≤ k and some i ≥ 0.

Reward Machines
Encoding a (non-Markovian) reward in a type of finite state-machine is achieved by reward machines, introduced in [IKVM18] and formally defined below.
Definition 2. A reward machine is a tuple A = (V, v I , 2 AP , R, δ, σ) that consists of the following: V is a finite set of states; v I ∈ V is an initial state; R is an input alphabet; δ : V × 2 AP → V is a deterministic transition function; σ : V × 2 AP → R is an output function. We define the size of A, denoted by |A|, to be |V | (i.e., the cardinality of the set V ).
The run of a reward machine A on a sequence of labels 0 . . . k ∈ (2 AP ) * is a sequence v 0 ( 0 , r 0 )v 1 ( 1 , v 1 ) . . . v k ( k , v k ) of states and label-reward pairs such that v 0 = v I and for all j ∈ {0, . . . , k}, we have δ(v j , j ) = v j+1 and σ(v j , j ) = r j . We write A( 0 . . . k ) = r 0 . . . r k to connect the input label sequence to the sequence of rewards produced by the reward machine A. We say that a reward machine A encodes the reward function R of an MDP on the ground truth if, for every trajectory s 0 a 0 . . . s k a k s k+1 and the corresponding label sequence 0 . . . k , the sequence of rewards equals A( 0 . . . k ). Moreover, given a fixed i ≥ 0, we say that a reward machine A encodes the reward function R onL i if for every trajectory s 0 a 0 . . . s k a k s k+1 and the corresponding observed label sequenceˆ 0 . . .ˆ k the sequence of rewards equals A(ˆ 0 . . .ˆ k ). Note, however, that there might not exist a reward machine that encodes the reward function onL i , as the next example shows.
Example 3 (Continued). Let the 10-state office workspace shown in Fig. 2. Assume that the agent receives award 1 if it brings coffee to the office without encountering states with obstacles, and zero otherwise. Such a task is encoded by the reward machine shown in Fig. 3. Let also the state sequence s 11 s 12 s 13 s 14 s 24 of a given trajectory. As stated before, the trajectory produces, via the ground-truth labelling function, the label and reward sequence (∅, 0)(c, 0)(∅, 0)(∅, 0) (o, 1). Such a label sequence produces the run of the reward machine The observed label and reward sequence, based on the estimateL, is (∅, 0) ((c,X), 0) (∅, 0)(o, 0)(o, 1), which produces the reward-machine run which clearly does not comply with the observed reward sequence. Hence, we conclude that the reward machine does not encode the reward function onL. In fact, note that we cannot construct a reward machine that encodes the reward function onL. The actual reward function of bringing coffee from s 12 to the office in s 24 cannot be expressed via the set AP with this specificL, because of the ambiguity in the states satisfying " o" and " c"; a single accepting trajectory does not correspond to a unique observed label sequence.

Observation Model
The probabilistic belief is updated based on an observation model O. More specifically, at each time step, the agent's perception module processes a set of sensory measurements regarding the atomic propositions AP. In the Bayesian framework, the observation model is used for the update of the agent's belief, as specified in the following definition.
Definition 3. Let Z(s 1 , s 2 , p) ∈ {True, False} denote the perception output of the agent, when the agent is at state s 1 , for the atomic proposition p ∈ AP at s 2 . The joint observation model of the In particular, an accurate observation model is the one for which O(s 1 , s 2 , p, b) = 1 for b = True and O(s 1 , s 2 , p, b) = 0 for b = False. In the Bayesian framework, the observation model is used for the update of the agent's belief. Nevertheless, in the absence of such observation model, one can perform the update in a frequentist way.

Problem Statement
We consider an MDP-modeled autonomous agent, whose task is encoded by a reward machine that is unknown to the agent. The ground-truth labelling function L G , providing the truth values of the atomic propositions, is also unknown to the agent. The formal definition of the problem statement is as follows.
Problem 1. Let an MDP agent model M = (S, s I , A, T , R, AP, L G ,L, γ) with unknown ground-truth labelling function L G and transition function T . Let also an unknown reward machine A that encodes the reward function R on the ground truth, representing a task to be accomplished. Develop an algorithm that learns a policy π that maximizes Z( Algorithm 1 JIRP Algorithm Add(X, (λ, ρ))

Main Results
This section gives the main results of this paper, which is a joint perception and learning algorithm. We first give an overview of the JIRP algorithm of [XGA + 20], which handles the joint inference of reward machines and policies, with perfect knowledge of the ground-truth labelling function L G .
The JIRP algorithm [XGA + 20] (see Algorithm 1) aims at learning the optimal policy for maximizing Z(s I ) by maintaining hypothesis reward machines. Its main component is the QRM episode(), shown in Algorithm 2 and originally proposed in [IKVM18] for accurately known reward machines. QRM episode() maintains a set Q of q-functions, denoted by q v , for each state v of the reward machine. The current state v of the reward machine guides the exploration by determining which q-function is used to choose the next action. In particular, the function GetEpsilonGreedyAction(q v , s) (line 3 of Alg. 2) selects a random action with probability ∈ (0, 1) and an action that maximizes q v with probability 1 − ∈ (0, 1), balancing exploration and exploitation. However, in each single exploration step, the q-functions corresponding to all reward machine states are updated (lines 7 and 11 of Alg. 2). Note that the returned rewards in ρ are observed (line 6 of Alg. 2), since the reward machine encoding the actual reward function is not known. Instead, JIRP operates with a hypothesis reward machine, which is updated using the traces of QRM episode(). In particular, the episodes of QRM are used to collect traces and update q-functions. As long as the traces are consistent with the current hypothesis reward machine, QRM explores more of the environment using the reward machine to guide the search. However, if a trace (λ, ρ) is detected that is inconsistent with the hypothesis reward machine (i.e., H(λ) = ρ, line 6 of Alg. 1), JIRP stores it in a set X (line 7 of Alg. 1) -the trace (λ, ρ) is called a counterexample and the set X a sample. Once the sample is updated, the algorithm re-learns a new hypothesis reward machine (line 8 of Alg. 1) and proceeds iteratively.

Proposed Algorithm
In this paper, we extend the JIRP algorithm to account for the unknown labelling function and the uncertainty in the truth values of the atomic propositions. The proposed algorithms holds two probabilistic beliefs,L h andL j . It attempts to infer a reward machine A h that encodes the reward function on the beliefL h , i.e., A h (ˆ 0 , . . . ,ˆ k ) = r 0 . . . r k , whereˆ 0 , . . . ,ˆ k is the label sequence generated according to the estimateL(L h , ·). At the same time, the algorithm updates the beliefL j at every time step j based on the observation model defined in Def. 3. Similarly to [GBT20], if the agent's knowledge about the environment, encoded inL j , has changed significantly, the algorithm replacesL h withL j and attempts to infer a new reward machine that encodes the reward function onL h .
Algorithm 3 illustrates the aforementioned procedure. The algorithm uses a hypothesis reward machine H (line 1 of Alg. 3) to guide the learning process. Similarly to [XGA + 20] and Algorithm 1, it runs multiple QRM episodes to update the q-functions, and uses the collected traces to update the counterexamples in the set X and the hypothesis reward machine H (lines 7-12 of Alg. 3). Nevertheless, in our case, H aims to approximate a reward machine A h that encodes the reward function on the most probable outcomes,L(L h , ·), since that is the available information to the agent;L h is the "current" belief that the agent uses in its policy learning and reward-machine inference. This is illustrated via a modified version of the QRM episode, which is called via the command QRM episode mod in line 7 s ← s ; v ← v 17: end for 18: return (λ, ρ, Q,L j ) X. The difference amongL h andL j is evaluated using a divergence test, illustrated in Section 4.3.

Information Processing
for all s ∈ S and p ∈ AP. Depending on the truth value observed for p,L j will be updated according to one of the aforementioned expressions.

Divergence Test on the Belief
If the agent's knowledge current estimate about the atomic propositions, encoded in the current belief L j , is not significantly different from the estimate encoded inL h , the algorithm will continue attempting to infer a reward machine A h that encodes the reward function onL h . Nevertheless, ifL j changes significantly with respect toL h , the algorithm updates the beliefL h and aims to infer a new reward machine (lines 13-18 of Alg. 3). We use the Jensen-Shannon divergence to quantify the change in the belief distribution between two consecutive time steps. The cumulative Jensen-Shannon divergence over the states and the propositions can be expressed as whereL m = 1 2 (L h +L j ) is the average distribution. One of the input parameters to the algorithm is a threshold γ d on the above divergence. If γ d is not exceeded, the algorithm uses the beliefL h to guide the learning and reward machine inference. Otherwise, it replacesL h withL j and resets the procedure.

Inference of Reward Machines
The goal of reward-machine inference is to find a reward machine H that is consistent with all the counterexamples in the sample set X, i.e., such that H(λ) = ρ for all (λ, ρ) ∈ X, for the current belief L h . Unfortunately, such a task is computationally hard in the sense that the corresponding decision problem "given a sample X and a natural number k > 0, does a consistent Mealy machine with at most k states exist?" is NP-complete [XGA + 20, Gol78]. In this paper we follow the approach of [XGA + 20], which is learning minimal consistent reward machines with the help of highly-optimized SAT solvers [NJ13,HV10,Nei14]. The underlying idea is to generate a sequence of formulas φ X k in propositional logic for increasing values of k ∈ N (starting with k = 1) that satisfy the following two properties: • φ X k is satisfiable if and only if there exists a reward machine with k states that is consistent with X; • a satisfying assignment of the variables in φ X k contains sufficient information to derive such a reward machine.
By increasing k by one and stopping once φ X k becomes satisfiable (or by using a binary search), an algorithm that learns a minimal reward machine that is consistent with the given sample is obtained.
As stressed in Example 1, however, a reward machine that encodes R onL h might not exist and the inference procedure might end up creating an arbitrarily large Mealy machine and yielding very high computation times (trying to compute a Mealy machine that does not exist). To prevent such cases, one can limit the maximum allowable number of states of the Mealy machine and the maximum number of episodes used to generate the counterexamples, and use the last valid inferred Mealy machine (or the initial estimate if there is no valid one). In the experimental results, such cases did not prevent the successful convergence of the proposed algorithm.

Convergence in the limit
In this section, we establish the correctness of Algorithm 3. We provide theoretical guarantees on its convergence based on two sufficient conditions. First, the estimateL(L h , ·) becomes identical with the ground-truth labelling function L G after a finite number of episodes. Second, the length of each episode is larger than 2 |M|+1 (|A| + 1) − 1, where A is the unknown reward machine that encodes the task to be accomplished. The first condition is tightly connected to the observation model O of the agent (see Def. 3) and its effect on the updates (2). It can be verified that, when there is no uncertainty in the observation model, i.e., it is sufficiently accurate or inaccurate (the values O(s 1 , s 2 , p, b) are close to one and zero depending on the value of b), the updates (2) lead to aL(L, ·) that converges to L G (·). Intuitively, if the agent is aware that its observation model is either accurate or inaccurate, it adjusts its updates accordingly and based on the respective perception outputs. Consider, for instance, an atomic proposition p ∈ AP satisfying s |= p for some state s ∈ S. Let also an observation model satisfying O(s, s, p, True) = 0.1 and O(s, s, p, False) = 1 − O(s, s, p, True) = 0.9, implying that the agent "knows" that the model is quite inaccurate. Then, a sensor output Z(s, s, p) = False will result in the updatê according to (2), giving hence a better estimate regarding the truth value of p at s. On the contrary, if O(s, s, p, True) = O(s, s, p, False) = 0.5, then the agent has no a priori information regarding the accuracy of the observation model, and the beliefs are not updated. Nevertheless, the conditions stated above are only sufficient for the convergence guarantees. The experiments of the next section illustrate that the proposed algorithm learns how to accomplish the underlying task even with random observation models.
Before proceeding with the convergence result, we introduce some necessary concepts, starting with the attainable trajectory: Definition 4. Let M = (S, s I , A, T , R, AP,L, γ) be an MDP and m ∈ N a natural number. We call a trajectory ζ = s 0 a 0 s 1 . . . s k a k s k+1 ∈ (S × A) * × S m-attainable if (i) k ≤ m and (ii) T (s i , a i , s i+1 ) > 0 for each i ∈ {0, . . . , k}. Moreover, we say that a trajectory ζ is attainable if there exists an m ∈ N such that ζ is m-attainable.
Since the function GetEpsilonGreedyAction() (line 3 of Algorithm 4) follows an -greedy policy, we can show that the proposed algorithm almost surely explores every attainable trajectory in the limit, i.e., with probability 1 as the number of episodes goes to infinity. Consider now Algorithm 3 and assume that the condition of SignifChange(L h ,L j ) (line 13) is not satisfied after a certain number of episodes, i.e.,L h remains fixed. Then, Lemma 1 implies that Algorithm 3 almost surely explores every m-attainable label sequences onL h in the limit, formalized in the following corollary.
Corollary 1. Assume that there exists a n r > 0 such that SignifChange(L h ,L j ) = False (line 13 of Algorithm 3), for all episodes n > n r . Then Algorithm 3, with eplength ≥ m, explores almost surely every m-attainable label sequence onL h in the limit.
Therefore, if Algorithm 3 explores sufficiently many m-attainable label sequences on some distri-butionL h and for a large enough value of m, it is guaranteed to infer a reward machine that is "good enough" in the sense that it is equivalent to the reward machine encoding the reward function R onL h and on all attainable label sequences onL h , assuming that such a reward machine exists (see Example 1). This is formalized in the next lemma.
Lemma 2. Let M = (S, s I , A, T , R, AP,L, γ) be an MDP and assume that there exists a reward machine A h that encodes the reward function R onL(L h , ·) for some beliefL h . Assume that there exists a n r > 0 such that SignifChange(L h ,L j ) = False (line 13 of Alg. 3), for all episodes n > n r . Then, Algorithm 3, with eplength ≥ 2 |M|+1 (|A h | + 1) − 1, almost surely learns a reward machine in the limit that is equivalent to A h on all attainable label sequences onL h .
Proof. The proof is identical to the one in [XGA + 20, Lemma 2] and is omitted.
Following Lemma 2, Algorithm 3 will eventually learn the reward machine encoding the reward function onL h , whenL h is fixed. Intuitively, Lemma 2 suggests that, eventually, the algorithm is able to learn a reward machine that encodes the reward function on a fixed estimateL(L h , ·), i.e., without the updates from the new observations. Nevertheless, the potential perception updates (line 14 of Algorithm 4) might prevent the algorithm to learn a reward machine, sinceL h might be changing after a finite number of episodes. However, if the observation model is accurate (or inaccurate) enough and the agent explores sufficiently many label sequences, intuition suggests thatL j will be constantly improving, leading to less frequent updates with respect to the divergence test (line 13 of Algorithm 3). Therefore, there exists a finite number of episodes, after whichL h will be fixed to some belief L f andL(L f , ·) will be "close" to the ground-truth function L G . By applying then Lemma 2, we conclude that Algorithm 3 will learn the reward machine that encodes the reward function onL f , and consequently converge to the q-function that defines an optimal policy. This is formalized in the next theorem.
Theorem 1. Let M = (S, s I , A, T , R, AP,L, γ) be an MDP and A be a reward machine that encodes the reward function R on the ground truth. Further, assume that there exists a constant n f > 0 such thatL(L h , ·) = L G (·), for all episodes n > n f of Algorithm 3 1 . Then, Algorithm 3, with eplength ≥ 2 |M|+1 (|A| + 1) − 1, converges almost surely to an optimal policy in the limit.
Proof. Note first that the -greedy action policy, imposed by the function GetEpsilonGreedyAction(), and eplength ≥ |M| imply that every action-pair of M will be visited infinitely often. Moreover, SignifChange(L h ,L j ) will be always false for all episodes n > n f , sincê L h will be an accurate enough estimate. By definition, the reward machine A that encodes the reward function on the ground truth encodes the reward function onL h as well. Therefore, Lemma 2 guarantees that Algorithm 3 eventually learns a reward machine H that is equivalent to A on all attainable label sequences.
According to Observation 1 of [IKVM18], and a Markovian reward function such that every attainable label sequence of M H receives the same reward as in M. Thus, an optimal policy for M H will be also optimal for M. Due to the -greedy action policy, the episode length being eplength ≥ |M|, and the fact that the updates are done in parallel for all states of the reward machine H, every state-action pair of the MDP M H will be visited infinitely often. Therefore, convergence of q-learning for M H is guaranteed [WD92]. Finally, since an optimal policy for M H is optimal for M, Algorithm 3 converges to an optimal policy too.

Experimental Results
We test the proposed algorithm in floor plan environments based on the indoor layouts collected in the HouseExpo dataset [TDC + 19]. HouseExpo consists of indoor layouts, built from 35,126 houses with a total of 252,550 rooms. There are 25 room categories, such as bedroom, garage, office, boiler room, etc. The dataset provides bounding boxes for the rooms with their corresponding categories, along with 2D images of layouts. We pre-process these layouts to generate grid-world-like floor plans, where every grid is labeled with the type of the room it is in (see Fig. 4). We use these labels as atomic propositions to define tasks encoded as reward machines. In particular, we use the set of atomic propositions AP = {kitchen, bathroom, bedroom, office, indoor}.
In the experiments, an episode terminates when the agents receives a non-zero reward or the maximum number of steps allowed in one episode is exceeded. This maximum number of steps per episode is 1000, whereas the maximum number of total training steps is 500,000. Furthermore, we set the Jensen-Shannon divergence threshold γ d of Section 4.3 to 10 −5 , and Algorithm 4 selects a random action with probability = 0.3. Our implementation utilizes the RC2 SAT solver [MDMS14] from the PySAT library [IMM18]. All experiments were conducted on a Lenovo laptop with 2.70-GHz Intel i7 CPU and 8-GB RAM.
We compare the following experimental settings regarding the observation model and the belief updates: The Bayesian observation model in the first two settings is a neural network of three densely connected layers with flipout estimators [WVB + 18]. The loss function is the expected lower-bound loss, and a combination of Kullback-Leibler divergence and categorical cross entropy. Given the information about a grid in a floor plan, the Bayesian neural network (BNN) predicts category probabilities of the room where the grid is located. The input vector consists of the 2D coordinates of the grid, the size and the number of neighbours of the room. The output is the predicted probabilities of the room type. The training set consists of 3,000 pre-processed layouts from the HouseExpo dataset, which we select based on the floor plan size and the number of different room categories. After pre-processing, to eliminate variance with respect to the initial position of grids, we rotate every training layout 3 times by 90 degrees, and then add each rotated version to the training set.
For the reward-machine-inference part of the algorithm, we limit the maximum allowable number of states for the hypothesis reward machine to 4. Similarly, we limit the maximum allowable size of the trace sets, which is the number of episodes whose traces are recorded, to 20. In case the inference exceeds the allowed number of states, the process is stopped and the training continues with the last valid hypothesis reward machine until the next inference step.
We first specify the following task to the agent: φ 1 = "Go to bedroom or office, and then to the kitchen", which is encoded by the reward machine shown in Fig. 5.
We execute the proposed algorithm in 75 test floor plans. For each one of the first four experimental settings, where updates according to an observation model are performed, the learning agent is trained 5 times in each test floor plan, each one corresponding to a different random seed, leading to a total of 375 runs. For the last setting that does not include belief updates, we removed 12 test floor plans where the inference procedure took unreasonably long time to finish, resulting in 63 test floor plans. This is attributed to the fact that the belief functions are randomly initialized and not updated throughout the training. For this last setting, the learning agent is also trained 5 times in each test floor plan, each one corresponding to a different random seed, leading to a total of 290 runs. In each one of the aforementioned runs, we perform an evaluation episode every 100 training steps and register the obtained rewards. Figure 6 depicts the progression of these rewards during training, separated in the ones belonging to the 25 th and 75 th percentiles, and the ones belonging to the 10 th and 90 th percentiles. The figures also depict the median values with solid line. Similarly, Table 1 shows the training steps required for convergence for the 25 th (LP), 50 th (MP), and 75 th (HP) percentiles of the attained rewards. The table also depicts the ratio of successful inferences of reward machines (RS), and the average number of belief updates (BU). One concludes that in the time-varying BNN ("TvBNN" in Table 1) setting, the algorithm achieves on average faster convergence, while requiring fewer belief updates. Additionally, the results indicate that the algorithm has slower convergence in the fixed BNN ("FiBNN" in Table 1) setting, which is attributed to the fact that only one set of observation probabilities is sampled and used throughout the training, compared to other settings, where one set is sampled at the beginning of every training episode. Moreover, the first random observation model ("Random" in Table 1) is faster than the second one ("Random2" in Table 1), as the range from which probabilities are sampled uniformly is wider and hence perception outputs are less random, convergence is faster, and fewer belief updates are necessary. Finally, one concludes that all settings where belief updates are performed, with both time-varying and fixed observation models, outperform the setting without any updates ("No update" in Table 1). It is also noteworthy that, in the no-update setting, the algorithm does not converge to the optimal policy in 21 of the 63 test floor plans. We further demonstrate the progression of belief updates throughout the training in Figure 7. Fig.  7a shows the number of training runs (y-axis) in which the belief function is updated for the k th time (x-axis). For instance, in the time-varying BNN ("TvBNN") setting, the learning agent updates its belief for the third time in around 20 runs out of 375 training runs (5 runs per every floor plan where a reward machine is successfully inferred). Fig. 7b shows the number of the step (y-axis) in which the belief function is updated for the k th time on average (x-axis), and Fig. 7c displays the average Jensen-Shannon divergence value that led to the k th belief update. The time-varying BNN ("TvBNN") setting allows the learning agent to obtain a belief function that represents the environment well in a couple of steps. When sampling observation probabilities once, as in the fixed BNN ("FiBNN") setting, it slows down the process since the sampled observation might not be informative. Regarding the random observation model, due to the wide uniform distribution range, the agent samples probabilities that are close to zero or one, i.e. being more "certain" that an atomic proposition won't or will be observed,    respectively. This is the reason why the second random observation model performs more belief updates until the divergence score drops below the threshold. Overall, it is expected to see that, in the BNN settings, the algorithm performs fewer belief updates compared to the random settings.
To test the capabilities of our method in a more complicated scenario, we set a second task as φ 2 = "Go to kitchen while avoiding the bathroom, and then go to the bathroom while avoiding the bedroom", which is encoded by the reward machine shown in Fig. 8.
For this second task, we execute the proposed algorithm in 7 test floor plans. Figs. 9, 10, and Table  2 show the convergence results and belief updates, similar to the ones for φ 1 , for the first four settings; we omit the last one (no belief updates) since training fails to yield an inferred reward machine. As reported in Table 1, the average number of belief updates is smaller for the BNN observation settings compared to the random observation settings. The TvBNN setting requires the fewest number of belief updates on average, whereas the Random-2 setting requires the highest number of updates. However, it is more difficult to interpret and compare the results in terms of convergence. For example, the Random setting converges the slowest and fastest in lower and upper quartiles, respectively, whereas the opposite occurs for the Random-2 settings. Nevertheless, the belief update pattern that we see in the outcomes of the previous experiment remains unchanged. One can still conclude that the proposed algorithm outperforms the final setting, where no belief updates are performed.

Conclusion and Discussion
We develop a reinforcement algorithm subject to reward-machine-encoded tasks and perceptual limitations. The algorithm holds probabilistic beliefs over the truth value of the atomic propositions and    uses a hypothesis reward machine to estimate the reward machine that encodes the task to be accomplished. Both the beliefs and the hypothesis reward machine are updated by using the agent's sensor measurements and the obtained rewards from the agent's trajectories. The algorithm uses the aforementioned beliefs and the hypothesis reward machine in a q-learning procedure to learn an optimal policy that accomplishes this task. Currently, the theoretical guarantees, provided by Theorem 1, presuppose that the algorithm obtains a good enough estimate of the atomic propositions' value, as encoded in the ground truth labelling function L G , i.e.,L(L h , ·) = L G (·). The definition ofL(L h , ·) in (1), based on the most probable atomic propositions p for state s, i.e.,L h (s, p) ≥ 0.5, mildens such an assumption, which is further verified to hold by the experimental results. Nevertheless, our future directions aim to examine how the observation model O(·), given in Def. 3, affects the estimateL(L h , ·) and, consequently, the inferred reward machine and the learned policy. Another limitation of the proposed algorithm is that it is currently limited in discrete and static state and action spaces. In the future, we will consider continuous spaces, possibly by deploying neural-network approximations, and cases where the ground-truth values of the atomic propositions change with time.