Discovery and learning of models with predictive state representations for dynamical systems without reset
Introduction
Modeling dynamical systems is a common problem in science and engineering, which has widely applied in many fields, such as artificial intelligence, operation research, computer science, etc. [1]. For the artificial intelligence field, one of the central and challenging problems is concerned with agent operating in environments that are partially observable and stochastic, i.e., how an agent can plan and act optimally in such environments. One commonly used technique for solving such a problem is to model these systems firstly, then solving the problem by using the obtained models.
Thus far, perhaps the most general framework for modeling such controlled dynamical systems is Partially Observable Markov Decision Processes (POMDPs) [2], but it is well known that the POMDP model is based on unobserved or hidden states, and requires much prior knowledge when learning such a model in reality. As an alterative, Littman, Sutton, and Singh (2002) proposed a new framework for modeling such systems, called predictive state representations (PSRs) [3], which models a system by defining and operating on the PSR state. A PSR state is a vector of predictions or outcome probabilities for tests that could be done on the system, where a test is a sequence of action-observation pairs. The PSR state summarizes all information regarding the past and the use of it allows one to transform the original POMDP into a Markov decision process (MDP). Unlike the POMDP model, which is based on unobserved or hidden states, the PSR model is expressed entirely on observable quantities. Therefore, learning PSR models of dynamical systems from observation data should be easier and less prone to local minimum problems than learning POMDP models, and PSRs is potentially more compact than POMDPs. At the same time, the PSR-based model is more expressive than the nth-order Markov model [3]. It has been proved that the PSR state involves only the predictions of core tests, other than the predictions of all tests. The details will be discussed in Section 2.
Controlled dynamical systems can be divided into two types: systems with reset action and systems without reset action. In many real environments, a reset action may not be available, i.e., there is not an action that can be invoked by the learning algorithm to return the system back to the original initial condition [4]. Therefore, the only data used for obtaining the PSR model is a continuous trace of actions and observations. How to use such a data to build PSR model is an important research aspect of PSRs.
To date, the primary method for discovery and learning of predictive state representations in dynamical systems without reset is a Monte Carlo algorithm called suffix-history [4], [5]. The idea behind suffix-history is to take every time step of the training data as the start of a new sample, i.e., the history at every time step of the training data is treated as the null history, where the definition of history is similar to the definition of test in that they are both a sequence of action-observation pairs, but a history is constrained to start from the beginning of time and is used to describe the full sequence of past events. For example, when the training data is , the following training sequences can be obtained: after using the suffix-history algorithm, then the prediction of test at a history can be estimated by using Monte Carlo approaches. The algorithm is described in detail in Ref. [6]. Potential problem of this algorithm is that the history identified by this algorithm may not be the actual history. As a consequence, the prediction vector at the history that is identified by this algorithm may be the weighted sum of prediction vectors at different actual histories. So, it is possible that the prediction vector beside the initial prediction vector is not correct, and only under some special conditions the PSR model obtained by this algorithm can reflect the true PSR model of the system [6]. In the work of Ref. [7] an interesting algorithm called constrained gradient algorithm is also proposed to address the problem of discovery and learning the PSR model in dynamical systems without reset, this algorithm takes advantage of several structural constraints in a valid prediction matrix to obtain the prediction of test at a history, which is different from the majority of existing PSR discovery and learning algorithms that use Monte Carlo prediction estimation [4], [8], [9], [10], [11], [12]. But just as mentioned in this paper, if using this method to obtain the prediction of test at a history, the threshold parameter for condition number can only be determined empirically, other than can be determined by using some rule when the predictions are estimated by Monte Carlo approaches [8]. Because such a threshold parameter is a key component in the process of discovering core tests, and without some rule it is very difficult to be determined, in this paper, we do not take further consideration on the work of Ref. [7] and the Monte Carlo approaches are used to estimate the prediction of test at a history.
The key component of discovery and learning of PSRs in dynamical systems without reset is how to identify the history at any time step in the training data, because when the history at any time step can be identified, it is straightforward to obtain the PSR model [4], [8]. In this paper, after obtaining the set of landmarks in a system and proving that different PSR models for a system, by using different state as the initial state, have the same PSR model parameters, we propose a method for identifying the history at any time step in the training data, and then the PSR model can be obtained. We start with a brief introduction to PSRs in Section 2. In Section 3, the problem of identifying the history at any time step and obtaining the PSR model is addressed. The experimental results are evaluated and compared in Section 4, and the conclusion is given in Section 5. In this paper, the system we are interested in is discrete-time, finite-observation, stochastic, and can be represented by a finite POMDP model.
Section snippets
Predictive state representations
As mentioned above, PSR-based models represent the state of a system as a vector of predictions for core tests. When a discrete set of observations and a discrete set of actions are given, the conditional probability of a length-m test at a length-n history h is defined as [12]. A set of tests constitutes a PSRs if its prediction vector, is a sufficient
Preliminaries
In the previous work of James et al. [11] the definition of memory, landmark and the property of landmark were proposed.
Definition of memory: A sequence of alternating actions and observations and it ends with observation. A length-n history means that the number of action-observation pairs in this history is n, while a length-n memory means that the total number of actions and observations in this memory is n, and it is possible that a memory is not started from action or initial time.
Experiments and results
In this section, we tested the performance of our algorithm on a set of dynamical systems taken from Cassandra’s POMDP page [13] and these systems are commonly used in evaluating PSRs and POMDPs. For each system, a training data was generated, existing work shows how to obtain the training data using no-blind policy [12], however, since our objective is just to evaluate the effect of identifying the history using the method we proposed, we use blind policy, which is a uniform random policy over
Conclusion
We have demonstrated in this paper that any landmark can be used as the initial state. After that, the concept of landmark is used to identify the history, so, the history at any time step beside the null history can be identified, which were unsolved with current PSR discovery and learning algorithms in dynamical systems without reset, then the PSR model can be obtained by using only a continuous trace of actions and observations as the training data. Our empirical results on a standard set of
References (14)
- S. Singh, M. James, M. Rudary, Predictive state representations: a new theory for modeling dynamical systems, in:...
- L.P. Kaelbling, M.L. Littman, A.R. Cassandra, Planning and acting in partially observable stochastic domains, Artif....
- M. Littman, R. Sutton, S. Singh, Predictive representation of state, in: Advances in Neural Information Processing...
- B. Wolfe, M. James, S. Singh, Learning predictive state representations in dynamical systems without reset, in:...
- D. Wingate, S. Singh, On discovery and learning of models with predictive state representations of state for agents...
- M. James, Using Predictions for Planning and Modeling in Stochastic Environments, PhD Thesis, Department of Computer...
- P. McCracken, M. Bowling, Online discovery and learning of predictive state representations, in: Advances in Neural...
Cited by (9)
Learning and planning in partially observable environments without prior domain knowledge
2022, International Journal of Approximate ReasoningCitation Excerpt :However, a black box simulator or strong prior knowledge about the environment is still required to guarantee the performance of the POMCP and BA-POMCP approaches. Predictive State Representations (PSRs) offer a powerful framework for modeling partially observable dynamical systems [3,17,25–27]. Unlike latent-state based approaches, such as hidden Markov models and POMDPs, PSRs represent state as predictions about future observable events, which leads to easier learning of the corresponding model, the avoidance of using local-minima prone expectation maximization, more expressive power, etc [40].
Solving partially observable problems with inaccurate PSR models
2014, Information SciencesCitation Excerpt :Singh et al. [36] developed an approximate gradient-based method to learn the PSR parameters assuming that the tests whose predictions constitute state are known a priori. In the work of [19,26,40], Monte Carlo approaches were adopted. Both the tests whose predictions constitute state and the PSR model parameters can be obtained.
Meta-reasoning for predictive error correction: Additional results from abstraction networks with empirical verification procedures
2014, Biologically Inspired Cognitive ArchitecturesCitation Excerpt :In our EVP-based learning technique, the requirement for empirical verifiability of each chunk of domain knowledge is a requirement that the knowledge chunk make a verifiable prediction. In work on learning PSRs (Wiewiora, 2005; Yun-Long & Ren-Hou, 2009), the system can be seen as defining concepts relevant to action selection, as the predictive representations of state are constructed on-line. However, the tasks addressed with PSRs vs. EVP-based learning are substantially different.
Detecting Changes and Avoiding Catastrophic Forgetting in Dynamic Partially Observable Environments
2020, Frontiers in NeuroroboticsMaking and improving predictions of interest using an MDP model
2017, Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS