DCOB: Action space for reinforcement learning of high DoF robots

Yamaguchi, Akihiko; Takamatsu, Jun; Ogasawara, Tsukasa

doi:10.1007/s10514-013-9328-1

DCOB: Action space for reinforcement learning of high DoF robots

Published: 29 March 2013

Volume 34, pages 327–346, (2013)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Akihiko Yamaguchi¹,
Jun Takamatsu¹ &
Tsukasa Ogasawara¹

743 Accesses
11 Citations
Explore all metrics

Abstract

Reinforcement learning (RL) for robot control is an important technology for future robots since it enables us to design a robot’s behavior using the reward function. However, RL for high degree-of-freedom robot control is still an open issue. This paper proposes a discrete action space DCOB which is generated from the basis functions (BFs) given to approximate a value function. The remarkable feature is that, by reducing the number of BFs to enable the robot to learn quickly the value function, the size of DCOB is also reduced, which improves the learning speed. In addition, a method WF-DCOB is proposed to enhance the performance, where wire-fitting is utilized to search for continuous actions around each discrete action of DCOB. We apply the proposed methods to motion learning tasks of a simulated humanoid robot and a real spider robot. The experimental results demonstrate outstanding performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Hierarchical dynamic movement primitive for the smooth movement of robots based on deep reinforcement learning

Article 29 April 2022

Yinlong Yuan, Zhu Liang Yu, … Xiaohu Sang

Robot Intelligent Trajectory Planning Based on PCM Guided Reinforcement Learning

Fast Robot Motor Skill Acquisition Based on Bayesian Inspired Policy Improvement

Notes

For a vector $\mathbf{x}=(x_1,\dots {},x_D)$, the maximum norm is defined as $\Vert \mathbf{x}\Vert _\infty = \max _m{|x_m|}$.
We do not abbreviate the trajectory by observing the output of BFs since when the dynamics is a POMDP, using the BFs output to terminate the action may complicate the dynamics more.
Actually, unit division and unit deletion are implemented.
Open Dynamics Engine: http://www.ode.org/
We start the EM algorithm with $200$ BFs, and obtain the $202$ trained BFs.
The term, $\dot{c}_{0x}(t) e_{\mathrm{{z}}1}(t) + \dot{c}_{0y}(t) e_{\mathrm{{z}}2}(t)$, indicates the velocity of the body link projected into the $(e_{\mathrm{{z}}1},e_{\mathrm{{z}}2},0)$ direction; that is, the $x$–$y$ direction from the body link to the head link.
A laptop PC: Pentium M $2 \text{ GHz }$ CPU, $512 \text{ MB }$ RAM, Debian Linux.
We assume that a simple PD-controller is used as the low-level controller.
$\varvec{\varSigma }_k^\mathcal Q $ is calculated from the original covariance matrix $\varvec{\varSigma }_k$ (on the $\mathcal X $ space) as follows. For ease of calculation, let $\mathbf{C}_{\mathrm{{P}}}(\mathbf{x})=\hat{\text{ C }}_\mathrm{{p}}\mathbf{x}$ where $\hat{\text{ C }}_\mathrm{{p}}$ is a constant matrix. The converted covariance matrix is $\varvec{\varSigma }_k^\mathcal Q = \hat{\text{ C }}_\mathrm{{p}} \varvec{\varSigma }_k \hat{\text{ C }}_\mathrm{{p}}^\top $.

References

Asada, M., Noda, S., & Hosoda, K. (1996). Action-based sensor space categorization for robot learning. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’96) (pp. 1502–1509).
Baird, L.C., & Klopf, A.H. (1993). Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Laboratory, Wright-Patterson Air Force Base.
Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. doi:10.1109/18.256500.
Article MATH MathSciNet Google Scholar
Doya, K., Samejima, K., Katagiri, K., & Kawato, M. (2002). Multiple model-based reinforcement learning. Neural Computation, 14(6), 1347–1369. doi:10.1162/089976602753712972.
Article MATH Google Scholar
Gaskett, C., Fletcher, L., & Zelinsky, A. (2000). Reinforcement learning for a vision based mobile robot. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’00).
Ijspeert, A., & Schaal, S. (2002). Learning attractor landscapes for learning motor primitives. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems (pp. 1547–1554). Cambridge: MIT Press.
Google Scholar
Kimura, H., Yamashita, T., & Kobayashi, S. (2001). Reinforcement learning of walking behavior for a four-legged robot. In Proceedings of the 40th IEEE Conference on Decision and Control. Portugal.
Kirchner, F. (1998). Q-learning of complex behaviours on a six-legged walking machine. Robotics and Autonomous Systems, 25(3–4), 253–262. doi:10.1016/S0921-8890(98)00054-2.
Article MathSciNet Google Scholar
Kober, J., & Peters, J. (2009). Learning motor primitives for robotics. In The IEEE International Conference on Robotics and Automation (ICRA’09) (pp. 2509–2515).
Kondo, T., & Ito, K. (2004). A reinforcement learning with evolutionary state recruitment strategy for autonomous mobile robots control. Robotics and Autonomous Systems, 46(2), 111–124.
Article Google Scholar
Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning. (pp. 323–331).
Matsubara, T., Morimoto, J., Nakanishi, J., Hyon, S., Hale, J.G., & Cheng, G. (2007). Learning to acquire whole-body humanoid CoM movements to achieve dynamic tasks. In The IEEE International Conference on Robotics and Automation (ICRA’07). (pp. 2688–2693). doi:10.1109/ROBOT.2007.363871.
Mcgovern, A., & Barto, A.G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In The Eighteenth International Conference on Machine Learning. (pp. 361–368). San Mateo, CA: Morgan Kaufmann.
Menache, I., Mannor, S., & Shimkin, N. (2002). Q-cut - dynamic discovery of sub-goals in reinforcement learning. In ECML ’02: Proceedings of the 13th European Conference on Machine Learning (pp. 295–306). London: Springer.
Miyamoto, H., Morimoto, J., Doya, K., & Kawato, M. (2004). Reinforcement learning with via-point representation. Neural Networks, 17(3), 299–305. doi:10.1016/j.neunet.2003.11.004.
Article MATH Google Scholar
Moore, A. W., & Atkeson, C. G. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21(3), 199–233. doi:10.1023/A:1022656217772.
Google Scholar
Morimoto, J., & Doya, K. (1998). Reinforcement learning of dynamic motor sequence: Learning to stand up. In The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’98). (pp 1721–1726).
Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1), 37–51. doi:10.1016/S0921-8890(01)00113-0.
Article MATH Google Scholar
Nakamura, Y., Mori, T., Sato, M., & Ishii, S. (2007). Reinforcement learning for a biped robot based on a CPG-actor-critic method. Neural Networks, 20(6), 723–735. doi:10.1016/j.neunet.2007.01.002.
Article MATH Google Scholar
Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In International Conference on Machine Learning. (pp. 226–232).
Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. In IEEE-RAS International Conference on Humanoid Robots. Karlsruhe, Germany.
Sato, M., & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 407–432.
Article Google Scholar
Sedgewick, R., & Wayne, K. (2011). Algorithms. Boston: Addison-Wesley.
Google Scholar
Stolle, M. (2004). Automated discovery of options in reinforcement learning (Master’s thesis, McGill University).
Sutton, R., & Barto, A. (1998). Reinforcement Learning: An Introduction. Cambridge: MIT Press. Retrieved from http://citeseer.ist.psu.edu/sutton98reinforcement.html.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.
Article MATH MathSciNet Google Scholar
Takahashi, Y., & Asada, M. (2003). Multi-layered learning systems for vision-based behavior acquisition of a real mobile robot. In Proceedings of SICE Annual Conference 2003 (pp. 2937–2942).
Tham, C. K., & Prager, R. W. (1994). A modular Q-learning architecture for manipulator task decomposition. In The Eleventh International Conference on Machine Learning (pp. 309–317).
Theodorou, E., Buchli, J., & Schaal, S. (2010). Reinforcement learning of motor skills in high dimensions: A path integral approach. In The IEEE International Conference on Robotics and Automation (ICRA’10) (pp. 2397–2403). doi:10.1109/ROBOT.2010.5509336.
Tsitsiklis, J. N., & Roy, B. V. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94.
MATH Google Scholar
Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.
Google Scholar
Uchibe, E., Doya, K. (2004). Competitive-cooperative-concurrent reinforcement learning with importance sampling. In The International Conference on Simulation of Adaptive Behavior: From Animals and Animats (pp. 287–296).
Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11(7), 1317–1329.
Article Google Scholar
Yamaguchi, A. (2011). Highly modularized learning system for behavior acquisition of functional robots. Ph.D. Thesis, Nara Institute of Science and Technology, Japan.
Zhang, J., & Rössler, B. (2004). Self-valuing learning and generalization with application in visually guided grasping of complex objects. Robotics and Autonomous Systems, 47(2), 117–127.
Article Google Scholar

Download references

Acknowledgments

Part of this work was supported by a Grant-in-Aid for JSPS, Japan Society for the Promotion of Science, Fellows (22$\cdot {}$9030).

Author information

Authors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, 630-0192, Japan
Akihiko Yamaguchi, Jun Takamatsu & Tsukasa Ogasawara

Authors

Akihiko Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar
Jun Takamatsu
View author publications
You can also search for this author in PubMed Google Scholar
Tsukasa Ogasawara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akihiko Yamaguchi.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mpg 23060 KB)

Appendices

Appendix

Appendix A Wire-fitting

For a continuous state $\mathbf{x}\in \mathcal X $ and a continuous action $\mathbf{u}\in \mathcal U $, wire-fitting is defined as:

$$\begin{aligned} Q(\mathbf{x},\mathbf{u})&= \lim _{\epsilon \rightarrow 0^+} \frac{\sum _{i\in \mathcal W } (d_i+\epsilon )^{-1}q_i(\mathbf{x})}{\sum _{i\in \mathcal W } (d_i+\epsilon )^{-1}} , \end{aligned}$$

(35)

$$\begin{aligned} d_i&= \Vert \mathbf{u}-\mathbf{u}_i(\mathbf{x}) \Vert ^2 + C\bigl [\max _{i^{\prime }\in \mathcal W } (q_{i^{\prime }}(\mathbf{x})) - q_i(\mathbf{x})\bigr ]. \end{aligned}$$

(36)

Here, a pair of the functions $q_i(\mathbf{x}):\mathcal X \rightarrow \mathbb R $ and $\mathbf{u}_i(\mathbf{x}):\mathcal X \rightarrow \mathcal U $ ($i\in \mathcal W $) is called a control wire; wire-fitting is regarded as an interpolator of the set of control wires $\mathcal W $. $C$ is the smoothing factor of the interpolation; we choose $C=0.001$ in the experiments. Any function approximator is available for $q_i(\mathbf{x})$ and $\mathbf{u}_i(\mathbf{x})$. For any kind of the function approximators, one of $q_i(\mathbf{x})$, $i\in \mathcal W $ is equal to $\max _{\mathbf{u}}{Q(\mathbf{x},\mathbf{u})}$ and the corresponding $\mathbf{u}_i(\mathbf{x})$ is the greedy action at $\mathbf{x}$.

$$\begin{aligned}&\max _{\mathbf{u}}{Q(\mathbf{x},\mathbf{u})} = \max _{i\in \mathcal W } (q_i(\mathbf{x})), \end{aligned}$$

(37)

$$\begin{aligned}&\arg \,\max _{\mathbf{u}}{Q(\mathbf{x},\mathbf{u})} = \mathbf{u}_i(\mathbf{x})\Big |_{i=\arg \,\max _{i^{\prime }\in \mathcal W }(q_{i^{\prime }}(\mathbf{x}))}. \end{aligned}$$

(38)

Namely, the greedy action at state $\mathbf{x}$ is calculated only by evaluating $q_{i}(\mathbf{x})$ for $i\in \mathcal W $.

We use NGnet for $q_{i}(\mathbf{x})$ and a constant vector for $\mathbf{u}_{i}(\mathbf{x})$, that is, we let $q_i(\mathbf{x})= {\mathbf{\theta }}_i^\top {\mathbf{\phi }}(\mathbf{x})$ and $\mathbf{u}_i(\mathbf{x})= \mathbf{U}_i$, where ${\mathbf{\phi }}(\mathbf{x})$ is the output vector of the NGnet. The parameter vector ${\mathbf{\theta }}$ is defined as ${\mathbf{\theta }}^\top = ({\mathbf{\theta }}_1^\top , \mathbf{U}_1^\top , {\mathbf{\theta }}_2^\top , \mathbf{U}_2^\top , \dots {}, {\mathbf{\theta }}_{|\mathcal W |}^\top , \mathbf{U}_{|\mathcal W |}^\top ) $, and the gradient $\mathbf{\nabla }_{\mathbf{\theta }} Q(\mathbf{x},\mathbf{u})$ can be calculated analytically.

Figure 18 shows an example of wire-fitting where both of $\mathbf{x}\in [-1,1]$ and $\mathbf{u}\in [-1,1]$ are a one-dimensional vector. There are two control wires (dashed lines) and three basis functions (dotted lines). The BFs (NGnet) are located at $\mathbf{x}=(-1),(0),(1)$ respectively, and the parameters of the wire-fitting are ${\mathbf{\theta }}_1=(0.0, 0.6, 0.0)^\top , \mathbf{U}_1=(-0.5), {\mathbf{\theta }}_2=(0.0, 0.3, 0.6)^\top , \mathbf{U}_2=(0.5)$. Each control wire is plotted as $(\mathbf{x}, \mathbf{u}_{1}(\mathbf{x}), q_{1}(\mathbf{x}))$ and $(\mathbf{x}, \mathbf{u}_{2}(\mathbf{x}), q_{2}(\mathbf{x}))$ respectively. Each $\times $-mark is put at $(\mathbf{x}, \mathbf{u}_{i^\star }(\mathbf{x}), q_{i^\star }(\mathbf{x}))\big |_{i^\star =\arg \,\max _{i}q_{i}(\mathbf{x})}$ which shows the greedy action at $\mathbf{x}$.

Appendix B Calculations of BFTrans

1.1 Generating trajectory

The reference trajectory $\mathbf{q}^\mathrm{{D}}(t_n+t_a),\>{}{} t_a\in [0,T_{\mathrm{F}}]$ is designed so that the state changes from the starting state $\mathbf{x}_n=\mathbf{x}(t_n)$ to the target $\mathbf{q}^\mathrm{trg}$ in the time interval $T_{\mathrm{F}}$. We represent the trajectory with a cubic function,

$$\begin{aligned} {\mathbf{q}}^{\mathrm{D}}(t_n+t_a)= \mathbf{c}_0+ \mathbf{c}_1 t_a+ \mathbf{c}_2 t_a^2+ \mathbf{c}_3 t_a^3, \end{aligned}$$

(39)

where $\mathbf{c}_{0,\dots {},3}$ are the coefficient vectors. These coefficients are determined by the boundary conditions,

$$\begin{aligned}&\mathbf{q}^\mathrm{{D}}(t_n)=\mathbf{C}_{\mathrm{{P}}}(\mathbf{x}_n) ,\quad \mathbf{q}^\mathrm{{D}}(t_n+T_{\mathrm{F}})=\mathbf{q}^\mathrm{trg}, \nonumber \\&\dot{\mathbf{q}}^\mathrm{{D}}(t_n+T_{\mathrm{F}})=\mathbf{0},\quad {\ddot{\mathbf{q}}}^\mathrm{{D}}(t_n+T_{\mathrm{F}})=\mathbf{0}, \end{aligned}$$

(40)

where $\mathbf{0}$ denotes a zero vector.

1.2 Abbreviating trajectory

The abbreviation is performed as follows: (1) estimate $D_\mathrm{{N}}(\mathbf{x}_n)$ as the distance between two neighboring BFs around the start state $\mathbf{x}_n$, (2) calculate $T_{\mathrm{N}}$ from the ratio of $D_\mathrm{{N}}(\mathbf{x}_n)$ and the distance between $\mathbf{x}_n$ and $\mathbf{q}^\mathrm{trg}$.

To define $D_\mathrm{{N}}(\mathbf{x}_n)$, for each BF $k$, we first calculate $d_\mathrm{{N}}(k)$ as the distance between its center ${\mathbf{\mu }}_{k}$ and the center of the nearest BF from $k$. Then, we estimate $D_\mathrm{{N}}(\mathbf{x}_n)$ by interpolating $\{d_\mathrm{{N}}(k)|k\in \mathcal{K }\}$ with the output of the BFs at $\mathbf{x}_n$.

$d_\mathrm{{N}}(k)$ is calculated by

$$\begin{aligned} k_\mathrm{{N}}(k)&= \text{ arg } \text{ min }_{k^{\prime }\in \mathcal{K }, k^{\prime }\ne k} \Vert \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k^{\prime }}) - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k})\Vert _\infty , \end{aligned}$$

(41)

$$\begin{aligned} d_\mathrm{{N}}(k)&= \max {}\bigl ( \Vert \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_\mathrm{{N}}(k)}) - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k})\Vert _\infty ,\>{}\text{ d }_{\min {}k} \bigr ), \end{aligned}$$

(42)

where $\text{ d }_{\min {}k}\in \mathbb R $ is a positive constant to adjust $d_\mathrm{{N}}(k)$ when $\Vert \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_\mathrm{{N}}(k)}) - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k})\Vert _\infty $ is too small. For NGnet, we define it as $\text{ d }_{\min {}k}= \sqrt{\lambda _{k}^\mathcal Q }$ where $\lambda _{k}^\mathcal Q $ is the maximum eigenvalue of the covariance matrix $\varvec{\varSigma }_k^\mathcal Q $ on the $\mathcal Q $ space^{Footnote 9}. Note that we can pre-compute $\{d_\mathrm{{N}}(k)|k\in \mathcal{K }\}$ for fixed BFs.

Using the output of BFs ${\mathbf{\phi }}(\mathbf{x}_n)$, $D_\mathrm{{N}}(\mathbf{x}_n)$ is estimated by

$$\begin{aligned} D_\mathrm{{N}}(\mathbf{x}_n) = (d_\mathrm{{N}}(1),\>{}d_\mathrm{{N}}(2),\ldots ,\>{}d_\mathrm{{N}}(|\mathcal{K }|))^\top {\mathbf{\phi }}(\mathbf{x}_n) \end{aligned}$$

(43)

Finally, $T_{\mathrm{N}}$ is defined by

$$\begin{aligned} T_{\mathrm{N}}(\mathbf{x}_n,\mathbf{u}_n) = \min \Bigl (1,\frac{\text{ F }_\mathrm{abbrv} D_\mathrm{{N}}(\mathbf{x}_n)}{\Vert \mathbf{q}^\mathrm{trg}-\mathbf{C}_{\mathrm{{P}}}(\mathbf{x}_n)\Vert _\infty }\Bigr ) T_{\mathrm{F}}. \end{aligned}$$

(44)

Appendix C Initialization and constraints of WF-DCOB

1.1 Initializing wire-fitting parameters

For a control wire $i\in \mathcal W $, we use $a_{i}^{\mathrm{dcob}}$ to denote the corresponding action in DCOB: $a_{i}^{\mathrm{dcob}} = (g_{i}^{\mathrm{dcob}}, k_{i}^{\mathrm{dcob}})$. Let $(\text{ g }^\mathrm{{S}}_i, \text{ g }^\mathrm{{E}}_i)$ denote the range of the interval factor which includes $g_{i}^{\mathrm{dcob}}$. For each control wire $i\in \mathcal W $, its parameter is defined as $\mathbf{U}_i=(g_i,\mathbf{q}^\mathrm{trg}_i)$ and is initialized by

$$\begin{aligned}&g_i \leftarrow \frac{\text{ g }^\mathrm{{S}}_i + \text{ g }^\mathrm{{E}}_i}{2}, \end{aligned}$$

(45a)

$$\begin{aligned}&\mathbf{q}^\mathrm{trg}_i \leftarrow \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_{i}^{\mathrm{dcob}}}). \end{aligned}$$

(45b)

The other parameters of the control wires $\{{\mathbf{\theta }}_i | i\in \mathcal W \}$ are initialized by zero, since, in a learning-from-scratch case, we do not have prior knowledge of the action values.

1.2 Constraints on wire-fitting parameters

For $\mathbf{U}_i=(g_i,\mathbf{q}^\mathrm{trg}_i)$, the interval factor $g_i$ is constrained inside $(\text{ g }^\mathrm{{S}}_i, \text{ g }^\mathrm{{E}}_i)$, and the target point $\mathbf{q}^\mathrm{trg}_i$ is constrained inside a hypersphere of radius $d_\mathrm{{N}}(k_{i}^{\mathrm{dcob}})$ centered at $\mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_{i}^{\mathrm{dcob}}})$. Here, $d_\mathrm{{N}}(k_{i}^{\mathrm{dcob}})$ denotes the distance to the nearest BF from $k_{i}^{\mathrm{dcob}}$ defined by Eq. 42. Specifically, the parameter $\mathbf{U}_i=(g_i,\mathbf{q}^\mathrm{trg}_i)$ of each control wire $i\in \mathcal W $ is constrained by

(46)

where

$$\begin{aligned} \mathbf{diff} \triangleq \mathbf{q}^\mathrm{trg}_i - \mathbf{C}_{\mathrm{{P}}}({\mathbf{\mu }}_{k_{i}^{\mathrm{dcob}}}). \end{aligned}$$

(47)

These constraints are applied after each update of an RL algorithm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yamaguchi, A., Takamatsu, J. & Ogasawara, T. DCOB: Action space for reinforcement learning of high DoF robots. Auton Robot 34, 327–346 (2013). https://doi.org/10.1007/s10514-013-9328-1

Download citation

Received: 21 November 2011
Accepted: 28 February 2013
Published: 29 March 2013
Issue Date: May 2013
DOI: https://doi.org/10.1007/s10514-013-9328-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

DCOB: Action space for reinforcement learning of high DoF robots

Abstract

Access this article

Similar content being viewed by others

Hierarchical dynamic movement primitive for the smooth movement of robots based on deep reinforcement learning

Robot Intelligent Trajectory Planning Based on PCM Guided Reinforcement Learning

Fast Robot Motor Skill Acquisition Based on Bayesian Inspired Policy Improvement

Notes

References

Acknowledgments