Policy gradient

Given a policy $\pi$ parameterized by $\theta$ having performance $\rho$ , policy gradient methods perform iterative updates of the parameters by:

\Delta \theta \propto \frac{\partial \rho}{\partial \theta}

TODO:

Sutton et al. (1999)^[1]: use a value-approximation-function to reduce variance

Direct gradient-based methods[]

(The name comes from Baxter & Barlett (2000).^[2])

These methods estimate the gradient of expected reward without the help of value function, action-value function, TD, etc. The first representative is REINFORCE (Williams, 1992)^[3] which "learns much more slowly than RL methods using value functions and has received relatively little attention" (Sutton et al., 1999)^[1].

TODO:

Infinite horizon: Bartlett & Baxter (2002)^[4]
VAPS (value and policy search) (Baird & Moore, 1999)^[5]: generalize policy search and value-based algorithms, guaranteed to converge, work with episodic tasks.
Cao & Wan (1998)^[6]

Actor-critic[]

Konda & Tsitsiklis (1999)^[7]

Parameter space exploring[]

Rückstieß et al. (2010)^[8] reviews some techniques that perturb parameters instead of actions. The main idea was to replace parameters by a distribution (says, Gausian) of parameters. Different from REINFORCE, the policy is deterministic given parameters. The paper report experiments in which parameter-based exploration is better than action-based. However, they only consider continuous action cases. I think the approach is unlikely to work for discrete action cases because a change if parameters may make no difference in the trajectory or changing the trajectory in an unpredictable way.

References[]

↑ ^1.0 ^1.1 Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, 1057–1063. doi:10.1.1.37.9714
↑ Baxter, J., & Bartlett, P. (2000). Direct Gradient-Based Reinforcement Learning: I. Gradient Ascent Algorithms. In Proceedings of the International Symposium on Circuits and Systems (pp. 271–274).
↑ Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229–256.
↑ Bartlett, P. L., & Baxter, J. (2002). Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning. Journal of Computer and System Sciences, 64(1), 133–150. doi:10.1006/jcss.2001.1793
↑ Baird, L., & Moore, A. (1999). Gradient Descent for General Reinforcement Learning. In Advances in Neural Information Processing Systems 11.
↑ Cao, X. R., & Wan, Y. W. (1998). Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. Control Systems Technology, IEEE Transactions on, 6(4), 482-494.
↑ Konda, V. R., & Tsitsiklis, J. N. (1999, November). Actor-Critic Algorithms. In NIPS (Vol. 13, pp. 1008-1014).
↑ Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., & Schmidhuber, J. (2010). Exploring parameter space in reinforcement learning. Paladyn, 1(1), 14–24. doi:10.2478/s13230-010-0002-4

[sutton99-1] 1.0 ^1.1 Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, 1057–1063. doi:10.1.1.37.9714

[2] Baxter, J., & Bartlett, P. (2000). Direct Gradient-Based Reinforcement Learning: I. Gradient Ascent Algorithms. In Proceedings of the International Symposium on Circuits and Systems (pp. 271–274).

[3] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229–256.

[4] Bartlett, P. L., & Baxter, J. (2002). Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning. Journal of Computer and System Sciences, 64(1), 133–150. doi:10.1006/jcss.2001.1793

[5] Baird, L., & Moore, A. (1999). Gradient Descent for General Reinforcement Learning. In Advances in Neural Information Processing Systems 11.

[6] Cao, X. R., & Wan, Y. W. (1998). Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. Control Systems Technology, IEEE Transactions on, 6(4), 482-494.

[7] Konda, V. R., & Tsitsiklis, J. N. (1999, November). Actor-Critic Algorithms. In NIPS (Vol. 13, pp. 1008-1014).

[8] Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., & Schmidhuber, J. (2010). Exploring parameter space in reinforcement learning. Paladyn, 1(1), 14–24. doi:10.2478/s13230-010-0002-4

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Policy gradient

Contents

Direct gradient-based methods[]

Actor-critic[]

Parameter space exploring[]

References[]

Fan Feed