Given a policy $ \pi $ parameterized by $ \theta $ having performance $ \rho $, policy gradient methods perform iterative updates of the parameters by:

TODO:

- Sutton et al. (1999)
^{[1]}: use a value-approximation-function to reduce variance

## Direct gradient-based methods Edit

(The name comes from Baxter & Barlett (2000).^{[2]})

These methods estimate the gradient of expected reward without the help of value function, action-value function, TD, etc. The first representative is REINFORCE (Williams, 1992)^{[3]} which "learns much more slowly than RL methods using value functions and has received relatively little attention" (Sutton et al., 1999)^{[1]}.

TODO:

- Infinite horizon: Bartlett & Baxter (2002)
^{[4]} - VAPS (value and policy search) (Baird & Moore, 1999)
^{[5]}: generalize policy search and value-based algorithms, guaranteed to converge, work with episodic tasks. - Cao & Wan (1998)
^{[6]}

## Actor-critic Edit

Konda & Tsitsiklis (1999)^{[7]}

## Parameter space exploring Edit

Rückstieß et al. (2010)^{[8]}
reviews some techniques that perturb parameters instead of actions. The main idea was to replace parameters by a distribution (says, Gausian) of parameters. Different from REINFORCE, the policy is deterministic given parameters. The paper report experiments in which parameter-based exploration is better than action-based. However, they only consider continuous action cases. I think the approach is unlikely to work for discrete action cases because a change if parameters may make no difference in the trajectory or changing the trajectory in an unpredictable way.

## References Edit

- ↑
^{1.0}^{1.1}Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, 1057–1063. doi:10.1.1.37.9714 - ↑ Baxter, J., & Bartlett, P. (2000). Direct Gradient-Based Reinforcement Learning: I. Gradient Ascent Algorithms. In Proceedings of the International Symposium on Circuits and Systems (pp. 271–274).
- ↑ Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229–256.
- ↑ Bartlett, P. L., & Baxter, J. (2002). Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning. Journal of Computer and System Sciences, 64(1), 133–150. doi:10.1006/jcss.2001.1793
- ↑ Baird, L., & Moore, A. (1999). Gradient Descent for General Reinforcement Learning. In Advances in Neural Information Processing Systems 11.
- ↑ Cao, X. R., & Wan, Y. W. (1998). Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. Control Systems Technology, IEEE Transactions on, 6(4), 482-494.
- ↑ Konda, V. R., & Tsitsiklis, J. N. (1999, November). Actor-Critic Algorithms. In NIPS (Vol. 13, pp. 1008-1014).
- ↑ Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., & Schmidhuber, J. (2010). Exploring parameter space in reinforcement learning. Paladyn, 1(1), 14–24. doi:10.2478/s13230-010-0002-4