Given a policy $ \pi $ parameterized by $ \theta $ having performance $ \rho $, policy gradient methods perform iterative updates of the parameters by:
- Sutton et al. (1999): use a value-approximation-function to reduce variance
Direct gradient-based methods Edit
(The name comes from Baxter & Barlett (2000).)
These methods estimate the gradient of expected reward without the help of value function, action-value function, TD, etc. The first representative is REINFORCE (Williams, 1992) which "learns much more slowly than RL methods using value functions and has received relatively little attention" (Sutton et al., 1999).
- Infinite horizon: Bartlett & Baxter (2002)
- VAPS (value and policy search) (Baird & Moore, 1999): generalize policy search and value-based algorithms, guaranteed to converge, work with episodic tasks.
- Cao & Wan (1998)
Konda & Tsitsiklis (1999)
Parameter space exploring Edit
Rückstieß et al. (2010) reviews some techniques that perturb parameters instead of actions. The main idea was to replace parameters by a distribution (says, Gausian) of parameters. Different from REINFORCE, the policy is deterministic given parameters. The paper report experiments in which parameter-based exploration is better than action-based. However, they only consider continuous action cases. I think the approach is unlikely to work for discrete action cases because a change if parameters may make no difference in the trajectory or changing the trajectory in an unpredictable way.
- ↑ 1.0 1.1 Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, 1057–1063. doi:10.1.1.37.9714
- ↑ Baxter, J., & Bartlett, P. (2000). Direct Gradient-Based Reinforcement Learning: I. Gradient Ascent Algorithms. In Proceedings of the International Symposium on Circuits and Systems (pp. 271–274).
- ↑ Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229–256.
- ↑ Bartlett, P. L., & Baxter, J. (2002). Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning. Journal of Computer and System Sciences, 64(1), 133–150. doi:10.1006/jcss.2001.1793
- ↑ Baird, L., & Moore, A. (1999). Gradient Descent for General Reinforcement Learning. In Advances in Neural Information Processing Systems 11.
- ↑ Cao, X. R., & Wan, Y. W. (1998). Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. Control Systems Technology, IEEE Transactions on, 6(4), 482-494.
- ↑ Konda, V. R., & Tsitsiklis, J. N. (1999, November). Actor-Critic Algorithms. In NIPS (Vol. 13, pp. 1008-1014).
- ↑ Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., & Schmidhuber, J. (2010). Exploring parameter space in reinforcement learning. Paladyn, 1(1), 14–24. doi:10.2478/s13230-010-0002-4