Natural Language Understanding Wiki

Given a policy parameterized by having performance , policy gradient methods perform iterative updates of the parameters by:


  • Sutton et al. (1999)[1]: use a value-approximation-function to reduce variance

Direct gradient-based methods[]

(The name comes from Baxter & Barlett (2000).[2])

These methods estimate the gradient of expected reward without the help of value function, action-value function, TD, etc. The first representative is REINFORCE (Williams, 1992)[3] which "learns much more slowly than RL methods using value functions and has received relatively little attention" (Sutton et al., 1999)[1].


  • Infinite horizon: Bartlett & Baxter (2002)[4]
  • VAPS (value and policy search) (Baird & Moore, 1999)[5]: generalize policy search and value-based algorithms, guaranteed to converge, work with episodic tasks.
  • Cao & Wan (1998)[6]


Konda & Tsitsiklis (1999)[7]

Parameter space exploring[]

Rückstieß et al. (2010)[8] reviews some techniques that perturb parameters instead of actions. The main idea was to replace parameters by a distribution (says, Gausian) of parameters. Different from REINFORCE, the policy is deterministic given parameters. The paper report experiments in which parameter-based exploration is better than action-based. However, they only consider continuous action cases. I think the approach is unlikely to work for discrete action cases because a change if parameters may make no difference in the trajectory or changing the trajectory in an unpredictable way.


  1. 1.0 1.1 Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, 1057–1063. doi:
  2. Baxter, J., & Bartlett, P. (2000). Direct Gradient-Based Reinforcement Learning: I. Gradient Ascent Algorithms. In Proceedings of the International Symposium on Circuits and Systems (pp. 271–274).
  3. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8:229–256.
  4. Bartlett, P. L., & Baxter, J. (2002). Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning. Journal of Computer and System Sciences, 64(1), 133–150. doi:10.1006/jcss.2001.1793
  5. Baird, L., & Moore, A. (1999). Gradient Descent for General Reinforcement Learning. In Advances in Neural Information Processing Systems 11.
  6. Cao, X. R., & Wan, Y. W. (1998). Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization. Control Systems Technology, IEEE Transactions on, 6(4), 482-494.
  7. Konda, V. R., & Tsitsiklis, J. N. (1999, November). Actor-Critic Algorithms. In NIPS (Vol. 13, pp. 1008-1014).
  8. Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., & Schmidhuber, J. (2010). Exploring parameter space in reinforcement learning. Paladyn, 1(1), 14–24. doi:10.2478/s13230-010-0002-4