Changes: Reinforcement learning and classification

Revision as of 22:51, 1 February 2016

TODO: Langford & Zadrozny (2004)^[1], Blatt & Hero (2006)^[2], Lagoudakis & Parr (2003)^[3], Farahmand et al. (2014)^[4], Joseph et al. (2014)^[5].

From Farahmand et al.:

"... classification-based RL algorithms, e.g., Lagoudakis and Parr [44], Fern et al. [32], Li et al. [47], Lazaric et al. [45]. These methods use Monte Carlo trajectories to roughly estimate the action-value function of the current policy (i.e., the value of choosing a particular action at the current state and then following the policy) at several states. This approach is called a rollout-based estimate by Tesauro and Galperin [61] and is closely related, but not equivalent, to the rollout algorithms of Bertsekas [10]. In these methods, the rollout estimates at several points in the state space define a set of (noisy) greedy actions (positive examples) as well as non-greedy actions (negative examples), which are then fed to a classifier. The classifier “generalizes” the greedy action choices over the entire state space. The procedure is repeated.

Classification-based methods can be interpreted as variants of Approximate Policy Iteration (API) that use rollouts to estimate the action-value function (policy evaluation step) and then project the greedy policy obtained at those points onto the predefined space of controllers (policy improvement step).

In many problems, this approach is helpful for three main reasons. First, good policies are sometimes simpler to rep- resent and learn than good value functions. Second, even a rough estimate of the value function is often sufficient to separate the best action from the rest, especially when the gap between the value of the greedy actions and the rest is large. And finally, even if the best action estimates are noisy (due to value function imprecision), one can take advantage of powerful classification methods to smooth out the noise."

References

↑ Langford, J., & Zadrozny, B. (2004). Reducing T-step Reinforcement learning to classification.
↑ Blatt, D., & Hero, A. O. (2006). From weighted classification to policy search. Advances in Neural Information Processing Systems 18, 18, 139–146.
↑ Lagoudakis, M. G., & Parr, R. (2003). Reinforcement Learning as Classification: Leveraging Modern Classifiers. In in Proceedings of the Twentieth International Conference on Machine Learning (pp. 424–431).
↑ Farahmand, A. M., Precup, D., da Motta Salles Barreto, A., & Ghavamzadeh, M. (2014). Classification-based Approximate Policy Iteration: Experiments and Extended Discussions. CoRR, abs/1407.0, 1–17.
↑ Joseph, J., Velez, J., & Roy, N. (2014). Structural Return Maximization for Reinforcement Learning, 1–18. Retrieved from http://arxiv.org/abs/1405.2606

[1] Langford, J., & Zadrozny, B. (2004). Reducing T-step Reinforcement learning to classification.

[2] Blatt, D., & Hero, A. O. (2006). From weighted classification to policy search. Advances in Neural Information Processing Systems 18, 18, 139–146.

[3] Lagoudakis, M. G., & Parr, R. (2003). Reinforcement Learning as Classification: Leveraging Modern Classifiers. In in Proceedings of the Twentieth International Conference on Machine Learning (pp. 424–431).

[4] Farahmand, A. M., Precup, D., da Motta Salles Barreto, A., & Ghavamzadeh, M. (2014). Classification-based Approximate Policy Iteration: Experiments and Extended Discussions. CoRR, abs/1407.0, 1–17.

[5] Joseph, J., Velez, J., & Roy, N. (2014). Structural Return Maximization for Reinforcement Learning, 1–18. Retrieved from http://arxiv.org/abs/1405.2606

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
 TODO: Langford & Zadrozny (2004)<ref>Langford, J., & Zadrozny, B. (2004). Reducing T-step Reinforcement learning to classification.</ref>,
 Blatt & Hero (2006)<ref>Blatt, D., & Hero, A. O. (2006). From weighted classification to policy search. Advances in Neural Information Processing Systems 18, 18, 139–146.</ref>,
-Lagoudakis & Parr (2003)<ref>Lagoudakis, M. G., & Parr, R. (2003). Reinforcement Learning as Classification: Leveraging Modern Classifiers. In in Proceedings of the Twentieth International Conference on Machine Learning (pp. 424–431).</ref>
+Lagoudakis & Parr (2003)<ref>Lagoudakis, M. G., & Parr, R. (2003). Reinforcement Learning as Classification: Leveraging Modern Classifiers. In in Proceedings of the Twentieth International Conference on Machine Learning (pp. 424–431).</ref>,
+Farahmand et al. (2014)<ref>Farahmand, A. M., Precup, D., da Motta Salles Barreto, A., & Ghavamzadeh, M. (2014). Classification-based Approximate Policy Iteration: Experiments and Extended Discussions. CoRR, abs/1407.0, 1–17.</ref>,
+Joseph et al. (2014)<ref>Joseph, J., Velez, J., & Roy, N. (2014). Structural Return Maximization for Reinforcement Learning, 1–18. Retrieved from http://arxiv.org/abs/1405.2606</ref>.
+From Farahmand et al.:
+"... classification-based RL algorithms, e.g., Lagoudakis and Parr [44], Fern et al. [32], Li et al. [47], Lazaric et al. [45]. These methods use Monte Carlo trajectories to roughly estimate the action-value function of the current policy (i.e., the value of choosing a particular action at the current state and then following the policy) at several states. This approach is called a rollout-based estimate by Tesauro and Galperin [61] and is closely related, but not equivalent, to the rollout algorithms of Bertsekas [10]. In these methods, the rollout estimates at several points in the state space define a set of (noisy) greedy actions (positive examples) as well as non-greedy actions (negative examples), which are then fed to a classifier. The classifier “generalizes” the greedy action choices over the entire state space. The procedure is repeated.
+Classification-based methods can be interpreted as variants of Approximate Policy Iteration (API) that use rollouts to estimate the action-value function (policy evaluation step) and then project the greedy policy obtained at those points onto the predefined space of controllers (policy improvement step).
+In many problems, this approach is helpful for three main
+reasons. First, good policies are sometimes simpler to rep- resent and learn than good value functions. Second, even a rough estimate of the value function is often sufficient to separate the best action from the rest, especially when the gap between the value of the greedy actions and the rest is large. And finally, even if the best action estimates are noisy (due to value function imprecision), one can take advantage of powerful classification methods to smooth out the noise."
 == References ==