TODO: Langford & Zadrozny (2004)^{[1]},
Blatt & Hero (2006)^{[2]},
Lagoudakis & Parr (2003)^{[3]},
Farahmand et al. (2014)^{[4]},
Joseph et al. (2014)^{[5]}.

Kakade & Langford (2002)^{[6]}: reduce RL to regression in the presence of a "muy-reset".

Bagnell et al. (2003)^{[7]},
Fern et al. (2003)^{[8]}.

From Farahmand et al.:

"... classification-based RL algorithms, e.g., Lagoudakis and Parr [44], Fern et al. [32], Li et al. [47], Lazaric et al. [45]. These methods use Monte Carlo trajectories to roughly estimate the action-value function of the current policy (i.e., the value of choosing a particular action at the current state and then following the policy) at several states. This approach is called a rollout-based estimate by Tesauro and Galperin [61] and is closely related, but not equivalent, to the rollout algorithms of Bertsekas [10]. In these methods, the rollout estimates at several points in the state space define a set of (noisy) greedy actions (positive examples) as well as non-greedy actions (negative examples), which are then fed to a classifier. The classifier “generalizes” the greedy action choices over the entire state space. The procedure is repeated.

Classification-based methods can be interpreted as variants of Approximate Policy Iteration (API) that use rollouts to estimate the action-value function (policy evaluation step) and then project the greedy policy obtained at those points onto the predefined space of controllers (policy improvement step).

In many problems, this approach is helpful for three main reasons. First, good policies are sometimes simpler to represent and learn than good value functions. Second, even a rough estimate of the value function is often sufficient to separate the best action from the rest, especially when the gap between the value of the greedy actions and the rest is large. And finally, even if the best action estimates are noisy (due to value function imprecision), one can take advantage of powerful classification methods to smooth out the noise."

TODO: global vs local guarantee on optimality? (Langford & Zadrozny, 2003)

## Reductions by Langford & Zadrozny (2003) Edit

Langford & Zadrozny proposed 2 types of reduction, both rely on importance-weighted classification. Departing from normal conceptualization, this type of classification assign a weight *w* to each example (*x*, *y*) to indicate the relative importance of getting it right. The problem can be converted into classification via subsampling though being "slightly tricky"^{[9]}.

Given a T-step reinforcement learning problem, the policy is constructed from *T* classifiers. The authors propose to generate weighted classification examples based on the agent's experience, train the *T* classifiers on those examples to arrive at a better policy.

### RLGenn Edit

Assume that we can fully manipulate the environment, RLGen works by generating the full trajectory tree (read: every single trajectories reachable from a starting state), identify the optimal trajectory, and use its actions as training examples. Those examples are weighted proportionally to how much one might lose classifying it wrongly. The authors showed proof that the learned policy is not worse than the optimal policy by more than $ \frac{(T+1)}{2}\epsilon $ where $ \epsilon $ is an upper bound for error rate of each classifier (each action is rewarded either 0 or 1).

Obviously, this reduction is neither effective nor efficient. For long sequences of actions, the lower bound for episode reward decreases quickly to meaningless values. At the same time, the number of trajectories to inspect increase exponentially to practical infinity.

### RLTrace Edit

I don't really undearstand this reduction. Classifiers were explicitly defined as deterministic functions $ c_t: X \rightarrow Y $. It necessarily follows that all policies based on them are deterministic. However the authors somehow can work on the probability of taking action *y* at state *x*.

Set aside impenetrable formulae, their algorithm looks like a batched version of REINFORCE although supposedly also works with indifferentiable classifiers.

## Classification-based Approximate Policy Iteration (CAPI) Edit

A good deterministic policy should always choose actions with higher action-value (*Q*). CAPI strives to improve this characteristic in an action-gap sensitive way. Like other policy iteration methods, it runs these two steps repeatedly: policy evaluation (compute estimated action-value function $ \hat{Q} $) and policy improvement (change policy to better align with $ \hat{Q} $). Action-gap quantify the distance between the best action (in terms of *Q*) and the rest. A large gap means the price of a mistake is high but it is also less likely while a small gap indicate a mistake is likely but also doesn't matter much. Training examples are weighted proportionately to action-gap.

## Learning to search Edit

Learning to search also makes use of classifiers internally. The research community of learning to search seems to different from that of the above methods and they rarely cross-cite.

## References Edit

- ↑ Langford, J., & Zadrozny, B. (2004). Reducing T-step Reinforcement learning to classification.
- ↑ Blatt, D., & Hero, A. O. (2006). From weighted classification to policy search. Advances in Neural Information Processing Systems 18, 18, 139–146.
- ↑ Lagoudakis, M. G., & Parr, R. (2003). Reinforcement Learning as Classification: Leveraging Modern Classifiers. In in Proceedings of the Twentieth International Conference on Machine Learning (pp. 424–431).
- ↑ Farahmand, A. M., Precup, D., da Motta Salles Barreto, A., & Ghavamzadeh, M. (2014). Classification-based Approximate Policy Iteration: Experiments and Extended Discussions. CoRR, abs/1407.0, 1–17.
- ↑ Joseph, J., Velez, J., & Roy, N. (2014). Structural Return Maximization for Reinforcement Learning, 1–18. Retrieved from http://arxiv.org/abs/1405.2606
- ↑ Kakade, S., & Langford, J. (2002, July). Approximately optimal approximate reinforcement learning. In ICML (Vol. 2, pp. 267-274).
- ↑ Bagnell, J. Andrew, et al. "Policy search by dynamic programming." Advances in neural information processing systems. 2003.
- ↑ Fern, A., Yoon, S. W., & Givan, R. (2003, December). Approximate Policy Iteration with a Policy Language Bias. In NIPS (pp. 847-854).
- ↑ Zadrozny, B., Langford, J., & Abe, N. (2003, November). Cost-sensitive learning by cost-proportionate example weighting. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on (pp. 435-442). IEEE.