- Deterministic vs. randomized policy

- Optimal Learning Trajectories: Some algorithms assume that the Optimal Learning Trajectories (OLTs) are known for all learning examples. An OLT is a sequence of actions that, given an input, leads from the initial state to the correct output.
- Optimal Learning Policy: Some algorithms assume that for each learning example, we know an Optimal Learning Policy (OLP). The OLP is a procedure that knows the best decision to perform for any state of the prediction space.

There's **not enough training data** for supervised learning to succeed in many tasks. See also Yoshua Bengio's argument for unsupervised learning^{[note 1]}.

**Anthropomorphic argument** (albeit a weak one): children learn from a small amount of "labeled" data. Humans of all age learn by trial-and-error, environment simulation,...

- TensorForce: deep integration with TensorFlow, good amount of documentation
- Github repo: https://github.com/reinforceio/tensorforce
- Blog: https://reinforce.io/blog/

- Keras-rl: simplistic, comes with some documentation
- Google's Dopamin: research-oriented framework
- Facebook's Horizon: production-ready framework based on PyTorch
- Gorilla
- RL-Glue: old framework, unlikely to scale
- Reference: Tanner and White (2009)
^{[1]}, - Website: https://sites.google.com/a/rl-community.org/rl-glue/Home?authuser=0

- A list of many frameworks is here

- ↑ Note that he means non-standard unsupervised learning in which an agent can also interact with its environment.

