alternative/related terms: bandit structured prediction, imitation learning, learning to search, response-based/response-driven learning, learning over constrained latent representation (LCLR).

TODO: Chang et al. (2010)[1], Chen et al. (2017), Fang et al. (2017)[2], Gu et al. (2017)[3].

Motivation Edit

Meet the i.i.d. assumption of machine learning:

We face a sequential prediction problem where future observations (visited states) depend on previous actions. This is challenging because it violates the common i.i.d. assumptions made in statistical learning. For example, naively training the agent on the gold labels alone would unrealistically teach the agent to make decisions under the assumption that all previous decisions were correct, potentially causing it to over-rely on information from past actions. (Clark 2015)[4]
TODO: reduce error propagation

TODO: "Training a fully joint model from scratch is also unrealistic because it requires text that is annotated with all the tasks, thus making joint training implausible from a learning theoretic perspective (See Punyakanok et al. (2005) for a discussion about the learning theoretic requirements of joint training.)"

Alleviate "exposure bias" -- from Ranzato et al. (2016)[5]: "exposure bias which occurs when a model is only exposed to the training data distribution, instead of its own predictions"

Algorithms Edit

TODO: Bengio et al. (2015)[6]

  • Approximate Policy Gradient (Le and Fokkens, 2016)[7], Shen et al. (2016)[8]
  • "Learning to search" algorithm family

Formalisms Edit

TODO: Solokov et al. (2016)[9]: a good list of formalisms

See also: Solokov et al. (2016b)[10]

Applications Edit

Multiclass text classification Edit

Solokov et al. (2016)[9]

Chunking Edit

Solokov et al. (2016)[9]

Named-entity recognition Edit

Lao et al. (2019)[11]

Syntactic parsing Edit

From Goldberg and Nivre (2012)[12] "Deterministic classifier-based dependency parsing is an instance of independent sequential classification-based structured prediction. [...] Several methods have been developed to cope with error propagation in sequential classification, including stacked sequential learning (Cohen and Carvalho, 2005), LaSO (Daumé III and Marcu, 2005), Searn (Daumé III et al., 2009) and its followers SMILe (Ross and Bagnell, 2010) and DAgger (Ross et al., 2011)."

Jiang et al. (2012)[13]: imitation learning for agenda-based parsing

Zhang and Chan (2009)[14]: energy-based value function, dependency parsing

Chang et al. (2015)[15]: a general (?) framework for dependency parsing using L2S, also does labeled parsing.

Le and Fokkens (2016)[16]: RL reduces error propagation

Semantic role labeling Edit

Wolfe et al. (2016)[17] tried some imitation learning approaches -- got very low results.

Semantic parsing Edit

Berant & Liang (2015)[18] apply imitation learning to the training of semantic parsers.

AMR parsing: Rao et al. (2015)[19]

TODO: supervision signals for semantic parsing

  • Demonstrations
  • Distant supervision
  • Conversations

TODO: Clarke et al. (2010)[20]

Text-based games Edit

TODO: RL for text games (He et al., 2016)[21]

Machine translation Edit

Used for real-time machine translation: Grissom et al. (2014)[22]

Wiseman and Rush (2016)[23]: seq2seq model

Sokolov et al. (2015)[24], Solokov et al. (2016)[9]

Language modeling Edit

TODO: Bengio et al. (2015)[25] use predicted sequence to predict next word during training

Coreference resolution Edit

TODO: Ma et al. (2014)[26],

Clark (2015)[27]: "naively training the agent on the gold labels alone would unrealistically teach the agent to make decisions under the assumption that all previous decisions were correct, potentially causing it to over-rely on information from past actions. This is especially problematic in coreference, where the error rate is quite high

Imitation learning, where expert demonstrations of good behavior are used to teach the agent, has proven very useful in practice for this sort of problem [1]. I use imitation learning to train the agent to classify whether an action (merge or do not merge the current pair of clusters) matches an expert policy. In particular, I use the DAgger imitation learning algorithm [32]."

From Wiseman et al. (2016)[28]: "We also experimented with training approaches and model variants that expose the model to its own predictions (Daume III et al., 2009; Ross et al., 2011; Bengio et al., 2015), but found that these yielded a negligible performance improvement." In Table 2, they gave compelling evidence that error propagation is not a big problem to their model.

From Clark and Manning (2016)[29]: "imitation learning algorithms such as SEARN (Daume III et al., 2009) have been used to train coreference resolvers (Daume III, 2006; Ma et al., 2014; Clark and Manning, 2015). These algorithms also directly optimize for coreference evaluation metrics, but they require an expert policy to learn from instead of relying on rewards alone." See also Le and Titov (2017)[30] for a non-reinforcement learning (?) approach.

Martschat and Strube (2015)[31]: "In principle we would like to directly optimize for the evaluation metric we are interested in. Unfortunately, the evaluation metrics used in coreference do not allow for efficient optimization based on mention pairs, since they operate on the entity level. For example, the CEAFe metric (Luo, 2005) needs to compute optimal entity alignments between gold and system entities. These alignments do not factor with respect to mention pairs."

Text generation Edit

Word ordering: Wiseman and Rush (2016)[23]

Summarization Edit

Paulus et al (2017)[32]

Optical character recognition Edit

Solokov et al. (2016)[9]

Techniques Edit

Memory and experience replay Edit

See also Edit

References Edit

  1. Chang, M.-W., Goldwasser, D., Roth, D., & Srikumar, V. (2010). Discriminative Learning over Constrained Latent Representations. Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT 2010), (June), 429–437. Retrieved from
  2. Fang, Meng, Yuan Li, and Trevor Cohn. "Learning how to Active Learn: A Deep Reinforcement Learning Approach." arXiv preprint arXiv:1708.02383(2017).
  3. Gu, Jiatao, Kyunghyun Cho, and Victor OK Li. "Trainable greedy decoding for neural machine translation." arXiv preprint arXiv:1702.02429 (2017).
  4. Clark, K. (2015). Neural Coreference Resolution.
  5. Ranzato, M., Chopra, S., Auli, M., & Zaremba, W. (2016). Sequence Level Training with Recurrent Neural Networks. ICLR, 1–15. Retrieved from
  6. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence pre- diction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171– 1179.
  7. Minh Lê and Antske Fokkens. Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing. Proceedings of the European chapter of the Association for Computational Linguistics (EACL 2017). Valencia, Spain.
  8. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Mini- mum Risk Training for Neural Machine Translation. arXiv:1512.02433v3 [cs.CL], pages 1–9. 
  9. 9.0 9.1 9.2 9.3 9.4 Sokolov, Artem, Julia Kreutzer, Christopher Lo, and Stefan Riezler. "Learning structured predictors from bandit feedback for interactive NLP." ACL, 2016. PDF
  10. Sokolov, Artem, Julia Kreutzer, and Stefan Riezler. "Stochastic structured prediction under bandit feedback." Advances In Neural Information Processing Systems. 2016. PDF
  11. Lao, Y., Xu, J., Gao, S., Guo, J., & Wen, J.-R. (2019). Name Entity Recognition with Policy-Value Networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1245–1248). New York, NY, USA: Association for Computing Machinery.
  12. Goldberg, Yoav, and Joakim Nivre. "A Dynamic Oracle for Arc-Eager Dependency Parsing." In COLING, pp. 959-976. 2012.
  13. J. Jiang, A. Teichert, J. Eisner, and H. Daume. 2012. Learned prioritization for trading off accuracy and speed. In Advances in Neural Information Processing Systems (NIPS).
  14. Lidan Zhang and Kwok Ping Chan. 2009. Dependency Parsing with Energy-based Reinforcement Learning. In IWPT 2009, pages 234–237. ACL. 
  15. Chang, Kai-Wei, He He, Hal Daumé III, and John Langford. "Learning to search for dependencies." arXiv preprint arXiv:1503.05615 (2015).
  16. Minh Lê and Antske Fokkens. Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing. Proceedings of the European chapter of the Association for Computational Linguistics (EACL 2017). Valencia, Spain. [arxiv]
  17. Wolfe, Travis, Mark Dredze, and Benjamin Van Durme. "A Study of Imitation Learning Methods for Semantic Role Labeling." EMNLP 2016 (2016): 44.
  18. Berant, J., & Liang, P. (2015). Imitation Learning of Agenda-based Semantic Parsers. Transactions of the Association for Computational Linguistics (TACL), 3, 545–558.
  19. Rao, Sudha, Yogarshi Vyas, Hal Daume III, and Philip Resnik. "Parser for abstract meaning representation using learning to search." arXiv preprint arXiv:1510.07586 (2015).
  20. Clarke, James, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. "Driving Semantic Parsing from the World’s Response." CoNLL-2010 51 (2010): 18.
  21. Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, Mari Ostendorf. 2016. Deep Reinforcement Learning with a Natural Language Action Space. ACL
  22. Grissom II, Alvin, He He, Jordan L. Boyd-Graber, John Morgan, and Hal Daumé III. "Don't Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation." In EMNLP, pp. 1342-1352. 2014.
  23. 23.0 23.1 Sam Wiseman, Alexander M. Rush. 2016. Sequence-to-Sequence Learning as Beam-Search Optimization
  24. Artem Sokolov, Stefan Riezler, and Tanguy Urvoy. 2015. Bandit structured prediction for learning from user feedback in statistical machine translation. In MT Summit XV, Miami, FL
  25. Bengio, Samy, et al. "Scheduled sampling for sequence prediction with recurrent neural networks." Advances in Neural Information Processing Systems. 2015.
  26. Chao Ma, Janardhan Rao Doppa, J Walker Orr, Prashanth Mannem, Xiaoli Fern, Tom Dietterich, and Prasad Tadepalli. 2014. Prune-and-score: Learning for greedy coreference resolution. In Proceedings of the Conference on Empirical Methods in Nat- ural Language Processing (EMNLP).
  27. Clark, K. (2015). Neural Coreference Resolution.
  28. Wiseman, S., Rush, A. M., & Shieber, S. M. (2016). Learning Global Features for Coreference Resolution. In NAACL-2016 (pp. 994–1004).
  29. Clark, Kevin, and Christopher D. Manning. "Deep reinforcement learning for mention-ranking coreference models." arXiv preprint arXiv:1609.08667 (2016).
  30. Le, Phong, and Ivan Titov. "Optimizing Differentiable Relaxations of Coreference Evaluation Metrics." arXiv preprint arXiv:1704.04451 (2017).
  31. Martschat, S., & Strube, M. (2015). Latent Structures for Coreference Resolution. Transactions of the Association for Computational Linguistics, 3(0), 405–418. Retrieved from
  32. Romain Paulus, Caiming Xiong, and Richard Socher. 2017.
    A Deep Reinforced Model for Abstractive Summarization
Community content is available under CC-BY-SA unless otherwise noted.