Natural Language Understanding Wiki

TODO: What's in a p-value in NLP?, Yeh (2000)[1], Rayson (2003)[2], Berg-Kirkpatrick et al. (2012)[3], Nivre (2001)[4], Cyril and Gaussier (2005)[5]

Stats in ACL 2018: only 1/5 papers do it right

Hitchhiker's guide: Dror et al. (2018)[6]


Stratified shuffling-based randomization test (Yeh 2000)[1] seems very common.

Lee et al. (2015)[7] use "two-sided bootstrap resampling statistical significance tests (Graham et al., 2014)"

Bugert et al. 2017[8]; Zhou et al. 2015[9], Nirve et al. (2009)[10] use McNemar's test

I saw some paper(s) use Koehn's subsampling procedure (Koehn 2004)[11]

Zapirain et al. (2013)[12]: "we checked for statistical significance using bootstrap resampling (100 samples) coupled with one-tailed paired t-test (Noreen 1989)."

Bengtson & Roth (2008)[13]: "paired non-parametric bootstrapping percentile test".

Batchkarov et al. (2016)[14] use "bootstrapping" to estimate variance and later on hint on statistical significance.

Gerber and Chain (2012)[15]: "We used a bootstrap resampling technique similar to those developed by Efron and Tibshirani (1993) to test the significance of the performance difference between various systems."

Random variation[]

TODO: Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

Deep Reinforcement Learning that Matters (


  1. 1.0 1.1 Yeh, Alexander. "More accurate tests for the statistical significance of result differences." Proceedings of the 18th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 2000.
  2. Rayson, Paul. Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Diss. Lancaster University, 2003.
  3. Berg-Kirkpatrick, Taylor, David Burkett, and Dan Klein. "An empirical investigation of statistical significance in nlp." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.
  4. Nivre, Joakim. "On statistical methods in natural language processing." Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001). 2001.
  5. Goutte, Cyril, and Eric Gaussier. "A probabilistic interpretation of precision, recall and F-score, with implication for evaluation." European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2005.
  6. Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1383-1392).
  7. Lee, K., Artzi, Y., Choi, Y., & Zettlemoyer, L. (2015). Event Detection and Factuality Assessment with Non-Expert Supervision. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1643–1648.
  8. Bugert, M., Puzikov, Y., Andreas, R., Eckle-kohler, J., Martin, T., & Mart, E. (2017). LSDSem 2017 : Exploring Data Generation Methods for the Story Cloze Test. The 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-Level Semantics (LSDSEM 2017), (2016), 56–61.
  9. Zhou, Mengfei, Anette Frank, Annemarie Friedrich, and Alexis Palmer. “Semantically Enriched Models for Modal Sense Classification.” In Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem), p. 44. 2015.
  10. Nivre, J., Kuhlmann, M., & Hall, J. (2009). An Improved Oracle for Dependency Parsing with Online Reordering. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09) (pp. 73–76). Paris, France: Association for Computational Linguistics.
  11. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 4, 388–395.
  12. Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2013). Selectional Preferences for Semantic Role Classification. Computational Linguistics, 39(3).
  13. Bengtson, E., & Roth, D. (2008). Understanding the value of features for coreference resolution. Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, 51(October), 294.
  14. ACL 2016.
  15. Gerber, M. S., & Chai, J. Y. (2012). Semantic role labeling of implicit arguments for nominal predicates. Computational Linguistics, 38(4), 755–798.