TODO: What's in a p-value in NLP?, Yeh (2000)[1], Rayson (2003)[2], Berg-Kirkpatrick et al. (2012)[3], Nivre (2001)[4], Cyril and Gaussier (2005)[5]

Stats in ACL 2018: only 1/5 papers do it right

Hitchhiker's guide: Dror et al. (2018)[6]

Usage Edit

Stratified shuffling-based randomization test (Yeh 2000)[1] seems very common.

Lee et al. (2015)[7] use "two-sided bootstrap resampling statistical significance tests (Graham et al., 2014)"

Bugert et al. 2017[8]; Zhou et al. 2015[9], Nirve et al. (2009)[10] use McNemar's test

I saw some paper(s) use Koehn's subsampling procedure (Koehn 2004)[11]

Zapirain et al. (2013)[12]: "we checked for statistical significance using bootstrap resampling (100 samples) coupled with one-tailed paired t-test (Noreen 1989)."

Bengtson & Roth (2008)[13]: "paired non-parametric bootstrapping percentile test".

Batchkarov et al. (2016)[14] use "bootstrapping" to estimate variance and later on hint on statistical significance.

Gerber and Chain (2012)[15]: "We used a bootstrap resampling technique similar to those developed by Efron and Tibshirani (1993) to test the significance of the performance difference between various systems."

Random variation Edit

TODO: Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

Deep Reinforcement Learning that Matters (

