Natural Language Understanding Wiki
Advertisement

TODO: What's in a p-value in NLP?, Yeh (2000)[1], Rayson (2003)[2], Berg-Kirkpatrick et al. (2012)[3], Nivre (2001)[4], Cyril and Gaussier (2005)[5]

Stats in ACL 2018: only 1/5 papers do it right https://twitter.com/catherinehavasi/status/1019030669828112384?s=21

Usage

Stratified shuffling-based randomization test (Yeh 2000)[1] seems very common.

Lee et al. (2015)[6] use "two-sided bootstrap resampling statistical significance tests (Graham et al., 2014)"

Bugert et al. 2017[7]; Zhou et al. 2015[8], Nirve et al. (2009)[9] use McNemar's test

I saw some paper(s) use Koehn's subsampling procedure (Koehn 2004)[10]

Zapirain et al. (2013)[11]: "we checked for statistical significance using bootstrap resampling (100 samples) coupled with one-tailed paired t-test (Noreen 1989)."

Bengtson & Roth (2008)[12]: "paired non-parametric bootstrapping percentile test".

Batchkarov et al. (2016)[13] use "bootstrapping" to estimate variance and later on hint on statistical significance.

Gerber and Chain (2012)[14]: "We used a bootstrap resampling technique similar to those developed by Efron and Tibshirani (1993) to test the significance of the performance difference between various systems."

Random variation

TODO: Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

Deep Reinforcement Learning that Matters (https://arxiv.org/abs/1709.06560)

References

  1. 1.0 1.1 Yeh, Alexander. "More accurate tests for the statistical significance of result differences." Proceedings of the 18th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 2000.
  2. Rayson, Paul. Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Diss. Lancaster University, 2003.
  3. Berg-Kirkpatrick, Taylor, David Burkett, and Dan Klein. "An empirical investigation of statistical significance in nlp." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.
  4. Nivre, Joakim. "On statistical methods in natural language processing." Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001). 2001.
  5. Goutte, Cyril, and Eric Gaussier. "A probabilistic interpretation of precision, recall and F-score, with implication for evaluation." European Conference on Information Retrieval. Springer, Berlin, Heidelberg, 2005.
  6. Lee, K., Artzi, Y., Choi, Y., & Zettlemoyer, L. (2015). Event Detection and Factuality Assessment with Non-Expert Supervision. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1643–1648.
  7. Bugert, M., Puzikov, Y., Andreas, R., Eckle-kohler, J., Martin, T., & Mart, E. (2017). LSDSem 2017 : Exploring Data Generation Methods for the Story Cloze Test. The 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-Level Semantics (LSDSEM 2017), (2016), 56–61.
  8. Zhou, Mengfei, Anette Frank, Annemarie Friedrich, and Alexis Palmer. “Semantically Enriched Models for Modal Sense Classification.” In Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem), p. 44. 2015.
  9. Nivre, J., Kuhlmann, M., & Hall, J. (2009). An Improved Oracle for Dependency Parsing with Online Reordering. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09) (pp. 73–76). Paris, France: Association for Computational Linguistics.
  10. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 4, 388–395. http://doi.org/10.1145/2063576.2063688
  11. Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2013). Selectional Preferences for Semantic Role Classification. Computational Linguistics, 39(3).
  12. Bengtson, E., & Roth, D. (2008). Understanding the value of features for coreference resolution. Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, 51(October), 294. http://doi.org/10.3115/1613715.1613756
  13. ACL 2016. http://sro.sussex.ac.uk/62044/1/acl2016.pdf
  14. Gerber, M. S., & Chai, J. Y. (2012). Semantic role labeling of implicit arguments for nominal predicates. Computational Linguistics, 38(4), 755–798. http://doi.org/10.1162/COLI_a_00110
Advertisement