There are concerns about evaluation metrics in NLP being not rigorous enough. Systems that attain good performance score still make stupid mistakes all the time.


  1. Levy et al. Do Supervised Distributional Methods Really Learn Lexical Inference Relations? NAACL 2015
  2. Moosavi & Strube Lexical Features in Coreference Resolution: To be Used With Caution ACL 2017
  3. Agrawal et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset 2017
  4. Yatskar et al. Commonly Uncommon: Semantic Sparsity in Situation Recognition CVPR 2016
  5. Jia & Liang Adversarial Examples for Evaluating Reading Comprehension Systems EMNLP 2017
  6. Levy et al. Zero-Shot Relation Extraction via Reading Comprehension CoNLL 2017
  7. Belinkov & Bisk Synthetic and Natural Noise Both Break Neural Machine Translation ICLR 2018
  8. Ettinger et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task EMNLP Wksp 2017

Build it - break it shared task papers:

BIBI System Description: Building with CNNs and Breaking with Deep Reinforcement Learning  Yitong Li, Trevor Cohn and Timothy Baldwin

Strawman: an Ensemble of Deep Bag-of-Ngrams for Sentiment Analysis  Kyunghyun Cho

Breaking Sentiment Analysis of Movie Reviews  Ieva Staliūnaitė and Ben Bonfil

Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World Knowledge to Fool Sentiment Analysis Systems  Taylor Mahler, Willy Cheung, Micha Elsner, David King, Marie-Catherine de Marneffe, Cory Shain, Symon Stevens-Guille and Michael White

ACTSA: Annotated Corpus for Telugu Sentiment Analysis  Sandeep Sricharan Mukku
Analysing Errors of Open Information Extraction Systems  Rudolf Schneider, Tom Oberhauser, Tobias Klatt, Felix A. Gers and Alexander Löser

An Adaptable Lexical Simplification Architecture for Major Ibero-Romance Languages  Daniel Ferrés, Horacio Saggion and Xavier Gómez Guinovart

Cross-genre Document Retrieval: Matching between Conversational and Formal Writings  Tomasz Jurczyk and Jinho D. Choi

Massively Multilingual Neural Grapheme-to-Phoneme Conversion  Ben Peters, Jon Dehdari and Josef van Genabith