There are concerns about evaluation metrics in NLP being not rigorous enough. Systems that attain good performance score still make stupid mistakes all the time.


