How you create the evaluation dataset can effect the result and its validity/reliability.
For example: Selecting random sentences for training and testing will end up putting "close" sentences (in the same document, the same paragraph) into both training and testing, therefore making the task easier.
In Koehn (2004), selecting blocks of continuous sentences from the test set increases variation (decreasing reliability).
- Koehn, Philipp. "Statistical Significance Tests for Machine Translation Evaluation." EMNLP. 2004.