Natural Language Understanding Wiki

Error propagation, cascading errors, compounding errors[1] or accumulation of errors is the tendency of a system to commit more errors once the first error(s) was made. That is the first error encourages the second, the first two errors encourage the third, and so on.


A pipeline architecture means that a module work on the output of previous modules, including mistakes they might have made. These mistakes might have negative impact on the current module's performance.

Textual entailment[]

Clark et al. (2007)[2] performed a "thought experiment" (authors' words) to identify necessary information to successfully solve RTE3.


Song et al. (2012)[3] found minor improvement in chunking by using a cost-sensitive POS tagger.

Syntactic parsing[]

Dridan and Oepen (2013)[4] show that using predicted sentence boundaries and tokenization reduce syntactic parsing accuracy slightly (still around 90%).

From McDonald & Nirve (2007)

McDonald and Nirve (2007)[5] are the first to point out that transition-based dependency parsing suffers from error propagation. To make the point, they showed that parsing accuracy decreases as sentences get longer (see photo).

Kummerfeld et al. (2013)[6] show that "improvements in Chinese POS tagging can make a substantial difference for some error types" in Chinese parsing and gold POS tags increase F1 from 81.10% to 86.77%.

Kong & Smith (2014)[7]: "all methods perform better with better POS tags". The effect of Berkeley POS tags vs. Stanford POS tags can be as big as 1 percentage point in LAS.

Nivre et al. (2009)[8] showed that shorter transition sequences lead to better results for SwapStandard parsing.

Favre et al. (2010)[9] showed that performance in syntactic parsing and POS are correlated.

Dell'Orletta (2009)[10]: "a POS tagging error may cause error propagation to the following steps of natural language processing. For instance Watson [11] and Yoshida et al. [12] show the impact of part-of speech errors in parsing task."

Semantic role labeling[]

From Carreras & Màrques (2005)[13]: "all systems experienced a severe drop in performance (about 10 F1 points), even though the base-line on the Brown test set is higher than that of the WSJ test set. As already said in previous sections, all the linguistic processors, from PoS tagging to full parsing, showed a much lower performance than in the WSJ test set... Error propagation and amplification through the chained modules make the final output generalize very badly when changing the domain of application."

From Zhou and Xu (2015)[14]: "The analysis in (Pradhan et al., 2005)[15] found that the major source of the incorrect predictions was the syntactic parser. Combination of different syntactic parsers was proposed to address this problem, from both feature level and model level (Surdeanu et al., 2007; Koomen et al., 2005; Pradhan et al., 2005)."

Punyakanok et al. (2008)[16] shows that syntactic parsing has a big effect on SRL performance. They do so by comparing shallow and full parsing. However they didn't quantify the effect of parsing errors on SRL directly. "the quality of the pruning stage cannot be solely determined based on its recall and precision. Instead, it depends on the characteristics of the output candidates that determine the difficulty of the downstream problems"

Gildea & Palmer (2002)[17] studied the effect of parsing accuracy on Propbank-style SRL. They show that "accuracy of semantic role prediction for known boundaries" increases from 79.2% to 82.8% and for unknown boundaries from 57.7% to 71.1%.

"F-measures on SRL drop more than 10% when switching from gold parses to automatic parses (for instance, from 91.2 to 80.0 for the joint model of Toutanova (2005)[18]."

Favre et al. (2010)[9] showed that performance in semantic role labeling and syntactic parsing are correlated.

From Marcheggiani et al. (2016)[19]: "Roth and Lapata (2016)[20] argue that syntactic features are necessary and show that performance of their model degrades dramatically if syntactic paths between arguments and predicates are not provided as an input."

Timeline construction[]

Caselli et al. (2015)[21] directly quantify error propagation and show that:

  • Errors in lower tasks (tokenization, POS tagging) don't have much effect
  • Errors in higher tasks (SRL, entity tasks) affect subsequent tasks very much

Machine translation[]

Han et al. (2012)[22] gave some examples where POS tagging and parsing errors negatively affect translation performance but didn't quantify the effect.

Han et al. (2013)[23]: "We observed relatively small effects on reordering quality in response of parsing errors. However, reordering quality affect word alignments, which in turn affect the quality of bilingual phrases that are extracted."

Plot from Quirk & Corston-Oliver (2006)

Quirk & Corston-Oliver (2006)[24]: "We vary parse quality by vary-ing the amount of data used to train the parser. [...] is a treelet SMT system sensitive to parse quality? We have shown that such a system is sensitive to the quality of the input syntactic analyses. With the less accurate parsers that result from training on extremely small numbers of sentences, performance is com-parable to state-of-the-art phrasal SMT systems. As the amount of data used to train the parser in-creases, both English-to-German and English-to-Japanese treelet SMT improve, and produce results that are statistically significantly better than the phrasal baseline." (see photo)

Excerpt from David Vandas[25]

Green (2001)[26]: Without proper noun phrase structure in PENN Treebank, researchers have resorted to right branching which leads to wrong interpretation in downstream modules. "This paper examines this noun phrase structure’s effect on dependency parsing, in English, with a maximum spanning tree parser and shows a 2.43%, 0.23 Bleu score, improvement for English to Czech machine translation."

Opinion extraction/Sentiment analysis[]

TODO: How important is syntactic parsing accuracy? An empirical evaluation on rule-based sentiment analysis

Yang & Cardie (2013)[27]: joint "extraction of opinion entities and opinion relations" --> "We can see that in general, using merged 10-best CRF outputs boosts the recall while sacrificing precision. This is expected since merging the 10-best CRF outputs favors candidates that are believed to be more accurate by the CRF predictor. If CRF makes mistakes, the mistakes will propagate to the relation extraction step. The poor performance on precision further confirms the error propagation problem in the pipeline approaches. In contrast, our joint-inference method success-fully boosts the recall while maintaining reason-able precision. This demonstrates that joint inference can effectively leverage the advantage of individual predictors and limit error propagation."

Path querying in knowledge bases (embedding models)[]

Demonstration of cascading errors in Guu et al. (2015)

See Guu et al. (2015)[28]

Noisy text[]

Intuitively, noise (e.g. via OCR or speech recognition) introduces errors that promote further errors down the pipeline. Lopresti (2008)[29] studies error propagation throughout 3 tasks: sentence boundary detection, tokenization and POS tagging.

Coreferene resolution[]

Lee et al. (2013, table 10)[30] show the impact of wrong named-entity and syntax analysis to (entity) coreference resolution. Compared to gold, predicted syntax analysis decreases CR performance by 1-2% and named-entity recognition by a smaller margin.

Peng et al. (2015)[31]: "Table 1: Performance gaps between using gold mentions and predicted mentions for three state-of-the-art corefer- ence resolution systems. Performance gaps are always larger than 10%."

Bengtson and Roth (2008)[32] show that using predicted mention boundaries and types in a end-to-end system decreases the performance by ~2.5% B3 F.

Ng and Cardie (2002)[33]: erroneous anaphoricity identification reduces recall of coreference resolution.

Haghighi and Klein (2010)[34]: "Performance on the A05RA dataset is generally lower because it includes articles from blogs and web forums where parser quality is significantly degraded."


TODO: Yu and Lam (2010)[35]: "Several closely related approaches have been proposed. Wellner et al. (2004) proposed an integrated model based on CRFs for citation matching, but the N-best list inference is a restrictive approximation for the full distribution. Zhu et al. (2007) and Yu et al. (2009) integrated two submodels together, but they are only loosely coupled in that they performed parameter estimation separately and inference information can only flow in one direction, similar as (Finkel, Manning, and Ng 2006). Hollingshead and Roark (2007) proposed pipeline iteration, using output from later stages of a pipeline to constrain earlier stages of the same pipeline, but it lacks the ability to model internal tight dependencies between stages. And these models often have complex structures and involve considerable engineering. For example, Zhu et al. (2008) used variational optimization approach for more learnable parameters in dynamic hierarchical modeling, and conducted extensive feature engineering. As can be seen, our approach outperforms previous integrated or joint models (Poon and Domingos 2007; Sutton, McCallum, and Rohanimanesh 2007), and is considerably easier to build and requires much less engineering. It is also general and can be easily applied to a wide range of probabilistic models and other real-world IE tasks."


See Easy first processing

Working on raw text[]

From Marcheggiani et al. (2016)[19]: "Syntactic parsers are unreliable on out-of-domain data, so standard (i.e. syntactically-informed) SRL models are hindered when tested in this setting."

Collobert et al. (2011)[36]: SENNA can be seen as a solution to error propagation.

Joint inference (integration)[]

TODO: integer linear programming (ILP)

Singh et al. (2013)[37]: "joint inference is an effective approach to avoid cascading of errors when inferring multiple natural language tasks", combined entity typing, entity relations and coreference resolution.

Peng et al. (2015)[31]: Joint Coreference Resolution and Mention Head Detection.

Li et al (2011)[38]: joint POS + parsing for Chinese, explicitly address error propagation: "Current research usually models POS tagging and dependency parsing independently. This may suffer from error propagation problem. [...] To solve this issue, this paper proposes a solution by jointly optimizing POS tagging and dependency parsing in a unique model."

Bayesian inference[]

Finkel et al. (2006)[39]

k-best lists[]

"A common improvement on this architecture is to pass k-best lists between processing stages, for example (Sutton and McCallum, 2005; Wellner et al., 2004). Passing on a k-best list gives useful improvements (e.g., in Koomen et al. (2005)), but efficiently enumerating k-best lists often requires very substantial cognitive and engineering effort, e.g., in (Huang and Chiang, 2005; Toutanova et al., 2005)."

Ambiguous output[]

Yoshida et al. (2007)[40]: "for each word in a sentence, POS taggers output a set of candidate POS tags, and the obtained tags are used to restrict parses in the parsing model."

Weighted finite state transducer[]

"If restricting oneself to weighted finite state transducers (WFSTs), a framework applicable to a number of NLP applications (as outlined in Karttunen (2000)), a pipeline can be compressed down into a single WFST, giving outputs equivalent to propagating the entire distribution through the pipeline."

compressing exponential parsing labelings: "for some models, such as most parsing models, these exponential labelings can be compactly represented in a packed form, e.g., (Maxwell and Kaplan, 1995; Crouch, 2005), and subsequent stages can be reengineered to work over these packed representations, e.g., (Geman and Johnson, 2002)."

Second pass[]

Hollingshead & Roark (2007)[41] use a second pass through the pipeline to let later modules inform earlier modules.

Probabilistic features or intermediate results[]

Bunescu (2008)[42] uses probabilistic features to alleviate error propagation, including POS tags and dependency. For dependency features, however, they marginalize the probability of each edge, effectively assume their dependency. TODO: there exists other attempts to use POS tags distribution to improve later statges.

Raman et al. (2013)[43] propose adaptive beam search using blackbox components.

Henestroza Anguiano & Candito, M. (2012)[44]: "We use a richer framework that allows for probabilistic generalization, with a word represented as a probability distribution over a space of generalized classes: lemmas, clusters, or synsets. Probabilistic lexical information is introduced into parser feature vectors by modifying the weights of lexical features."

Reinforcement learning[]

See also: Reinforcement learning for NLP

Imperatively-Defined Factor Graphs[]

TODO: Singh et al. (2009)[45] for joint entity resolution and (entity) segmentation.


  1. Used in this lecture:
  2. Clark, P., Harrison, P., Thompson, J., Murray, W., Hobbs, J., Fellbaum, C.: On the role of lexical and world knowledge in RTE3. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Prague (June 2007)
  3. Song, H.-J., Son, J.-W., Noh, T.-G., Park, S.-B., & Lee, S.-J. (2012). A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1025–1034). Jeju Island, Korea: Association for Computational Linguistics.
  4. Dridan, Rebecca, and Stephan Oepen. "Document Parsing: Towards Realistic Syntactic Analysis." In Proceedings of the 13th International Conference on Parsing Technologies. 2013.
  5. McDonald, Ryan T., and Joakim Nivre. "Characterizing the Errors of Data-Driven Dependency Parsing Models." EMNLP-CoNLL. 2007.
  6. Kummerfeld, J. K., Tse, D., Curran, J. R., & Klein, D. (2013). An Empirical Examination of Challenges in Chinese Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 98–103). Sofia, Bulgaria: Association for Computational Linguistics.
  7. Kong, Lingpeng and Noah A. Smith (2014). An Empirical Comparison of Parsing Methods for Stanford Dependencies. PDF
  8. Joakim Nivre, Marco Kuhlmann, and Johan Hall. 2009. An improved oracle for dependency parsing with online reordering. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09), pages 73–76, Paris, France, October. Association for Computational Linguistics. PDF
  9. 9.0 9.1 Favre, Benoit, Bernd Bohnet, and Dilek Hakkani-Tür. "Evaluation of semantic role labeling and dependency parsing of automatic speech recognition output." 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2010.
  10. Dell’Orletta, F. (2009). Ensemble system for Part-of-Speech tagging. Proceedings of EVALITA 2009.
  11. Watson, R.: Part-of-speech tagging models for parsing. In: Proceedings of the 9th Annual CLUK Colloquium. Open University, Milton Keynes, UK (2006)
  12. Yoshida, K., Tsuruoka, Y., Miyao, Y., Tsujii, J.: Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers. In: Proceed- ings of the 20th international joint conference on Artifical intelligence (IJCAI’07). Hyderabad, India (2007)
  13. Carreras, X., & Màrques, L. (2005). Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. Proceedings of CoNLL-2005, 152–164.
  14. Zhou, Jie, and Wei Xu. "End-to-end learning of semantic role labeling using recurrent neural networks." Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.
  15. Sameer Pradhan, Kadri Hacioglu, Wayne Ward, James H. Martin, and Daniel Jurafsky. 2005. Semantic role chunking combining complementary syntactic views. In Proceedings of the 9th Conference on Computational Natural Language Learning, CONLL ’05, pages 217–220, Stroudsburg, PA, USA. Association for Computational Linguistics.
  16. Punyakanok, V., Roth, D., & Yih, W. (2008). The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics.
  17. Gildea, D., & Palmer, M. (2002). The Necessity of Parsing for Predicate Argument Recognition. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (pp. 239–246). Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. doi:10.3115/1073083.1073124
  18. Kristina Toutanova. 2005. Effective statistical models for syn- tactic and semantic disambiguation. Ph.D. thesis, Stan- ford University.
  19. 19.0 19.1 Marcheggiani, D., Frolov, A., & Titov, I. (2016). A Simple and Accurate Syntax-Agnostic Neural Model for Dependency-based Semantic Role Labeling.
  20. Michael Roth and Mirella Lapata. 2016. Neural se- mantic role labeling with dependency path embed- dings. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 1192–1202, Berlin, Germany, August. Association for Compu- tational Linguistics. Michael
  21. Caselli, T., Vossen, P., van Erp, M., Fokkens, A., Ilievski, F., Izquierdo, R., … Postma, M. (2015). When it’s all piling up: investigating error propagation in an NLP pipeline. In Proceedings of the Workshop on NLP Applications: Completing the Puzzle, {WNACP} 2015, co-located with the 20th International Conference on Applications of Natural Language to Information Systems {(NLDB} 2015), Passau, Germany, June 17-19, 2015.
  22. Han, D., Sudoh, K., Wu, X., Duh, K., Tsukada, H., & Nagata, M. (2012). Head Finalization Reordering for Chinese-to-Japanese Machine Translation. In Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation (pp. 57–66). Jeju, Republic of Korea: Association for Computational Linguistics.
  23. Han, D., Mart\inez-Gómez, P., Miyao, Y., Sudoh, K., & Nagata, M. (2013). Effects of parsing errors on pre-reordering performance for Chinese-to-Japanese SMT. 27th Pacific Asia Conference on Language, Information, and Computation, 267–276.
  24. Quirk, C., & Corston-Oliver, S. (2006). The impact of parse quality on syntactically-informed statistical machine translation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 62–69). Sydney, Australia: Association for Computational Linguistics.
  25. Vadas, D., & Curran, J. R. (2007). Adding noun phrase structure to the Penn Treebank. 45th Annual Meeting of the Association of Computational Linguistics, (June), 240–247. Retrieved from
  26. Nathan Green. 2011. Effects of noun phrase bracketing in dependency parsing and machine translation. In Proc. of ACL-HLT, Student Ses- sion, pages 69–74.
  27. Yang, B., & Cardie, C. (2013). Joint Inference for Fine-grained Opinion Extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1640–1649). Sofia, Bulgaria: Association for Computational Linguistics.
  28. Guu, K., Miller, J., & Liang, P. (2015). Traversing knowledge graphs in vector space. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP ’15), (September), 318–327. Retrieved from
  29. Lopresti, D. (2009). Optical character recognition errors and their effects on natural language processing. International Journal on Document Analysis and Recognition (IJDAR), 12(3), 141–151.
  30. Lee, Heeyoung, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. "Deterministic coreference resolution based on entity-centric, precision-ranked rules." Computational Linguistics 39, no. 4 (2013): 885-916.
  31. 31.0 31.1 Peng, H., Chang, K.-W., & Roth, D. (2015). A Joint Framework for Coreference Resolution and Mention Head Detection. CoNLL, 12–21.
  32. Bengtson, E., & Roth, D. (2008). Understanding the value of features for coreference resolution. Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP ’08, 51(October), 294.
  33. Ng, V., & Cardie, C. (2002). Identifying anaphoric and non-anaphoric noun phrases to improve coreference resolution. Proceedings of the 19th International Conference on Computational Linguistics -, 1(1987), 1–7.
  34. Haghighi, A., & Klein, D. (2010). Coreference resolution in a modular, entity-centered model. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, (June), 385–393.
  35. Yu, X. (2001). Bidirectional Integration of Pipeline Models. Artificial Intelligence, 1045–1050.
  36. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493--2537.
  37. Singh, S., Riedel, S., Martin, B., Zheng, J., & McCallum, A. (2013). Joint Inference of Entities, Relations, and Coreference. Automated Knowledge Base Construction (AKBC CIKM 2013), 1–6.
  38. Li, Z., Zhang, M., Che, W., Liu, T., Chen, W., & Li, H. (2011). Joint Models for Chinese POS Tagging and Dependency Parsing. EMNLP 2011, 22(1), 1180–1191.
  39. Finkel, J. R., Manning, C. D., & Ng, A. Y. (2006). Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 618–626). Sydney, Australia: Association for Computational Linguistics.
  40. Yoshida, K., Tsuruoka, Y., Miyao, Y., & Tsujii, J. (2007). Ambiguous part-of-speech tagging for improving accuracy and domain portability of syntactic parsers. IJCAI International Joint Conference on Artificial Intelligence, 1783–1788.
  41. Hollingshead, K., & Roark, B. (2007). Pipeline Iteration. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 952–959). Prague, Czech Republic: Association for Computational Linguistics.
  42. Bunescu, R. C. (2008). Learning with Probabilistic Features for Improved Pipeline Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 670–679). Stroudsburg, PA, USA: Association for Computational Linguistics.
  43. Raman, K., Swaminathan, A., Gehrke, J., & Joachims, T. (2013). Beyond Myopic Inference in Big Data Pipelines. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 86–94). New York, NY, USA: ACM. doi:10.1145/2487575.2487588
  44. Henestroza Anguiano, E., & Candito, M. (2012). Probabilistic Lexical Generalization for French Dependency Parsing. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (pp. 1–11). Jeju, Republic of Korea: Association for Computational Linguistics.
  45. Singh, S., Schultz, K., & McCallum, A. (2009). Bi-directional joint inference for entity resolution and segmentation using imperatively-defined factor graphs. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5782 LNAI(PART 2), 414–429.