Natural Language Understanding Wiki

From Strange et al. (2014)[1]: "Some studies of the effect of OCR errors [Lopresti 2008][2] [Walker et al. 2010][3] [Stein et al. 2006][4] have conducted comparisons by analysing two corpora, identical except for corrections of individual words."

TODO: Subramaniam et al. (2009)[5]

From Lopresti (2008)[2]: "Miller, et al.[6] study the performance of named entity ex- traction under a variety of scenarios involving both ASR and OCR output [17], although speech is their primary interest. They found that by training their system on both clean and noisy input material, performance de- graded linearly as a function of word error rates."

TODO: Michel Généreux, Diego Spano. NLP challenges in dealing with OCR-ed documents of derogated quality


"Noisy text dataset" by Lopresti:


  1. Strange, C., Mcnamara, D., Wodak, J., & Wood, I. (2014). Mining for the Meanings of a Murder : The Impact of OCR Quality on the Use of Digitized Historical Newspapers. Digital Humanities Quarterly, 8(1), 1–17.
  2. 2.0 2.1 Lopresti, D. (2009). Optical character recognition errors and their effects on natural language processing. International Journal on Document Analysis and Recognition (IJDAR), 12(3), 141–151.
  3. Walker et al. 2010 Walker, Daniel D., William B. Lund and Eric K. Ringger. "Evaluating Models of Latent Document Semantics in the Presence of OCR Errors". Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Boston: Association for Computational Linguistics, October 2010. (2010): 240-50.
  4. Stein, Sterling Stuart, Shlomo Argamon, and Ophir Frieder. “"The effect of OCR errors on stylistic text classification".” Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.
  5. Subramaniam, L. V., Roy, S., Faruquie, T. A., & Negi, S. (2009). A Survey of Types of Text Noise and Techniques to Handle Noisy Text. Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, (January), 115–122.
  6. D. Miller, S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. Named entity extraction from noisy input: Speech and OCR. In Proceedings of the 6th Applied Natural Language Processing Conference, pages 316–324, Seattle, WA, 2000.