From Strange et al. (2014)[1]: "Some studies of the effect of OCR errors [Lopresti 2008][2] [Walker et al. 2010][3] [Stein et al. 2006][4] have conducted comparisons by analysing two corpora, identical except for corrections of individual words."

TODO: Subramaniam et al. (2009)[5]

From Lopresti (2008)[2]: "Miller, et al.[6] study the performance of named entity ex- traction under a variety of scenarios involving both ASR and OCR output [17], although speech is their primary interest. They found that by training their system on both clean and noisy input material, performance de- graded linearly as a function of word error rates."

TODO: Michel Généreux, Diego Spano. NLP challenges in dealing with OCR-ed documents of derogated quality

Datasets Edit

"Noisy text dataset" by Lopresti:

