Paper
18 December 2001 Triage of OCR results using confidence scores
Author Affiliations +
Proceedings Volume 4670, Document Recognition and Retrieval IX; (2001) https://doi.org/10.1117/12.450730
Event: Electronic Imaging, 2002, San Jose, California, United States
Abstract
We describe a technique for modeling the character recognition accuracy of an OCR system -- treated as a black box -- on a particular page of printed text based on an examination only of the output top-choice character classifications and, for each, a confidence score such as is supplied by many commercial OCR systems. Latent conditional independence (LCI) models perform better on this task, in our experience, than naive uniform thresholding methods. Given a sufficiently large and representative dataset of OCR (errorful) output and manually proofed (correct) text, we can automatically infer LCI models that exhibit a useful degree of reliability. A collaboration between a PARC research group and a Xerox legacy conversion service bureau has demonstrated that such models can significantly improve the productivity of human proofing staff by triaging -- that is, selecting to bypass manual inspection -- pages whose estimated OCR accuracy exceeds a threshold chosen to ensure that a customer-specified per-page accuracy target will be met with sufficient confidence. We report experimental results on over 1400 pages. Our triage software tools are running in production and will be applied to more than 5 million pages of multi-lingual text.
© (2001) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Prateek Sarkar, Henry S. Baird, and John Henderson "Triage of OCR results using confidence scores", Proc. SPIE 4670, Document Recognition and Retrieval IX, (18 December 2001); https://doi.org/10.1117/12.450730
Lens.org Logo
CITATIONS
Cited by 5 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Systems modeling

Classification systems

Data modeling

Inspection

Reliability

Back to Top