Paper
21 December 2000 Pattern matching techniques for correcting low-confidence OCR words in a known context
Glenn Ford, Susan E. Hauser, Daniel X. Le, George R. Thoma
Author Affiliations +
Proceedings Volume 4307, Document Recognition and Retrieval VIII; (2000) https://doi.org/10.1117/12.410842
Event: Photonics West 2001 - Electronic Imaging, 2001, San Jose, CA, United States
Abstract
A commercial OCR system is a key component of a system developed at the National Library of Medicine for the automated extraction of bibliographic fields from biomedical journals. This 5-engine OCR system, while exhibiting high performance overall, does not reliably convert very small characters, especially those that are in italics. As a result, the 'affiliations' field that typically contains such characters in most journals, is not captured accurately, and requires a disproportionately high manual input. To correct this problem, dictionaries have been created from words occurring in this field (e.g., university, department, street addresses, names of cities, etc.) from 230,000 articles already processed. The OCR output corresponding to the affiliation field is then matched against these dictionary entries by approximate string-matching techniques, and the ranked matches are presented to operators for verification. This paper outlines the techniques employed and the results of a comparative evaluation.
© (2000) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Glenn Ford, Susan E. Hauser, Daniel X. Le, and George R. Thoma "Pattern matching techniques for correcting low-confidence OCR words in a known context", Proc. SPIE 4307, Document Recognition and Retrieval VIII, (21 December 2000); https://doi.org/10.1117/12.410842
Lens.org Logo
CITATIONS
Cited by 11 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Associative arrays

Mars

Medicine

Algorithm development

Biomedical optics

Databases

RELATED CONTENT

Correcting OCR text by association with historical datasets
Proceedings of SPIE (January 13 2003)
Automated zone correction in bitmapped document images
Proceedings of SPIE (December 22 1999)
Study of style effects on OCR errors in the MEDLINE...
Proceedings of SPIE (January 17 2005)
Automated labeling in document images
Proceedings of SPIE (December 21 2000)
Automated data entry system: performance issues
Proceedings of SPIE (December 18 2001)

Back to Top