Paper
1 April 1998 Document image cleanup and binarization
Victor Wu, Raghaven Manmatha
Author Affiliations +
Proceedings Volume 3305, Document Recognition V; (1998) https://doi.org/10.1117/12.304638
Event: Photonics West '98 Electronic Imaging, 1998, San Jose, CA, United States
Abstract
Image binarization is a difficult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet effective algorithm for document image clean-up and binarization. The algorithm consists of two basic steps. In the first step, the input image is smoothed using a low-pass filter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the first peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the first and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass filter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of source: digitized video frames, photos, newspapers, advertisements in magazines or sales flyers, personal checks, etc. There are 21820 characters and 4406 words in these images. 91 percent of the characters and 86 percent of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84 percent for the characters and 77 percent for the words.
© (1998) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Victor Wu and Raghaven Manmatha "Document image cleanup and binarization", Proc. SPIE 3305, Document Recognition V, (1 April 1998); https://doi.org/10.1117/12.304638
Lens.org Logo
CITATIONS
Cited by 28 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Detection and tracking algorithms

Linear filtering

Algorithm development

Image processing

Indium nitride

Photography

Back to Top