Paper
17 January 2005 Using the web to validate document recognition results: experiments with business cards
Clemens Oertel, Shauna O'Shea, Adam Bodnar, Dorothea Blostein
Author Affiliations +
Proceedings Volume 5676, Document Recognition and Retrieval XII; (2005) https://doi.org/10.1117/12.588717
Event: Electronic Imaging 2005, 2005, San Jose, California, United States
Abstract
The World Wide Web is a vast information resource which can be useful for validating the results produced by document recognizers. Three computational steps are involved, all of them challenging: (1) use the recognition results in a Web search to retrieve Web pages that contain information similar to that in the document, (2) identify the relevant portions of the retrieved Web pages, and (3) analyze these relevant portions to determine what corrections (if any) should be made to the recognition result. We have conducted exploratory implementations of steps (1) and (2) in the business-card domain: we use fields of the business card to retrieve Web pages and identify the most relevant portions of those Web pages. In some cases, this information appears suitable for correcting OCR errors in the business card fields. In other cases, the approach fails due to stale information: when business cards are several years old and the business-card holder has changed jobs, then websites (such as the home page or company website) no longer contain information matching that on the business card. Our exploratory results indicate that in some domains it may be possible to develop effective means of querying the Web with recognition results, and to use this information to correct the recognition results and/or detect that the information is stale.
© (2005) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Clemens Oertel, Shauna O'Shea, Adam Bodnar, and Dorothea Blostein "Using the web to validate document recognition results: experiments with business cards", Proc. SPIE 5676, Document Recognition and Retrieval XII, (17 January 2005); https://doi.org/10.1117/12.588717
Lens.org Logo
CITATIONS
Cited by 2 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Computer science

Analytical research

Databases

Electrical engineering

Mining

Associative arrays

RELATED CONTENT

Trigram-based algorithms for OCR result correction
Proceedings of SPIE (March 17 2017)
SIP-CCME cooperation in multimedia VoIP network
Proceedings of SPIE (October 12 2006)
Automatic document navigation for digital content remastering
Proceedings of SPIE (December 15 2003)
XML middleware for scalable web mining
Proceedings of SPIE (March 21 2003)

Back to Top