Paper
29 January 2007 Title extraction and generation from OCR'd documents
Kazem Taghva, Allen Condit, Steve Lumos, Julie Borsack, Thomas Nartker
Author Affiliations +
Proceedings Volume 6500, Document Recognition and Retrieval XIV; 65000R (2007) https://doi.org/10.1117/12.712264
Event: Electronic Imaging 2007, 2007, San Jose, CA, United States
Abstract
Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM)1 or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC),2 is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.
© (2007) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Kazem Taghva, Allen Condit, Steve Lumos, Julie Borsack, and Thomas Nartker "Title extraction and generation from OCR'd documents", Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000R (29 January 2007); https://doi.org/10.1117/12.712264
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Data modeling

Databases

Lanthanum

Visualization

Medicine

Motion models

RELATED CONTENT


Back to Top