Title extraction and generation from OCR'd documents

Kazem Taghva; Allen Condit; Steve Lumos; Julie Borsack; Thomas Nartker

doi:10.1117/12.712264

29 January 2007 Title extraction and generation from OCR'd documents

Kazem Taghva, Allen Condit, Steve Lumos, Julie Borsack, Thomas Nartker

Proceedings Volume 6500, Document Recognition and Retrieval XIV; 65000R (2007) https://doi.org/10.1117/12.712264
Event: Electronic Imaging 2007, 2007, San Jose, CA, United States

Abstract

Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM)¹ or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC),² is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.

Citation Download Citation

Kazem Taghva, Allen Condit, Steve Lumos, Julie Borsack, and Thomas Nartker "Title extraction and generation from OCR'd documents", Proc. SPIE 6500, Document Recognition and Retrieval XIV, 65000R (29 January 2007); https://doi.org/10.1117/12.712264

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available