Paper
19 January 2009 Enriching a document collection by integrating information extraction and PDF annotation
Brett Powley, Robert Dale, Ilya Anisimoff
Author Affiliations +
Proceedings Volume 7247, Document Recognition and Retrieval XVI; 724707 (2009) https://doi.org/10.1117/12.805548
Event: IS&T/SPIE Electronic Imaging, 2009, San Jose, California, United States
Abstract
Modern digital libraries offer all the hyperlinking possibilities of the World Wide Web: when a reader finds a citation of interest, in many cases she can now click on a link to be taken to the cited work. This paper presents work aimed at providing the same ease of navigation for legacy PDF document collections that were created before the possibility of integrating hyperlinks into documents was ever considered. To achieve our goal, we need to carry out two tasks: first, we need to identify and link citations and references in the text with high reliability; and second, we need the ability to determine physical PDF page locations for these elements. We demonstrate the use of a high-accuracy citation extraction algorithm which significantly improves on earlier reported techniques, and a technique for integrating PDF processing with a conventional text-stream based information extraction pipeline. We demonstrate these techniques in the context of a particular document collection, this being the ACL Anthology; but the same approach can be applied to other document sets.
© (2009) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Brett Powley, Robert Dale, and Ilya Anisimoff "Enriching a document collection by integrating information extraction and PDF annotation", Proc. SPIE 7247, Document Recognition and Retrieval XVI, 724707 (19 January 2009); https://doi.org/10.1117/12.805548
Lens.org Logo
CITATIONS
Cited by 5 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Detection and tracking algorithms

Digital libraries

Lithium

Mining

Reliability

Internet

Prototyping

RELATED CONTENT

Web log mining based on improved FCM clustering algorithm
Proceedings of SPIE (August 19 2010)
Design of paper-based user interface for editing documents
Proceedings of SPIE (December 21 2000)
Automatic document navigation for digital content remastering
Proceedings of SPIE (December 15 2003)
DRR is a teenager
Proceedings of SPIE (January 28 2008)
An implementation of iSCSI HBA based on Intel IOP80321
Proceedings of SPIE (December 05 2005)

Back to Top