Dealing with extreme data diversity: extraction and fusion from the growing types of document formats

Peter David; Nichole Hansen; James J. Nolan; Pedro Alcocer

doi:10.1117/12.2184171

15 May 2015 Dealing with extreme data diversity: extraction and fusion from the growing types of document formats

Peter David, Nichole Hansen, James J. Nolan, Pedro Alcocer

Proceedings Volume 9499, Next-Generation Analyst III; 94990Q (2015) https://doi.org/10.1117/12.2184171
Event: SPIE Sensing Technology + Applications, 2015, Baltimore, MD, United States

Abstract

The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F₁- score than a baseline random classifier.

Citation Download Citation

Peter David, Nichole Hansen, James J. Nolan, and Pedro Alcocer "Dealing with extreme data diversity: extraction and fusion from the growing types of document formats", Proc. SPIE 9499, Next-Generation Analyst III, 94990Q (15 May 2015); https://doi.org/10.1117/12.2184171

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
7 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Visualization

Image segmentation

Visual analytics

Data fusion

Image processing

Mathematics

Control systems

Show All Keywords

Keywords/Phrases

Search In:

Publication Years