Paper
15 May 2015 Dealing with extreme data diversity: extraction and fusion from the growing types of document formats
Peter David, Nichole Hansen, James J. Nolan, Pedro Alcocer
Author Affiliations +
Abstract
The growth in text data available online is accompanied by a growth in the diversity of available documents. Corpora with extreme heterogeneity in terms of file formats, document organization, page layout, text style, and content are common. The absence of meaningful metadata describing the structure of online and open-source data leads to text extraction results that contain no information about document structure and are cluttered with page headers and footers, web navigation controls, advertisements, and other items that are typically considered noise. We describe an approach to document structure and metadata recovery that uses visual analysis of documents to infer the communicative intent of the author. Our algorithm identifies the components of documents such as titles, headings, and body content, based on their appearance. Because it operates on an image of a document, our technique can be applied to any type of document, including scanned images. Our approach to document structure recovery considers a finer-grained set of component types than prior approaches. In this initial work, we show that a machine learning approach to document structure recovery using a feature set based on the geometry and appearance of images of documents achieves a 60% greater F1- score than a baseline random classifier.
© (2015) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Peter David, Nichole Hansen, James J. Nolan, and Pedro Alcocer "Dealing with extreme data diversity: extraction and fusion from the growing types of document formats", Proc. SPIE 9499, Next-Generation Analyst III, 94990Q (15 May 2015); https://doi.org/10.1117/12.2184171
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Image segmentation

Visual analytics

Data fusion

Image processing

Mathematics

Control systems

RELATED CONTENT


Back to Top