Paper
24 March 2014 The Lehigh Steel Collection: a new open dataset for document recognition research
Barri Bruno, Daniel Lopresti
Author Affiliations +
Proceedings Volume 9021, Document Recognition and Retrieval XXI; 90210O (2014) https://doi.org/10.1117/12.2042615
Event: IS&T/SPIE Electronic Imaging, 2014, San Francisco, California, United States
Abstract
Document image analysis is a data-driven discipline. For a number of years, research was focused on small, homogeneous datasets such as the University of Washington corpus of scanned journal pages. More recently, library digitization efforts have raised many interesting problems with respect to historical documents and their recognition. In this paper, we present the Lehigh Steel Collection (LSC), a new open dataset we are currently assembling which will be, in many ways, unique to the field. LSC is an extremely large, heterogeneous set of documents dating from the 1960's through the 1990's relating to the wide-ranging research activities of Bethlehem Steel, a now-bankrupt company that was once the second-largest steel producer and the largest shipbuilder in the United States. As a result of the bankruptcy process and the disposition of the company's assets, an enormous quantity of documents (we estimate hundreds of thousands of pages) were left abandoned in buildings recently acquired by Lehigh University. Rather than see this history destroyed, we stepped in to preserve a portion of the collection via digitization. Here we provide an overview of LSC, including our efforts to collect and scan the documents, a preliminary characterization of what the collection contains, and our plans to make this data available to the research community for non-commercial purposes.
© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Barri Bruno and Daniel Lopresti "The Lehigh Steel Collection: a new open dataset for document recognition research", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 90210O (24 March 2014); https://doi.org/10.1117/12.2042615
Lens.org Logo
CITATIONS
Cited by 3 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Buildings

Image segmentation

Optical character recognition

Manufacturing

Visualization

Algorithm development

Document image analysis

RELATED CONTENT

Detecting planar patches in urban scenes
Proceedings of SPIE (July 21 1999)
Speed-up of optical scanner characterization subsystem
Proceedings of SPIE (January 13 2003)
Adaptive thresholding based on active surface
Proceedings of SPIE (July 31 2002)
Document image binarization based on texture analysis
Proceedings of SPIE (March 23 1994)

Back to Top