Clustering header categories extracted from web tables

George Nagy; David W. Embley; Mukkai Krishnamoorthy; Sharad Seth

doi:10.1117/12.2076209

8 February 2015 Clustering header categories extracted from web tables

George Nagy, David W. Embley, Mukkai Krishnamoorthy, Sharad Seth

Proceedings Volume 9402, Document Recognition and Retrieval XXII; 94020M (2015) https://doi.org/10.1117/12.2076209
Event: SPIE/IS&T Electronic Imaging, 2015, San Francisco, California, United States

Abstract

Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube view of a table are extracted by factoring the (often multi-row/column) headers. To reveal commonalities between tables from diverse sources, the Jaccard distances between pairs of category headers (and also table titles) are computed. We show how about one third of our heterogeneous collection can be clustered into a dozen groups that exhibit table-title and header similarities that can be exploited for queries.

Citation Download Citation

George Nagy, David W. Embley, Mukkai Krishnamoorthy, and Sharad Seth "Clustering header categories extracted from web tables", Proc. SPIE 9402, Document Recognition and Retrieval XXII, 94020M (8 February 2015); https://doi.org/10.1117/12.2076209

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available