Finding keywords amongst noise: automatic text classification without parsing

Andrew G. Allison; Charles E. M. Pearce; Derek Abbott

doi:10.1117/12.724655

15 June 2007 Finding keywords amongst noise: automatic text classification without parsing

Andrew G. Allison, Charles E. M. Pearce, Derek Abbott

Proceedings Volume 6601, Noise and Stochastics in Complex Systems and Finance; 660113 (2007) https://doi.org/10.1117/12.724655
Event: SPIE Fourth International Symposium on Fluctuations and Noise, 2007, Florence, Italy

Abstract

The amount of text stored on the Internet, and in our libraries, continues to expand at an exponential rate. There is a great practical need to locate relevant content. This requires quick automated methods for classifying textual information, according to subject. We propose a quick statistical approach, which can distinguish between 'keywords' and 'noisewords', like 'the' and 'a', without the need to parse the text into its parts of speech. Our classification is based on an F-statistic, which compares the observed Word Recurrence Interval (WRI) with a simple null hypothesis. We also propose a model to account for the observed distribution of WRI statistics and we subject this model to a number of tests.

Citation Download Citation

Andrew G. Allison, Charles E. M. Pearce, and Derek Abbott "Finding keywords amongst noise: automatic text classification without parsing", Proc. SPIE 6601, Noise and Stochastics in Complex Systems and Finance, 660113 (15 June 2007); https://doi.org/10.1117/12.724655

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available