Paper
15 June 2007 Finding keywords amongst noise: automatic text classification without parsing
Andrew G. Allison, Charles E. M. Pearce, Derek Abbott
Author Affiliations +
Proceedings Volume 6601, Noise and Stochastics in Complex Systems and Finance; 660113 (2007) https://doi.org/10.1117/12.724655
Event: SPIE Fourth International Symposium on Fluctuations and Noise, 2007, Florence, Italy
Abstract
The amount of text stored on the Internet, and in our libraries, continues to expand at an exponential rate. There is a great practical need to locate relevant content. This requires quick automated methods for classifying textual information, according to subject. We propose a quick statistical approach, which can distinguish between 'keywords' and 'noisewords', like 'the' and 'a', without the need to parse the text into its parts of speech. Our classification is based on an F-statistic, which compares the observed Word Recurrence Interval (WRI) with a simple null hypothesis. We also propose a model to account for the observed distribution of WRI statistics and we subject this model to a number of tests.
© (2007) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Andrew G. Allison, Charles E. M. Pearce, and Derek Abbott "Finding keywords amongst noise: automatic text classification without parsing", Proc. SPIE 6601, Noise and Stochastics in Complex Systems and Finance, 660113 (15 June 2007); https://doi.org/10.1117/12.724655
Lens.org Logo
CITATIONS
Cited by 9 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Statistical modeling

Statistical analysis

Data modeling

Computer simulations

Switching

Internet

Logic

Back to Top