Contextual feature selection for text classification

  • Authors:
  • Francois Paradis;Jian-Yun Nie

  • Affiliations:
  • DIRO, Université de Montréal, Montreal, Que., Canada;DIRO, Université de Montréal, Montreal, Que., Canada

  • Venue:
  • Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a simple approach for the classification of "noisy" documents using bigrams and named entities. The approach combines conventional feature selection with a contextual approach to filter out passages around selected features. Originally designed for call for tender documents, the method can be useful for other web collections that also contain non-topical contents. Experiments are conducted on our in-house collection as well as on the 4-Universities data set, Reuters 21578 and 20 Newsgroups. We find a significant improvement on our collection and the 4-Universities data set (10.9% and 4.1%, respectively). Although the best results are obtained by combining bigrams and named entities, the impact of the latter is not found to be significant.