Document-Base Extraction for Single-Label Text Classification

  • Authors:
  • Yanbo J. Wang;Robert Sanderson;Frans Coenen;Paul Leng

  • Affiliations:
  • Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX;Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX;Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX;Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX

  • Venue:
  • DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many text mining applications, especially when investigating Text Classification (TC), require experiments to be performed using common text-collections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the following limitations: the overall volume of textual data is too large for ease of experimentation; there are many predefined classes; most of the classes consist of only a very few documents; some documents are labeled with a single class whereas others have multiple classes; and there are documents found with little or no actual text-content. In this paper, we propose a standard approach to automatically extract "qualified" document-bases from a given textual data-source that can be used more effectively and reliably in single-label TC experiments. The experimental results demonstrate that document-bases extracted based on our approach can be used effectively in single-label TC experiments.