Towards practical genre classification of web documents

  • Authors:
  • George Ferizis;Peter Bailey

  • Affiliations:
  • CSIRO ICT Centre, ANU Canberra, Australia;CSIRO ICT Centre, ANU Canberra, Australia

  • Venue:
  • Proceedings of the 15th international conference on World Wide Web
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classification of documents by genre is typically done either using linguistic analysis or term frequency based techniques. The former provides better classification accuracy than the latter but at the cost of two orders of magnitude more computation time. While term frequency analysis requires much less computational resources than linguistic analysis,it returns poor classification accuracy when the genres are not sufficiently distinct. A method that removes or approximates the expensive portions of linguistic analysis is presented.The accuracy and computation time of this method then compared with both linguistic analysis and term frequency analysis. The results in this paper show that this method can significantly reduce the computation of both time of linguistic analysis and term frequency analysis, while retaining an accuracy that is higher than that of term frequency analysis.