WordNet-based text document clustering

  • Authors:
  • Julian Sedding;Dimitar Kazakov

  • Affiliations:
  • University of York, Heslington, York, United Kingdom;University of York, Heslington, York, United Kingdom

  • Venue:
  • ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

Text document clustering can greatly simplify browsing large collections of documents by reorganizing them into a smaller number of manageable clusters. Algorithms to solve this task exist; however, the algorithms are only as good as the data they work on. Problems include ambiguity and synonymy, the former allowing for erroneous groupings and the latter causing similarities between documents to go unnoticed. In this research, naïve, syntax-based disambiguation is attempted by assigning each word a part-of-speech tag and by enriching the 'bag-of-words' data representation often used for document clustering with synonyms and hypernyms from WordNet.