Clustering a very large number of textual unstructured customers' reviews in english

  • Authors:
  • Jan Žižka;Karel Burda;František Dařena

  • Affiliations:
  • Department of Informatics, FBE, Mendel University in Brno, Brno, Czech Republic;Department of Informatics, FBE, Mendel University in Brno, Brno, Czech Republic;Department of Informatics, FBE, Mendel University in Brno, Brno, Czech Republic

  • Venue:
  • AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Having a very large volume of unstructured text documents representing different opinions without knowing which document belongs to a certain category, clustering can help reveal the classes. The presented research dealt with almost two millions of opinions concerning customers' (dis)satisfaction with hotel services all over the world. The experiments investigated the automatic building of clusters representing positive and negative opinions. For the given high-dimensional sparse data, the aim was to find a clustering algorithm with a set of its best parameters, similarity and clustering-criterion function, word representation, and the role of stemming. As the given data had the information of belonging to the positive or negative class at its disposal, it was possible to verify the efficiency of various algorithms and parameters. From the entropy viewpoint, the best results were obtained with k-means using the binary representation with the cosine similarity, idf, and H2 criterion function, while stemming played no role.