Clustering a very large number of textual unstructured customers' reviews in english

Authors:
Jan Žižka;Karel Burda;František Dařena
Affiliations:
Department of Informatics, FBE, Mendel University in Brno, Brno, Czech Republic;Department of Informatics, FBE, Mendel University in Brno, Brno, Czech Republic;Department of Informatics, FBE, Mendel University in Brno, Brno, Czech Republic
Venue:
AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Year:
2012

Citing 9
Cited 0

Concept decompositions for large sparse text data using clustering

Machine Learning
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
External validation measures for K-means clustering: A data distribution perspective

Expert Systems with Applications: An International Journal
Using text mining and sentiment analysis for online forums hotspot detection and forecast

Decision Support Systems
Cross-Language Information Retrieval

Cross-Language Information Retrieval
Word co-occurrence features for text classification

Information Systems
Mining significant words from customer opinions written in different natural languages

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Improving the quality of predictions using textual information in online user reviews

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Having a very large volume of unstructured text documents representing different opinions without knowing which document belongs to a certain category, clustering can help reveal the classes. The presented research dealt with almost two millions of opinions concerning customers' (dis)satisfaction with hotel services all over the world. The experiments investigated the automatic building of clusters representing positive and negative opinions. For the given high-dimensional sparse data, the aim was to find a clustering algorithm with a set of its best parameters, similarity and clustering-criterion function, word representation, and the role of stemming. As the given data had the information of belonging to the positive or negative class at its disposal, it was possible to verify the efficiency of various algorithms and parameters. From the entropy viewpoint, the best results were obtained with k-means using the binary representation with the cosine similarity, idf, and H2 criterion function, while stemming played no role.