Mining significant words from customer opinions written in different natural languages

Authors:
Jan Žižka;František Dařena
Affiliations:
Department of Informatics, SoNet Research Center, Faculty of Business and Economics, Mendel University in Brno, Brno, Czech Republic;Department of Informatics, SoNet Research Center, Faculty of Business and Economics, Mendel University in Brno, Brno, Czech Republic
Venue:
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Year:
2011

Citing 4
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Cross-Language Information Retrieval

Cross-Language Information Retrieval
Automatic sentiment analysis using the textual pattern content similarity in natural language

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner

Data Mining for Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner

Clustering a very large number of textual unstructured customers' reviews in english

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Opinions expressed by text documents freely written in various natural languages represent a valuable source of knowledge that is hidden in large datasets. The presented research describes a text mining-method how to discover words that are significant for expressing different opinions (positive and negative). The method applies a simple but unified data pre-processing for all languages, providing the bag-of-words with words represented by their frequencies in the data. Then, the frequencies are used by the algorithm which generates decision trees. The tree decisive nodes contain the words that are significant for expressing the opinions. Positions of these words in the tree represent their significance degree, where the most significant word is in the node. As a result, a list of relevant words can be used for creating a dictionary containing only relevant information. The described method was tested using very large sets of customers' reviews concerning the on-line hotel room booking. For more than 15 languages, there were available several millions of reviews. The resulting dictionaries included only about 200 significant words.