Distributional Similarity Model for Multi-modality Clustering in Social Media

Authors:
Donahue C. M. Sze;Tak-chung Fu;Fu-lai Chung;Robert W. P. Luk
Affiliations:
-;-;-;-
Venue:
WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Year:
2007

Citing 6
Cited 0

Text classification using string kernels

The Journal of Machine Learning Research
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Distributional similarity models: clustering vs. nearest neighbors

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Multi-way distributional clustering via pairwise interactions

ICML '05 Proceedings of the 22nd international conference on Machine learning
Recognizing contextual polarity in phrase-level sentiment analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Interactive clustering of text collections according to a user-specified criterion

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

User Generated Content (UGC) has become the fastest growing sector of the WWW. Data mining from UGC presents challenges not typically found in text mining from documents. UGC can be semi-structured and its content can be very short and informal, containing relatively little content similar to a chat or an email conversation. In addition, UGC can be viewed as a multi-modality data. These characteristics pose big challenges and research questions for scholars to cope with. To cluster UGC data, we can construct multiple contingency tables of modalities and employ the multi-way distributional clustering (MDC) algorithm. However, by considering a contingency table which summarizes the co-occurrence statistics of two modalities, it is not robust to represent the information entropy between two modalities in UGC data. In this paper, we propose a novel similarity measurement, called Distributional Similarity Model (DSM), to solidify the graph model in the MDC algorithm to deal with the unique characteristics of the UGC data.