CluChunk: clustering large scale user-generated content incorporating chunklet information

Authors:
Yu Cheng;Yusheng Xie;Kunpeng Zhang;Ankit Agrawal;Alok Choudhary
Affiliations:
Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL
Venue:
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Year:
2012

Citing 14
Cited 2

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Adjustment Learning and Relevant Component Analysis

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Solving the Small Sample Size Problem of LDA

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Query enrichment for web-query classification

ACM Transactions on Information Systems (TOIS)
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Clustering short texts using wikipedia

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving similarity measures for short segments of text

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Short text classification improved by learning multi-granularity topics

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

JobMiner: a real-time system for mining job-related patterns from social media

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Feedback-driven multiclass active learning for data streams

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The exponential rise of online content in the form of blogs, microblogs, forums, and multimedia sharing sites has raised an urgent demand for efficient and high-quality text clustering algorithms for fast navigation and browsing of users based on better document organization. For several kinds of these user-generated content, it is much easier to obtain the input in small sets, where the data in each set belongs to the same class but with unknown class labels. Such data is viewed as weakly-labeled data and the inherent chunklet information is very useful for improving clustering performance. In this paper, we propose a system - CluChunk (clustering chunklet data) to cluster unlabeled web data which incorporates chunklet information. We try to transfer the original feature space by a discriminatively learning linear transformation such that simple unsupervised learning techniques (such as K-Means) in the transformed space can achieve good clustering accuracy. Using larger scale data from some web applications (social media and online forums), we demonstrate that the clustering performance can get significantly improved by: 1)incorporating the inherent weakly-labeled information into the clustering framework; 2)enriching the representation of short text with additional features extracted from the chunklet subset. The proposed approach can be applied to other mining tasks with large scale user-generated content, like product review summarizing and blog content clustering/classification task.