On profiling blogs with representative entries

Authors:
Jinfeng Zhuang;Steven C. H. Hoi;Aixin Sun
Affiliations:
Nanyang Technological University, Singapore;Nanyang Technological University, Singapore;Nanyang Technological University, Singapore
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 21
Cited 0

The Strength of Weak Learnability

Machine Learning
Bagging predictors

Machine Learning
Solving the multiple instance problem with axis-parallel rectangles

Artificial Intelligence
Feature Selection: Evaluation, Application, and Small Sample Performance

IEEE Transactions on Pattern Analysis and Machine Intelligence
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A framework for multiple-instance learning

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
On Issues of Instance Selection

Data Mining and Knowledge Discovery
A Unifying View on Instance Selection

Data Mining and Knowledge Discovery
Multi-Instance Kernels

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Solving the Multiple-Instance Problem: A Lazy Learning Approach

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
On the bursty evolution of blogspace

WWW '03 Proceedings of the 12th international conference on World Wide Web
Convex Optimization

Convex Optimization
Large-scale text categorization by batch mode active learning

Proceedings of the 15th international conference on World Wide Web
Batch mode active learning and its application to medical image classification

ICML '06 Proceedings of the 23rd international conference on Machine learning
Extracting redundancy-aware top-k patterns

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Distances and (Indefinite) Kernels for Sets of Objects

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Exploring in the weblog space by detecting informative and affective articles

Proceedings of the 16th international conference on World Wide Web
Multiple instance learning for sparse positive bags

Proceedings of the 24th international conference on Machine learning
Feature selection for ranking

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Comments-oriented blog summarization by sentence extraction

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

With an explosive growth of blogs, information seeking in blogosphere becomes more and more challenging. One example task is to find the most relevant topical blogs against a given query or an existing blog. Such a task requires concise representation of blogs for effective and efficient searching and matching. In this paper, we investigate a new problem of profiling a blog by choosing a set of m most representative entries from the blog, where m is a predefined number that is application-dependent. With the set of selected representative entries, applications on blogs avoid handling hundreds or even thousands of entries (or posts) associated with each blog, which are updated frequently and often noisy in nature. To guide the process of selecting the most representative entries, we propose three principles, i.e., anomaly, representativeness, and diversity. Based on these principles, a greedy yet very efficient entry selection algorithm is proposed. To evaluate the entry selection algorithms, an extrinsic evaluation methodology from document summarization research is adapted. Specifically, we evaluate the proposed entry selection algorithms by examining their blog classification accuracies. By evaluating on a number of different classification methods, our empirical results showed that comparable classification accuracy could be achieved by using fewer than 20 representative entries for each blog compared to that of engaging all entries.