PostRank: a new algorithm for incremental finding of persian blog representative words

Authors:
Mohsen Sayyadiharikandeh;Mohammad Ghodsi;Mohammad Naghibi
Affiliations:
Sharif University of Technology, Tehran;Sharif University of Technology and Institute for Research in Fundamental Sciences (IPM), Tehran;Sharif University of Technology, Tehran
Venue:
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Year:
2012

Citing 8
Cited 0

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
OCELOT: a system for summarizing Web pages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Feature Selection Algorithms: A Survey and Experimental Evaluation

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-page summarization using clickthrough data

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Comments-oriented blog summarization by sentence extraction

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Introduction to Information Retrieval

Introduction to Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dimension reduction techniques for text documents can be used for in the preprocessing phrase of blog mining, but these techniques can be more effective if they deal with the nature of the blogs properly. In this paper we propose a novel algorithm called PostRank using shallow approach to identify theme of the blog or blog representative words in order to reduce the dimensions of blogs. PostRank uses a graph-based syntactic representation of the weblog by taking into account some structural features of weblog. At the first step it models the blog as a complete graph and assumes the theme of the blog as a query applied to a search engine like Google and each post as a search result. It tries to rank the posts using Markov chain model like PageRank in Google. We used the ranking model under the assumption that top ranked nodes contain blog best representative words. Then it tries to identify post groups according to their scores. Finally this algorithm analyzes the first group using statistical methods(like TF-IDF) to identify blog representative words. Other groups are candidates of having blog theme after occurring change of theme to the blog. By arriving new instances of posts we try to update the blog graph by setting the initial scores of old nodes in the Markov chain to their final score from last run and continue the PostRank iterations until reaching convergence point. If half of the representative words have changed we would say that theme of the weblog has been changed. We evaluated our method on the Persianblog dataset and obtained promising results. The blogs have been assigned to ten representative words by human beings and the results of PostRank have been compared to them and results of old related algorithms in this area.