Blog post and comment extraction using information quantity of web format

Authors:
Donglin Cao;Xiangwen Liao;Hongbo Xu;Shuo Bai
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, Beijing and Graduate School, the Chinese Academy of Sciences, Beijing and Dept. of Cognitive Science, Xiamen University, Xiamen, P.R ...;Institute of Computing Technology, Chinese Academy of Sciences, Beijing and Graduate School, the Chinese Academy of Sciences, Beijing;Institute of Computing Technology, Chinese Academy of Sciences, Beijing;Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Venue:
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Year:
2008

Citing 13
Cited 1

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Automating Content Extraction of HTML Documents

World Wide Web
Automatic Extraction of Publication Time from News Search Results

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
CUTS: CUrvature-based development pattern analysis and segmentation for blogs and other Text Streams

Proceedings of the seventeenth conference on Hypertext and hypermedia
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Automated extraction of hit numbers from search result pages

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management

Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.