Automatic Extraction of Blog Post from Diverse Blog Pages

Authors:
Chia-Hui Chang;Jhih-Ming Chen
Affiliations:
-;-
Venue:
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2012

Citing 12
Cited 0

QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Linear Time Algorithm for Finding All Maximal Scoring Subsequences

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automating Content Extraction of HTML Documents

World Wide Web
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
Text Extraction from the Web via Text-to-Tag Ratio

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Blog post extraction is essential for researches on blogosphere. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of the previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages. Our research is based on the combination of maximum scoring subsequence and text-to-tag ratio to develop algorithms that are suitable for blog pages. The first method that we propose is PTR Scoring, which combines post-to-tag ratio with maximum scoring subsequence. The second method is CRF Scoring, which applies Conditional Random Field to train a sequence labeling model and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9\% compared with other methods.