QuASM: a system for question answering using semi-structured data
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Linear Time Algorithm for Finding All Maximal Scoring Subsequences
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Automating Content Extraction of HTML Documents
World Wide Web
Extracting context to improve accuracy for HTML content extraction
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Separating XHTML content from navigation clutter using DOM-structure block analysis
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Adaptive web-page content identification
Proceedings of the 9th annual ACM international workshop on Web information and data management
Text Extraction from the Web via Text-to-Tag Ratio
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
Hi-index | 0.00 |
Blog post extraction is essential for researches on blogosphere. In this paper, we address the issue of extracting blog posts from diverse blog pages, which aims at automatically and precisely finding the location of each blog post. Most of the previous researches focused on extracting main content from news pages, but the problem becomes more complex when one turns to blog pages. Our research is based on the combination of maximum scoring subsequence and text-to-tag ratio to develop algorithms that are suitable for blog pages. The first method that we propose is PTR Scoring, which combines post-to-tag ratio with maximum scoring subsequence. The second method is CRF Scoring, which applies Conditional Random Field to train a sequence labeling model and use maximum scoring subsequence to improve the accuracy of extraction. The experimental results show that CRF Scoring achieves the best F-Measure at 91.9\% compared with other methods.