RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Topic segmentation: algorithms and applications
Topic segmentation: algorithms and applications
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Automating Content Extraction of HTML Documents
World Wide Web
Automatic Extraction of Publication Time from News Search Results
ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Interactive wrapper generation with minimal user effort
Proceedings of the 15th international conference on World Wide Web
CUTS: CUrvature-based development pattern analysis and segmentation for blogs and other Text Streams
Proceedings of the seventeenth conference on Hypertext and hypermedia
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Automated extraction of hit numbers from search result pages
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Extracting informative textual parts from web pages containing user-generated content
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Hi-index | 0.00 |
With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.