Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Deriving marketing intelligence from online discussion
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Machine Learning
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity Resolution with Markov Logic
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Collective information extraction with relational Markov networks
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Expertise networks in online communities: structure and algorithms
Proceedings of the 16th international conference on World Wide Web
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Finding question-answer pairs from online forums
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative training of Markov logic networks
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Joint inference in information extraction
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
Content integration in digital libraries
AMC '09 Proceedings of the 2009 workshop on Ambient media computing
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Covertly probing underground economy marketplaces
DIMVA'10 Proceedings of the 7th international conference on Detection of intrusions and malware, and vulnerability assessment
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Shopping for top forums: discovering online discussion for product research
Proceedings of the First Workshop on Social Media Analytics
From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Web information extraction using markov logic networks
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised user-generated content extraction by dependency relationships
WISE'11 Proceedings of the 12th international conference on Web information system engineering
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
Extracting multiple news attributes based on visual features
Journal of Intelligent Information Systems
Complete-Thread extraction from web forums
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Automatically extracting user reviews from forum sites
Computers & Mathematics with Applications
A simple approach to the design of site-level extractors using domain-centric principles
Proceedings of the 21st ACM international conference on Information and knowledge management
An unsupervised method for author extraction from web pages containing user-generated content
Proceedings of the 21st ACM international conference on Information and knowledge management
Collective information extraction with context-specific consistencies
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
Hi-index | 0.00 |
Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.