XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Effective Web data extraction with standard XML technologies
Proceedings of the 10th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
XRules: an effective structural classifier for XML data
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web wrapper induction: a brief survey
AI Communications
Declarative information extraction using datalog with embedded extraction predicates
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Generating XML structure using examples and constraints
Proceedings of the VLDB Endowment
eXtract: a snippet generation system for XML search
Proceedings of the VLDB Endowment
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model
WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Unsupervised user-generated content extraction by dependency relationships
WISE'11 Proceedings of the 12th international conference on Web information system engineering
A Generalized Links and Text Properties Based Forum Crawler
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
Web forums have become important information sources on the Web due to their rich content contributed by millions of Internet users every day. Data extraction from Web pages is a key but cumbersome step for data analysis because of significant human intervention. Web forums have fairly regular structures which allow us to generate extraction rules automatically according to their paths. In this paper, we introduce formal expressions for XPath patterns and pattern mapping rules, and advise machine learning methods to generate extraction rules for automatic data extraction from Web forums. The experimental results on real-life Web forums show good feasibility and accuracy for forum data.