Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Proceedings of the 10th international conference on World Wide Web
Semi-Automatic Wrapper Generation for Internet Information Sources
COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Automatic information extraction from large websites
Journal of the ACM (JACM)
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW
WWW '05 Proceedings of the 14th international conference on World Wide Web
Semantic partitioning of web pages
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Information extraction from web tables
Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
Tag tree template for Web information and schema extraction
Expert Systems with Applications: An International Journal
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
LattesMiner: a multilingual DSL for information extraction from lattes platform
Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Data extraction from web pages based on structural-semantic entropy
Proceedings of the 21st international conference companion on World Wide Web
Hi-index | 0.00 |
World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.