Automated data extraction from the web with conditional models

Authors:
Xuan-Hieu Phan;Susumu Horiguchi;Tu-Bao Ho
Affiliations:
Graduate School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), 1-1, Asahidai, Nomi, Ishikawa 923-1292, Japan.;Graduate School of Information Sciences, Tohoku University, Aoba 6-3-09, Sendai 980-8579, Japan.;Graduate School of Knowledge Science, Japan Advanced Institute of Science and Technology (JAIST), 1-1, Asahidai, Nomi, Ishikawa 923-1292, Japan
Venue:
International Journal of Business Intelligence and Data Mining
Year:
2005

Citing 15
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
DEByE - Date extraction by example

Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A maximum entropy approach to information extraction from semi-structured and free text

Eighteenth national conference on Artificial intelligence
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
PEWeb: Product Extraction from the Web Based on Entropy Estimation

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting data on the Web is an important information extraction task. Most existing approaches rely on wrappers which require human knowledge and user interaction during extraction. This paper proposes the use of conditional models as an alternative solution to this task. Deriving the strength of conditional models like maximum entropy and maximum entropy Markov models, our method offers three major advantages: the full automation, the ability to incorporate various non-independent, overlapping features of different hypertext representations, and the ability to deal with missing and disordered data fields. The experimental results on a wide range of e-commercial websites with different layouts show that our method can achieve a satisfactory trade-off between automation and accuracy, and also provide a practical application of automated data extraction from the Web.