Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Understanding SGML and XML Tools: Practical Programs for Handling Structured Text
Understanding SGML and XML Tools: Practical Programs for Handling Structured Text
Learning Logical Definitions from Relations
Machine Learning
Machine Learning
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Wrapper induction for information extraction
Wrapper induction for information extraction
Information extraction with automatic knowledge expansion
Information Processing and Management: an International Journal
Rule identification from web pages by the XRML approach
Decision Support Systems
Architecture and performance of the rule based comparison shopping: delivery cost experience
Proceedings of the 10th international conference on Electronic commerce
Relevant estimation among fields using field association words
International Journal of Computer Applications in Technology
Rule identification from Web pages by the XRML approach
Decision Support Systems
Rule-based personalized comparison shopping including delivery cost
Electronic Commerce Research and Applications
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Hi-index | 0.00 |
This paper presents a new extraction pattern, called modified Document Type Definition (mDTD), which relies on analytical interpretation to identify extraction target from the contents of the Web documents. From conventional DTD in XML documents, we develop two major extensions: first, we introduce an extended content model with type-specific operators and keywords, and second, we refine the way to interpret the conventional DTD rules. As the result of the two, our mDTD becomes freely represent HTML structures and extraction targets. The goal of mDTD is to overcome the current major barriers, that is, domain portability (with minimal human intervention) and high performance, on information extraction. The human experts compose an mDTD as seed rules, and then our system automatically extracts a set of instances by the mDTD from structured documents on the Web. We use the extracted instances as Sequential mDTD Learner (SmL) inputs to generate new mDTD rules based on part-of-speech tags and features for lexical similarity. This process does not require any hand-annotated corpus. We have experimented with 330 Korean and 220 English Web documents on audio and video shopping sites. The average extraction precision is 91.3% for Korean and 81.9% for English.