Unsupervised learning of mDTD extraction patterns for web text mining

Authors:
Dongseok Kim;Hanmin Jung;Gary Geunbae Lee
Affiliations:
Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja, Pohang 790-784, South Korea;Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja, Pohang 790-784, South Korea;Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja, Pohang 790-784, South Korea
Venue:
Information Processing and Management: an International Journal
Year:
2003

Citing 8
Cited 7

Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Understanding SGML and XML Tools: Practical Programs for Handling Structured Text

Understanding SGML and XML Tools: Practical Programs for Handling Structured Text
Learning Logical Definitions from Relations

Machine Learning
The CN2 Induction Algorithm

Machine Learning
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Wrapper induction for information extraction

Wrapper induction for information extraction

Information extraction with automatic knowledge expansion

Information Processing and Management: an International Journal
Rule identification from web pages by the XRML approach

Decision Support Systems
Architecture and performance of the rule based comparison shopping: delivery cost experience

Proceedings of the 10th international conference on Electronic commerce
Relevant estimation among fields using field association words

International Journal of Computer Applications in Technology
Rule identification from Web pages by the XRML approach

Decision Support Systems
Rule-based personalized comparison shopping including delivery cost

Electronic Commerce Research and Applications
Learning robust web wrappers

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new extraction pattern, called modified Document Type Definition (mDTD), which relies on analytical interpretation to identify extraction target from the contents of the Web documents. From conventional DTD in XML documents, we develop two major extensions: first, we introduce an extended content model with type-specific operators and keywords, and second, we refine the way to interpret the conventional DTD rules. As the result of the two, our mDTD becomes freely represent HTML structures and extraction targets. The goal of mDTD is to overcome the current major barriers, that is, domain portability (with minimal human intervention) and high performance, on information extraction. The human experts compose an mDTD as seed rules, and then our system automatically extracts a set of instances by the mDTD from structured documents on the Web. We use the extracted instances as Sequential mDTD Learner (SmL) inputs to generate new mDTD rules based on part-of-speech tags and features for lexical similarity. This process does not require any hand-annotated corpus. We have experimented with 330 Korean and 220 English Web documents on audio and video shopping sites. The average extraction precision is 91.3% for Korean and 81.9% for English.