Learning Object Models from Semistructured Web Documents

Authors:
Shiren Ye;Tat-Seng Chua
Affiliations:
-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 24
Cited 4

A Polynomial Approach to the Constructive Induction of Structural Knowledge

Machine Learning - Special issue on evaluating and changing representation
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
Knowledge engineering: principles and methods

Data & Knowledge Engineering - Special jubilee issue: DKE 25
Towards text knowledge engineering

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
An Experimental Evaluation of Integrating Machine Learning with Knowledge Acquisition

Machine Learning
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
OminiSearch: a method for searching dynamic content on the Web

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Ontology Learning for the Semantic Web

Ontology Learning for the Semantic Web
Ontology Learning for the Semantic Web

IEEE Intelligent Systems
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A survey of kernels for structured data

ACM SIGKDD Explorations Newsletter
Knowledge level modelling: concepts and terminology

The Knowledge Engineering Review
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Detecting and Partitioning Data Objects in Complex Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Learning from parsed sentences with INTHELEX

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
A bootstrapping method for learning semantic lexicons using extraction pattern contexts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Methods for domain-independent information extraction from the web: an experimental comparison

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
An introduction to kernel-based learning algorithms

IEEE Transactions on Neural Networks

Recognition of Data Records in Semi-structured Web-Pages Using Ontology and Χ2 Statistical Distribution

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Information extraction from syllabi for academic e-Advising

Expert Systems with Applications: An International Journal
Hierarchical organization of unstructured consumer reviews

Proceedings of the 20th international conference companion on World wide web
Domain-assisted product aspect hierarchy generation: towards hierarchical organization of unstructured consumer reviews

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an automated approach to learning object models by means of useful object data extracted from data-intensive semistructured web documents such as product descriptions. Modeling intensive data on the Web involves the following three phrases: First, we identify the object region covering the descriptions of object data when irrelevant contents from the web documents are excluded. Second, we partition the contents of different object data appearing in the object region and construct object data using hierarchical XML outputs. Third, we induce the abstract object model from the analogous object data. This model will match the corresponding object data from a Web site more precisely and comprehensively than the existing handcrafted ontologies. The main contribution of this study is in developing a fully automated approach to extract object data and object model from semistructured web documents using kernel-based matching and View Syntax interpretation. Our system, OnModer, can automatically construct object data and induce object models from complicated web documents, such as the technical descriptions of personal computers and digital cameras downloaded from manufacturers' and vendors' sites. A comparison with the available hand-crafted ontologies and tests on an open corpus demonstrate that our framework is effective in extracting meaningful and comprehensive models.