Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
RoadRunner: automatic data extraction from data-intensive web sites
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Some Effective Techniques for Naive Bayes Text Classification
IEEE Transactions on Knowledge and Data Engineering
An unsupervised framework for extracting and normalizing product attributes from multiple web sites
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Normalizing web product attributes and discovering domain ontology with minimal effort
Proceedings of the fourth ACM international conference on Web search and data mining
Hi-index | 0.01 |
Much work has been done in the area of template independent web data extraction. However, these approaches deal with the attribute value extraction and annotation either in separate phases or constrained to a predefined set of attributes which is highly ineffective. In this paper, we perform the attribute extraction and annotation simultaneously by extracting the attribute name and value pair at the same time. In our approach, we use a co-training algorithm with naive Bayesian classifier to identify the candidate attribute name and value pairs in the unlabeled pages. The candidate attribute name and value pairs are used to detect the specification block of the product in web pages. Finally, all the attribute name and value pairs in the specification block are discovered. We conduct experiments for three types of products and obtain a promising result.