Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

Authors:
Tak-Lam Wong;Wai Lam
Affiliations:
City University of Hong Kong, Kowloon, Hong Kong;The Chinese University of Hong Kong, Shatin, Hong Kong
Venue:
ACM Transactions on Internet Technology (TOIT)
Year:
2007

Citing 35
Cited 5

The nature of statistical learning theory

The nature of statistical learning theory
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Applications of approximate word matching in information retrieval

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Learning page-independent heuristics for extracting data from Web pages

WWW '99 Proceedings of the eighth international conference on World Wide Web
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Learning to extract hierarchical information from semi-structured documents

Proceedings of the ninth international conference on Information and knowledge management
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Wrapper verification

World Wide Web
Reasoning about Textual Similarity in a Web-Based Information Access System

Autonomous Agents and Multi-Agent Systems
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Selective Sampling with Redundant Views

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Getting from here to there: interactive planning and agent execution for optimizing travel

Eighteenth national conference on Artificial intelligence
Adapting Information Extraction Knowledge For Unseen Web Sites

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Bottom-up relational learning of pattern matching rules for information extraction

The Journal of Machine Learning Research
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A Probabilistic Approach for Adapting Information Extraction Wrappers and Discovering New Attributes

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
A probabilistic model of redundancy in information extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Adaptive information extraction: core technologies for information agents

Intelligent information agents
Learning with scope, with application to information extraction and classification

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A Structured Approach to Data Reverse Engineering of Web Applications

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Cross Language Information Extraction Knowledge Adaptation

RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
An unsupervised approach for product record normalization across different web sites

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.