Extracting Web Data Using Instance-Based Learning

Authors:
Yanhong Zhai;Bing Liu
Affiliations:
Department of Computer Science, University of Illinois at Chicago, Chicago, USA 60607;Department of Computer Science, University of Illinois at Chicago, Chicago, USA 60607
Venue:
World Wide Web
Year:
2007

Citing 23
Cited 12

Identifying syntactic differences between two programs

Software—Practice & Experience
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
Adaptive View Validation: A First Step Towards Automatic View Detection

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Learning the Common Structure of Data

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A Comparative Study of Information Extraction Strategies

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction

Wrapper induction for information extraction
Accurately and reliably extracting data from the Web: a machine learning approach

Intelligent exploration of the web
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Active learning with strong and weak views: a case study on wrapper induction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

On Finding Templates on Web Collections

World Wide Web
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Facilitating wrapper generation with page analysis

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Extracting product descriptions from polish e-commerce websites using classification and clustering

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances may not be representative of all other instances. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process to extract the required data items from the new instance. The technique is also very efficient. Experimental results based on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art existing systems significantly.