Structured Data Extraction from the Web Based on Partial Tree Alignment

Authors:
Yanhong Zhai;Bing Liu
Affiliations:
-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 21
Cited 37

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Pattern Matching in Trees

Journal of the ACM (JACM)
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Structural extraction from visual layout of documents

Proceedings of the eleventh international conference on Information and knowledge management
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
WICCAP: From Semi-structured Data to Structured Data

ECBS '04 Proceedings of the 11th IEEE International Conference and Workshop on Engineering of Computer-Based Systems
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Automatically maintaining navigation sequences for querying semi-structured web sources

Data & Knowledge Engineering
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
A Workflow-Based Approach for Creating Complex Web Wrappers

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Structure Extraction from Presentation Slide Information

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting informative images from web news pages via imbalanced classification

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Finding and extracting data records from web pages

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Using clustering and edit distance techniques for automatic web data extraction

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
An automatic HTTP cookie management system

Computer Networks: The International Journal of Computer and Telecommunications Networking
SXPath: extending XPath towards spatial querying on web documents

Proceedings of the VLDB Endowment
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
SILA: a spatial instance learning approach for deep webpages

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Semantic entity-relationship model for large-scale multimedia news exploration and recommendation

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
Automatically learning gazetteers from the deep web

Proceedings of the 21st international conference companion on World Wide Web
Building enriched web page representations using link paths

Proceedings of the 23rd ACM conference on Hypertext and social media
A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
DEQA: deep web extraction for question answering

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
A framework for learning web wrappers from the crowd

Proceedings of the 22nd international conference on World Wide Web
Structured positional entity language model for enterprise entity retrieval

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

International Journal of Information Retrieval Research
Effects of Terms Recognition Mistakes on Requests Processing for Interactive Information Retrieval

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective.