Using structured tokens to identify webpages for data extraction

Authors:
Ling Lin;Lizhu Zhou;Qi Guo;Gang Li
Affiliations:
Tsinghua University, Beijing, PRC;Tsinghua University, Beijing, PRC;Tsinghua University, Beijing, PRC;Tsinghua University, Beijing, PRC
Venue:
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Year:
2007

Citing 14
Cited 0

Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Object-level ranking: bringing order to Web objects

WWW '05 Proceedings of the 14th international conference on World Wide Web
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Accurate and efficient crawling for relevant websites

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Wrapper generation for automatic data extraction from large web sites

DNIS'05 Proceedings of the 4th international conference on Databases in Networked Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the web grows, more and more data has become available from webpages, such as the product items from the back-end databases. To provide efficient access to the data objects contained in these pages, data extraction plays an important role. However, identifying the suitable webpages to feed the data extraction is a pre-requisite and non-trivial task. As a result, there is an increasing need for methods that can automatically identify the target pages from unknown websites. In this paper, we solve the problem by exploiting the structured-token features of the webpage content, and applying decision tree based classification algorithm to induce the structure information. Furthermore, a preliminary recognition of data-object is acquired to efficiently initiate the subsequential data extraction. We experiment our approach on the real-world data, and achieve promising results.