An effective method supporting data extraction and schema recognition on deep web

Authors:
Wei Liu;Derong Shen;Tiezheng Nie
Affiliations:
Department of Computer, Northeastern University, Shenyang, China;Department of Computer, Northeastern University, Shenyang, China;Department of Computer, Northeastern University, Shenyang, China
Venue:
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Year:
2008

Citing 11
Cited 0

RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
DEByE - Date extraction by example

Data & Knowledge Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SG-WRAP: A Schema-Guided Wrapper Generator

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Structured databases on the web: observations and implications

ACM SIGMOD Record
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the rapid development of Internet, data sources on deep web store a large number of high-quality structured data, which demands the development of structured data extraction method. But the existing methods focus on data rather than structure, and some of them are difficult to maintain. To resolve these problems, a complete and effective method supporting data extraction and schema recognition is proposed in this paper. To extract data, a novel algorithm based on clustering is adopted, which is also effective when faced complex data and excessive noise. And a simple extraction rule model is defined to resolve the problem of maintenance. In addition, it does deep mining on result schema recognition. At last, experiments show satisfactory results.