Extracting XML data from the web

Authors:
Ngo Sy Viet Phu;Toshiyuki Amagasa;Hiroyuki Kitagawa
Affiliations:
University of Tsukuba, Japan;University of Tsukuba, Japan;University of Tsukuba, Japan
Venue:
Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
Year:
2010

Citing 12
Cited 0

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Relational Databases for Querying XML Documents: Limitations and Opportunities

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting relational data from HTML repositories

ACM SIGKDD Explorations Newsletter
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Simple algorithms for complex relation extraction with applications to biomedical IE

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
KnowItNow: fast, scalable information extraction from the web

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
LegoDB: customizing relational storage for XML documents

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Record extraction based on user feedback and document selection

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Extraction (IE) is a technique to extract structured information (record) from unstructured documents such as Web pages. However, existing techniques are basically aiming at extracting simple records, such as binary relationships like "(company, location)" or named entities like "(organization)". In this paper, we propose an algorithm for extracting complex records like XML by utilizing an existing IE technique. Given a set of seed records in the form of XML data (XML records), we firstly infer the schema information from the XML records. Then, we transform the XML records to a set of relational records consisting of several tables. The obtained relational tables are decomposed into a set of binary relations, and they are forwarded to a record extraction system. We reconstruct XML data from the results obtained from the record of the extraction system. We point out a naive implementation docs not work well, and propose an improved scheme for more efficient XML record extraction. We evaluate the effectiveness of our proposed algorithm in some experiments.