Efficient schema extraction from a large collection of XML documents

Authors:
Guangming Xing;Vijayeandra Parthepan
Affiliations:
Western Kentucky University, Bowling Green, KY;Western Kentucky University, Bowling Green, KY
Venue:
Proceedings of the 49th Annual Southeast Regional Conference
Year:
2011

Citing 10
Cited 0

Extensible markup language

World Wide Web Journal - Special issue on XML: principles, tools, and techniques
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
TIMBER: A native XML database

The VLDB Journal — The International Journal on Very Large Data Bases
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

Information Systems - Special issue on web data integration
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Report on the XML Mining Track at INEX 2005 and INEX 2006

Comparative Evaluation of XML Information Retrieval Systems
Regular expression learning for information extraction

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Fast approximate matching between XML documents and schemata

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Learning regular expressions from noisy sequences

SARA'05 Proceedings of the 6th international conference on Abstraction, Reformulation and Approximation

Quantified Score

Hi-index	0.00

Visualization

Abstract

XML is becoming the standard format for data exchange on the Internet. In this paper, we present a system that is effective in extracting schema information from a large collection of XML documents. Based on Xtract, we propose using the cost of an NFA simulation to compute the Minimum Length Description. We also studied using frequencies of the sample inputs to improve the effectiveness of the schema extraction. Experimental studies were conducted on synthesized XML data sets, suggesting the efficiency and effectiveness of our approach as a solution for schema inference.