Extraction of tag tree patterns with contractible variables from irregular semistructured data

Authors:
Tetsuhiro Miyahara;Yusuke Suzuki;Takayoshi Shoudai;Tomoyuki Uchida;Sachio Hirokawa;Kenichi Takahashi;Hiroaki Ueda
Affiliations:
Faculty of Information Sciences, Hiroshima City University, Hiroshima, Japan;Department of Informatics, Kyushu University, Kasuga, Japan;Department of Informatics, Kyushu University, Kasuga, Japan;Faculty of Information Sciences, Hiroshima City University, Hiroshima, Japan;Computing and Communications Center, Kyushu University, Fukuoka, Japan;Faculty of Information Sciences, Hiroshima City University, Hiroshima, Japan;Faculty of Information Sciences, Hiroshima City University, Hiroshima, Japan
Venue:
PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2003

Citing 7
Cited 0

Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Polynomial Time Inductive Inference of Ordered Tree Patterns with Internal Structured Variables from Positive Data

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
A polynomial time matching algorithm of structured ordered tree patterns for data mining from semistructured data

ILP'02 Proceedings of the 12th international conference on Inductive logic programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Extraction from semistructured data becomes more and more important. In order to extract meaningful or interesting contents from semistructured data, we need to extract common structured patterns from semistructured data. Many semistructured data have irregularities such as missing or erroneous data. A tag tree pattern is an edge labeled tree with ordered children which has tree structures of tags and structured variables. An edge label is a tag, a keyword or a wild-card, and a variable can be substituted by an arbitrary tree. Especially, a contractible variable matches any subtree including a singleton vertex. So a tag tree pattern is suited for representing common tree structured patterns in irregular semistructured data. We present a new method for extracting characteristic tag tree patterns from irregular semistructured data by using an algorithm for finding a least generalized tag tree pattern explaining given data. We report some experiments of applying this method to extracting characteristic tag tree patterns from irregular semistructured data.