Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

Authors:
Tetsuhiro Miyahara;Yusuke Suzuki;Takayoshi Shoudai;Tomoyuki Uchida;Kenichi Takahashi;Hiroaki Ueda
Affiliations:
-;-;-;-;-;-
Venue:
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2002

Citing 9
Cited 22

Generating ordered trees

Theoretical Computer Science - International Symposium on Mathematical Foundations of Computer Science, Bratisl
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Optimizing Regular Path Expressions Using Graph Schemas

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Applying Pattern Mining to Web Information Extraction

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Extracting Characteristic Structures among Words in Semistructured Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Polynomial Time Algorithms for Finding Unordered Tree Patterns with Internal Variables

FCT '01 Proceedings of the 13th International Symposium on Fundamentals of Computation Theory

Learning of Finite Unions of Tree Patterns with Internal Structured Variables from Queries

AI '02 Proceedings of the 15th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Extracting Characteristic Structures among Words in Semistructured Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Ordered Term Tree Languages which Are Polynomial Time Inductively Inferable from Positive Data

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
Polynomial Time Inductive Inference of Ordered Tree Patterns with Internal Structured Variables from Positive Data

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
On the use of hierarchical information in sequential mining-based XML document similarity computation

Knowledge and Information Systems
Ordered term tree languages which are polynomial time inductively inferable from positive data

Theoretical Computer Science - Algorithmic learning theory(ALT 2002)
Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Exact Learning of Finite Unions of Graph Patterns from Queries

ALT '07 Proceedings of the 18th international conference on Algorithmic Learning Theory
Evolution of Multiple Tree Structured Patterns from Tree-Structured Data Using Clustering

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Learning of Finite Unions of Tree Patterns with Internal Structured Variables from Queries

IEICE - Transactions on Information and Systems
Mining frequent instances on workflows

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Extraction of tag tree patterns with contractible variables from irregular semistructured data

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
A polynomial time matching algorithm of structured ordered tree patterns for data mining from semistructured data

ILP'02 Proceedings of the 12th international conference on Inductive logic programming
A genetic programming approach to extraction of glycan motifs using tree structured patterns

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Efficient algorithms for finding frequent substructures from semi-structured data streams

JSAI'03/JSAI04 Proceedings of the 2003 and 2004 international conference on New frontiers in artificial intelligence
EXiT-B: a new approach for extracting maximal frequent subtrees from XML data

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Extraction of interesting financial information from heterogeneous XML-Based data

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part IV
Extracting structural features among words from document data streams

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Evolution of characteristic tree structured patterns from semistructured documents

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Mining frequent association tag sequences for clustering XML documents

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructuredWeb documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge labeled tree with ordered children which has structured variables. An edge label is a tag or a keyword in such Web documents, and a variable can be substituted by an arbitrary tree. So a tag tree pattern is suited for representing tree structured patterns in such Web documents. First we show that it is hard to compute the optimum frequent tag tree pattern. So we present an algorithm for generating all maximally frequent tag tree patterns and give the correctness of it. Finally, we report some experimental results on our algorithm. Although this algorithm is not efficient, experiments show that we can extract characteristic tree structured patterns in those data.