Extracting Characteristic Structures among Words in Semistructured Documents

Authors:
Kazuyoshi Furukawa;Tomoyuki Uchida;Kazuya Yamada;Tetsuhiro Miyahara;Takayoshi Shoudai;Yasuaki Nakamura
Affiliations:
-;-;-;-;-;-
Venue:
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2002

Citing 10
Cited 3

Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques

Data mining: concepts and techniques
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Optimizing Regular Path Expressions Using Graph Schemas

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Polynomial Time Matching Algorithms for Tree-Like Structured Patterns in Knowledge Discovery

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Discovering Unordered and Ordered Phrase Association Patterns for Text Mining

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
The q-gram distance for ordered unlabeled trees

DS'05 Proceedings of the 8th international conference on Discovery Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W1,W2, . . . , Wk) be a list of words which are sorted in lexicographical order and let k 驴 2 be an integer. Firstly, we define a tree-association pattern on (W1,W2, ..., Wk). A tree-association pattern on (W1,W2, . . . , Wk) is a sequence 驴t1; t2; ... ; tk-1驴 of labeled rooted trees such that, for i = 1, 2, ..., k - 1, (1) ti consists of only one node having the pair of two words Wi and Wi+1 as its label, or (2) ti is a labeled rooted tree which has just two leaves labeled with Wi and Wi+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.