Data on the Web: from relations to semistructured data and XML
Data on the Web: from relations to semistructured data and XML
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques
Data mining: concepts and techniques
Discovering Structural Association of Semistructured Data
IEEE Transactions on Knowledge and Data Engineering
Optimizing Regular Path Expressions Using Graph Schemas
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Polynomial Time Matching Algorithms for Tree-Like Structured Patterns in Knowledge Discovery
PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Discovering Unordered and Ordered Phrase Association Patterns for Text Mining
PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Discovery of Frequent Tree Structured Patterns in Semistructured Web Documents
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
The q-gram distance for ordered unlabeled trees
DS'05 Proceedings of the 8th international conference on Discovery Science
Hi-index | 0.00 |
Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W1,W2, . . . , Wk) be a list of words which are sorted in lexicographical order and let k 驴 2 be an integer. Firstly, we define a tree-association pattern on (W1,W2, ..., Wk). A tree-association pattern on (W1,W2, . . . , Wk) is a sequence 驴t1; t2; ... ; tk-1驴 of labeled rooted trees such that, for i = 1, 2, ..., k - 1, (1) ti consists of only one node having the pair of two words Wi and Wi+1 as its label, or (2) ti is a labeled rooted tree which has just two leaves labeled with Wi and Wi+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.