Efficient Text Mining with Optimized Pattern Discovery

Authors:
Hiroki Arimura
Affiliations:
-
Venue:
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Year:
2002

Citing 13
Cited 3

Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Toward Efficient Agnostic Learning

Machine Learning - Special issue on computational learning theory, COLT'92
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Extracting Partial Structures from HTML Documents

Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
On Classification and Regression

DS '98 Proceedings of the First International Conference on Discovery Science
Mining Semi-structured Data by Path Expressions

DS '01 Proceedings of the 4th International Conference on Discovery Science

Discovering instances of poetic allusion from anthologies of classical Japanese poems

Theoretical Computer Science
Discovering characteristic expressions in literary works

Theoretical Computer Science
A UML profile for the conceptual modelling of structurally complex data: Easing human effort in the KDD process

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rapid progress of computer and network technologies makes it easy to collect and store a large amount of unstructured or semi-structured texts such as Web pages, HTML/XML archives, E-mails, and text files. These text data can be thought of large scale text databases, and thus it becomes important to develop an efficient tools to discover interesting knowledge from such text databases.There are a large body of data mining researches to discover interesting rules or patterns from well-structured data such as transaction databases with boolean or numeric attributes [1,8,13]. However, it is difficult to directly apply the traditional data mining technologies to text or semi-structured data mentioned above since these text databases consist of (i) heterogeneous and (ii) huge collections of (iii) un-structured or semi-structured data. Therefore, there still have been a small number of studies on text mining, e.g., [4,5,12,17].Our research goal is to devise an efficient semi-automatic tool that supports human discovery from large text databases. Therefore, we require a fast pattern discovery algorithm that can work in time, e.g., O(n) to O(n log n), to respond in real time on an unstructured data set of total size n. Furthermore, such an algorithm has to be robust in the sense that it can work on a large amount of noisy and incomplete data without the assumption of an unknown hypothesis class.