Communications of the ACM
Combinatorial pattern discovery for scientific data: some preliminary results
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning page-independent heuristics for extracting data from Web pages
WWW '99 Proceedings of the eighth international conference on World Wide Web
Data on the Web: from relations to semistructured data and XML
Data on the Web: from relations to semistructured data and XML
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
Machine Learning
Machine Learning
Identification of Tree Translation Rules from Examples
ICGI '00 Proceedings of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications
A Unifying Approach to HTML Wrapper Representation and Learning
DS '00 Proceedings of the Third International Conference on Discovery Science
Polynomial time approximation schemes for Euclidean TSP and other geometric problems
FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Knowledge Discovery from Semistructured Texts
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Efficient Text Mining with Optimized Pattern Discovery
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Hi-index | 0.00 |
A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path 驴 is measured by the agreement of a labelling function on given data, i.e., the number of paths on which a agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.