Applying Pattern Mining to Web Information Extraction

Authors:
Chia-Hui Chang;Shao-Chen Lui;Yen-Chin Wu
Affiliations:
-;-;-
Venue:
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Year:
2001

Citing 7
Cited 2

New indices for text: PAT Trees and PAT arrays

Information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.