New indices for text: PAT Trees and PAT arrays
Information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Tag tree template for Web information and schema extraction
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.