Template-based information mining from HTML documents

Authors:
Jane Yung-jen Hsu;Wen-tau Yih
Affiliations:
Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.;Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.
Venue:
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Year:
1997

Citing 6
Cited 7

Automatic text processing

Automatic text processing
Scalable Internet resource discovery: research problems and approaches

Communications of the ACM
The World-Wide Web: quagmire or gold mine?

Communications of the ACM
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Document Processing for Automatic Knowledge Acquisition

IEEE Transactions on Knowledge and Data Engineering
A case-based approach to knowledge navigation

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Extracting Information from Semi-structured Web Documents

OOIS '02 Proceedings of the Workshops on Advances in Object-Oriented Information Systems
A Case-Based Recognition of Semantic Structures in HTML Documents

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Knowledge Discovery for Automatic Query Expansion on the World-Wide Web

ER '99 Proceedings of the Workshops on Evolution and Change in Data Management, Reverse Engineering in Information Systems, and the World Wide Web and Conceptual Modeling
Single-agent and Multi-agent Approaches to WWW Information Integration

PRIMA '98 Selected papers from the First Pacific Rim International Workshop on Multi-Agents, Multiagent Platforms
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Automatic extraction of bilingual word pairs using inductive chain learning in various languages

Information Processing and Management: an International Journal
Data mining using links in open hypermedia

MIS'02 Proceedings of the 2002 international conference on Metainformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tools for mining information from data can create added value for the Internet. As the majority of electronic documents available over the network are in unstructured textual form, extracting useful information from a document usually involves information retrieval techniques or manual processing. This paper presents a novel approach to mining information from HTML documents using tree-structured templates. In addition to syntactic and semantic descriptions, each template is designed to capture the logical structure of a class of documents. Experiments have been conducted to extract FAQ information automatically from over one hundred HTML documents collected from the Web. Using two basic templates, the prototype FAQ Miner has accurately analyzed 65% of the collection of FAQ documents. With additional processing to handle "near-pass" es, the success rate is approximately 75%. The preliminary results have demonstrated the utility of structural templates for mining information from semi-structured text-based documents.