A Personal Web Information/Knowledge Retrieval System

Authors:
Hao Han;Takehiro Tokuda
Affiliations:
{han, tokuda}@tt.cs.titech.ac.jp, Department of Computer Science, Tokyo Institute of Technology, Meguro, Tokyo 152-8552, Japan;{han, tokuda}@tt.cs.titech.ac.jp, Department of Computer Science, Tokyo Institute of Technology, Meguro, Tokyo 152-8552, Japan
Venue:
Proceedings of the 2008 conference on Information Modelling and Knowledge Bases XIX
Year:
2008

Citing 5
Cited 1

Internet scrapbook: automating Web browsing tasks by demonstration

Proceedings of the 11th annual ACM symposium on User interface software and technology
Effective Web data extraction with standard XML technologies

Proceedings of the 10th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting content from accessible web pages

W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
HTML2RSS: automatic generation of RSS feed based on structure analysis of HTML document

Proceedings of the 15th international conference on World Wide Web

An Efficient Method for Quick Construction of Web Services

Proceedings of the 2009 conference on Information Modelling and Knowledge Bases XX

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web is the richest source of information and knowledge. Unfortunately the current structure of Web pages makes it difficult for users to retrieve the information or knowledge in a systematic way. In this paper, using the tree approach, we propose a personal Web information/knowledge retrieval system for the extraction of structured parts from Web pages. First we get the layout pattern and paths of extraction parts of a typical Web page in target sites. Then we use the recorded layout pattern and paths to extract the structured parts from the rest of Web pages in target sites. We show the usefulness of our approach using the results of extracting structured parts of notable Web pages.