Task-oriented world wide web retrieval by document type classification

  • Authors:
  • Katsushi Matsuda;Toshikazu Fukushima

  • Affiliations:
  • Human Media Res. Labs., NEC 8916-47, Takayama-cho, Ikoma, Nara, 630-0101 Japan;Human Media Res. Labs., NEC 8916-47, Takayama-cho, Ikoma, Nara, 630-0101 Japan

  • Venue:
  • Proceedings of the eighth international conference on Information and knowledge management
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a novel approach to accurately searching Web pages for relevant information in problem solving by specifying a Web document category instead of the user's task. Accessing information from World Wide Web pages as an approach to problem solving has become commonplace. However, such a search is difficult with current search services, since these services only provide keyword-based search methods that are equivalent to narrowing down the target references according to domains. However, problem solving usually involves both a domain and a task. Accordingly, our approach is based on problem solving tasks. To specify a user's problem solving task, we introduce the concept of document types that directly relate to the problem solving tasks; with this approach, users can easily designate problem solving tasks. We implemented PageTypeSearch system based on our approach. Classifier of PageTypeSearch classifies Web pages into the document types by comparing their pages with typical structural characteristics of the types. We compare PageTypeSearch using the document typeindices with a conventional keyword-based search system in experiments. The average precision of the document type-based search is 88.9%, while the average precision of the keyword-based search is 31.2%. Moreover, the number of irrelevant references gathered by our system is about one-thirteenth that of traditional keyword-based search systems. Our approach has practical advantages for problem solving by introducing the viewpoint of tasks to achieve higher performance.