Clustering web pages about persons and organizations

  • Authors:
  • Shiren Ye;Tat-Seng Chua;Jeremy R. Kei

  • Affiliations:
  • School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore

  • Venue:
  • Web Intelligence and Agent Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most frequent Web surfing tasks is to search for persons and organizations by their names. Such names are often not distinctive, commonly occurring, and non-unique. Thus, a single name may be mapped to several named target entities. This paper describes a new methodology to cluster web pages returned by a search engine so that pages belonging to different entities are clustered into different groups. The algorithm uses a combination of named entities, and link-based and structure-based information as features to partition the document set into direct and indirect pages by means of a decision-tree model. It then chooses the appropriate distinctive direct pages as seeds to cluster the document set into different clusters. The algorithm has been found to be effective for web-based information retrieval applications.