Document Categorization and Query Generation on the World Wide WebUsing WebACE

  • Authors:
  • Daniel Boley;Maria Gini;Robert Gross;Eui-Hong (Sam) Han;Kyle Hastings;George Karypis;Vipin Kumar;Bamshad Mobasher;Jerome Moore

  • Affiliations:
  • Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA;Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN 55455, USA

  • Venue:
  • Artificial Intelligence Review - Special issue on data mining on the Internet
  • Year:
  • 1999

Quantified Score

Hi-index 0.02

Visualization

Abstract

We present WebACE, an agent for exploring and categorizing documents onthe World Wide Web based on a user profile. The heart of the agent is anunsupervised categorization of a set of documents, combined with a processfor generating new queries that is used to search for new relateddocuments and for filtering the resulting documents to extract the onesmost closely related to the starting set. The document categories are notgiven a priori. We present the overall architecture and describe twonovel algorithms which provide significant improvement over HierarchicalAgglomeration Clustering and AutoClass algorithms and form the basis forthe query generation and search component of the agent. We report on theresults of our experiments comparing these new algorithms with moretraditional clustering algorithms and we show that our algorithms are fastand sacalable.