Performing Binary-Categorization on Multiple-Record Web Documents Using Information Retrieval Models and Application Ontologies

  • Authors:
  • Linus W. Kwong;Yiu-Kai Ng

  • Affiliations:
  • Department of Computer Science, Brigham Young University, Provo, Utah 84602, USA kwongl@cs.byu.edu;Department of Computer Science, Brigham Young University, Provo, Utah 84602, USA ng@cs.byu.edu

  • Venue:
  • World Wide Web
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

To retrieve Web documents of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.