Fuzzy combinations of criteria: an application to web page representation for clustering

  • Authors:
  • Alberto Pérez García-Plaza;Víctor Fresno;Raquel Martínez

  • Affiliations:
  • NLP & IR Group, UNED, Madrid, Spain;NLP & IR Group, UNED, Madrid, Spain;NLP & IR Group, UNED, Madrid, Spain

  • Venue:
  • CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document representation is an essential step in web page clustering. Web pages are usually written in HTML, offering useful information to select the most important features to represent them. In this paper we investigate the use of nonlinear combinations of criteria by means of a fuzzy system to find those important features. We start our research from a term weighting function called Fuzzy Combination of Criteria (fcc) that relies on term frequency, document title, emphasis and term positions in the text. Next, we analyze its drawbacks and explore the possibility of adding contextual information extracted from inlinks anchor texts, proposing an alternative way of combining criteria based on our experimental results. Finally, we apply a statistical test of significance to compare the original representation with our proposal.