Clustering abstracts of scientific texts using the transition point technique

  • Authors:
  • David Pinto;Héctor Jiménez-Salazar;Paolo Rosso

  • Affiliations:
  • Faculty of Computer Science, BUAP, Puebla, Mexico;Faculty of Computer Science, BUAP, Puebla, Mexico;Department of Information Systems and Computation, UPV, Valencia, Spain

  • Venue:
  • CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Current keyword-based techniques fail on narrow domain-oriented libraries, e.g., those containing only documents on high energy physics like those of the hep-ex collection of CERN. We propose a simple procedure to cluster abstracts which consists in applying the transition point technique during the term selection process. This technique uses the mid-frequency terms to index the documents due to the fact that they have a high semantic content. In the experiments we have carried out, the transition point approach has been compared with well known unsupervised term selection techniques. Transition point technique shown that it is possible to obtain a better performance than traditional methods. Moreover, we propose an approach to analyse the stability of transition point term selection method.