Text mining without document context

  • Authors:
  • Eric SanJuan;Fidelia Ibekwe-SanJuan

  • Affiliations:
  • LITA, Université de Metz & URI, Ile du Saulcy, Metz Cedex, France;URSIDOC-ENSSIB & Universitéé de Lyon, Lyon Cedex, France

  • Venue:
  • Information Processing and Management: an International Journal - Special issue: Informetrics
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multiword terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the Term-Watch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.