Text mining without document context

Authors:
Eric SanJuan;Fidelia Ibekwe-SanJuan
Affiliations:
LITA, Université de Metz & URI, Ile du Saulcy, Metz Cedex, France;URSIDOC-ENSSIB & Universitéé de Lyon, Lyon Cedex, France
Venue:
Information Processing and Management: an International Journal - Special issue: Informetrics
Year:
2006

Citing 13
Cited 14

Bootstrap technique in cluster analysis

Pattern Recognition
Word association norms, mutual information, and lexicography

Computational Linguistics
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
CLARANS: A Method for Clustering Objects for Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
Terminological variation, a means of identifying research topics from texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
The clustering power of low frequency words in academic Webs: Brief Communication

Journal of the American Society for Information Science and Technology
Combining full text and bibliometric information in mapping scientific disciplines

Information Processing and Management: an International Journal - Special issue: Infometrics
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
A symbolic approach to automatic multiword term structuring

Computer Speech and Language

Mining knowledge from natural language texts using fuzzy associated concept mapping

Information Processing and Management: an International Journal
A Bounded Index for Cluster Validity

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Decomposition of terminology graphs for domain knowledge acquisition

Proceedings of the 17th ACM conference on Information and knowledge management
A comprehensive validity index for clustering

Intelligent Data Analysis
The landscape of information science: 1996-2008

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
A generic construct based workload model for web search

Information Processing and Management: an International Journal
Rapid Ontology Development

Proceedings of the 2010 conference on Information Modelling and Knowledge Bases XXI
Graph decomposition approaches for terminology graphs

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Facilitating Ontology Development with Continuous Evaluation

Informatica
A multi-faceted and automatic knowledge elicitation system (MAKES) for managing unstructured information

Expert Systems with Applications: An International Journal
A clustering study of a 7000 EU document inventory using MDS and SOM

Expert Systems with Applications: An International Journal
A Semantic-based Intellectual Property Management System (SIPMS) for supporting patent analysis

Engineering Applications of Artificial Intelligence
Topic detection and multi-word terms extraction for arabic unvowelized documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Combining vector space model and multi word term extraction for semantic query expansion

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multiword terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the Term-Watch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.