Bootstrap technique in cluster analysis
Pattern Recognition
Word association norms, mutual information, and lexicography
Computational Linguistics
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with committees
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
CLARANS: A Method for Clustering Objects for Spatial Data Mining
IEEE Transactions on Knowledge and Data Engineering
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
Retrieving collocations from text: Xtract
Computational Linguistics - Special issue on using large corpora: I
Terminological variation, a means of identifying research topics from texts
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
The clustering power of low frequency words in academic Webs: Brief Communication
Journal of the American Society for Information Science and Technology
Combining full text and bibliometric information in mapping scientific disciplines
Information Processing and Management: an International Journal - Special issue: Infometrics
Introduction to the bio-entity recognition task at JNLPBA
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
A symbolic approach to automatic multiword term structuring
Computer Speech and Language
Mining knowledge from natural language texts using fuzzy associated concept mapping
Information Processing and Management: an International Journal
A Bounded Index for Cluster Validity
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Decomposition of terminology graphs for domain knowledge acquisition
Proceedings of the 17th ACM conference on Information and knowledge management
A comprehensive validity index for clustering
Intelligent Data Analysis
The landscape of information science: 1996-2008
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
A generic construct based workload model for web search
Information Processing and Management: an International Journal
Proceedings of the 2010 conference on Information Modelling and Knowledge Bases XXI
Graph decomposition approaches for terminology graphs
MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Expert Systems with Applications: An International Journal
A clustering study of a 7000 EU document inventory using MDS and SOM
Expert Systems with Applications: An International Journal
A Semantic-based Intellectual Property Management System (SIPMS) for supporting patent analysis
Engineering Applications of Artificial Intelligence
Topic detection and multi-word terms extraction for arabic unvowelized documents
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Combining vector space model and multi word term extraction for semantic query expansion
NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Hi-index | 0.01 |
We consider a challenging clustering task: the clustering of multi-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multiword terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the Term-Watch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices.