An approach to clustering abstracts

Authors:
Mikhail Alexandrov;Alexander Gelbukh;Paolo Rosso
Affiliations:
,Center for Computing Research, National Polytechnic Institute, Mexico;Center for Computing Research, National Polytechnic Institute, Mexico;Polytechnic University of Valencia, Spain
Venue:
NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Year:
2005

Citing 9
Cited 12

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Natural Language Information Retrieval

Natural Language Information Retrieval
Clustering Algorithms

Clustering Algorithms
Modern Information Retrieval

Modern Information Retrieval
Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Analysis of Clustering Algorithms for Web-Based Search

PAKM '02 Proceedings of the 4th International Conference on Practical Aspects of Knowledge Management
Automated Selection of Interesting Medical Text Documents by the TEA Text Analyzer

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
On the Nature of Structure and Its Identification

WG '99 Proceedings of the 25th International Workshop on Graph-Theoretic Concepts in Computer Science

Multi-attribute Text Classification Using the Fuzzy Borda Method and Semantic Grades

WILF '07 Proceedings of the 7th international workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory
Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation

WILF '07 Proceedings of the 7th international workshop on Fuzzy Logic and Applications: Applications of Fuzzy Sets Theory
Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Recurrent-neural-network-based Boolean factor analysis and its application to word clustering

IEEE Transactions on Neural Networks
Particle Swarm Optimization for clustering short-text corpora

Proceedings of the 2009 conference on Computational Intelligence and Bioengineering: Essays in Memory of Antonina Starita
Fuzzifying clustering algorithms: the case study of majorclust

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Evaluation of internal validity measures in short-text corpora

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
ITSA*: an effective iterative method for short-text clustering tasks

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Sense cluster based categorization and clustering of abstracts

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
A general bio-inspired method to improve the short-text clustering task

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
An efficient Particle Swarm Optimization approach to cluster short texts

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein's MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov's proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.