A Competitive Term Selection Method for Information Retrieval

Authors:
Franco Rojas López;Héctor Jiménez-Salazar;David Pinto
Affiliations:
Faculty of Computer Science, BUAP, Puebla, 72570 Ciudad Universitaria, Mexico;Faculty of Computer Science, BUAP, Puebla, 72570 Ciudad Universitaria, Mexico;Faculty of Computer Science, BUAP, Puebla, 72570 Ciudad Universitaria, Mexico and Department of Information Systems and Computation, UPV, Valencia 46022, Camino de Vera s/n, Spain
Venue:
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 8
Cited 1

Natural language information retrieval: progress report

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Use of a Weighted Topic Hierarchy for Document Classification

TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
BUAP-UPV TPIRS: a system for document indexing reduction at WebCLEF

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing

Comparison of term frequency and document frequency based feature selection metrics in text categorization

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Term selection process is a very necessary component for most natural language processing tasks. Although different unsupervised techniques have been proposed, the best results are obtained with a high computational cost, for instance, those based on the use of entropy. The aim of this paper is to propose an unsupervised term selection technique based on the use of a bigram-enriched version of the transition point. Our approach reduces the corpus vocabulary size by using the transition point technique and, thereafter, it expands the reduced corpus with bigrams obtained from the same corpus, i.e., without external knowledge sources. This approach provides a considerable dimensionality reduction of the TREC-5 collection and, also has shown to improve precision for some entropy-based methods.