A study on automatically extracted keywords in text categorization

Authors:
Anette Hulth;Beáta B. Megyesi
Affiliations:
Uppsala University, Sweden;Uppsala University, Sweden
Venue:
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Year:
2006

Citing 10
Cited 10

Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Summarization as feature selection for text categorization

Proceedings of the tenth international conference on Information and knowledge management
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Improving text categorization using the importance of sentences

Information Processing and Management: an International Journal
Improved automatic keyword extraction given more linguistic knowledge

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Using bag-of-concepts to improve the performance of support vector machines in text categorization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Text categorization with class-based and corpus-based keyword selection

ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences

Semantic Text Classification of Emergent Disease Reports

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Association Rule Mining Based on the Semantic Categories of Tourism Information

ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks, Part II
KP-Miner: A keyphrase extraction system for English and Arabic documents

Information Systems
A novel video summarization based on mining the story-structure and semantic relations among concept entities

IEEE Transactions on Multimedia - Special issue on integration of context and content
Re-examining automatic keyphrase extraction approaches in scientific articles

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Automatic classification of sentences for evidence based medicine

DTMBIO '10 Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics
Automatic categorization of questions for user-interactive question answering

Information Processing and Management: an International Journal
Combining classification with clustering for web person disambiguation

Proceedings of the 21st international conference companion on World Wide Web
A hybrid bug triage algorithm for developer recommendation

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Automatic keyphrase extraction from scientific articles

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance --- as measured by micro-averaged F-measure on a standard text categorization collection --- is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving higher weights to words in the full-texts that are also extracted as keywords. We also present results for experiments in which the keywords are the only input to the categorizer, either represented as unigrams or intact. Of these two experiments, the unigrams have the best performance, although neither performs as well as headlines only.