Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms

Authors:
Arash Joorabchi;Abdulhussain E. Mahdi
Affiliations:
;
Venue:
Journal of Information Science
Year:
2013

Citing 21
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Information Retrieval

Information Retrieval
Automatic extraction of document keyphrases for use in digital libraries: evaluation and applications

Journal of the American Society for Information Science and Technology
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Using Noun Phrase Heads to Extract Document Keyphrases

AI '00 Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Thesaurus based automatic keyphrase indexing

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Mining Domain-Specific Thesauri from Wikipedia: A Case Study

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Document keyphrases as subject metadata: incorporating document key concepts in search results

Information Retrieval
Domain-independent automatic keyphrase indexing with small training sets

Journal of the American Society for Information Science and Technology
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Extracting key terms from noisy and multitheme documents

Proceedings of the 18th international conference on World wide web
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Mining meaning from Wikipedia

International Journal of Human-Computer Studies
Coherent keyphrase extraction via web mining

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Re-examining automatic keyphrase extraction approaches in scientific articles

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
Human-competitive tagging using automatic keyphrase extraction

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Keyphrase extraction in scientific publications

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
HUMB: Automatic key term extraction from scientific articles in GROBID

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
A citation-based approach to automatic topical indexing of scientific literature

Journal of Information Science

Cross-language patent matching via an international patent classification-based concept bridge

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilizes Wikipedia as a thesaurus for candidate selection from documents' content. We have devised a set of 20 statistical, positional and semantical features for candidate phrases to capture and reflect various properties of those candidates that have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods.