DIKEA: domain-independent keyphrase extraction algorithm

Authors:
David X. Wang;Xiaoying Gao;Peter Andreae
Affiliations:
School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand
Venue:
AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
Year:
2012

Citing 16
Cited 0

KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Learning Algorithms for Keyphrase Extraction

Information Retrieval
Improved automatic keyword extraction given more linguistic knowledge

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Thesaurus based automatic keyphrase indexing

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Domain-independent automatic keyphrase indexing with small training sets

Journal of the American Society for Information Science and Technology
Clustering Documents Using a Wikipedia-Based Concept Representation

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Query dependent pseudo-relevance feedback based on wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Concept-based feature generation and selection for information retrieval

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Coherent keyphrase extraction via web mining

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Automatic hypertext keyphrase detection

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Clustering to find exemplar terms for keyphrase extraction

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Human-competitive tagging using automatic keyphrase extraction

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
SemEval-2010 task 5: Automatic keyphrase extraction from scientific articles

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
WEKA---Experiences with a Java Open-Source Project

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new domain-independent keyphrase extraction system (DIKEA). Keyphrase extraction is a challenging problem that automatically extracts or assigns keyphrases to documents and it can benefit many research areas such as information retrieval, particularly indexing, clustering, and summarization. A landmark research KEA (Keyphrase Extraction Algorithm) formulated the problem as a supervised machine learning problem and successfully applied a Naïve Bayes model to it, which showed great promise but the performance is not satisfactory. Its state-of-the-art extension KEA++ has a significantly improved performance but relies on a domain specific vocabulary which is often not available or not complete. This paper introduces a novel domain-independent approach and has three main contributions: utilising the largest online knowledge source--Wikipedia--for keyphrase candidate selection; presenting new features for keyphrase evaluation, including a Wikipedia-based feature---link probability; and evaluating a number of different learning algorithms, including multilayer perceptrons, for keyphrase selection. Experiments show that our system clearly outperforms KEA and closely matches the performance of KEA++, without requiring any domain-specific knowledge such as KEA++'s vocabulary list.