DIKEA: domain-independent keyphrase extraction algorithm

  • Authors:
  • David X. Wang;Xiaoying Gao;Peter Andreae

  • Affiliations:
  • School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand;School of Engineering and Computer Science, Victoria University of Wellington, New Zealand

  • Venue:
  • AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a new domain-independent keyphrase extraction system (DIKEA). Keyphrase extraction is a challenging problem that automatically extracts or assigns keyphrases to documents and it can benefit many research areas such as information retrieval, particularly indexing, clustering, and summarization. A landmark research KEA (Keyphrase Extraction Algorithm) formulated the problem as a supervised machine learning problem and successfully applied a Naïve Bayes model to it, which showed great promise but the performance is not satisfactory. Its state-of-the-art extension KEA++ has a significantly improved performance but relies on a domain specific vocabulary which is often not available or not complete. This paper introduces a novel domain-independent approach and has three main contributions: utilising the largest online knowledge source--Wikipedia--for keyphrase candidate selection; presenting new features for keyphrase evaluation, including a Wikipedia-based feature---link probability; and evaluating a number of different learning algorithms, including multilayer perceptrons, for keyphrase selection. Experiments show that our system clearly outperforms KEA and closely matches the performance of KEA++, without requiring any domain-specific knowledge such as KEA++'s vocabulary list.