KPSpotter: a flexible information gain-based keyphrase extraction system

  • Authors:
  • Min Song;Il-Yeol Song;Xiaohua Hu

  • Affiliations:
  • Drexel University, Philadelphia, PA;Drexel University, Philadelphia, PA;Drexel University, Philadelphia, PA

  • Venue:
  • WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.