Extracting named entities using support vector machines

  • Authors:
  • Yu-Chieh Wu;Teng-Kai Fan;Yue-Shi Lee;Show-Jane Yen

  • Affiliations:
  • Department of Computer Science and Information Engineering, National Central University, Jhongli City, Taoyuan County, Taiwan, R.O.C;Department of Computer Science and Information Engineering, National Central University, Jhongli City, Taoyuan County, Taiwan, R.O.C;Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan County, Taiwan, R.O.C;Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan County, Taiwan, R.O.C

  • Venue:
  • KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Identifying proper names, like gene names, DNAs, or proteins is useful to help researchers to mining the text information. Learning to extract proper names in natural language text is a named entity recognition (NER) task. Previous studies focus on combining abundant human made rules, trigger words, to enhance the system performance. However these methods require domain experts to build up these rules and word set which relies on lots of human efforts. In this paper, we present a robust named entity recognition system based on support vector machines (SVM). By integrating with rich feature set and the proposed mask method, the system performance is satisfactory on the MUC-7 and biology named entity recognition tasks which outperforms famous machine learning-based method, such as hidden markov model (HMM), and maximum entropy model (MEM). We compare our method to previous systems that were performed on the same data set. The experiments show that when training with the MUC-7 data set, our system achieves 86.4 in F(β=1) rate and 81.57 for the biology corpus. Besides, our named entity system is able to handle real time processing applications, the turn around time on a 63 K words document set is less than 30 seconds.