Learning to extract chemical names based on random text generation and incomplete dictionary

  • Authors:
  • Su Yan;W. Scott Spangler;Ying Chen

  • Affiliations:
  • IBM Almaden Research Lab, San Jose, CA;IBM Almaden Research Lab, San Jose, CA;IBM Almaden Research Lab, San Jose, CA

  • Venue:
  • Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatically extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable good quality training set to train a reliable entity extraction model. Leveraging the well-studied random text generation techniques based on formal grammars, we explore the idea of automatically creating training sets for the task of chemical named entity extraction. Assuming the availability of an incomplete list of chemical names, we are able to generate well-controlled, random, yet realistic chemical-like training documents. Compared to state-of-the-art models learned from manually labeled data and rule-based systems using real-world data, our solutions show comparable or better results, with least human effort.