Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

Authors:
Su Yan;W. Scott Spangler;Ying Chen
Affiliations:
IBM, San Jose;IBM, San Jose;IBM, San Jose
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2013

Citing 11
Cited 0

SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules

Journal of Chemical Information & Computer Sciences
A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Support-Vector Networks

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Extension of Zipf's law to words and phrases

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Mining, indexing, and searching for textual chemical molecule information on the web

Proceedings of the 17th international conference on World Wide Web
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Detection of IUPAC and IUPAC-like chemical names

Bioinformatics
Cascaded classifiers for confidence-based chemical named entity recognition

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Annotation of chemical named entities

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.