Learning to extract chemical names based on random text generation and incomplete dictionary

Authors:
Su Yan;W. Scott Spangler;Ying Chen
Affiliations:
IBM Almaden Research Lab, San Jose, CA;IBM Almaden Research Lab, San Jose, CA;IBM Almaden Research Lab, San Jose, CA
Venue:
Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Year:
2012

Citing 8
Cited 0

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Support-Vector Networks

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Extension of Zipf's law to words and phrases

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Mining, indexing, and searching for textual chemical molecule information on the web

Proceedings of the 17th international conference on World Wide Web
Detection of IUPAC and IUPAC-like chemical names

Bioinformatics
Cascaded classifiers for confidence-based chemical named entity recognition

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable good quality training set to train a reliable entity extraction model. Leveraging the well-studied random text generation techniques based on formal grammars, we explore the idea of automatically creating training sets for the task of chemical named entity extraction. Assuming the availability of an incomplete list of chemical names, we are able to generate well-controlled, random, yet realistic chemical-like training documents. Compared to state-of-the-art models learned from manually labeled data and rule-based systems using real-world data, our solutions show comparable or better results, with least human effort.