Recognizing names in biomedical texts: a machine learning approach

Authors:
Guodong Zhou;Jie Zhang;Jian Su;Dan Shen;Chewlim Tan
Affiliations:
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 and,;Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 and,;Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 and,;Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 and,;School of Computing, National University of Singapore, Singapore 119610
Venue:
Bioinformatics
Year:
2004

Citing 0
Cited 40

Comparison of character-level and part of speech features for name recognition in biomedical texts

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Using name-internal and contextual features to classify biological terms

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Literature Extraction of Protein Functions Using Sentence Pattern Mining

IEEE Transactions on Knowledge and Data Engineering
ME-based biomedical named entity recognition using lexical knowledge

ACM Transactions on Asian Language Information Processing (TALIP)
Role of local context in automatic deidentification of ungrammatical, fragmented text

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
A hybrid approach to biomedical named entity recognition and semantic role labeling

NAACL-DocConsortium '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium
A Grid-Based Pseudo-Cache solution for MISD biomedical problems with high confidentiality and efficiency

International Journal of Bioinformatics Research and Applications
Challenges in biological literature mining for online discovery of molecular interaction pathways

International Journal of Computer Applications in Technology
Vote-Based Classifier Selection for Biomedical NER Using Genetic Algorithms

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II
Ontology-centric integration and navigation of the dengue literature

Journal of Biomedical Informatics
Experimental Study on a Two Phase Method for Biomedical Named Entity Recognition

IEICE - Transactions on Information and Systems
Named entity recognition in biomedical texts using an HMM model

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Recognizing nested named entities in GENIA corpus

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Recognising nested named entities in biomedical text

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Online assessment of content skill levels for medical texts

Expert Systems with Applications: An International Journal
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Database Note: iProLINK: an integrated protein resource for literature mining

Computational Biology and Chemistry
Two learning approaches for protein name extraction

Journal of Biomedical Informatics
Context-based online medical terminology navigation

Expert Systems with Applications: An International Journal
Recognizing nested named entities in GENIA corpus

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Classifier subset selection for biomedical named entity recognition

Applied Intelligence
Annotating and recognising named entities in clinical notes

ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Nested named entity recognition

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities

Expert Systems with Applications: An International Journal
MaxMatcher: biological concept extraction using approximate dictionary lookup

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Enhancing biomedical named entity classification using terabyte unlabeled data

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Recognizing biomedical named entities in Chinese research abstracts

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Extracting formulaic and free text clinical research articles metadata using conditional random fields

Louhi '10 Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents
Recognizing biomedical named entities using skip-chain conditional random fields

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Identifying disease diagnosis factors by proximity-based mining of medical texts

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part II
Unsupervised relation extraction using dependency trees for automatic generation of multiple-choice questions

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Generating links to background knowledge: a case study using narrative radiology reports

Proceedings of the 20th ACM international conference on Information and knowledge management
Headwords and suffixes in biomedical names

KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature
Recognizing biomedical named entities using SVMs: improving recognition performance with a minimal set of features

KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature
Various features with integrated strategies for protein name classification

ISPA'05 Proceedings of the 2005 international conference on Parallel and Distributed Processing and Applications
Empirical textual mining to protein entities recognition from pubmed corpus

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Incremental maintenance of biological databases using association rule mining

PRIB'06 Proceedings of the 2006 international conference on Pattern Recognition in Bioinformatics
A distributional semantics approach to simultaneous recognition of multiple classes of named entities

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Information Extraction Approaches to Unconventional Data Sources for "Injury Surveillance System": the Case of Newspapers Clippings

Journal of Medical Systems
Combining information extraction and text mining for cancer biomarker detection

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition. Results: In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k-Nearest Neighbor (k-NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F-measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F-measure of 75.8 on the 'protein' class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k-NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F-measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F-measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the 'protein' class, without help of any dictionaries. We think that a F-measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the 'protein' class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary. Availability: A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm. Technology license is available upon the bilateral agreement.