Developing a robust part-of-speech tagger for biomedical text

Authors:
Yoshimasa Tsuruoka;Yuka Tateishi;Jin-Dong Kim;Tomoko Ohta;John McNaught;Sophia Ananiadou;Jun’ichi Tsujii
Affiliations:
CREST, JST (Japan Science and Technology Agency), Saitama, Japan;CREST, JST (Japan Science and Technology Agency), Saitama, Japan;CREST, JST (Japan Science and Technology Agency), Saitama, Japan;CREST, JST (Japan Science and Technology Agency), Saitama, Japan;School of Informatics, University of Manchester, Manchester, UK;School of Computing, Science and Engineering, Salford University, Salford, Greater Manchester, UK;University of Tokyo, Tokyo, Japan
Venue:
PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Year:
2005

Citing 8
Cited 85

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Intricacies of Collins' Parsing Model

Computational Linguistics
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Evaluation and extension of maximum entropy models with inequality constraints

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

An intelligent search engine and GUI-based efficient MEDLINE search tool based on deep syntactic parsing

COLING-ACL '06 Proceedings of the COLING/ACL on Interactive presentation sessions
An automated system for conversion of clinical notes into SNOMED clinical terminology

ACSW '07 Proceedings of the fifth Australasian symposium on ACSW frontiers - Volume 68
Extracting semantics in a clinical scenario

ACSW '07 Proceedings of the fifth Australasian symposium on ACSW frontiers - Volume 68
Rich features based Conditional Random Fields for biological named entities recognition

Computers in Biology and Medicine
Brief Communication: Exploiting the performance of dictionary-based bio-entity name recognition in biomedical literature

Computational Biology and Chemistry
Exploiting the contextual cues for bio-entity name recognition in biomedical literature

Journal of Biomedical Informatics
How Can the Term Compositionality Be Useful for Acquiring Elementary Semantic Relations?

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
@Note: A workbench for Biomedical Text Mining

Journal of Biomedical Informatics
A hierarchical approach to encoding medical concepts for clinical notes

HLT-SRWS '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop
Subdomain adaptation of a POS tagger with a small corpus

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Extraction of biomedical events using case-based reasoning

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Biomedical event annotation with CRFs and precision grammars

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Biomedical event detection using rules, conditional random fields and parse tree distances

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Learning the scope of hedge cues in biomedical texts

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
User-driven development of text mining resources for cancer risk assessment

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
On the unification of syntactic annotations under the stanford dependency scheme: a case study on BioInfer and GENIA

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Adaptation of POS tagging for multiple BioMedical domains

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
A metalearning approach to processing the scope of negation

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Event frame extraction based on a gene regulation corpus

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Learning the scope of negation in biomedical texts

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Extracting bilingual dictionary from comparable corpora with dependency heterogeneity

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Integrated NLP evaluation system for pluggable evaluation metrics with extensive interoperable toolkit

SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
Towards identifying intervention arms in randomized controlled trials: Extracting coordinating constructions

Journal of Biomedical Informatics
Porting a lexicalized-grammar parser to the biomedical domain

Journal of Biomedical Informatics
Assigning roles to protein mentions: The case of transcription factors

Journal of Biomedical Informatics
Subdomain adaptation of a POS tagger with a small corpus

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology
Using conditional random fields for result identification in biomedical abstracts

Integrated Computer-Aided Engineering
Classifier subset selection for biomedical named entity recognition

Applied Intelligence
Automatic Keyphrase Extraction from Medical Documents

PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Nested named entity recognition

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
BioPPISVMExtractor: A protein-protein interaction extractor for biomedical literature using SVM and rich feature sets

Journal of Biomedical Informatics
Acquisition of elementary synonym relations from biological structured terminology

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
An Overview of BioCreative II.5

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Can recognising multiword expressions improve shallow parsing?

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Creating robust supervised classifiers via web-scale N-gram data

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A comparative study of syntactic parsers for event extraction

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Resolving speculation: MaxEnt cue classification and dependency-based scope rules

CoNLL '10: Shared Task Proceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task
Exploiting rich features for detecting hedges and their scope

CoNLL '10: Shared Task Proceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task
Learning to detect hedges and their scope using CRF

CoNLL '10: Shared Task Proceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task
Exploiting multi-features to detect hedges and their scope in biomedical texts

CoNLL '10: Shared Task Proceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task
Exploiting CCG structures with tree kernels for speculation detection

CoNLL '10: Shared Task Proceedings of the Fourteenth Conference on Computational Natural Language Learning --- Shared Task
Word sense disambiguation for event trigger word detection

DTMBIO '10 Proceedings of the ACM fourth international workshop on Data and text mining in biomedical informatics
Robust measurement and comparison of context similarity for finding translation pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Evaluating dependency representation for event extraction

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Syntactic scope resolution in uncertainty analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Natural language query processing for life science knowledge

AMT'10 Proceedings of the 6th international conference on Active media technology
Desiderata for ontologies to be used in semantic annotation of biomedical documents

Journal of Biomedical Informatics
Effective use of dependency structure for bilingual lexicon creation

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Automatic acquisition of huge training data for bio-medical named entity recognition

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Parsing natural language queries for life science knowledge

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Unsupervised relation extraction using dependency trees for automatic generation of multiple-choice questions

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Learning the optimal use of dependency-parsing information for finding translations with comparable corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Quantifying the impact of concept recognition on biomedical information retrieval

Information Processing and Management: an International Journal
Information extraction from pathology reports in a hospital setting

Proceedings of the 20th ACM international conference on Information and knowledge management
Text mining for efficient search and assisted creation of clinical trials

Proceedings of the ACM fifth international workshop on Data and text mining in biomedical informatics
A parser-based approach to detecting modification of biomedical events

Proceedings of the ACM fifth international workshop on Data and text mining in biomedical informatics
A scalable and distributed NLP architecture for web document annotation

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Improving term extraction with terminological resources

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Biomedical named entities recognition using conditional random fields model

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery
ASCOT: assisting search and creation of clinical trials

Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
A framework for schema-driven relationship discovery from unstructured text

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Adding text mining workflows as web services to the BioCatalogue

Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences
Legal language and legal knowledge management applications

Semantic Processing of Legal Texts
An ontology for clinical questions about the contents of patient notes

Journal of Biomedical Informatics
A framework and its empirical study of automatic diagnosis of traditional Chinese medicine utilizing raw free-text clinical records

Journal of Biomedical Informatics
Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

ACM Transactions on Asian Language Information Processing (TALIP)
CharaParser for fine-grained semantic annotation of organism morphological descriptions

Journal of the American Society for Information Science and Technology
Speculation and negation: Rules, rankers, and the role of syntax

Computational Linguistics
Modelling a biological system: network creation by triplet extraction from biological literature

Bisociative Knowledge Discovery
Developing multilingual text mining workflows in UIMA and u-compare

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Relation mining experiments in the pharmacogenomics domain

Journal of Biomedical Informatics
A hybrid approach to finding negated and uncertain expressions in biomedical documents

Proceedings of the 2nd international workshop on Managing interoperability and compleXity in health systems
Domain adaptation of a dependency parser with a class-class selectional preference model

ACL '12 Proceedings of ACL 2012 Student Research Workshop
Helping our own: NTHU NLPLAB system description

Proceedings of the Seventh Workshop on Building Educational Applications Using NLP
Semantic distance and terminology structuring methods for the detection of semantically close terms

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Combining compositionality and pagerank for the identification of semantic relations between biomedical words

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
RankPref: ranking sentences describing relations between biomedical entities with an application

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Finding small molecule and protein pairs in scientific literature using a bootstrapping method

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Improving speculative language detection using linguistic knowledge

ExProM '12 Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics
BioOntoVerb: A top level ontology based framework to populate biomedical ontologies from texts

Knowledge-Based Systems
A lazy man's way to part-of-speech tagging

PKAW'12 Proceedings of the 12th Pacific Rim conference on Knowledge Management and Acquisition for Intelligent Systems
Evidence in automatic error correction improves learners' english skill

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Machine learning for high-quality tokenization replicating variable tokenization schemes

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
PIMiner: a web tool for extraction of protein interactions from biomedical literature

International Journal of Data Mining and Bioinformatics
Unsupervised mining of frequent tags for clinical eligibility text indexing

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a part-of-speech tagger which is specifically tuned for biomedical text. We have built the tagger with maximum entropy modeling and a state-of-the-art tagging algorithm. The tagger was trained on a corpus containing newspaper articles and biomedical documents so that it would work well on various types of biomedical text. Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and our tagger exhibits very good precision (97% to 98%) on all these corpora. We also evaluated the robustness of the tagger using recent MEDLINE articles.