Boosting precision and recall of dictionary-based protein name recognition
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
Introduction to the bio-entity recognition task at JNLPBA
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
How to make the most of NE dictionaries in statistical NER
BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Overview of BioNLP'09 shared task on event extraction
BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Design challenges and misconceptions in named entity recognition
CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Scaling up biomedical event extraction to the entire PubMed
BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Disease mention recognition with specific features
BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Overview of Genia event task in BioNLP Shared Task 2011
BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
Overview of the infectious diseases (ID) task of BioNLP Shared Task 2011
BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
Hi-index | 0.00 |
In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest match in each of a collection of lexical resources. We collect lexical resources for a multitude of semantic categories from a variety of biomedical domain sources. The combined resources, containing more than twenty million lexical items, are queried using a recently proposed fast and efficient approximate string matching algorithm that allows us to query large resources without severely impacting system performance. We evaluate our results on six corpora representing a variety of disambiguation tasks. While the integration of approximate string matching features is shown to substantially improve performance on one corpus, results are modest or negative for others. We suggest possible explanations and future research directions. Our lexical resources and implementation are made freely available for research purposes at: http://github.com/ninjin/simsem