SimSem: fast approximate string matching in relation to semantic category disambiguation

Authors:
Pontus Stenetorp;Sampo Pyysalo;Jun'ichi Tsujii
Affiliations:
The University of Tokyo, Tokyo, Japan;The University of Tokyo, Tokyo, Japan;Microsoft Research Asia, Beijing, People's Republic of China
Venue:
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Year:
2011

Citing 13
Cited 0

Boosting precision and recall of dictionary-based protein name recognition

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
How to make the most of NE dictionaries in statistical NER

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Overview of BioNLP'09 shared task on event extraction

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
A dictionary to identify small molecules and drugs in free text

Bioinformatics
Scaling up biomedical event extraction to the entire PubMed

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Disease mention recognition with specific features

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Overview of Genia event task in BioNLP Shared Task 2011

BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
Overview of the epigenetics and post-translational modifications (EPI) task of BioNLP Shared Task 2011

BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop
Overview of the infectious diseases (ID) task of BioNLP Shared Task 2011

BioNLP Shared Task '11 Proceedings of the BioNLP Shared Task 2011 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this study we investigate the merits of fast approximate string matching to address challenges relating to spelling variants and to utilise large-scale lexical resources for semantic class disambiguation. We integrate string matching results into machine learning-based disambiguation through the use of a novel set of features that represent the distance of a given textual span to the closest match in each of a collection of lexical resources. We collect lexical resources for a multitude of semantic categories from a variety of biomedical domain sources. The combined resources, containing more than twenty million lexical items, are queried using a recently proposed fast and efficient approximate string matching algorithm that allows us to query large resources without severely impacting system performance. We evaluate our results on six corpora representing a variety of disambiguation tasks. While the integration of approximate string matching features is shown to substantially improve performance on one corpus, results are modest or negative for others. We suggest possible explanations and future research directions. Our lexical resources and implementation are made freely available for research purposes at: http://github.com/ninjin/simsem