Microblog-genre noise and impact on semantic annotation accuracy

Authors:
Leon Derczynski;Diana Maynard;Niraj Aswani;Kalina Bontcheva
Affiliations:
University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK;University of Sheffield, Sheffield, UK
Venue:
Proceedings of the 24th ACM Conference on Hypertext and Social Media
Year:
2013

Citing 24
Cited 1

Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Independence and commitment: assumptions for rapid training and execution of rule-based POS taggers

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Investigation and modeling of the structure of texting language

International Journal on Document Analysis and Recognition
Why we twitter: understanding microblogging usage and communities

Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis
Annotating named entities in Twitter data with crowdsourcing

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Lexical normalisation of short text messages: makn sens a #twitter

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Semantic enrichment of twitter posts for user profile construction on the social web

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
The ML-Model for Multi-layer Social Networks

ASONAM '11 Proceedings of the 2011 International Conference on Advances in Social Networks Analysis and Mining
DBpedia spotlight: shedding light on the web of documents

Proceedings of the 7th International Conference on Semantic Systems
Adding semantics to microblog posts

Proceedings of the fifth ACM international conference on Web search and data mining
Named entity recognition in tweets: an experimental study

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
LINDEN: linking named entities with knowledge base via semantic knowledge

Proceedings of the 21st international conference on World Wide Web
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web
Finding co-solvers on twitter, with a little help from linked data

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
An approach for named entity recognition in poorly structured data

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
langid.py: an off-the-shelf language identification tool

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Joint inference of named entity recognition and normalization for tweets

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Automatically constructing a normalisation dictionary for microblogs

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation
Towards context-aware search and analysis on social media data

Proceedings of the 16th International Conference on Extending Database Technology

The 24th ACM Conference on Hypertext and Social Media (HT2013): a personal review

ACM SIGWEB Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

Using semantic technologies for mining and intelligent information access to microblogs is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Semantic annotation of tweets is typically performed in a pipeline, comprising successive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). Consequently, errors are cumulative, and earlier-stage problems can severely reduce the performance of final stages. This paper presents a characterisation of genre-specific problems at each semantic annotation stage and the impact on subsequent stages. Critically, we evaluate impact on two high-level semantic annotation tasks: named entity detection and disambiguation. Our results demonstrate the importance of making approaches specific to the genre, and indicate a diminishing returns effect that reduces the effectiveness of complex text normalisation.