Named entity normalization in user generated content

Authors:
Valentin Jijkoun;Mahboob Alam Khalid;Maarten Marx;Maarten de Rijke
Affiliations:
University of Amsterdam, Amsterdam;University of Amsterdam, Amsterdam;University of Amsterdam, Amsterdam;University of Amsterdam, Amsterdam
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 15
Cited 7

Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Memory-based named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Memory-based named entity recognition using unannotated data

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Discovering missing links in Wikipedia

Proceedings of the 3rd international workshop on Link discovery
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Learning to detect conversation focus of threaded discussions

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Efficient topic-based unsupervised name disambiguation

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Automatically assessing the post quality in online discussions on software

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Arabic cross-document person name normalization

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
The impact of named entity normalization on information retrieval for question answering

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Tokenizing micro-blogging messages using a text classification approach

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
User logs as a means to enrich and refine translation dictionaries

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Generating links to background knowledge: a case study using narrative radiology reports

Proceedings of the 20th ACM international conference on Information and knowledge management
Expertise Retrieval

Foundations and Trends in Information Retrieval
Joint inference of named entity recognition and normalization for tweets

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Automatically embedding newsworthy links to articles

Proceedings of the 21st ACM international conference on Information and knowledge management
Feeding the second screen: semantic linking based on subtitles

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.