Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Memory-based named entity recognition
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Unsupervised personal name disambiguation
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Memory-based named entity recognition using unannotated data
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Discovering missing links in Wikipedia
Proceedings of the 3rd international workshop on Link discovery
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Learning to detect conversation focus of threaded discussions
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Efficient topic-based unsupervised name disambiguation
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Wikify!: linking documents to encyclopedic knowledge
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Automatically assessing the post quality in online discussions on software
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries
ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Arabic cross-document person name normalization
Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
The impact of named entity normalization on information retrieval for question answering
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Tokenizing micro-blogging messages using a text classification approach
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
User logs as a means to enrich and refine translation dictionaries
CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Generating links to background knowledge: a case study using narrative radiology reports
Proceedings of the 20th ACM international conference on Information and knowledge management
Foundations and Trends in Information Retrieval
Joint inference of named entity recognition and normalization for tweets
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Automatically embedding newsworthy links to articles
Proceedings of the 21st ACM international conference on Information and knowledge management
Feeding the second screen: semantic linking based on subtitles
Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Hi-index | 0.00 |
Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.