Entity disambiguation with hierarchical topic models

Authors:
Saurabh S. Kataria;Krishnan S. Kumar;Rajeev R. Rastogi;Prithviraj Sen;Srinivasan H. Sengamedu
Affiliations:
Pennsylvania State University, State College, PA, USA;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 11
Cited 6

Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Latent dirichlet allocation

The Journal of Machine Learning Research
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
Mixtures of hierarchical topics with Pachinko allocation

Proceedings of the 24th international conference on Machine learning
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Modeling online reviews with multi-grain topic models

Proceedings of the 17th international conference on World Wide Web
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
A Latent Topic Model for Complete Entity Resolution

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Collective annotation of Wikipedia entities in web text

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

An entity-topic model for entity linking

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Using semi-structured data for assessing research paper similarity

Information Sciences: an International Journal
Fast and accurate incremental entity resolution relative to an entity knowledge base

Proceedings of the 21st ACM international conference on Information and knowledge management
Entity Disambiguation with Freebase

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Mining evidences for named entity disambiguation

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
A hybrid approach for spotting, disambiguating and annotating places in user-generated text

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.