Learning to classify text from labeled and unlabeled documents
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
The Journal of Machine Learning Research
Pachinko allocation: DAG-structured mixture models of topic correlations
ICML '06 Proceedings of the 23rd international conference on Machine learning
Mixtures of hierarchical topics with Pachinko allocation
Proceedings of the 24th international conference on Machine learning
Wikify!: linking documents to encyclopedic knowledge
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Proceedings of the 17th international conference on World Wide Web
Modeling online reviews with multi-grain topic models
Proceedings of the 17th international conference on World Wide Web
Learning to link with wikipedia
Proceedings of the 17th ACM conference on Information and knowledge management
A Latent Topic Model for Complete Entity Resolution
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Collective annotation of Wikipedia entities in web text
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
An entity-topic model for entity linking
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Using semi-structured data for assessing research paper similarity
Information Sciences: an International Journal
Fast and accurate incremental entity resolution relative to an entity knowledge base
Proceedings of the 21st ACM international conference on Information and knowledge management
Entity Disambiguation with Freebase
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Mining evidences for named entity disambiguation
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
A hybrid approach for spotting, disambiguating and annotating places in user-generated text
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.